# Final Project 

#### Please note, due to memory constraints we are using a 70% randomised and representative sample of the full data set for our analyses and visualisations. This subet was created in a previous exercise. 

This notebook focused on creating additional new variables and dataframes, to futher our analysis. 

###  Contents:

1. Importing libraries and files
2. Considering security implications of our data
3. Comparing customer behavior in different geographic areas across the US
    1. Assigning states to geographic regions
    2. Analysing spending by regions
4. Creating an exclusion flag for low-activity customers (customers with less than 5 orders) and excluding them from the data. 
    1. Low activity dataframe
    2. High activity dataframe
5. Exporting new data frames
    
   ** **

# 1. Importing Libraries and Files

In [1]:
#import libraries
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
#create usable path 
path = r'C:\Users\rutha\CareerFoundry\01-23_Instacart_Basket_Analysis'

In [3]:
#import data set
df = pd.read_pickle(os.path.join(path, '02_Data', 'Prepared_data', 'big_sample.pkl'))

In [4]:
#checking output
print ('A sample of our orders and products data:')
df.head()

A sample of our orders and products data:


Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,order_freq,order_freq_flag,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income
1,2539329,1,1,2,8,,14084,2,0,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
4,2539329,1,1,2,8,,26405,5,0,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
5,2398795,1,2,3,7,15.0,196,1,1,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
7,2398795,1,2,3,7,15.0,12427,3,1,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
8,2398795,1,2,3,7,15.0,13176,4,0,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423


In [6]:
df.shape

(22705099, 31)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22705099 entries, 1 to 32435058
Data columns (total 31 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   order_id                int32         
 1   user_id                 int32         
 2   order_number            int8          
 3   orders_day_of_week      int8          
 4   order_hour_of_day       int8          
 5   days_since_prior_order  float16       
 6   product_id              int16         
 7   add_to_cart_order       int16         
 8   reordered               int16         
 9   _merge                  category      
 10  product_name            object        
 11  aisle_id                float16       
 12  department_id           float16       
 13  prices                  float64       
 14  price_range_loc         object        
 15  busiest_days            object        
 16  busiest_period_of_day   object        
 17  max_order               int8          
 18  

## 2. Considering security implications in the data

Looking at whether we have any personal identification information in our dataset, which we would need to flag. 

In [8]:
df.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,order_freq,order_freq_flag,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income
1,2539329,1,1,2,8,,14084,2,0,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
4,2539329,1,1,2,8,,26405,5,0,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
5,2398795,1,2,3,7,15.0,196,1,1,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
7,2398795,1,2,3,7,15.0,12427,3,1,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423
8,2398795,1,2,3,7,15.0,13176,4,0,both,...,20.5,Non-frequent customer,Female,AL,31,2019-02-17,3,Has dependants,married,40423


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22705099 entries, 1 to 32435058
Data columns (total 31 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   order_id                int32         
 1   user_id                 int32         
 2   order_number            int8          
 3   orders_day_of_week      int8          
 4   order_hour_of_day       int8          
 5   days_since_prior_order  float16       
 6   product_id              int16         
 7   add_to_cart_order       int16         
 8   reordered               int16         
 9   _merge                  category      
 10  product_name            object        
 11  aisle_id                float16       
 12  department_id           float16       
 13  prices                  float64       
 14  price_range_loc         object        
 15  busiest_days            object        
 16  busiest_period_of_day   object        
 17  max_order               int8          
 18  

**Observations**

In our current merged data set we have no personal identification information (PPI). 

In one of our original data sets, 'customers', we had two columns, first name and surname, which would have fallen within the parameters of PPI. However, given that we have the user_id information, a better unique identifier, we decided to drop those two columns. 

## 3. Comparing customer behaviour across different regions

Create a regional segmentation of the data. Create a regional segmentation of the data. You’ll need to create a “Region” column based on the “State” column from your customers data set.**

**Determine whether there’s a difference in spending habits between the different U.S. regions.**

### 3.1 Assigning states to geographic regions

In [10]:
# creating a new "region" column
df['state'].value_counts().sort_index()

AK    454931
AL    446684
AR    445981
AZ    458113
CA    462486
CO    447652
CT    436642
DC    429773
DE    446004
FL    440808
GA    459836
HI    443836
IA    438140
ID    425326
IL    443165
IN    439833
KS    446661
KY    443395
LA    446450
MA    453348
MD    439026
ME    447276
MI    442176
MN    453947
MO    448296
MS    444206
MT    445274
NC    456547
ND    446390
NE    438309
NH    431732
NJ    440202
NM    458744
NV    445370
NY    446026
OH    440753
OK    456549
OR    445497
PA    467164
RI    460504
SC    446097
SD    443749
TN    433210
TX    448721
UT    429469
VA    449068
VT    428174
WA    443467
WI    440226
WV    428857
WY    451009
Name: state, dtype: int64

In [11]:
#defining region lists 
region1_NE = ['ME', 'NH', 'VT', 'MA', 'RI', 'CT', 'NY', 'PA', 'NJ']
region2_MW = ['WI', 'MI', 'IL', 'IN', 'OH', 'ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO']
region3_S = ['DE', 'MD', 'DC', 'VA', 'WV', 'NC', 'SC', 'GA', 'FL', 'KY', 'TN', 'MS', 'AL', 'OK', 'TX', 'AR', 'LA']
region4_W = ['ID', 'MT', 'WY', 'NV', 'UT', 'CO', 'AZ', 'NM', 'AK', 'WA', 'OR', 'CA', 'HI']

In [12]:
#assigning regions values to a new column (region) based on information from US Census Bureau

df.loc[df['state'].isin(region1_NE), 'region'] = 'Northeast'
df.loc[df['state'].isin(region2_MW), 'region'] = 'Midwest'
df.loc[df['state'].isin(region3_S), 'region'] = 'South'
df.loc[df['state'].isin(region4_W), 'region'] = 'West'

In [12]:
#checking regions output
df['region'].value_counts(dropna = False)

South        7561212
West         5811174
Midwest      5321645
Northeast    4011068
Name: region, dtype: int64

### 3.2 Analysing spending by regions

In [13]:
#creating a spending_habits crosstab
crosstab_spend = pd.crosstab (df['region'], df['spending_flag'], dropna = False)

In [14]:
crosstab_spend

spending_flag,High spender,Low spender
region,Unnamed: 1_level_1,Unnamed: 2_level_1
Midwest,109504,5212141
Northeast,75760,3935308
South,147243,7413969
West,112417,5698757


## 4. Customers with Low Activity

The client asked us to exclude any customers with fewer than 5 purchases. So we create an exclusion flag for customers with low-activity (customers with fewer than 5 orders), and exclude them from the data

In [15]:
# creating the exclusion flag 
df.loc[df['max_order'] < 5, 'exclusion_flag'] = 'low-activity'

In [16]:
df.loc[df['max_order'] >= 5, 'exclusion_flag'] = 'high-activity'

In [17]:
#checking new added column
df['exclusion_flag'].value_counts(dropna = False)

high-activity    21695423
low-activity      1009676
Name: exclusion_flag, dtype: int64

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22705099 entries, 1 to 32435058
Data columns (total 33 columns):
 #   Column                  Dtype         
---  ------                  -----         
 0   order_id                int32         
 1   user_id                 int32         
 2   order_number            int8          
 3   orders_day_of_week      int8          
 4   order_hour_of_day       int8          
 5   days_since_prior_order  float16       
 6   product_id              int16         
 7   add_to_cart_order       int16         
 8   reordered               int16         
 9   _merge                  category      
 10  product_name            object        
 11  aisle_id                float16       
 12  department_id           float16       
 13  prices                  float64       
 14  price_range_loc         object        
 15  busiest_days            object        
 16  busiest_period_of_day   object        
 17  max_order               int8          
 18  

### 4.1 Excluding low-activity customers from the data and export the subset.

In [19]:
#creating a new dataframe minus low-activity customers
df_high = df.loc[df['exclusion_flag'] == 'high-activity']

In [20]:
#checking output 
df_high.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income,region,exclusion_flag
1,2539329,1,1,2,8,,14084,2,0,both,...,Female,AL,31,2019-02-17,3,Has dependants,married,40423,South,high-activity
4,2539329,1,1,2,8,,26405,5,0,both,...,Female,AL,31,2019-02-17,3,Has dependants,married,40423,South,high-activity
5,2398795,1,2,3,7,15.0,196,1,1,both,...,Female,AL,31,2019-02-17,3,Has dependants,married,40423,South,high-activity
7,2398795,1,2,3,7,15.0,12427,3,1,both,...,Female,AL,31,2019-02-17,3,Has dependants,married,40423,South,high-activity
8,2398795,1,2,3,7,15.0,13176,4,0,both,...,Female,AL,31,2019-02-17,3,Has dependants,married,40423,South,high-activity


In [21]:
df_high.shape

(21695423, 33)

### 4.2 Excluding high-activity customers from the data and export the subset

In [22]:
#creating a new dataframe for just low-activity customers
df_low = df.loc[df['exclusion_flag'] == 'low-activity']

In [23]:
# checking output
df_low.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,...,gender,state,age,date_joined,n_dependants,dependants_loc,fam_status,income,region,exclusion_flag
360,2717275,5,1,3,12,,15349,1,0,both,...,Female,CA,75,2018-10-08,0,No dependants,divorced/widowed,115242,West,low-activity
362,2717275,5,1,3,12,,-16761,3,0,both,...,Female,CA,75,2018-10-08,0,No dependants,divorced/widowed,115242,West,low-activity
363,2717275,5,1,3,12,,28289,4,0,both,...,Female,CA,75,2018-10-08,0,No dependants,divorced/widowed,115242,West,low-activity
364,2717275,5,1,3,12,,8518,5,0,both,...,Female,CA,75,2018-10-08,0,No dependants,divorced/widowed,115242,West,low-activity
367,2717275,5,1,3,12,,26604,8,0,both,...,Female,CA,75,2018-10-08,0,No dependants,divorced/widowed,115242,West,low-activity


In [24]:
df_low.shape

(1009676, 33)

# 5. Exporting newly created data sets

In [25]:
#exporting high activity data set
df_high.to_pickle(os.path.join(path, '02_Data', 'Prepared_data', 'high_activity_customers.pkl'))

In [26]:
#exporting low activity data set
df_low.to_pickle(os.path.join(path, '02_Data', 'Prepared_data', 'low_activity_customers.pkl'))