# 4.10 Coding Etiquette & Excel Reporting

## This script contains the following points:
** **

1. Importing libraries and data

2. Considering security implications

3. Comparing customer behavior in different geographic areas
    * 3.1 Create a regional segmentation of the data
    * 3.2 Check to see if spending habits vary by U.S. regions


4. Creating an exclusion flag for low-activity customers (customers with less than 5 orders) and excluding them from the data 
    * 4.1 Excluding low-activity customers from the data and export the subset 
    * 4.2 Excluding high-activity customers from the data and export the subset
    
    
5. Customer profiling
     * 5.1 Market segmentation by demographics
         
        * 5.1.1 Age segmentation of Instacart customers
        * 5.1.2 Income-based segmentation of IC customers
        * 5.1.3 Segmentation based on the number of dependents and relationship of IC customers
        * 5.1.4 Segmentation by department 
        
        
  * 5.2 Market segmentation based on behavior
    
       * 5.2.1 Snacks eater
       * 5.2.2 Customers that do not use alcohol
       * 5.2.3 Customers who own pets
       * 5.2.4 Vegetarian customers
       * 5.2.5 Parents with babies
       * 5.2.6 Early birds and night owls 
       
6. Customer profile visualization

7. Customer profile aggregation for usage frequency and expenditure

8. Customer profile comparison in regions and departments


9. Exporting updated dataframes and charts
    * 9.1 Exporting dataframes
    * 9.2 Exporting charts
    
** **

## 1. Importing libraries and data
** **

In [1]:
# Import Libraries 

import pandas as pd 
import numpy as np
import os 
import matplotlib.pyplot as plt 
import seaborn as sns
import scipy

In [2]:
# Set a path

path = r'/Users/berk/Instacart_Grocery_Basket_Analysis'

In [3]:
# Import the most recently merged dataset(orders, products and customers dataframes)

df_merged_all = pd.read_pickle(os.path.join(path, '02_Data', 'Prepared_Data', 'df_ords_prods_custs.pkl'))

In [4]:
# Import departments dataframe

df_depts = pd.read_csv(os.path.join(path, '02_Data', 'Prepared_Data', 'departments_wrangled.csv'), index_col = False)

# 2. Considering security implications
** **

In [5]:
df_merged_all.head()

Unnamed: 0.1,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,Unnamed: 0,...,order_frequency_flag,first_name,last_name,gender,state,age,date_joined,number_dependents,fam_status,income
0,2539329,1,1,2,8,,196,1,0,195,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,195,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,195,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,195,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,195,...,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423


In [6]:
df_merged_all.dtypes

order_id                     int64
user_id                     object
order_number                 int64
orders_day_of_week           int64
order_hour_of_day            int64
days_since_prior_order     float64
product_id                   int64
add_to_cart_order            int64
reordered                    int64
Unnamed: 0                   int64
product_name                object
aisle_id                     int64
department_id                int64
prices                     float64
_merge                    category
price_range_loc             object
busiest_day                 object
busiest_days                object
busiest_period_of_day       object
max_order                    int64
loyalty_flag                object
avg_price                  float64
spending_flag               object
median_prior_orders        float64
order_frequency_flag        object
first_name                  object
last_name                   object
gender                      object
state               

### PII data addressing

  The 'first_name' and 'last_name' columns within our dataset could potentially be traced back to a particular person
     
To address: In order to protect customer privacy, these two columns should be removed.   

In [7]:
# Dropping PII data 

df_merged_all = df_merged_all.drop(columns = ['first_name','last_name'])

In [8]:
# Check that PII data has been deleted from the dataset.

df_merged_all.head()

Unnamed: 0.1,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,Unnamed: 0,...,spending_flag,median_prior_orders,order_frequency_flag,gender,state,age,date_joined,number_dependents,fam_status,income
0,2539329,1,1,2,8,,196,1,0,195,...,High spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,195,...,High spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,195,...,High spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,195,...,High spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,195,...,High spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423


# 3. Comparing customer behavior in different geographic areas
** **

In [9]:
# Check the state column

df_merged_all['state'].value_counts(dropna = False)

Pennsylvania            667082
California              659783
Rhode Island            656913
Georgia                 656389
New Mexico              654494
Arizona                 653964
North Carolina          651900
Oklahoma                651739
Alaska                  648495
Minnesota               647825
Massachusetts           646358
Wyoming                 644255
Virginia                641421
Missouri                640732
Texas                   640394
Colorado                639280
Maine                   638583
North Dakota            638491
Alabama                 638003
Kansas                  637538
Louisiana               637482
Delaware                637024
South Carolina          636754
Oregon                  636425
Arkansas                636144
Nevada                  636139
New York                635983
Montana                 635265
South Dakota            633772
Illinois                633024
Hawaii                  632901
Washington              632852
Mississi

## 3.1 Create a regional segmentation of the data

In [10]:
# Lists of defined regions

region_NE = ['Maine', 'New Hampshire', 'Vermont', 'Massachusetts', 'Rhode Island', 'Connecticut', 'New York', 'Pennsylvania', 'New Jersey']
region_MW = ['Wisconsin', 'Michigan', 'Illinois', 'Indiana', 'Ohio', 'North Dakota', 'South Dakota', 'Nebraska', 'Kansas', 'Minnesota', 'Iowa', 'Missouri']
region_S = ['Delaware', 'Maryland', 'District of Columbia', 'Virginia', 'West Virginia', 'North Carolina', 'South Carolina', 'Georgia', 'Florida', 'Kentucky', 'Tennessee', 'Mississippi', 'Alabama', 'Oklahoma', 'Texas', 'Arkansas', 'Louisiana']
region_W = ['Idaho', 'Montana', 'Wyoming', 'Nevada', 'Utah', 'Colorado', 'Arizona', 'New Mexico', 'Alaska', 'Washington', 'Oregon', 'California', 'Hawaii']

In [11]:
# Values from region lists are assigned to region

df_merged_all.loc[df_merged_all['state'].isin(region_NE), 'region'] = 'Northeast'
df_merged_all.loc[df_merged_all['state'].isin(region_MW), 'region'] = 'Midwest'
df_merged_all.loc[df_merged_all['state'].isin(region_S), 'region'] = 'South'
df_merged_all.loc[df_merged_all['state'].isin(region_W), 'region'] = 'West'

In [12]:
# Check 'region' values

df_merged_all['region'].value_counts(dropna = False)

South        10791885
West          8292913
Midwest       7597325
Northeast     5722736
Name: region, dtype: int64

In [13]:
# Verify the newly added column in the dataframe dimension.

df_merged_all.shape

(32404859, 33)

## 3.2 Check to see if spending habits vary by U.S. regions

In [14]:
# Create a spending_habits crosstab

crosstab_spending_habits = pd.crosstab (df_merged_all['spending_flag'], df_merged_all['region'], dropna = False)

In [15]:
crosstab_spending_habits

region,Midwest,Northeast,South,West
spending_flag,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
High spender,7589534,5717129,10781873,8284433
Low spender,7791,5607,10012,8480


# 4. Creating an exclusion flag for low-activity customers (customers with less than 5 orders) and excluding them from the data
** **

In [16]:
# Creating activity_flag based on the number of orders

df_merged_all.loc[df_merged_all['max_order'] <5, 'activity_flag'] = 'Low activity'
df_merged_all.loc[df_merged_all['max_order'] >=5, 'activity_flag'] = 'High activity'
df_merged_all.tail(5)

Unnamed: 0.1,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,Unnamed: 0,...,order_frequency_flag,gender,state,age,date_joined,number_dependents,fam_status,income,region,activity_flag
32404854,156685,106143,26,4,23,5.0,19675,1,1,19676,...,Frequent customer,Male,Hawaii,25,5/26/2017,0,single,53755,West,High activity
32404855,484769,66343,1,6,11,,47210,1,0,47214,...,Non-frequent customer,Female,Tennessee,22,9/12/2017,3,married,46151,South,Low activity
32404856,1561557,66343,2,1,11,30.0,47210,1,1,47214,...,Non-frequent customer,Female,Tennessee,22,9/12/2017,3,married,46151,South,Low activity
32404857,276317,66343,3,6,15,19.0,47210,1,1,47214,...,Non-frequent customer,Female,Tennessee,22,9/12/2017,3,married,46151,South,Low activity
32404858,2922475,66343,4,1,12,30.0,47210,1,1,47214,...,Non-frequent customer,Female,Tennessee,22,9/12/2017,3,married,46151,South,Low activity


In [17]:
# Check the 'activity_flag' column

df_merged_all['activity_flag'].value_counts(dropna = False)

High activity    30964564
Low activity      1440295
Name: activity_flag, dtype: int64

In [18]:
df_merged_all.dtypes

order_id                     int64
user_id                     object
order_number                 int64
orders_day_of_week           int64
order_hour_of_day            int64
days_since_prior_order     float64
product_id                   int64
add_to_cart_order            int64
reordered                    int64
Unnamed: 0                   int64
product_name                object
aisle_id                     int64
department_id                int64
prices                     float64
_merge                    category
price_range_loc             object
busiest_day                 object
busiest_days                object
busiest_period_of_day       object
max_order                    int64
loyalty_flag                object
avg_price                  float64
spending_flag               object
median_prior_orders        float64
order_frequency_flag        object
gender                      object
state                       object
age                          int64
date_joined         

## 4.1 Excluding low-activity customers from the data and export the subset

In [19]:
# Create a subset that excludes customers with low activity

df_high = df_merged_all[df_merged_all['activity_flag'] == 'High activity']

In [20]:
df_high.head()

Unnamed: 0.1,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,Unnamed: 0,...,order_frequency_flag,gender,state,age,date_joined,number_dependents,fam_status,income,region,activity_flag
0,2539329,1,1,2,8,,196,1,0,195,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,High activity
1,2398795,1,2,3,7,15.0,196,1,1,195,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,High activity
2,473747,1,3,3,12,21.0,196,1,1,195,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,High activity
3,2254736,1,4,4,7,29.0,196,1,1,195,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,High activity
4,431534,1,5,4,15,28.0,196,1,1,195,...,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423,South,High activity


In [21]:
# Export df_high in .pkl format 

df_high.to_pickle(os.path.join(path, '02_Data', 'Prepared_Data', 'high_activity_customers.pkl'))

## 4.2 Excluding high-activity customers from the data and export the subset

In [22]:
# Create a subset that excludes customers with high activity

df_low = df_merged_all[df_merged_all['activity_flag'] == 'Low activity']

In [23]:
df_low.head()

Unnamed: 0.1,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,Unnamed: 0,...,order_frequency_flag,gender,state,age,date_joined,number_dependents,fam_status,income,region,activity_flag
1510,520620,120,1,3,11,,196,2,0,195,...,Regular customer,Female,Kentucky,54,3/2/2017,2,married,99219,South,Low activity
1511,3273029,120,3,2,8,19.0,196,2,1,195,...,Regular customer,Female,Kentucky,54,3/2/2017,2,married,99219,South,Low activity
1512,520620,120,1,3,11,,46149,1,0,46153,...,Regular customer,Female,Kentucky,54,3/2/2017,2,married,99219,South,Low activity
1513,3273029,120,3,2,8,19.0,46149,1,1,46153,...,Regular customer,Female,Kentucky,54,3/2/2017,2,married,99219,South,Low activity
1514,520620,120,1,3,11,,26348,3,0,26349,...,Regular customer,Female,Kentucky,54,3/2/2017,2,married,99219,South,Low activity


In [24]:
# Export df_low in .pkl format 

df_low.to_pickle(os.path.join(path, '02_Data', 'Prepared_Data', 'low_activity_customers.pkl'))