# 4.9 IC Intro to Data Visualization with Python Task

# Table of Contents
1. Import libraries
2. Import data
3. Customer data wrangling
4. Combine customer data with existing data set
5. Export

## Part 1

### 1.0 Import Libraries

In [5]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

### 2.0 Import Data

In [7]:
# Define project folder path
path = r'/Users/sharnti/Desktop/CareerFoundry/Data Immersion/Achievement 4/Instacart Basket Analysis'

In [8]:
# Import customer dataframe
customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

In [9]:
# Check dimensions of import
customers.shape

(206209, 10)

There are 206,209 rows and 11 columns.

### 3.0 Customer data wrangling

In [12]:
# View descriptive statistics 
customers.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


Min/Max seems to be in order for all columns and counts are all even which is a good sign.

In [14]:
# Check sample of first few rows of the dataset
customers.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [15]:
# Check sample of last few rows of the dataset
customers.tail()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
206204,168073,Lisa,Case,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Jeremy,Robbins,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Doris,Richmond,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Rose,Rollins,Female,California,27,4/1/2020,1,married,99799
206208,80148,Cynthia,Noble,Female,New York,55,4/1/2020,1,married,57095


From the above, we can see that there are several columns that need to be renamed for readibilty. These are as follows:
- Rename __First Name__ to __first_name__
- Rename __Surnam__ to __last_name__
- Rename __Gender__ to __gender__
- Rename __STATE__ to __state__
- Rename __Age__ to __age__
- Rename __n_dependants__ to __num_dependants__

It's unclear what the fam_status column is capturing. This could just be martial status. Will check with a frequency table to confirm before renaming.

In [17]:
# Create a frequency table for the values of the fam_status column to see if this can be renamed to something more suitable.
customers['fam_status'].value_counts(dropna = False)

fam_status
married                             144906
single                               33962
divorced/widowed                     17640
living with parents and siblings      9701
Name: count, dtype: int64

The fam_status column appears to capture a mixture of martial status and household composition. I can't think of a better name for this column so it will remain as is. I'm unlso unsure if this information is necessary for my analysis, but will keep it for the time being.

In [19]:
# Rename columns
customers.rename(columns = {'First Name' : 'first_name',
                            'Surnam' : 'last_name',
                            'Gender' : 'gender',
                            'STATE' : 'state',
                            'Age' : 'age',
                            'n_dependants' : 'num_dependants'}, inplace = True)

In [20]:
# Check renaming worked
customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,num_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


### 3.1 Customer data quality and consistency checks

#### 3.1.1 Mixed data types

In [23]:
# Check data types
customers.dtypes

user_id            int64
first_name        object
last_name         object
gender            object
state             object
age                int64
date_joined       object
num_dependants     int64
fam_status        object
income             int64
dtype: object

I'm not entirely sure what a data type of _object_ means, so I'll check specifically for mixed type columns.

In [25]:
# Check for mixed types
for col in customers.columns.tolist():
  weird = (customers[[col]].map(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (customers[weird]) > 0:
    print (col)

first_name


It appears that the __first_name__ column has mixed-type data within it. We will convert this to string (as this is the most appropriate data type for names).

In [27]:
# Convert type from numeric to string
customers['first_name'] = customers['first_name'].astype('str')

In [28]:
# Check again for mixed types (to see if conversion has worked)
for col in customers.columns.tolist():
  weird = (customers[[col]].map(type) != customers[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (customers[weird]) > 0:
    print (col)

In [29]:
# As we don't want the user_id included in any descriptive statistics, I will also convert this to string.
customers['user_id'] = customers['user_id'].astype('str')

In [30]:
# Check all conversions have worked
customers.dtypes

user_id           object
first_name        object
last_name         object
gender            object
state             object
age                int64
date_joined       object
num_dependants     int64
fam_status        object
income             int64
dtype: object

#### 3.1.2 Missing values

In [32]:
# Check customers for missing values across all columns
customers.isnull().sum()

user_id           0
first_name        0
last_name         0
gender            0
state             0
age               0
date_joined       0
num_dependants    0
fam_status        0
income            0
dtype: int64

No missing values in any columns.

#### 3.1.3 Duplicates

In [35]:
# Create a new subset of df_ords (df_dups) containing only rows that are duplicates
dups = customers[customers.duplicated()]

In [36]:
# View dups
dups

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,num_dependants,fam_status,income


No duplicate rows found.

In [38]:
# Check descriptive statistic again to ensure everything looks normal
customers.describe()

Unnamed: 0,age,num_dependants,income
count,206209.0,206209.0,206209.0
mean,49.501646,1.499823,94632.852548
std,18.480962,1.118433,42473.786988
min,18.0,0.0,25903.0
25%,33.0,0.0,59874.0
50%,49.0,1.0,93547.0
75%,66.0,3.0,124244.0
max,81.0,3.0,593901.0


### 4.0 Combine customer data with existing data set

In [40]:
# Import orders_products_combined dataframe
ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged_grouped.pkl'))

In [41]:
# Check shape of import
ords_prods_merge.shape

(32404859, 26)

In [42]:
# Check column names
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price,spender_flag,median_days_since_prior_order,frequency_flag
0,2539329,1,1,2,8,,True,196,1,0,...,Mid-range product,Regularly busy,Regularly busy days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2539329,1,1,2,8,,True,14084,2,0,...,Mid-range product,Regularly busy,Regularly busy days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,2539329,1,1,2,8,,True,12427,3,0,...,Low-range product,Regularly busy,Regularly busy days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2539329,1,1,2,8,,True,26088,4,0,...,Low-range product,Regularly busy,Regularly busy days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,2539329,1,1,2,8,,True,26405,5,0,...,Low-range product,Regularly busy,Regularly busy days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


In [43]:
# Check data types of columns
ords_prods_merge.dtypes

order_id                            int64
user_id                             int64
order_number                        int64
orders_day_of_week                  int64
order_hour_of_day                   int64
days_since_prior_order            float64
first_order                          bool
product_id                          int64
add_to_cart_order                   int64
reordered                           int64
_merge                           category
product_name                       object
aisle_id                            int64
department_id                       int64
prices                            float64
match                            category
price_range_loc                    object
busiest_day                        object
busiest_days                       object
busiest_period_of_day              object
max_order                           int64
loyalty_flag                       object
avg_price                         float64
spender_flag                      

__user_id__ would make the most sense to use to combine these files, however the data type isn't the same. I will need to change the user_id in ords_prods_merge to string as well.

In [45]:
# Convert user_id from ords_prods_merge to string
ords_prods_merge['user_id'] = ords_prods_merge['user_id'].astype('str')

In [46]:
# Check user_id is now 'int64' data type.
ords_prods_merge.dtypes

order_id                            int64
user_id                            object
order_number                        int64
orders_day_of_week                  int64
order_hour_of_day                   int64
days_since_prior_order            float64
first_order                          bool
product_id                          int64
add_to_cart_order                   int64
reordered                           int64
_merge                           category
product_name                       object
aisle_id                            int64
department_id                       int64
prices                            float64
match                            category
price_range_loc                    object
busiest_day                        object
busiest_days                       object
busiest_period_of_day              object
max_order                           int64
loyalty_flag                       object
avg_price                         float64
spender_flag                      

In [47]:
# Drop the existing '_merge' column from previous exercises
ords_prods_merge = ords_prods_merge.drop(columns=['_merge'])

In [48]:
# Check the drop worked
ords_prods_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32404859 entries, 0 to 32404858
Data columns (total 25 columns):
 #   Column                         Dtype   
---  ------                         -----   
 0   order_id                       int64   
 1   user_id                        object  
 2   order_number                   int64   
 3   orders_day_of_week             int64   
 4   order_hour_of_day              int64   
 5   days_since_prior_order         float64 
 6   first_order                    bool    
 7   product_id                     int64   
 8   add_to_cart_order              int64   
 9   reordered                      int64   
 10  product_name                   object  
 11  aisle_id                       int64   
 12  department_id                  int64   
 13  prices                         float64 
 14  match                          category
 15  price_range_loc                object  
 16  busiest_day                    object  
 17  busiest_days             

In [49]:
# Drop the existing 'match' column from previous exercises
ords_prods_merge = ords_prods_merge.drop(columns=['match'])

In [50]:
# Check the drop worked
ords_prods_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32404859 entries, 0 to 32404858
Data columns (total 24 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   order_id                       int64  
 1   user_id                        object 
 2   order_number                   int64  
 3   orders_day_of_week             int64  
 4   order_hour_of_day              int64  
 5   days_since_prior_order         float64
 6   first_order                    bool   
 7   product_id                     int64  
 8   add_to_cart_order              int64  
 9   reordered                      int64  
 10  product_name                   object 
 11  aisle_id                       int64  
 12  department_id                  int64  
 13  prices                         float64
 14  price_range_loc                object 
 15  busiest_day                    object 
 16  busiest_days                   object 
 17  busiest_period_of_day          object 
 18  

In [51]:
# Check the shape of customers before merge
customers.shape

(206209, 10)

In [52]:
# Check the shape of ords_prods_merge before merge
ords_prods_merge.shape

(32404859, 24)

In [53]:
# Merge customers with ords_prods_merge
custs_ords_prods = ords_prods_merge.merge(customers, on = 'user_id', indicator = True)

In [54]:
# Check the output of custs_ords_prods
custs_ords_prods.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order,product_id,add_to_cart_order,reordered,...,first_name,last_name,gender,state,age,date_joined,num_dependants,fam_status,income,_merge
0,2539329,1,1,2,8,,True,196,1,0,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
1,2539329,1,1,2,8,,True,14084,2,0,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
2,2539329,1,1,2,8,,True,12427,3,0,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
3,2539329,1,1,2,8,,True,26088,4,0,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both
4,2539329,1,1,2,8,,True,26405,5,0,...,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423,both


In [55]:
# Check for a full match
custs_ords_prods['_merge'].value_counts()

_merge
both          32404859
left_only            0
right_only           0
Name: count, dtype: int64

In [56]:
# Check shape
custs_ords_prods.shape

(32404859, 34)

### 5.0 Export

In [58]:
# Export data to pkl
custs_ords_prods.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'custs_ords_prods_combined.pkl'))