# Exercise 4.9 Intro to Data Visualization with Python (Part 1)

## This script contains the following:
1. Importing Libraries and Data Files
2. Data Wrangling the customers.csv Data
3. Consistency Checks on the customers.csv Data
4. Converting Data Types for Optimal Performance
5. Merging Dataframes
6. Export Dataframes

# 1. Import Libraries and Data Files

#### Import your analysis libraries, as well as your new customer data set as a dataframe.

In [1]:
# Import the necessary libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Create a string of the path for the main project folder
path = r'C:\Users\Ryan\Documents\07-17-2023 Instacart Basket Analysis'

In [3]:
# Import the “customers.csv” data set using the os library
df_custs = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

In [4]:
# Check the dimensions
df_custs.shape

(206209, 10)

In [5]:
# Check the output
df_custs.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


# 2. Data Wrangling the customers.csv Data

#### Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.

In [6]:
# Drop 'Gender' and 'date_joined' columns from df_custs
df_custs = df_custs.drop(columns = ['Gender', 'date_joined'])

In [7]:
# Rename columns in customers dataframe to appropriate naming conventions
df_custs.rename(columns = {'First Name': 'first_name',
                           'Surnam': 'last_name',
                           'STATE': 'state',
                           'Age': 'age',
                           'n_dependants': 'number_of_dependants',
                           'fam_status': 'marital_status'},
                inplace = True)

In [8]:
# Check the output
df_custs.head()

Unnamed: 0,user_id,first_name,last_name,state,age,number_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Missouri,48,3,married,165665
1,33890,Patricia,Hart,New Mexico,36,0,single,59285
2,65803,Kenneth,Farley,Idaho,35,2,married,99568
3,125935,Michelle,Hicks,Iowa,40,0,single,42049
4,130797,Ann,Gilmore,Maryland,26,1,married,40374


In [9]:
# Check the dimensions
df_custs.shape

(206209, 8)

# 3. Consistency Checks on the customers.csv Data

#### Complete the fundamental data quality and consistency checks you’ve learned throughout this Achievement; for example, check for and address missing values and duplicates, and convert any mixed-type data.

#### Convert any mixed-type data

In [10]:
# Check for mixed types
for col in df_custs.columns.tolist():
    weird = (df_custs[[col]].applymap(type) != df_custs[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_custs[weird]) > 0:
        print (col)

first_name


In [11]:
# Change the data type of 'first_name' column to string
df_custs['first_name'] = df_custs['first_name'].astype('str')

#### Address missing values, if any

In [12]:
# Find any missing values in df_custs
df_custs.isnull().sum()

user_id                 0
first_name              0
last_name               0
state                   0
age                     0
number_of_dependants    0
marital_status          0
income                  0
dtype: int64

No missing values to address

#### Address duplicate values

In [13]:
# View dataframe that contains full duplicates from df_custs
df_custs[df_custs.duplicated()]

Unnamed: 0,user_id,first_name,last_name,state,age,number_of_dependants,marital_status,income


There are no duplicates in df_custs

#### Consistency checks

In [14]:
# Check statistics of df_custs
df_custs.describe()

Unnamed: 0,user_id,age,number_of_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


The max income is very high! Perform a check on incomes greater than the 75th percentile

In [15]:
# Check for income > 124244
df_custs.loc[df_custs['income'] > 124244]

Unnamed: 0,user_id,first_name,last_name,state,age,number_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Missouri,48,3,married,165665
8,69965,Jeremy,Vang,Texas,47,1,married,162432
14,516,Peter,Hunt,Colorado,51,2,married,146559
21,37744,Wanda,Salas,Arkansas,77,1,married,125977
24,57549,Shawn,Moore,Montana,72,0,divorced/widowed,135302
...,...,...,...,...,...,...,...,...
206196,139950,Gloria,Murray,Colorado,45,2,married,150954
206197,74598,Christopher,Velazquez,Minnesota,52,0,single,140700
206199,179673,Adam,Villanueva,Wyoming,77,0,divorced/widowed,162239
206204,168073,Lisa,Case,North Carolina,44,1,married,148828


There are 51,511 customers that have an income greater than 124,244. It appears the max income amount is not a mistake.

In [16]:
# Check 'marital_status' column values
df_custs['marital_status'].value_counts(dropna = False)

married                             144906
single                               33962
divorced/widowed                     17640
living with parents and siblings      9701
Name: marital_status, dtype: int64

In [17]:
# Check 'state' column values
df_custs['state'].value_counts(dropna = False)

Florida                 4044
Colorado                4044
Illinois                4044
Alabama                 4044
District of Columbia    4044
Hawaii                  4044
Arizona                 4044
Connecticut             4044
California              4044
Indiana                 4044
Arkansas                4044
Alaska                  4044
Delaware                4044
Iowa                    4044
Idaho                   4044
Georgia                 4044
Wyoming                 4043
Mississippi             4043
Oklahoma                4043
Utah                    4043
New Hampshire           4043
Kentucky                4043
Maryland                4043
Rhode Island            4043
Massachusetts           4043
Michigan                4043
New Jersey              4043
Kansas                  4043
South Dakota            4043
Minnesota               4043
Tennessee               4043
New York                4043
Washington              4043
Louisiana               4043
Montana       

# 4. Converting Data Types for Optimal Performance

#### Use more optimal data types

In [18]:
# Check the data types in df_custs
df_custs.dtypes

user_id                  int64
first_name              object
last_name               object
state                   object
age                      int64
number_of_dependants     int64
marital_status          object
income                   int64
dtype: object

In [19]:
# Change the data type of 'user_id' to 32-bit unsigned integer
df_custs['user_id'] = df_custs['user_id'].astype('uint32')

In [20]:
# Change the data type of 'age' to 8-bit unsigned integer
df_custs['age'] = df_custs['age'].astype('uint8')

In [21]:
# Change the data type of 'n_dependants' to 8-bit unsigned integer
df_custs['number_of_dependants'] = df_custs['number_of_dependants'].astype('uint8')

In [22]:
# Change the data type of 'income' to 32-bit unsigned integer
df_custs['income'] = df_custs['income'].astype('uint32')

In [23]:
# Check the data types in df_custs
df_custs.dtypes

user_id                 uint32
first_name              object
last_name               object
state                   object
age                      uint8
number_of_dependants     uint8
marital_status          object
income                  uint32
dtype: object

#### Combine your customer data with the rest of your prepared Instacart data. (Hint: Make sure the key columns are the same data type!)

In [24]:
# Import the “orders_products_merged.pkl” data set using the os library
ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_products_merged.pkl'))

In [25]:
# Check the output
ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,price_range,busiest_days,busiest_period_of_day,max_order,loyalty_label,mean_spent,spender_label,median_days_since_prior_order,customer_frequency_level
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Mid-range product,Regularly busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Mid-range product,Least busy,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


In [26]:
# Check the dimensions
ords_prods_merge.shape

(32404859, 22)

In [27]:
# Check the data types of ords_prods_merge
ords_prods_merge.dtypes

order_id                          uint32
user_id                           uint32
order_number                       uint8
orders_day_of_week                 uint8
order_hour_of_day                  uint8
days_since_prior_order           float32
product_id                        uint16
add_to_cart_order                  uint8
reordered                          uint8
product_name                      object
aisle_id                           uint8
department_id                      uint8
prices                           float32
price_range                       object
busiest_days                      object
busiest_period_of_day             object
max_order                          uint8
loyalty_label                     object
mean_spent                       float32
spender_label                     object
median_days_since_prior_order    float32
customer_frequency_level          object
dtype: object

#### Consistency check

In [28]:
ords_prods_merge.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,aisle_id,department_id,prices,max_order,mean_spent,median_days_since_prior_order
count,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,30328760.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32399730.0,32404860.0,32404860.0,32404850.0
mean,1710745.0,102937.2,17.1423,2.738867,13.42515,11.10408,25598.66,8.352547,0.5895873,71.19612,9.919792,7.790978,33.05217,7.790991,10.39776
std,987298.8,59466.1,17.53532,2.090077,4.24638,8.541397,14084.0,7.127071,0.4919087,38.21139,6.281485,4.12004,25.15525,1.091535,6.894948
min,2.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,855947.0,51422.0,5.0,1.0,10.0,5.0,13544.0,3.0,0.0,31.0,4.0,4.2,13.0,7.378488,6.0
50%,1711049.0,102616.0,11.0,3.0,13.0,8.0,25302.0,6.0,1.0,83.0,9.0,7.4,26.0,7.811941,8.0
75%,2565499.0,154389.0,24.0,5.0,16.0,15.0,37947.0,11.0,1.0,107.0,16.0,11.3,47.0,8.229327,13.0
max,3421083.0,206209.0,99.0,6.0,23.0,30.0,49688.0,145.0,1.0,134.0,21.0,25.0,99.0,23.2,30.0


Consistency check looks good!

# 5. Merging Dataframes

#### Merge ords_prods_merge dataframe with df_custs dataframe

The user_id column in both ords_prods_merge and df_custs are of the same data type 'uint32'

In [29]:
# Merge df_custs and ords_prods_merge
ords_prods_all = ords_prods_merge.merge(df_custs, on = 'user_id', how = 'left', indicator = True)

In [30]:
# Check the output
ords_prods_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,median_days_since_prior_order,customer_frequency_level,first_name,last_name,state,age,number_of_dependants,marital_status,income,_merge
0,2539329,1,1,2,8,,196,1,0,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Alabama,31,3,married,40423,both
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Alabama,31,3,married,40423,both
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Alabama,31,3,married,40423,both
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Alabama,31,3,married,40423,both
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Alabama,31,3,married,40423,both


In [31]:
# Obtain frequencies of '_merge' column
ords_prods_all['_merge'].value_counts(dropna = False)

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

In [32]:
# Drop '_merge' column from ords_prods_merge
ords_prods_all = ords_prods_all.drop(columns = ['_merge'])

In [33]:
# Check the dimensions
ords_prods_all.shape

(32404859, 29)

# 6. Export Dataframes

#### Export this new dataframe as a pickle file so you can continue to use it in the second part of this task.

In [34]:
# Export ords_prods_all dataframe as "orders_products_all.pkl"
ords_prods_all.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_all.pkl'))