# [Instacart](https://www.instacart.com/store) Grocery Basket Data Analysis: Data Preparation

## Table of Contents

### 1. [Import libraries](#Import_libraries)

### 2. [Import data](#Import_data)

### 3. [Data wrangling and cleaning](#Data_wrangling_and_cleaning)

- [Wrangling and cleaning departments dataframe](#Departments)
- [Wrangling and cleaning orders dataframe](#Orders)
 - [Dropping columns](#Orders_dropping)
 - [Renaming columns](#Orders_renaming)
 - [Missing values check](#Orders_missing)
 - [Duplicates check](#Orders_duplicates)
 - [Accuracy check](#Orders_accuracy)
- [Wrangling and cleaning products dataframe](#Products)
 - [Missing values check](#Products_missing)
 - [Duplicates check](#Products_duplicates)
 - [Accuracy check](#Products_accuracy)
- [Wrangling and cleaning orders prior dataframe](#Orders_prior)
 - [Missing values check](#Prior_missing)
 - [Duplicates check](#Prior_duplicates)
- [Wrangling and cleaning customers dataframe](#Customers) (**Note:** This data set was fabricated in service of learning)
 - [Dropping columns](#Customers_dropping)
 - [Renaming columns](#Customers_renaming)
 - [Missing values check](#Customers_missing)
 - [Duplicates check](#Customers_duplicates)
 - [Accuracy check](#Customers_accuracy)
 
### 4. [Converting data types for optimal performance](#Data_types)

- [Convert data types in orders dataframe](#Orders_convert)
- [Convert data types in orders prior dataframe](#Prior_convert)
- [Convert data types in products dataframe](#Products_convert)
- [Convert data types in customers dataframe](#Customers_convert)

### 5. [Merging dataframes](#Merging_dataframes)

- [Merge orders dataframe with orders prior dataframe](#Merge1)
- [Merge orders products combined dataframe with products dataframe](#Merge2)
- [Merge orders products merged dataframe with customers dataframe](#Merge3)

### 6. [Export data](#Export_data)

<a id='Import_libraries'></a>
# 1. Import libraries

In [1]:
# Import the necessary libraries

import pandas as pd
import numpy as np
import os

<a id='Import_data'></a>
# 2. Import data

In [2]:
# Create a string of the path for the main project folder

path = r'C:\Users\Ryan\Documents\07-17-2023 Instacart Basket Analysis'

In [3]:
# Import the “orders.csv” data set using the os library

df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)

In [4]:
# Import the "products.csv" data set using the os library

df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [5]:
# Import the “departments.csv” data set using the os library

df_depts = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'departments.csv'), index_col = False)

In [6]:
# Import the “orders_products_prior.csv” data set using the os library

df_ords_prior = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders_products_prior.csv'), index_col = False)

In [7]:
# Import the “customers.csv” data set using the os library

df_custs = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

<a id='Data_wrangling_and_cleaning'></a>
# 3. Data wrangling and cleaning

<a id='Departments'></a>
## Wrangling and cleaning departments dataframe

In [8]:
# Check the output

df_depts

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


In [9]:
# Transpose the dataframe

df_depts = df_depts.T

In [10]:
# Check the output

df_depts

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [11]:
# Make 'department_id' and 'department' the headers

df_depts.reset_index() # adds an index to df_dep
new_header = df_depts.iloc[0] # Take the first row of df_dep for the header
df_depts = df_depts[1:] # Delete the first row of df_dep
df_depts.columns = new_header # set the header row as the df header
df_depts # Check the output

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


<a id='Orders'></a>
## Wrangling and cleaning orders dataframe

In [12]:
# Check the output

df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [13]:
# Check the dimensions

df_ords.shape

(3421083, 7)

<a id='Orders_dropping'></a>
### Dropping columns

In [14]:
# Drop unnecessary columns for analysis

df_ords = df_ords.drop(columns = ['eval_set'])

<a id='Orders_renaming'></a>
### Renaming columns

In [15]:
# Rename the 'order_dow' column to 'orders_day_of_week'

df_ords.rename(columns = {'order_dow' : 'orders_day_of_week'}, inplace = True)

<a id='Orders_missing'></a>
### Missing values check

In [16]:
# Check for null values

df_ords.isna().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

There are 206,209 missing values in the 'days_since_prior_order' column. What might explain these missing values is when the order is a first time order, because there are no days prior to a first time order.

In [17]:
# Check the missing values from 'days_since_prior_order' column

df_ords[df_ords['days_since_prior_order'].isnull() == True]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


In [18]:
# Check if there are 206,209 records of order_number = 1

df_ords[df_ords['order_number'] == 1]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


As expected, the missing values in the 'days_since_prior_order' column are first time orders (when order number = 1) for each user. Since first time orders do not have any days since prior order, it is correct to leave them blank. Therefore, no action will be taken to address the missing values.

<a id='Orders_duplicates'></a>
### Duplicates check

In [19]:
# Check for duplicates

df_ords[df_ords.duplicated()]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


No duplicates in the orders dataframe

In [20]:
# Check the dimensions

df_ords.shape

(3421083, 6)

<a id='Orders_accuracy'></a>
### Accuracy check

In [21]:
# Obtain statistics

df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


- The 'order_id', 'user_id', and 'order_number' are not measures, so there should not be statitics for them. Most likely pandas  assigned them integer values when reading the csv file.
- The min and max values for the other columns appear to be correct.

<a id='Products'></a>
## Wrangling and cleaning products dataframe

In [22]:
# Check the output

df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [23]:
# Check the dimensions

df_prods.shape

(49693, 5)

<a id='Products_missing'></a>
### Missing values check

In [24]:
# Check for null values

df_prods.isna().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [25]:
# Removing null values

df_prods = df_prods.loc[df_prods['product_name'].isnull() == False]

In [26]:
# Check the dimensions

df_prods.shape

(49677, 5)

There are 16 fewer rows in the dataframe.

<a id='Products_duplicates'></a>
### Duplicates check

In [27]:
# Check for duplicates

df_prods[df_prods.duplicated()]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


Five duplicates found.

In [28]:
# Remove duplicates

df_prods = df_prods.drop_duplicates()

In [29]:
# Check the dimesions

df_prods.shape

(49672, 5)

There are 5 fewer rows in the dataframe.

<a id='Products_accuracy'></a>
### Accuracy check

In [30]:
# Check statistics of df_prods

df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


It's unusual for a product to be priced at $99,999. Must investigate the prices further.

In [31]:
df_prods[df_prods['prices'] > 30]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


Two items have an unusually high price and will skew the results of the analysis. For the current analysis, these values will be replaced with the prices for similar products.

**The data owner should review and fix the prices for future analysis.**

In [32]:
# Find products named 'Lowfat Cottage Cheese'

df_prods.loc[df_prods['product_name'].str.contains('Lowfat Cottage Cheese')]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
194,195,Grade A Pasteurized 2% Milkfat Lowfat Cottage ...,108,16,3.3
11203,11203,"Cottage Doubles Peach, 2% Milkfat Lowfat Cotta...",108,16,2.4
15285,15285,Lowfat Cottage Cheese,108,16,5.6
20658,20657,2% Milkfat Lowfat Cottage Cheese Large Curd,100,21,7.0
22130,22129,1% Milkfat Lowfat Cottage Cheese,108,16,9.3
22731,22730,Singles 1% Lowfat Cottage Cheese,108,16,2.2
30533,30531,2% Lowfat Cottage Cheese,21,16,7.5
35776,35772,1% Lowfat Cottage Cheese,108,16,7.0
38612,38608,Lowfat Cottage Cheese 2%,108,16,10.5
41083,41079,1.5% Milkfat Grade A Pasteurized Lowfat Cottag...,120,16,11.9


In [33]:
# Find products named 'Reduced Fat Milk'

df_prods.loc[df_prods['product_name'].str.contains('Reduced Fat Milk')]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
1940,1940,Organic 2% Reduced Fat Milk,84,16,9.1
5612,5612,Reduced Fat Milk,84,16,9.2
14882,14882,2% Milkfat Reduced Fat Milk,84,16,8.4
19172,19171,100% Lactose Free Organic 2% Reduced Fat Milk,84,16,1.9
19585,19584,Dairy Pure 2% Milkfat Reduced Fat Milk,84,16,1.7
19821,19820,Grassmilk 2% Reduced Fat Milk,84,16,12.6
20852,20851,Filtered Fresh 2% Reduced Fat Milk,84,16,5.8
22960,22959,Reduced Fat Milk 100% Lactose Free,91,16,14.6
23910,23909,2% Reduced Fat Milk,84,16,9.2
24442,24441,2% Reduced Fat Milk With Vitamin A&D,84,16,7.0


- There is a 2% Milkfat Lowfat Cottage Cheese priced at 3.3.
- There is a 2% Reduced Fat Milk priced at 9.2.

We will replace the extremely high priced values with these prices.

In [34]:
# Replace the high priced values

df_prods['prices'] = df_prods['prices'].replace([14900,99999],[3.3,9.2])

In [35]:
# Re-check the statistics

df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,7.680379
std,14340.705287,38.315784,5.850779,4.199348
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,25.0


Prices look good now (:

<a id='Orders_prior'></a>
## Wrangling and cleaning orders prior dataframe

In [36]:
# Check the output

df_ords_prior.head(10)

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0
5,2,17794,6,1
6,2,40141,7,1
7,2,1819,8,1
8,2,43668,9,0
9,3,33754,1,1


In [37]:
# Check the dimensions

df_ords_prior.shape

(32434489, 4)

<a id='Prior_missing'></a>
### Missing values check

In [38]:
# Check for null values

df_ords_prior.isna().sum()

order_id             0
product_id           0
add_to_cart_order    0
reordered            0
dtype: int64

No null values found.

<a id='Prior_duplicates'></a>
### Duplicates check

In [39]:
# Check for duplicates

df_ords_prior[df_ords_prior.duplicated()]

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered


No duplicates found.

<a id='Customers'></a>
## Wrangling and cleaning customers dataframe

**Note:** The customers data set is not an actual data set from Instacart. The customers data set was fabricated by CareerFoundry in service of learning.

In [40]:
# Check the output

df_custs.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [41]:
# Check the dimensions

df_custs.shape

(206209, 10)

<a id='Customers_dropping'></a>
### Dropping columns

In [42]:
# Drop 'First Name' and 'Surnam' columns from df_custs as it contains PII

df_custs = df_custs.drop(columns = ['First Name', 'Surnam'])

<a id='Customers_renaming'></a>
### Renaming columns

In [43]:
# Rename columns in customers dataframe to appropriate naming conventions

df_custs.rename(columns = {'STATE': 'state',
                           'Age': 'age',
                           'Gender': 'gender'},
                inplace = True)

<a id='Customers_missing'></a>
### Missing values check

In [44]:
# Check for null values

df_custs.isna().sum()

user_id         0
gender          0
state           0
age             0
date_joined     0
n_dependants    0
fam_status      0
income          0
dtype: int64

No missing values

<a id='Customers_duplicates'></a>
### Duplicates check

In [45]:
# Check for duplicates

df_custs[df_custs.duplicated()]

Unnamed: 0,user_id,gender,state,age,date_joined,n_dependants,fam_status,income


No duplicates found

<a id='Customers_accuracy'></a>
### Accuracy check

In [46]:
# Check statistics of df_custs

df_custs.describe()

Unnamed: 0,user_id,age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


In [47]:
# Check 'fam_status' column values

df_custs['fam_status'].value_counts(dropna = False)

married                             144906
single                               33962
divorced/widowed                     17640
living with parents and siblings      9701
Name: fam_status, dtype: int64

In [48]:
# Check 'state' column values

df_custs['state'].value_counts(dropna = False)

Florida                 4044
Colorado                4044
Illinois                4044
Alabama                 4044
District of Columbia    4044
Hawaii                  4044
Arizona                 4044
Connecticut             4044
California              4044
Indiana                 4044
Arkansas                4044
Alaska                  4044
Delaware                4044
Iowa                    4044
Idaho                   4044
Georgia                 4044
Wyoming                 4043
Mississippi             4043
Oklahoma                4043
Utah                    4043
New Hampshire           4043
Kentucky                4043
Maryland                4043
Rhode Island            4043
Massachusetts           4043
Michigan                4043
New Jersey              4043
Kansas                  4043
South Dakota            4043
Minnesota               4043
Tennessee               4043
New York                4043
Washington              4043
Louisiana               4043
Montana       

In [49]:
# Check 'age' column values

df_custs['age'].value_counts(dropna = False)

19    3329
55    3317
51    3317
56    3306
32    3305
      ... 
65    3145
25    3127
66    3114
50    3102
36    3101
Name: age, Length: 64, dtype: int64

In [50]:
# Check 'date_joined' column values

df_custs['date_joined'].value_counts(dropna = False)

9/17/2018     213
2/10/2018     212
4/1/2019      211
9/21/2019     211
12/19/2017    210
             ... 
9/1/2018      141
1/22/2018     140
11/24/2017    139
7/18/2019     138
8/6/2018      128
Name: date_joined, Length: 1187, dtype: int64

**Note:** It's important to recall that the customers data set was fabricated in service of learning. This explains all of the accuracy issues in this data set. For example:
- It's unusual for all these customers to max out at 3 dependants
- It's unusual that the customers are evenly distributed among the 50 states in the United States
- It's unusual that the customers are almost evenly distributed in different age groups
- It's unusual that the customers are almost evenly distributed in their join dates

<a id='Data_types'></a>
# 4. Converting data types for optimal performance

<a id='Orders_convert'></a>
## Convert data types in orders dataframe

In [51]:
# Check the data types

df_ords.dtypes

order_id                    int64
user_id                     int64
order_number                int64
orders_day_of_week          int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

In [52]:
# Check for mixed types in df_ords dataframe

for col in df_ords.columns.tolist():
    weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords[weird]) > 0:
        print (col)

No mixed data types found.

In [53]:
# Obtain summary statistics

df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


We will convert the following data types:
- order_id to 32-bit unsigned integer
- user_id to 32-bit unsigned integer
- order_number to 8-bit unsigned integer
- orders_day_of_week to 8-bit unsigned integer
- order_hour_of_day to 8-bit unsigned integer
- days_since_prior_order to 16-bit unsigned integer

In [54]:
# Convert data types for optimal performance

df_ords['order_id'] = df_ords['order_id'].astype('uint32')
df_ords['user_id'] = df_ords['user_id'].astype('uint32')
df_ords['order_number'] = df_ords['order_number'].astype('uint8')
df_ords['orders_day_of_week'] = df_ords['orders_day_of_week'].astype('uint8')
df_ords['order_hour_of_day'] = df_ords['order_hour_of_day'].astype('uint8')
df_ords['days_since_prior_order'] = df_ords['days_since_prior_order'].astype('float16')

In [55]:
# Re-check the data types

df_ords.dtypes

order_id                   uint32
user_id                    uint32
order_number                uint8
orders_day_of_week          uint8
order_hour_of_day           uint8
days_since_prior_order    float16
dtype: object

<a id='Prior_convert'></a>
## Convert data types in orders prior dataframe

In [56]:
# Check the data types

df_ords_prior.dtypes

order_id             int64
product_id           int64
add_to_cart_order    int64
reordered            int64
dtype: object

In [57]:
# Check for mixed type data

for col in df_ords_prior.columns.tolist():
    weird = (df_ords_prior[[col]].applymap(type) != df_ords_prior[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_ords_prior[weird]) > 0:
        print (col)

No mixed data types found.

In [58]:
# Obtain summary statistics

df_ords_prior.describe()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
count,32434490.0,32434490.0,32434490.0,32434490.0
mean,1710749.0,25576.34,8.351076,0.5896975
std,987300.7,14096.69,7.126671,0.4918886
min,2.0,1.0,1.0,0.0
25%,855943.0,13530.0,3.0,0.0
50%,1711048.0,25256.0,6.0,1.0
75%,2565514.0,37935.0,11.0,1.0
max,3421083.0,49688.0,145.0,1.0


We will convert the following data types:
- order_id to 32-bit unsigned integer
- product_id to 16-bit unsigned integer
- add_to_cart_order to 8-bit unsigned integer
- reordered to 8-bit unsigned integer

In [59]:
# Convert data types for optimal performance

df_ords_prior['order_id'] = df_ords_prior['order_id'].astype('uint32')
df_ords_prior['product_id'] = df_ords_prior['product_id'].astype('uint16')
df_ords_prior['add_to_cart_order'] = df_ords_prior['add_to_cart_order'].astype('uint8')
df_ords_prior['reordered'] = df_ords_prior['reordered'].astype('uint8')

In [60]:
# Re=check the data types

df_ords_prior.dtypes

order_id             uint32
product_id           uint16
add_to_cart_order     uint8
reordered             uint8
dtype: object

<a id='Products_convert'></a>
## Convert data types in products dataframe

In [61]:
# Check the data types

df_prods.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

In [62]:
# Check for mixed type data

for col in df_prods.columns.tolist():
    weird = (df_prods[[col]].applymap(type) != df_prods[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_prods[weird]) > 0:
        print (col)

No mixed data types found.

In [63]:
# Obtain summary statistics of df_prods

df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,7.680379
std,14340.705287,38.315784,5.850779,4.199348
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,25.0


We will convert the following data types:
- product_id to 16-bit unsigned integer
- aisle_id to 8-bit unsigned integer
- department_id to 8-bit unsigned integer
- prices to 16-bit float

In [64]:
# Converting data types

df_prods['product_id'] = df_prods['product_id'].astype('uint16')
df_prods['aisle_id'] = df_prods['aisle_id'].astype('uint8')
df_prods['department_id'] = df_prods['department_id'].astype('uint8')
df_prods['prices'] = df_prods['prices'].astype('float16')

In [65]:
# Re-check the data types

df_prods.dtypes

product_id        uint16
product_name      object
aisle_id           uint8
department_id      uint8
prices           float16
dtype: object

<a id='Customers_convert'></a>
## Convert data types in customers dataframe

In [66]:
# Check the data types in df_custs

df_custs.dtypes

user_id          int64
gender          object
state           object
age              int64
date_joined     object
n_dependants     int64
fam_status      object
income           int64
dtype: object

In [67]:
# Check for mixed types in df_custs dataframe

for col in df_custs.columns.tolist():
    weird = (df_custs[[col]].applymap(type) != df_custs[[col]].iloc[0].apply(type)).any(axis = 1)
    if len (df_custs[weird]) > 0:
        print (col)

No mixed type data found

In [68]:
# Obtain summary statistics of df_custs

df_custs.describe()

Unnamed: 0,user_id,age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


We will convert the following data types:
- user_id to 32-bit unsigned integer
- age to 8-bit unsigned integer
- n_dependants to 8-bit unsigned integer
- income to 32-bit unsigned integer
- date joined to datetime

In [69]:
# Change data types in df_custs to more optimal ones

df_custs['user_id'] = df_custs['user_id'].astype('uint32')
df_custs['age'] = df_custs['age'].astype('uint8')
df_custs['n_dependants'] = df_custs['n_dependants'].astype('uint8')
df_custs['income'] = df_custs['income'].astype('uint32')
df_custs['date_joined'] = pd.to_datetime(df_custs['date_joined'], format="%m/%d/%Y")

In [70]:
# Re-check the data types

df_custs.dtypes

user_id                 uint32
gender                  object
state                   object
age                      uint8
date_joined     datetime64[ns]
n_dependants             uint8
fam_status              object
income                  uint32
dtype: object

<a id='Merging_dataframes'></a>
# 5. Merging dataframes

<a id='Merge1'></a>
## Merge orders dataframe with orders prior dataframe

In [71]:
# Merge orders dataframe with orders prior dataframe

ords_prods_combined = df_ords.merge(df_ords_prior, on = 'order_id', how = 'inner', indicator = True)

In [72]:
# Check the output

ords_prods_combined.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge
0,2539329,1,1,2,8,,196,1,0,both
1,2539329,1,1,2,8,,14084,2,0,both
2,2539329,1,1,2,8,,12427,3,0,both
3,2539329,1,1,2,8,,26088,4,0,both
4,2539329,1,1,2,8,,26405,5,0,both


In [73]:
# Obtain frequencies from '_merge' column

ords_prods_combined['_merge'].value_counts(dropna = False)

both          32434489
left_only            0
right_only           0
Name: _merge, dtype: int64

In [74]:
# Drop '_merge' column

ords_prods_combined = ords_prods_combined.drop(columns = ['_merge'])

In [75]:
# Check the dimensions

ords_prods_combined.shape

(32434489, 9)

In [76]:
# Check the data types

ords_prods_combined.dtypes

order_id                   uint32
user_id                    uint32
order_number                uint8
orders_day_of_week          uint8
order_hour_of_day           uint8
days_since_prior_order    float16
product_id                 uint16
add_to_cart_order           uint8
reordered                   uint8
dtype: object

<a id='Merge2'></a>
## Merge orders products combined dataframe with products dataframe

In [77]:
# Merge orders products combined dataframe with products dataframe

ords_prods_merge = ords_prods_combined.merge(df_prods, on = 'product_id', how = 'inner', indicator = True)

In [78]:
# Check the output

ords_prods_merge.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both
1,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both
2,473747,1,3,3,12,21.0,196,1,1,Soda,77,7,9.0,both
3,2254736,1,4,4,7,29.0,196,1,1,Soda,77,7,9.0,both
4,431534,1,5,4,15,28.0,196,1,1,Soda,77,7,9.0,both


In [79]:
# Check the frequency of '_merge' column

ords_prods_merge['_merge'].value_counts()

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

In [80]:
# Drop '_merge' column

ords_prods_merge = ords_prods_merge.drop(columns = ['_merge'])

In [81]:
# Check the dimensions

ords_prods_merge.shape

(32404859, 13)

In [82]:
# Check the data types

ords_prods_merge.dtypes

order_id                   uint32
user_id                    uint32
order_number                uint8
orders_day_of_week          uint8
order_hour_of_day           uint8
days_since_prior_order    float16
product_id                 uint16
add_to_cart_order           uint8
reordered                   uint8
product_name               object
aisle_id                    uint8
department_id               uint8
prices                    float16
dtype: object

<a id='Merge3'></a>
## Merge orders products merged dataframe with customers dataframe

In [83]:
# Merge ords_prods_merge and df_custs

ords_prods_all = ords_prods_merge.merge(df_custs, on = 'user_id', how = 'left', indicator = True)

In [84]:
# Check the output

ords_prods_all.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,department_id,prices,gender,state,age,date_joined,n_dependants,fam_status,income,_merge
0,2539329,1,1,2,8,,196,1,0,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,both
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,both
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,both
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,both
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,7,9.0,Female,Alabama,31,2019-02-17,3,married,40423,both


In [85]:
# Obtain frequencies of '_merge' column

ords_prods_all['_merge'].value_counts(dropna = False)

both          32404859
left_only            0
right_only           0
Name: _merge, dtype: int64

In [86]:
# Drop '_merge' column from ords_prods_merge

ords_prods_all = ords_prods_all.drop(columns = ['_merge'])

In [87]:
# Check the dimensions

ords_prods_all.shape

(32404859, 20)

In [88]:
# Check the data types

ords_prods_all.dtypes

order_id                          uint32
user_id                           uint32
order_number                       uint8
orders_day_of_week                 uint8
order_hour_of_day                  uint8
days_since_prior_order           float16
product_id                        uint16
add_to_cart_order                  uint8
reordered                          uint8
product_name                      object
aisle_id                           uint8
department_id                      uint8
prices                           float16
gender                            object
state                             object
age                                uint8
date_joined               datetime64[ns]
n_dependants                       uint8
fam_status                        object
income                            uint32
dtype: object

<a id='Export_data'></a>
# 6. Export data

In [89]:
# Export df_depts dataframe as "departments_wrangled.pkl"

df_depts.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'departments_wrangled.pkl'))

In [90]:
# Export ords_prods_all dataframe as "orders_products_all.pkl"

ords_prods_all.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_all.pkl'))