# Tasks<br> 

<b>Part 1: </b><br>
1. Download the customer data set and add it to your “Original Data” folder.<br>
2. Create a new notebook in your “Scripts” folder for part 1 of this task.<br>
3. Import your analysis libraries, as well as your new customer data set as a dataframe.<br>
4. Wrangle the data so that it follows consistent logic; for example, rename columns with illogical names and drop columns that don’t add anything to your analysis.<br>
5. Complete the fundamental data quality and consistency checks (check for and address missing values and duplicates, and convert any mixed-type data).<br>
6. Combine your customer data with the rest of your prepared Instacart data.<br>
7. Ensure your notebook contains logical titles, section headings, and descriptive code comments.<br>
8. Export this new dataframe as a pickle file so you can continue to use it in the second part of this task.<br>
9. Save your notebook so that you can send it to your tutor for review after completing part 2.<br><br>

<b>Steps 1 and 2 have been completed. Onwards to Step 3. </b>

# Step 3. Import libraries and dataframe

In [1]:
# Import libraries 
import pandas as pd 
import numpy as np 
import os 
import matplotlib.pyplot as plt 
import seaborn as sns 
import scipy

In [2]:
# Create path 
path = r'/Users/bentley/Documents/Instacart'

In [5]:
# Import dataframe 
df_customers = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

# Step 4. Data wrangling 

In [6]:
# Check the size of the dataframe 
df_customers.shape

(206209, 10)

In [8]:
# View the dataframe
df_customers.head(20)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [9]:
# Rename a column 
df_customers.rename(columns = {'First Name' : 'first_name'}, inplace = True)

In [10]:
# Rename a column 
df_customers.rename(columns = {'Surnam' : 'last_name'}, inplace = True)

In [11]:
# Rename a column 
df_customers.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [12]:
# Rename a column 
df_customers.rename(columns = {'Age' : 'age'}, inplace = True)

In [20]:
# Rename a column 
df_customers.rename(columns = {'n_dependants' : 'nbr_of_dependents'}, inplace = True)

In [14]:
# View dataframe 
df_customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,STATE,age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [16]:
# Rename a column 
df_customers.rename(columns = {'STATE' : 'state'}, inplace = True)

In [17]:
# View dataframe 
df_customers.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


Names of columns have been checked and updated. 

In [18]:
# Check descriptive stats 
df_customers.describe()

Unnamed: 0,user_id,age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


In [19]:
# Change datatype 
df_customers['user_id'] = df_customers['user_id'].astype('str')

In [22]:
# Check datatype 
df_customers['user_id'].dtype

dtype('O')

Records in 'user_id' column are objects (aka strings). 

In [23]:
# View dataframe 
df_customers.head(25)

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,nbr_of_dependents,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [24]:
# View information 
df_customers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   user_id            206209 non-null  object
 1   first_name         194950 non-null  object
 2   last_name          206209 non-null  object
 3   gender             206209 non-null  object
 4   state              206209 non-null  object
 5   age                206209 non-null  int64 
 6   date_joined        206209 non-null  object
 7   nbr_of_dependents  206209 non-null  int64 
 8   fam_status         206209 non-null  object
 9   income             206209 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 15.7+ MB


In [25]:
# Check frequency of states in the 'state' column 
df_customers['state'].value_counts()

Iowa                    4044
District of Columbia    4044
Colorado                4044
California              4044
Florida                 4044
Arkansas                4044
Connecticut             4044
Indiana                 4044
Alabama                 4044
Arizona                 4044
Illinois                4044
Georgia                 4044
Alaska                  4044
Hawaii                  4044
Delaware                4044
Idaho                   4044
Maine                   4043
Michigan                4043
Virginia                4043
Oregon                  4043
Wisconsin               4043
South Carolina          4043
New Mexico              4043
Vermont                 4043
New Jersey              4043
Oklahoma                4043
Maryland                4043
Montana                 4043
West Virginia           4043
Massachusetts           4043
Rhode Island            4043
New York                4043
Tennessee               4043
North Dakota            4043
Missouri      

In [26]:
# Sort states in indexing order 
df_customers['state'].value_counts().sort_index()

Alabama                 4044
Alaska                  4044
Arizona                 4044
Arkansas                4044
California              4044
Colorado                4044
Connecticut             4044
Delaware                4044
District of Columbia    4044
Florida                 4044
Georgia                 4044
Hawaii                  4044
Idaho                   4044
Illinois                4044
Indiana                 4044
Iowa                    4044
Kansas                  4043
Kentucky                4043
Louisiana               4043
Maine                   4043
Maryland                4043
Massachusetts           4043
Michigan                4043
Minnesota               4043
Mississippi             4043
Missouri                4043
Montana                 4043
Nebraska                4043
Nevada                  4043
New Hampshire           4043
New Jersey              4043
New Mexico              4043
New York                4043
North Carolina          4043
North Dakota  

All states are accounted for. There are no mispellings or duplicates. 

# Step 5. Fundamental data quality and consistency checks

In [29]:
# Check for missing values 
df_customers.isnull().sum()

user_id                  0
first_name           11259
last_name                0
gender                   0
state                    0
age                      0
date_joined              0
nbr_of_dependents        0
fam_status               0
income                   0
dtype: int64

There are 11,259 missing values in the 'first_name' column. There are no other missing values in other columns. 

In [31]:
## create a subset of missing values 
df_customers_nan = df_customers[df_customers['first_name'].isnull()== True] 

In [32]:
# View subset 
df_customers_nan.head(25)

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,nbr_of_dependents,fam_status,income
53,76659,,Gilbert,Male,Colorado,26,1/1/2017,2,married,41709
73,13738,,Frost,Female,Louisiana,39,1/1/2017,0,single,82518
82,89996,,Dawson,Female,Oregon,52,1/1/2017,3,married,117099
99,96166,,Oconnor,Male,Oklahoma,51,1/1/2017,1,married,155673
105,29778,,Dawson,Female,Utah,63,1/1/2017,3,married,151819
128,8562,,Oconnor,Male,Utah,46,1/1/2017,1,married,134898
140,149267,,Hutchinson,Male,South Carolina,20,1/1/2017,0,single,86778
149,82632,,Orr,Male,Hawaii,61,1/1/2017,1,married,118130
155,172331,,Williamson,Female,Alaska,27,1/1/2017,0,single,55047
236,182963,,Nicholson,Female,New Mexico,58,1/2/2017,1,married,163391


In [33]:
# Check descriptive stats 
df_customers.describe()

Unnamed: 0,age,nbr_of_dependents,income
count,206209.0,206209.0,206209.0
mean,49.501646,1.499823,94632.852548
std,18.480962,1.118433,42473.786988
min,18.0,0.0,25903.0
25%,33.0,0.0,59874.0
50%,49.0,1.0,93547.0
75%,66.0,3.0,124244.0
max,81.0,3.0,593901.0


In [34]:
# Check shape 
df_customers.shape

(206209, 10)

The missing values in the 'first_name' column will be left alone since everything else is accounted (ex: last names, gender, age, state). No columns need to be dropped at this point as well.  

In [35]:
# Check for duplicates 
df_customers_dups = df_customers[df_customers.duplicated()]

In [36]:
df_customers_dups

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,nbr_of_dependents,fam_status,income


There are no duplicates in the dataframe. 

The df_customers dataframe is wrangled and ready to be merged with the Instacart data. First, I'll export the df_customers dataframe to save it. 

In [38]:
# Export dataframe as a pickle file 
df_customers.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'customers_wrangled.pkl'))

In [39]:
# Import pickle files as dataframes to prepare for the merge between data sets 
df_ords_prods = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_flagged.pkl'))

In [40]:
df_cust = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'customers_wrangled.pkl'))

# Step 7. Combine data sets 

In [41]:
# Check shape of dataframes 
df_ords_prods.shape

(32404859, 25)

In [42]:
df_cust.shape

(206209, 10)

df_cust is significially smaller than df_ords_prods.

In [43]:
# Check descriptive stats of dataframes 
df_ords_prods.describe()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,aisle_id,department_id,prices,max_order,spending_habits,ordering_habits
count,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,30328760.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404860.0,32404850.0
mean,1710745.0,102937.2,17.1423,2.738867,13.42515,11.10408,25598.66,8.352547,0.5895873,71.19612,9.919792,11.98023,33.05217,11.98023,10.39776
std,987298.8,59466.1,17.53532,2.090077,4.24638,8.779064,14084.0,7.127071,0.4919087,38.21139,6.281485,495.6554,25.15525,83.24227,7.131754
min,2.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,855947.0,51422.0,5.0,1.0,10.0,5.0,13544.0,3.0,0.0,31.0,4.0,4.2,13.0,7.387298,6.0
50%,1711049.0,102616.0,11.0,3.0,13.0,8.0,25302.0,6.0,1.0,83.0,9.0,7.4,26.0,7.824786,8.0
75%,2565499.0,154389.0,24.0,5.0,16.0,15.0,37947.0,11.0,1.0,107.0,16.0,11.3,47.0,8.254023,13.0
max,3421083.0,206209.0,99.0,6.0,23.0,30.0,49688.0,145.0,1.0,134.0,21.0,99999.0,99.0,25005.42,30.0


In [44]:
df_ords_prods.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404859 entries, 0 to 32404858
Data columns (total 25 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   order_id                int64   
 1   user_id                 int64   
 2   eval_set                object  
 3   order_number            int64   
 4   order_day_of_week       int64   
 5   order_hour_of_day       int64   
 6   days_since_prior_order  float64 
 7   product_id              int64   
 8   add_to_cart_order       int64   
 9   reordered               int64   
 10  _merge                  category
 11  product_name            object  
 12  aisle_id                int64   
 13  department_id           int64   
 14  prices                  float64 
 15  price_range_loc         object  
 16  busiest_day             object  
 17  busiest_days            object  
 18  busiest_period_of_day   object  
 19  max_order               int64   
 20  loyalty_flag            object  
 21  spendi

In [45]:
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   user_id            206209 non-null  object
 1   first_name         194950 non-null  object
 2   last_name          206209 non-null  object
 3   gender             206209 non-null  object
 4   state              206209 non-null  object
 5   age                206209 non-null  int64 
 6   date_joined        206209 non-null  object
 7   nbr_of_dependents  206209 non-null  int64 
 8   fam_status         206209 non-null  object
 9   income             206209 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 15.7+ MB


The key column that can be used to merge both data sets is the 'user_id' column. Before the merge, the datatype for 'user_id' in df_ords_prods need to be updated as well as 'order_id', 'product_id', 'aisle_id', 'department_id'. 

In [46]:
# Change datatype in multiple columns 
df_ords_prods['user_id'] = df_ords_prods['user_id'].astype('str')

In [49]:
df_ords_prods['order_id'] = df_ords_prods['order_id'].astype('str')

In [50]:
df_ords_prods['product_id'] = df_ords_prods['product_id'].astype('str')

In [51]:
df_ords_prods['aisle_id'] = df_ords_prods['aisle_id'].astype('str')

In [52]:
df_ords_prods['department_id'] = df_ords_prods['department_id'].astype('str')

In [53]:
# Check dataframe 
df_ords_prods.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404859 entries, 0 to 32404858
Data columns (total 25 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   order_id                object  
 1   user_id                 object  
 2   eval_set                object  
 3   order_number            int64   
 4   order_day_of_week       int64   
 5   order_hour_of_day       int64   
 6   days_since_prior_order  float64 
 7   product_id              object  
 8   add_to_cart_order       int64   
 9   reordered               int64   
 10  _merge                  category
 11  product_name            object  
 12  aisle_id                object  
 13  department_id           object  
 14  prices                  float64 
 15  price_range_loc         object  
 16  busiest_day             object  
 17  busiest_days            object  
 18  busiest_period_of_day   object  
 19  max_order               int64   
 20  loyalty_flag            object  
 21  spendi

Data types of all columns are all checked out. 

In [54]:
# View dataframes 
df_ords_prods.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits,spending_flag,ordering_habits,order_freq_flag
0,2539329,1,prior,1,2,8,,196,1,0,...,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,prior,2,3,7,15.0,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,prior,3,3,12,21.0,196,1,1,...,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,prior,4,4,7,29.0,196,1,1,...,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,prior,5,4,15,28.0,196,1,1,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


In [55]:
# Command pandas not to assign any options regarding the max number of columns to display 
pd.options.display.max_columns = None 

In [56]:
df_ords_prods.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits,spending_flag,ordering_habits,order_freq_flag
0,2539329,1,prior,1,2,8,,196,1,0,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,prior,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,prior,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,prior,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,prior,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


All columns are in view. 

In [57]:
df_cust.head()

Unnamed: 0,user_id,first_name,last_name,gender,state,age,date_joined,nbr_of_dependents,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


As mentioned earlier, df_cust is signficially smaller than df_ords_prods. To prep for a merge, I find a shared column to use as a key; "user_id" columns is a full match in both dataframes. The default type for a join is INNER, which means the resulting data set will only contain observations included in both input data sets. 

In [60]:
# Merge dataframes 
df_merge = df_ords_prods.merge(df_cust, on = 'user_id')

In [61]:
# Check merge 
df_merge['_merge'].value_counts()

both          32404859
right_only           0
left_only            0
Name: _merge, dtype: int64

In [62]:
# Merge dataframes using INNER JOIN 
df_merge_1 = df_ords_prods.merge(df_cust, on = 'user_id', how = 'inner')

In [63]:
# Check merge 
df_merge_1['_merge'].value_counts()

both          32404859
right_only           0
left_only            0
Name: _merge, dtype: int64

CareerFoundry notified me that there will be a full merge rate between both data sets so I won't need to double check for a merge rate between both data sets. If I did, all I would need to do is to create another merge and use the OUTER JOIN (how = 'outer') argument in the code like this: <br> 
<br><b>df_merge_2 = df_ords_prods.merge(df_cust, on = 'user_id', how = 'outer')</b>

In [65]:
# Check new dataframe 
df_merge_1.head(100)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits,spending_flag,ordering_habits,order_freq_flag,first_name,last_name,gender,state,age,date_joined,nbr_of_dependents,fam_status,income
0,2539329,1,prior,1,2,8,,196,1,0,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,prior,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,prior,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,prior,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,prior,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,3317979,15,prior,5,4,15,17.0,14715,1,1,both,Coconut Water,98,7,4.0,Low-range product,Least busy,Slowest days,Most orders,22,Regular customer,3.980556,Low spender,10.0,Frequent customer,Janet,Woodard,Female,Indiana,69,6/3/2019,0,divorced/widowed,54313
96,2685110,15,prior,7,1,11,17.0,14715,3,1,both,Coconut Water,98,7,4.0,Low-range product,Regularly busy,Busiest days,Most orders,22,Regular customer,3.980556,Low spender,10.0,Frequent customer,Janet,Woodard,Female,Indiana,69,6/3/2019,0,divorced/widowed,54313
97,887727,15,prior,9,2,13,7.0,14715,1,1,both,Coconut Water,98,7,4.0,Low-range product,Regularly busy,Normal days,Most orders,22,Regular customer,3.980556,Low spender,10.0,Frequent customer,Janet,Woodard,Female,Indiana,69,6/3/2019,0,divorced/widowed,54313
98,2600170,15,prior,11,2,9,14.0,14715,1,1,both,Coconut Water,98,7,4.0,Low-range product,Regularly busy,Normal days,Most orders,22,Regular customer,3.980556,Low spender,10.0,Frequent customer,Janet,Woodard,Female,Indiana,69,6/3/2019,0,divorced/widowed,54313


In [66]:
# Command pandas not to assign any options regarding the max number of rows to display 
pd.options.display.max_rows = None 

In [67]:
df_merge_1.head(100)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits,spending_flag,ordering_habits,order_freq_flag,first_name,last_name,gender,state,age,date_joined,nbr_of_dependents,fam_status,income
0,2539329,1,prior,1,2,8,,196,1,0,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,prior,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,prior,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,prior,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,prior,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
5,3367565,1,prior,6,2,7,19.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Fewest orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
6,550135,1,prior,7,1,9,20.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
7,3108588,1,prior,8,1,14,14.0,196,2,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
8,2295261,1,prior,9,1,16,0.0,196,4,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423
9,2550362,1,prior,10,4,8,30.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,2/17/2019,3,married,40423


In [68]:
# Check shape 
df_merge_1.shape

(32404859, 34)

There are 32,404,859 rows in this dataframe with 34 columns. This is a large dataframe. Both data sets are fully merged. 

# Steps 7, 8, 9 <br>

The notebook contains logical titles, section headings, and descriptive code comments.

In [71]:
# Export dataframe 
df_merge_1.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_cust.pkl'))