## Instacart Data Wrangling 
I wrangled three data sets: orders, products, and departments. 

<b>Main sections: </b>
1. Import libraries and data frames 
2. Dataframe: df_ords 
3. Dataframe: df_deps
4. Data Dictionary 
5. Dataframe: df_prods 
6. Dataframe: df_snacks 
7. Dataframe: df_prods 
8. Dataframe: df_snacks 
9. Dataframe: df_ords  
10. Dataframe: df_breakfast 
11. Dataframe: df_parties_dinner 
12. Dataframe: df_ords_user_1 

<b>Main dataframes: </b>
df_ords, df_prods, and df_depts 

<b>Subsets:</b> 
df_snacks, df_breakfast, df_parties_dinner, and df_ords_user_1 

## Importing libraries and dataframes

In [1]:
# Import libriaries 
import pandas as pd 
import numpy as np 
import os 

In [4]:
# Create path
path = r'/Users/bentley/Documents/Instacart'

In [6]:
# Import df
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)

In [7]:
df_prods = pd.read_csv(os.path.join(path,'02 Data', 'Original Data', 'products.csv'), index_col = False)

In [9]:
df_deps = pd.read_csv(os.path.join(path,'02 Data', 'Original Data', 'departments.csv'), index_col = False)

In [10]:
# View head of df
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [11]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [12]:
df_deps.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


## Dataframe: df_ords

In [13]:
# Drop a column from the df 
df_ords.drop(columns = ['eval_set'])

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [14]:
# Check missing values 
df_ords['days_since_prior_order'].value_counts(dropna = False)

30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
NaN     206209
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: days_since_prior_order, dtype: int64

In [15]:
# Rename a column in the df
df_ords.rename(columns = {'order_dow' : 'order_day_of_week'}, inplace = True)

In [16]:
# View df
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [17]:
# Check descriptive stats 
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [18]:
# Change datatype to ignore strings
df_ords['order_id'] = df_ords['order_id'].astype('str')

In [19]:
# Check  
df_ords['order_id'].dtype

dtype('O')

## Dataframe: df_deps 

In [21]:
# View df
df_deps

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


Notes: The dataframe is not viewed properly. It needs to be transposed. 

In [22]:
# Tranpose the df
df_deps.T

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [24]:
## Overwrite by creating a new df
df_deps_t = df_deps.T 

In [26]:
# Check  
df_deps_t

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


Notes: The dataframe looks much better now that it's transposed. 

In [27]:
# Add index in df
df_deps_t.reset_index()

Unnamed: 0,index,0
0,department_id,department
1,1,frozen
2,2,other
3,3,bakery
4,4,produce
5,5,alcohol
6,6,international
7,7,beverages
8,8,pets
9,9,dry goods pasta


In [29]:
# Take the first row of the df as the header 
new_header = df_deps_t.iloc[0]

In [30]:
# Check 
new_header

0    department
Name: department_id, dtype: object

In [31]:
# Create a new df that only copies over rows beyond the first row 
df_deps_t_new = df_deps_t[1:]

In [32]:
# Check  
df_deps_t_new

Unnamed: 0,0
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


Notes: The dataframe is showing a list of departments with their identifying department ID number. 

In [33]:
# Set the header row as the df header 
df_deps_t_new.columns = new_header

In [34]:
# View new df
df_deps_t_new

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


## Data Dictionary

In [35]:
# Use pandas and create a data dict that contains meanings for the values in 'department_id' in df
data_dict = df_deps_t_new.to_dict('index')

In [36]:
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

Notes: 
The to_dict() function just used transformed df into dict format and saved it as a new variable as data_dict.  
Use new dict in practice with df_prods.  

## Dataframe: df_prods

In [38]:
# View the df
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


Notes: Use data_dict to see what 19 stands for in 1st row since 'department_id' exists in the dataframe

In [40]:
# Print
print(data_dict.get('19'))

{'department': 'snacks'}


## Dataframe: df_snacks

In [41]:
# Create a subset that only contains data from the snacks dep
df_snacks = df_prods[df_prods['department_id']==19]

In [42]:
df_snacks 

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


## Dataframe: df_prods

In [43]:
# Look for inner part to comprehend this process better 
df_prods['department_id']==19

0         True
1        False
2        False
3        False
4        False
         ...  
49688    False
49689    False
49690    False
49691    False
49692    False
Name: department_id, Length: 49693, dtype: bool

In [44]:
# Run the same indexing code again only this time wrapped within another instance of df_prods[]
df_prods[df_prods['department_id']==19]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


## Dataframe: df_snacks 

In [45]:
df_snacks.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5


Notes: There's a more simple way to achieve same results using the loc function. 

In [47]:
df_snacks_2 = df_prods.loc[df_prods['department_id']==19]

In [48]:
df_snacks_3 = df_prods.loc[df_prods['department_id'].isin([19])]

In [49]:
df_snacks_2.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5


In [50]:
df_snacks_3.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5


## Dataframe: df_ords 

Notes: I need to find another identifier variable in df_ords that don't need to be included in analysis as num variable and change it to a suitable format.

In [52]:
# View
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [53]:
# Check descriptive stats 
df_ords.describe()

Unnamed: 0,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,102978.2,17.15486,2.776219,13.45202,11.11484
std,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,0.0,0.0,0.0
25%,51394.0,5.0,1.0,10.0,4.0
50%,102689.0,11.0,3.0,13.0,7.0
75%,154385.0,23.0,5.0,16.0,15.0
max,206209.0,100.0,6.0,23.0,30.0


In [54]:
# Identify 'user_id' as str, not num variable and change dtype for user_id 
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [55]:
# Check 
df_ords['user_id'].dtype

dtype('O')

In [56]:
# Identify 'order_number' as str, not num variable and change dtype for order_number 
df_ords['order_number'] = df_ords['order_number'].astype('str')

In [57]:
# Check
df_ords['order_number'].dtype

dtype('O')

Notes: I need to look for a variable in df_ords to rename without overwriting the df.  

In [62]:
# Rename 'order_hour_of_day' to 'order_time_of_day' 
df_ords.rename(columns = {'order_hour_of_day' : 'order_time_of_day'}, inplace = True)

In [63]:
# Check 
df_ords

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_time_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0
...,...,...,...,...,...,...,...
3421078,2266710,206209,prior,10,5,18,29.0
3421079,1854736,206209,prior,11,4,10,30.0
3421080,626363,206209,prior,12,1,12,18.0
3421081,2977660,206209,prior,13,1,12,7.0


Notes: The client wants to know what the busiest hour is for placing orders. I need to find the freq. of the corresponding variables and share findings.

In [65]:
# View a column in the df 
df_ords['order_time_of_day'].value_counts(dropna = False)

10    288418
11    284728
15    283639
14    283042
13    277999
12    272841
16    272553
9     257812
17    228795
18    182912
8     178201
19    140569
20    104292
7      91868
21     78109
22     61468
23     40043
6      30529
0      22758
1      12398
5       9569
2       7539
4       5527
3       5474
Name: order_time_of_day, dtype: int64

Results: 10:00 or 10AM is the busiest time of the day for customers to make orders. 

Notes: I need to determine meaning behind a value of 4 in 'department_id' column within the df_prods using data_dict

In [68]:
print(data_dict.get('4'))

{'department': 'produce'}


Notes: The Client's org's sales team want to know more about breakfast item sales.
Therefore, I need to create a subset containing only the req info. 

In [70]:
# View 
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

Notes: breakfast food and produce are in department #14. 

## Dataframe: df_breakfast

In [71]:
df_breakfast = df_prods[df_prods['department_id']==14]

In [72]:
df_breakfast

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6
...,...,...,...,...,...
49330,49326,Cereal Variety Fun Pack,121,14,9.1
49395,49391,Light and Fluffy Buttermilk Pancake Mix,130,14,2.0
49547,49543,Chocolate Cheerios Cereal,121,14,10.8
49637,49633,Shake 'N Pour Buttermilk Pancake Mix,130,14,14.2


Notes: The sales team also want to see details about customers who might be throwing dinner parties. 
I need to find all observations from df that include items: alcohol, deli, beverages, meat/seafood as new subset. 

In [75]:
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [78]:
# Use loc function for new df with mutiple items (values) as variables 
df_parties_dinner = df_prods.loc[df_prods['department_id'].isin([5,20,7,12])]

## Dataframe: df_parties_dinner

In [79]:
df_parties_dinner

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1
...,...,...,...,...,...
49676,49672,Cafe Mocha K-Cup Packs,26,7,6.5
49679,49675,Cinnamon Dolce Keurig Brewed K Cups,26,7,14.0
49680,49676,Ultra Red Energy Drink,64,7,14.5
49686,49682,California Limeade,98,7,4.3


Notes: I keep track of total counts in dataframes. <br>
How many rows do the last df have? Answer: 7650 rows in df_parties_dinner 

Someone from the data engineers team in Instacart thinks they've spotted something strange about customer 1. Refer customer 1 as '1' in 'user_id' and extract all info about customer 1. 

In [82]:
# View info about customer 1 
df_ords['user_id']==1

0          False
1          False
2          False
3          False
4          False
           ...  
3421078    False
3421079    False
3421080    False
3421081    False
3421082    False
Name: user_id, Length: 3421083, dtype: bool

In [83]:
df_ords[df_ords['user_id']==1]

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_time_of_day,days_since_prior_order


Results: Customer 1 don't exist or have never made a purchase. A new dataframe is created to reflect Customer 1 results.

## Dataframe: df_ords_user 1

In [85]:
df_ords_user_1 = df_ords[df_ords['user_id']==1]

In [86]:
df_ords_user_1

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_time_of_day,days_since_prior_order


In [90]:
# Export dataframes in Prepared Data folder 
df_ords.to_csv(os.path.join(path,'02 Data', 'Prepared Data', 'orders_wrangled.csv'))

In [92]:
df_deps_t_new.to_csv(os.path.join(path,'02 Data', 'Prepared Data', 'departments_wrangled.csv'))