Table of Contents:
1. Import libraries.  
2. Import data.  
3. Drop columns. 
4. Look for missing values in a column.
5. Rename columns.
6. Change a variable's data type.
7. Transpose data.
8. Data dictionary.
9. Subsetting.
10. Wrangling procedures from the task instructions.
11. Breakfast item sales subset.
12. Dinner item sales subset.
13. Investigate customer with "user_id" of 1.
14. Overall insights on the customer's behavior.
15. Export dataframes.

1. Import libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [5]:
# Turning the project folder path into a string
path = r'/Users/samlisik/Documents/Instacart Basket Analysis'

2. Import data

In [6]:
# Import the "orders.csv" dataset
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'))
# Import the "products.csv" dataset
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

In [11]:
# Print the shape (rows, columns) of the orders dataframe
df_ords.shape

(3421083, 7)

In [12]:
# Print the shape (rows, cloumns) of the products dataframe
df_prods.shape

(49693, 5)

In [9]:
# Display the first few rows of the orders dataframe
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [10]:
# Display the first few rows of the product dataframe
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


3. Drop columns

In [54]:
# Drop eval_set column from orders.csv
df_ords.drop(columns = ['eval_set'])

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [55]:
# Alternative to overwriting the df_ords dataframe which excludes the eval_set column (creating a new version of the orders df)
df_ords_2 = df_ords.drop(columns = ['eval_set'])

4. Look for missing values in a column

In [16]:
# Looking for missing values in the 'days_since_prior_order' column in the orders dataframe
df_ords['days_since_prior_order'].value_counts(dropna=False)

days_since_prior_order
30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
NaN     206209
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: count, dtype: int64

5. Rename columns

In [58]:
# Rename the "order_dow" column in df_ords
df_ords.rename(columns = {'order_dow':'orders_day_of_week'}, inplace=True)

In [60]:
# Check if "order_dow" column has been renamed
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,oders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


6. Change a variable's data type

In [61]:
# Change the data type of the "order_id" column in df_ords
df_ords['order_id']=df_ords['order_id'].astype('str')

In [62]:
# Use the dtype() function to return the data type of the new "order_id" column
df_ords['order_id'].dtype

dtype('O')

7. Transpose data

In [18]:
# Import data set departments.csv
df_dep = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'departments.csv'), index_col=False)

In [19]:
df_dep.head()

Unnamed: 0,department_id,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,department,frozen,other,bakery,produce,alcohol,international,beverages,pets,dry goods pasta,...,meat seafood,pantry,breakfast,canned goods,dairy eggs,household,babies,snacks,deli,missing


In [20]:
# Transpose df_dep
df_dep.T

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [22]:
# Rename the new transposed df_dep dataframe
df_dep_t = df_dep.T

In [23]:
# Check the new dataframe
df_dep_t

Unnamed: 0,0
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta


In [25]:
# Add an index to the new dataframe
df_dep_t.reset_index()

Unnamed: 0,index,0
0,department_id,department
1,1,frozen
2,2,other
3,3,bakery
4,4,produce
5,5,alcohol
6,6,international
7,7,beverages
8,8,pets
9,9,dry goods pasta


In [27]:
# Create a new header for the df_dep_t dataframe
new_header = df_dep_t.iloc[0]

In [28]:
new_header

0    department
Name: department_id, dtype: object

In [29]:
# Create a new dataframe that only copies over rows beyond the first row
df_dep_t_new = df_dep_t[1:]

In [30]:
df_dep_t_new

Unnamed: 0,0
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


In [31]:
# Set the header row as the df header
df_dep_t_new.columns = new_header

In [32]:
df_dep_t_new

department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta
10,bulk


8. Data dictionary

In [36]:
# Turn the df_dep_t_new dataframe into a dictionary
data_dict = df_dep_t_new.to_dict ('index')

In [37]:
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [38]:
# Use the data dictionary for checking the df_prods dataframe
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [40]:
# Check what "department_id" of 19 stands for
print(data_dict.get('19'))

{'department': 'snacks'}


9. Subsetting

In [43]:
# Indexing - step by step
df_prods['department_id']==19

0         True
1        False
2        False
3        False
4        False
         ...  
49688    False
49689    False
49690    False
49691    False
49692    False
Name: department_id, Length: 49693, dtype: bool

In [46]:
df_prods[df_prods['department_id']==19]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


In [48]:
# Save the results as a new dataframe called df_snacks
df_snacks = df_prods[df_prods['department_id']==19]

In [49]:
df_snacks.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5


In [50]:
# Alternative: using the loc function
df_snacks_2=df_prods.loc[df_prods['department_id']==19]

In [51]:
df_snacks_2

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


In [52]:
# Or
df_snacks_3=df_prods.loc[df_prods['department_id'].isin([19])]

In [53]:
df_snacks_3

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
15,16,Mint Chocolate Flavored Syrup,103,19,5.2
24,25,Salted Caramel Lean Protein & Fiber Bar,3,19,1.9
31,32,Nacho Cheese White Bean Chips,107,19,4.9
40,41,Organic Sourdough Einkorn Crackers Rosemary,78,19,6.5
...,...,...,...,...,...
49666,49662,Bacon Cheddar Pretzel Pieces,107,19,3.6
49669,49665,Super Dark Coconut Ash & Banana Chocolate Bar,45,19,6.9
49670,49666,Ginger Snaps Snacking Cookies,61,19,5.2
49675,49671,Milk Chocolate Drops,45,19,3.0


10. Wrangling procedures from the task instructions

In [64]:
# Find another identifier variable in the df_ords dataframe that doesn’t need to be included
df_ords.describe()

Unnamed: 0,user_id,order_number,oders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,102978.2,17.15486,2.776219,13.45202,11.11484
std,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,0.0,0.0,0.0
25%,51394.0,5.0,1.0,10.0,4.0
50%,102689.0,11.0,3.0,13.0,7.0
75%,154385.0,23.0,5.0,16.0,15.0
max,206209.0,100.0,6.0,23.0,30.0


In [65]:
# Convert the "user_id" column to string
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [66]:
# Check the data type of the new "user_id" column
df_ords['user_id'].dtype

dtype('O')

In [69]:
# Renaming the unintuitive column ("days_since_prior_order") and the column where I previously made a typo ("oders_day_of_week)in df_ords
df_ords_renamed = df_ords.rename(columns={'oders_day_of_week': 'orders_day_of_week','days_since_prior_order': 'days_since_last_order'})

In [71]:
df_ords_renamed.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [72]:
# Find the frequency of orders by hour
df_ords['order_hour_of_day'].value_counts() # This returns a list of hours (0–23) and the number of orders placed in each hour. 
# As seen below, the busiest hour is 10:00.

order_hour_of_day
10    288418
11    284728
15    283639
14    283042
13    277999
12    272841
16    272553
9     257812
17    228795
18    182912
8     178201
19    140569
20    104292
7      91868
21     78109
22     61468
23     40043
6      30529
0      22758
1      12398
5       9569
2       7539
4       5527
3       5474
Name: count, dtype: int64

In [73]:
# Use the already created dictionary to find the meaning of department_id = 4
print(data_dict.get('4'))

{'department': 'produce'}


11. Breakfast item sales subset

In [74]:
# Find the department_id of breakfast items ('14')
data_dict

{'1': {'department': 'frozen'},
 '2': {'department': 'other'},
 '3': {'department': 'bakery'},
 '4': {'department': 'produce'},
 '5': {'department': 'alcohol'},
 '6': {'department': 'international'},
 '7': {'department': 'beverages'},
 '8': {'department': 'pets'},
 '9': {'department': 'dry goods pasta'},
 '10': {'department': 'bulk'},
 '11': {'department': 'personal care'},
 '12': {'department': 'meat seafood'},
 '13': {'department': 'pantry'},
 '14': {'department': 'breakfast'},
 '15': {'department': 'canned goods'},
 '16': {'department': 'dairy eggs'},
 '17': {'department': 'household'},
 '18': {'department': 'babies'},
 '19': {'department': 'snacks'},
 '20': {'department': 'deli'},
 '21': {'department': 'missing'}}

In [75]:
# Create the subset for breakfast item sales
df_breakfast = df_prods[df_prods['department_id'] == 14]

In [76]:
# Verify the breakfast item sales subset
df_breakfast.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
27,28,Wheat Chex Cereal,121,14,10.1
33,34,,121,14,12.2
67,68,"Pancake Mix, Buttermilk",130,14,13.7
89,90,Smorz Cereal,121,14,3.9
210,211,Gluten Free Organic Cereal Coconut Maple Vanilla,130,14,3.6


12. Dinner item sales subset

In [77]:
# Create the subset for products that customers might use to throw dinner parties
df_dinner_party = df_prods[df_prods['department_id'].isin([12, 20, 5, 7])] # subset containing only alcohol, deli, beverages, and meat/seafood items

In [78]:
df_dinner_party.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
6,7,Pure Coconut Water With Orange,98,7,4.4
9,10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4
10,11,Peach Mango Juice,31,7,2.8
16,17,Rendered Duck Fat,35,12,17.1


In [79]:
# How many items are included in the subset
df_dinner_party.shape[0]

7650

13. Investigate customer with "user_id" of 1

In [5]:
# Import pandas and os
import pandas as pd
import os

In [3]:
# Set the path variable
path = r'/Users/samlisik/Documents/Instacart Basket Analysis'

In [6]:
# Load the cleaned orders dataframe
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_clean.csv'))

In [7]:
# Check the first few rows
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [8]:
# Ensure 'user_id' is a string
df_ords['user_id'] = df_ords['user_id'].astype(str)

In [9]:
df_user1 = df_ords[df_ords['user_id'] == '1']
df_user1.shape

(11, 7)

In [10]:
# Get info about data types, number of rows, and missing values
df_user1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11 entries, 0 to 10
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   order_id               11 non-null     int64  
 1   user_id                11 non-null     object 
 2   eval_set               11 non-null     object 
 3   order_number           11 non-null     int64  
 4   orders_day_of_week     11 non-null     int64  
 5   order_hour_of_day      11 non-null     int64  
 6   days_since_last_order  10 non-null     float64
dtypes: float64(1), int64(4), object(2)
memory usage: 704.0+ bytes


In [11]:
# Basic descriptive statistics for numeric columns
df_user1.describe()

Unnamed: 0,order_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order
count,11.0,11.0,11.0,11.0,10.0
mean,1923450.0,6.0,2.636364,10.090909,19.0
std,1071950.0,3.316625,1.286291,3.477198,9.030811
min,431534.0,1.0,1.0,7.0,0.0
25%,869017.0,3.5,1.5,7.5,14.25
50%,2295261.0,6.0,3.0,8.0,19.5
75%,2544846.0,8.5,4.0,13.0,26.25
max,3367565.0,11.0,4.0,16.0,30.0


In [12]:
# Total orders of the customer
total_orders = df_user1.shape[0]
total_orders

11

In [13]:
# Average time between orders of the customer
avg_days_between = df_user1['days_since_last_order'].mean()
avg_days_between

np.float64(19.0)

In [14]:
# Most common order day
most_common_day = df_user1['orders_day_of_week'].mode()[0]
most_common_day

np.int64(4)

In [15]:
# Most common order hour
most_common_hour = df_user1['order_hour_of_day'].mode()[0]
most_common_hour

np.int64(7)

14. Overall insights on the customer's behavior:
After extracting all available information about the customer with user_id '1', we observed the following behavior:

- Total orders: 11

- Average time between orders: 19 days

- Most common order day of the week: 4 (Thursday)

- Most common order hour: 7 AM

Iterpretation:

Customer with user_id '1' orders earlier than the majority of customers, which might indicate a habitual early shopper or possibly automated/recurring orders.

All other metrics (total orders = 11, avg days between orders = 19) are plausible and not erroneous.

This explains why the data engineers might have flagged this user for unusual behavior.

15. Export dataframes

In [96]:
df_ords_renamed.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index=False)

In [97]:
df_dep_t_new.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'departments_wrangled.csv'), index=False)