<b>Task directions:</b><br>
1. Create a new notebook for this task. Be sure to import the relevant libraries, along with your ords_prods_merge dataframe, which should include your newly derived columns from the previous Exercise.<br><br>
2. In this Exercise, you learned how to find the aggregated mean of the “order_number” column grouped by “department_id” for a subset of your dataframe. Now, repeat this process for the entire dataframe.<br><br>
3. Analyze the result. How do the results for the entire dataframe differ from those of the subset? Include your comments in a markdown cell below the executed code.<br><br>
4. Follow the instructions in the Exercise for creating a loyalty flag for existing customers using the transform() and loc() functions.<br><br>
5. The marketing team at Instacart wants to know whether there’s a difference between the spending habits of the three types of customers you identified. Use the loyalty flag you created and check the basic statistics of the product prices for each loyalty category (Loyal Customer, Regular Customer, and New Customer). What you’re trying to determine is whether the prices of products purchased by loyal customers differ from those purchased by regular or new customers.<br><br>
6. The team now wants to target different types of spenders in their marketing campaigns. This can be achieved by looking at the prices of the items people are buying. Create a spending flag for each user based on the average price across all their orders using the following criteria:<br>
- If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”<br>
- If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”<br><br>
7. In order to send relevant notifications to users within the app (for instance, asking users if they want to buy the same item again), the Instacart team wants you to determine frequent versus non-frequent customers. Create an order frequency flag that marks the regularity of a user’s ordering behavior according to the median in the “days_since_prior_order” column. The criteria for the flag should be as follows:<br>
- If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.”<br>
- If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a “Regular customer.”<br><br>
- If the median is lower than or equal to 10, then the customer should be labeled a “Frequent customer.” <br>
8. Ensure your notebook is clean and structured and that your code is well commented.<br><br>
9. Export your dataframe as a pickle file and store it correctly in your “Prepared Data” folder.<br><br>
10. Save your notebook and submit it to your tutor for review.

# Step 1 

In [1]:
# Import libraries 
import pandas as pd
import numpy as np 
import os

In [2]:
# Create path 
path = r'/Users/bentley/Documents/Instacart'

In [4]:
# Import a pickle file as a dataframe 
df_ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_derived_new.pkl'))

In [6]:
# Check shape of dataframe
df_ords_prods_merge.shape

(32404859, 19)

In [7]:
# View dataframe 
df_ords_prods_merge.head(25)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,prior,1,2,8,,196,1,0,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders
1,2398795,1,prior,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders
2,473747,1,prior,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders
3,2254736,1,prior,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders
4,431534,1,prior,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders
5,3367565,1,prior,6,2,7,19.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Fewest orders
6,550135,1,prior,7,1,9,20.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders
7,3108588,1,prior,8,1,14,14.0,196,2,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders
8,2295261,1,prior,9,1,16,0.0,196,4,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders
9,2550362,1,prior,10,4,8,30.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Average orders


# Step 2 

In [8]:
# Find the aggregated mean of the "order_number" column grouped by "department_id" 
df_ords_prods_merge.groupby('department_id').agg({'order_number' : ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


# Step 3 

The results for the entire dataframe in Step 2 differ from the subset in the lesson. For instance, in the subset from the lesson, the shape of the subset is limited to 1mil rows. Meanwhile, the entire dataframe (df_ords_prods_merge) run through all 32,404,859 rows which also run through all departments (#21). Because the subset is limited to 1-mil rows, it showed limited results based on departments identified in the first 1-mil rows of the subset (ex: 4, 17, 13, 14, 16, 17, 19, and 20). 

# Step 4

Establishing Flag Criteria: To create my flag to identify and mark loyal customers who made multiple orders, I need some criteria. I'll use the following... <br> 
- If the max orders the user has made is over 40 <b>(>40)</b>, then the customer will be labeled a 'Loyal customer'<br>
- If the max orders the user has made is over 10 <b>(40=< >=10)</b>, then the customer will be labeled a 'Regular customer'<br> 
    - If the max orders the user has made is less than or equal to 10 <b>(>=10)</b>, then the customer will be labeled a 'New customer' <br>
<br> 
Now, let’s map this task onto the three-step process introduced earlier:<br>
<br>
Step 1 - Split the data into groups based on the “user_id” column.<br>
Step 2 - Apply the transform() function on the “order_number” column to generate the maximum orders for each user.<br>
Step 3 - Create a new column, “max_order,” into which you’ll place the results of your aggregation.<br>
<br>
Once this process is complete, I can use the “max_order” column to create another new column that assigns a loyalty flag to each customer according to the criteria (via the loc() function). 

In [9]:
# Complete the 3-step process in a single line of code 
df_ords_prods_merge['max_order'] = df_ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max) 

In [10]:
# Check results in new column 
df_ords_prods_merge.head(20)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order
0,2539329,1,prior,1,2,8,,196,1,0,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10
1,2398795,1,prior,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10
2,473747,1,prior,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10
3,2254736,1,prior,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10
4,431534,1,prior,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10
5,3367565,1,prior,6,2,7,19.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Normal days,Fewest orders,10
6,550135,1,prior,7,1,9,20.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10
7,3108588,1,prior,8,1,14,14.0,196,2,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10
8,2295261,1,prior,9,1,16,0.0,196,4,1,both,Soda,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10
9,2550362,1,prior,10,4,8,30.0,196,1,1,both,Soda,77,7,9.0,Mid-range product,Least busy,Slowest days,Average orders,10


Notes: a new column ("max_order") has been created. Next, I can assign a new column ("loyalty_flag") to it based on the flag criteria. 

In [11]:
# Create a flag that assigns a "loyalty" label to a user ID based on its corresponding max order value 
df_ords_prods_merge.loc[df_ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [12]:
df_ords_prods_merge.loc[(df_ords_prods_merge['max_order'] <=40) & (df_ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [13]:
df_ords_prods_merge.loc[df_ords_prods_merge['max_order'] <=10, 'loyalty_flag'] = 'New customer'

In [14]:
# Print frequency of the new "loyalty_flag" column using the value_counts() function 
df_ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: loyalty_flag, dtype: int64

Notes: Most of the customers fall under the "Regular customer" category which makes sense. 

# Step 5

In [15]:
# Check information about the dataframe 
df_ords_prods_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32404859 entries, 0 to 32404858
Data columns (total 21 columns):
 #   Column                  Dtype   
---  ------                  -----   
 0   order_id                int64   
 1   user_id                 int64   
 2   eval_set                object  
 3   order_number            int64   
 4   order_day_of_week       int64   
 5   order_hour_of_day       int64   
 6   days_since_prior_order  float64 
 7   product_id              int64   
 8   add_to_cart_order       int64   
 9   reordered               int64   
 10  _merge                  category
 11  product_name            object  
 12  aisle_id                int64   
 13  department_id           int64   
 14  prices                  float64 
 15  price_range_loc         object  
 16  busiest_day             object  
 17  busiest_days            object  
 18  busiest_period_of_day   object  
 19  max_order               int64   
 20  loyalty_flag            object  
dtypes: cat

For this step, I'll need to use the "loyalty_flag" and "prices" columns. 

In [17]:
# Check basic, multiple statistics about 3 types of customers and their spending habits 
df_ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,10.386336,1.0,99999.0
New customer,13.29467,1.0,99999.0
Regular customer,12.495717,1.0,99999.0


Notes: Loyal customers spend less on average compared to the other two types of customers. New and regular customers spend nearly the same amount of money purchased via Instacart. 

# Step 6

Establishing Spending Flag Criteria: <br>
To create my "spending_flag" to identify and label types of spenders (Low spender/High spender), I need some criteria. I'll look at the prices of the items people are buying and create a spending flag for each user based on the average price across all their orders using the following: <br>
<br> 
- If the mean of the prices of products purchased by a user is lower than 10 <b>(<10)</b>, then I'll flag them as a 'Low spender'<br> 
- If the mean of the prices of products purchased by a user is higher than or equal to 10 <b>(>=10)</b>, then I'll flag them as a 'High spender'
<br><br>
I'll use the 3-step process as outlined in Step 4 to compute Step 6. <br>
Step 1 - Split the data into groups based on the “user_id” column.<br>
Step 2 - Apply the transform() function on the “prices” column to generate the average orders for each user.<br>
Step 3 - Create a new column, “spending_habits,” into which I’ll place the results of my aggregation.<br>
<br>
Once this process is complete, I can use the “spending_habits” column to create another new column that assigns a spending flag to each customer according to the criteria (via the loc() function). 

In [18]:
# Complete the 3-step process in a single line of code 
df_ords_prods_merge['spending_habits'] = df_ords_prods_merge.groupby(['user_id'])['prices'].transform(np.mean) 

In [19]:
# Check results in new column 
df_ords_prods_merge.head(10)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits
0,2539329,1,prior,1,2,8,,196,1,0,...,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797
1,2398795,1,prior,2,3,7,15.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797
2,473747,1,prior,3,3,12,21.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797
3,2254736,1,prior,4,4,7,29.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797
4,431534,1,prior,5,4,15,28.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797
5,3367565,1,prior,6,2,7,19.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Normal days,Fewest orders,10,New customer,6.367797
6,550135,1,prior,7,1,9,20.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
7,3108588,1,prior,8,1,14,14.0,196,2,1,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
8,2295261,1,prior,9,1,16,0.0,196,4,1,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
9,2550362,1,prior,10,4,8,30.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797


Notes: The new column "spending_habits" has been created and cells are computed. Next, I'll check again by viewing the first 75 rows of the dataframe. 

In [20]:
# Recheck results in new column 
df_ords_prods_merge.head(75)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits
0,2539329,1,prior,1,2,8,,196,1,0,...,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797
1,2398795,1,prior,2,3,7,15.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797
2,473747,1,prior,3,3,12,21.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797
3,2254736,1,prior,4,4,7,29.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797
4,431534,1,prior,5,4,15,28.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70,2846344,98,prior,11,4,13,15.0,196,2,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,14,Regular customer,8.028000
71,747431,98,prior,12,6,13,30.0,196,4,1,...,77,7,9.0,Mid-range product,Regularly busy,Normal days,Most orders,14,Regular customer,8.028000
72,2383054,98,prior,13,4,11,30.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,14,Regular customer,8.028000
73,1993729,98,prior,14,5,8,8.0,196,2,1,...,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,14,Regular customer,8.028000


In [21]:
# Command pandas not to assign any options regarding the max number of rows to display 
pd.options.display.max_rows = None 

In [22]:
# Rerun the recheck results in new column 
df_ords_prods_merge.head(75)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,aisle_id,department_id,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits
0,2539329,1,prior,1,2,8,,196,1,0,...,77,7,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797
1,2398795,1,prior,2,3,7,15.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797
2,473747,1,prior,3,3,12,21.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797
3,2254736,1,prior,4,4,7,29.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797
4,431534,1,prior,5,4,15,28.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797
5,3367565,1,prior,6,2,7,19.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Normal days,Fewest orders,10,New customer,6.367797
6,550135,1,prior,7,1,9,20.0,196,1,1,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
7,3108588,1,prior,8,1,14,14.0,196,2,1,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
8,2295261,1,prior,9,1,16,0.0,196,4,1,...,77,7,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
9,2550362,1,prior,10,4,8,30.0,196,1,1,...,77,7,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797


Notes: All 75 rows are visible. This allows me to check whether my aggregation procedure was conducted successfully. Now that I've created my new column "spending_habits", it's time to set the "spending_flag" column. 

In [23]:
# Create a flag that assigns a "spending" label to a user ID based on its corresponding average price value 
df_ords_prods_merge.loc[df_ords_prods_merge['spending_habits'] < 10, 'spending_flag'] = 'Low spender'

In [24]:
df_ords_prods_merge.loc[df_ords_prods_merge['spending_habits'] >= 10, 'spending_flag'] = 'High spender'

In [25]:
# Print frequency of the new "spending_flag" column using the value_counts() function 
df_ords_prods_merge['spending_flag'].value_counts(dropna = False)

Low spender     31770742
High spender      634117
Name: spending_flag, dtype: int64

Results: A customer is 50 times more likely to be a low spender (purchasing orders of items at low prices) on Instacart. <br>(31770742 / 634117 = 50.1) 

# Step 7

Establishing Order Frequency Flag Criteria: <br>
To create my "order_freq_flag" to identify and label types of customers (Frequent/Non-Frequent), I need some criteria. I'll look at users' ordering behavior and create a order frequency flag for each user based on the median in the "days_since_prior_order" column using the following: <br>
<br> 
- If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.” <br>
- If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a “Regular customer.”<br>
- If the median is lower than or equal to 10, then the customer should be labeled a “Frequent customer.”<br>
<br>
I'll use the 3-step process as outlined in Steps 4 and 6 to compute Step 7. <br>
Step 1 - Split the data into groups based on the “user_id” column.<br>
Step 2 - Apply the transform() function on the “days_since_prior_order” column to generate the frequency of orders for each user.<br>
Step 3 - Create a new column, “ordering_habits,” into which I’ll place the results of my aggregation.<br>
<br>
Once this process is complete, I can use the “ordering_habits” column to create another new column that assigns a order frequency flag to each customer according to the criteria (via the loc() function). 

In [26]:
# Complete the 3-step process in a single line of code 
df_ords_prods_merge['ordering_habits'] = df_ords_prods_merge.groupby(['user_id'])['days_since_prior_order'].transform(np.median) 

In [27]:
# Check results in new column 
df_ords_prods_merge.head(10)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits,spending_flag,ordering_habits
0,2539329,1,prior,1,2,8,,196,1,0,...,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797,Low spender,20.5
1,2398795,1,prior,2,3,7,15.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5
2,473747,1,prior,3,3,12,21.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5
3,2254736,1,prior,4,4,7,29.0,196,1,1,...,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5
4,431534,1,prior,5,4,15,28.0,196,1,1,...,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5
5,3367565,1,prior,6,2,7,19.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Normal days,Fewest orders,10,New customer,6.367797,Low spender,20.5
6,550135,1,prior,7,1,9,20.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
7,3108588,1,prior,8,1,14,14.0,196,2,1,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
8,2295261,1,prior,9,1,16,0.0,196,4,1,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
9,2550362,1,prior,10,4,8,30.0,196,1,1,...,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.5


Notes: "ordering_habits" column has been created and cells are computed. Next, I'll recheck results with 100 rows to see more users and their ordering habits. 

In [28]:
df_ords_prods_merge.head(100)

Unnamed: 0,order_id,user_id,eval_set,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,...,prices,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,spending_habits,spending_flag,ordering_habits
0,2539329,1,prior,1,2,8,,196,1,0,...,9.0,Mid-range product,Regularly busy,Normal days,Average orders,10,New customer,6.367797,Low spender,20.5
1,2398795,1,prior,2,3,7,15.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5
2,473747,1,prior,3,3,12,21.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5
3,2254736,1,prior,4,4,7,29.0,196,1,1,...,9.0,Mid-range product,Least busy,Slowest days,Fewest orders,10,New customer,6.367797,Low spender,20.5
4,431534,1,prior,5,4,15,28.0,196,1,1,...,9.0,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5
5,3367565,1,prior,6,2,7,19.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Normal days,Fewest orders,10,New customer,6.367797,Low spender,20.5
6,550135,1,prior,7,1,9,20.0,196,1,1,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
7,3108588,1,prior,8,1,14,14.0,196,2,1,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
8,2295261,1,prior,9,1,16,0.0,196,4,1,...,9.0,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
9,2550362,1,prior,10,4,8,30.0,196,1,1,...,9.0,Mid-range product,Least busy,Slowest days,Average orders,10,New customer,6.367797,Low spender,20.5


Notes: All 100 rows are visible. This allows me to check whether my aggregation procedure was conducted successfully. Now that I've created my new column "ordering_habits", it's time to set the "order_freq_flag" column.

In [29]:
# Create a flag that assigns a "order frequency" label to a user ID based on its corresponding median order frequency value 
df_ords_prods_merge.loc[df_ords_prods_merge['ordering_habits'] > 20, 'order_freq_flag'] = 'Non-frequent customer'

In [30]:
df_ords_prods_merge.loc[(df_ords_prods_merge['ordering_habits'] > 10) & (df_ords_prods_merge['ordering_habits'] <= 20), 'order_freq_flag'] = 'Regular customer'

In [31]:
df_ords_prods_merge.loc[df_ords_prods_merge['ordering_habits'] <= 10, 'order_freq_flag'] = 'Frequent customer'

In [32]:
# Print frequency of the new "order_freq_flag" column using the value_counts() function 
df_ords_prods_merge['order_freq_flag'].value_counts(dropna = False)

Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
NaN                             5
Name: order_freq_flag, dtype: int64

Results: More than half of customers are frequent customers. There are 5 missing values which is preassumed as not enough data is available to classify 5 users in the system. 

# Steps 8, 9, 10 

The notebook is clean and organized! 

In [33]:
# Export dataframe in a pickle file 
df_ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_flagged.pkl'))

The notebook is saved in the Scripts folder and submitted for review. 