# Creating flags 

## Table of Contents 
* [01. Importing Libraries](#01.-Importing-Libraries)
* [02. Importing File](#02.-Importing-File)
* [03. Creating loyalty flag](#03.-Creating-loyalty-flag)
* [04. Creating spending flag](#04.-Creating-spending-flag)
* [05. Creating frequency flag](#05.-Creating-frequency-flag)
* [06. Exporting File](#06.-Exporting-File)

# 01. Importing Libraries 

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import os

# 02. Importing File

In [2]:
ords_prods_merged = pd.read_pickle(r'/Users/suzandiab/Documents/Instacart Basket Analysis/02 Data/Prepared Data/ords_prods_merge_derived.pkl')

In [3]:
# Calculating mean of order number column grouped by department id column for entire df
ords_prods_merged.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,15.457838
2,17.27792
3,17.170395
4,17.811403
5,15.215751
6,16.439806
7,17.225802
8,15.34065
9,15.895474
10,20.197148


The results are different when compared to the subset.
All of the departments are included and the averages are lower when compared to the subset.

# 03. Creating loyalty flag

In [4]:
# Creating a new column based on grouping user id column and generating max # of orders for each user
ords_prods_merged['max_order'] = ords_prods_merged.groupby(['user_id'])['order_number'].transform(np.max)

  ords_prods_merged['max_order'] = ords_prods_merged.groupby(['user_id'])['order_number'].transform(np.max)


Created new max_order column where aggregation results will go. 
This column will store the max number of orders for each user.
Grouped by user id column.

In [5]:
# Checking first 15 rows
ords_prods_merged.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,product_name,aisle_id,department_id,prices,exists,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order
0,2539329,1,1,2,8,,196,1,0,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10
1,2398795,1,2,3,7,15.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Fewest Orders,10
2,473747,1,3,3,12,21.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders,10
3,2254736,1,4,4,7,29.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10
4,431534,1,5,4,15,28.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders,10
5,3367565,1,6,2,7,19.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10
6,550135,1,7,1,9,20.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Busiest days,Most orders,10
7,3108588,1,8,1,14,14.0,196,2,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Busiest days,Most orders,10
8,2295261,1,9,1,16,0.0,196,4,1,both,Soda,77,7,9.0,both,Mid-range product,Regularly busy,Busiest days,Most orders,10
9,2550362,1,10,4,8,30.0,196,1,1,both,Soda,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10


In [6]:
# Using loc function to assign loyalty flag
ords_prods_merged.loc[ords_prods_merged['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [7]:
# Using loc function to assign loyalty flag
ords_prods_merged.loc[(ords_prods_merged['max_order'] <= 40) & (ords_prods_merged['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'

In [8]:
# Using loc function to assign loyalty flag
ords_prods_merged.loc[ords_prods_merged['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

Flag Criteria:

1) If the maximum orders the user has made is over 40, then the customer will be labeled a “Loyal customer.”
2) If the maximum orders the user has made is over 10 but less than or equal to 40, then the customer will be labeled a “Regular customer.”
3) If the maximum orders the user has made is less than or equal to 10, then the customer will be labeled a “New customer.”

In [9]:
# Printing the frequency of new “loyalty_flag” column
ords_prods_merged['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64

In [10]:
# First 60 rows of these 3 columns
ords_prods_merged[['user_id', 'loyalty_flag', 'order_number']].head(60)

Unnamed: 0,user_id,loyalty_flag,order_number
0,1,New customer,1
1,1,New customer,2
2,1,New customer,3
3,1,New customer,4
4,1,New customer,5
5,1,New customer,6
6,1,New customer,7
7,1,New customer,8
8,1,New customer,9
9,1,New customer,10


In [11]:
# Checking columns
ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,...,aisle_id,department_id,prices,exists,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag
0,2539329,1,1,2,8,,196,1,0,both,...,77,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10,New customer
1,2398795,1,2,3,7,15.0,196,1,1,both,...,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Fewest Orders,10,New customer
2,473747,1,3,3,12,21.0,196,1,1,both,...,77,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer
3,2254736,1,4,4,7,29.0,196,1,1,both,...,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10,New customer
4,431534,1,5,4,15,28.0,196,1,1,both,...,77,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders,10,New customer


In [12]:
# Checking customer spending habits based on loyalty flag
ords_prods_merged.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max']})

Unnamed: 0_level_0,prices,prices,prices
Unnamed: 0_level_1,mean,min,max
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Loyal customer,7.772831,1.0,25.0
New customer,7.80032,1.0,25.0
Regular customer,7.797431,1.0,25.0


Spending habits don't seem to differ much between customers

# 04. Creating spending flag

In [15]:
# Calculating mean of prices column grouped by user id column for entire df
ords_prods_merged.groupby('user_id').agg({'prices': ['mean']})

Unnamed: 0_level_0,prices
Unnamed: 0_level_1,mean
user_id,Unnamed: 1_level_2
1,6.367797
2,7.515897
3,8.197727
4,8.205556
5,9.189189
...,...
206205,8.909375
206206,7.646667
206207,7.313453
206208,8.366617


In [16]:
# Creating a new column based on grouping user id column and generating max # of orders for each user
ords_prods_merged['avg_price_all_orders'] = ords_prods_merged.groupby(['user_id'])['prices'].transform(np.mean)

  ords_prods_merged['avg_price_all_orders'] = ords_prods_merged.groupby(['user_id'])['prices'].transform(np.mean)


In [17]:
# Checking first 15 rows of df
ords_prods_merged.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,...,department_id,prices,exists,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price_all_orders
0,2539329,1,1,2,8,,196,1,0,both,...,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10,New customer,6.367797
1,2398795,1,2,3,7,15.0,196,1,1,both,...,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Fewest Orders,10,New customer,6.367797
2,473747,1,3,3,12,21.0,196,1,1,both,...,7,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797
3,2254736,1,4,4,7,29.0,196,1,1,both,...,7,9.0,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10,New customer,6.367797
4,431534,1,5,4,15,28.0,196,1,1,both,...,7,9.0,both,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797
5,3367565,1,6,2,7,19.0,196,1,1,both,...,7,9.0,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10,New customer,6.367797
6,550135,1,7,1,9,20.0,196,1,1,both,...,7,9.0,both,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
7,3108588,1,8,1,14,14.0,196,2,1,both,...,7,9.0,both,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
8,2295261,1,9,1,16,0.0,196,4,1,both,...,7,9.0,both,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797
9,2550362,1,10,4,8,30.0,196,1,1,both,...,7,9.0,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10,New customer,6.367797


In [18]:
# Using loc function to assign spending flag
ords_prods_merged.loc[ords_prods_merged['avg_price_all_orders'] < 10, 'spending_flag'] = 'Low spender'

In [20]:
# Using loc function to assign spending flag
ords_prods_merged.loc[ords_prods_merged['avg_price_all_orders'] >= 10, 'spending_flag'] = 'High spender'

Flag Criteria:
1) If the mean of the prices of products purchased by a user is lower than 10, then flag them as a “Low spender.”
2) If the mean of the prices of products purchased by a user is higher than or equal to 10, then flag them as a “High spender.”

In [21]:
# Printing the frequency of new “spending_flag” column
ords_prods_merged['spending_flag'].value_counts(dropna = False)

spending_flag
Low spender     32285165
High spender      119694
Name: count, dtype: int64

In [22]:
# Checking columns
ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,...,prices,exists,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price_all_orders,spending_flag
0,2539329,1,1,2,8,,196,1,0,both,...,9.0,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10,New customer,6.367797,Low spender
1,2398795,1,2,3,7,15.0,196,1,1,both,...,9.0,both,Mid-range product,Regularly busy,Slowest days,Fewest Orders,10,New customer,6.367797,Low spender
2,473747,1,3,3,12,21.0,196,1,1,both,...,9.0,both,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender
3,2254736,1,4,4,7,29.0,196,1,1,both,...,9.0,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10,New customer,6.367797,Low spender
4,431534,1,5,4,15,28.0,196,1,1,both,...,9.0,both,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender


# 05. Creating frequency flag

In [24]:
# Calculating median of days_since_last_order column grouped by user id column for entire df
ords_prods_merged.groupby('user_id').agg({'days_since_last_order': ['median']})

Unnamed: 0_level_0,days_since_last_order
Unnamed: 0_level_1,median
user_id,Unnamed: 1_level_2
1,20.5
2,13.0
3,10.0
4,20.0
5,11.0
...,...
206205,30.0
206206,3.0
206207,16.0
206208,7.0


In [25]:
# Creating a new column based on grouping user id column and generating median days since last order for each user
ords_prods_merged['median_days_last_order'] = ords_prods_merged.groupby(['user_id'])['days_since_last_order'].transform(np.median)

  ords_prods_merged['median_days_last_order'] = ords_prods_merged.groupby(['user_id'])['days_since_last_order'].transform(np.median)


In [26]:
# Checking first 15 rows of df
ords_prods_merged.head(15)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,...,exists,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price_all_orders,spending_flag,median_days_last_order
0,2539329,1,1,2,8,,196,1,0,both,...,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10,New customer,6.367797,Low spender,20.5
1,2398795,1,2,3,7,15.0,196,1,1,both,...,both,Mid-range product,Regularly busy,Slowest days,Fewest Orders,10,New customer,6.367797,Low spender,20.5
2,473747,1,3,3,12,21.0,196,1,1,both,...,both,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5
3,2254736,1,4,4,7,29.0,196,1,1,both,...,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10,New customer,6.367797,Low spender,20.5
4,431534,1,5,4,15,28.0,196,1,1,both,...,both,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5
5,3367565,1,6,2,7,19.0,196,1,1,both,...,both,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10,New customer,6.367797,Low spender,20.5
6,550135,1,7,1,9,20.0,196,1,1,both,...,both,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
7,3108588,1,8,1,14,14.0,196,2,1,both,...,both,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
8,2295261,1,9,1,16,0.0,196,4,1,both,...,both,Mid-range product,Regularly busy,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5
9,2550362,1,10,4,8,30.0,196,1,1,both,...,both,Mid-range product,Least busy,Slowest days,Fewest Orders,10,New customer,6.367797,Low spender,20.5


In [27]:
# Using loc function to assign frequency flag
ords_prods_merged.loc[ords_prods_merged['median_days_last_order'] > 20, 'frequency_flag'] = 'Non-frequent customer'

In [28]:
# Using loc function to assign frequency flag
ords_prods_merged.loc[(ords_prods_merged['median_days_last_order'] > 10) & (ords_prods_merged['median_days_last_order'] <= 20), 'frequency_flag'] = 'Regular customer'

In [29]:
# Using loc function to assign frequency flag
ords_prods_merged.loc[ords_prods_merged['median_days_last_order'] <= 10, 'frequency_flag'] = 'Frequent customer'

Flag Criteria:
1) If the median of “days_since_prior_order” is higher than 20, then the customer should be labeled a “Non-frequent customer.”
2) If the median is higher than 10 and lower than or equal to 20, then the customer should be labeled a “Regular customer.”
3) If the median is lower than or equal to 10, then the customer should be labeled a “Frequent customer.”

In [30]:
# Printing the frequency of new “spending_flag” column
ords_prods_merged['frequency_flag'].value_counts(dropna = False)

frequency_flag
Frequent customer        21559853
Regular customer          7208564
Non-frequent customer     3636437
nan                             5
Name: count, dtype: int64

These missing values were not removed during data wrangling because it just means that customer never ordered again.

In [31]:
# Checking columns
ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,_merge,...,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order,loyalty_flag,avg_price_all_orders,spending_flag,median_days_last_order,frequency_flag
0,2539329,1,1,2,8,,196,1,0,both,...,Mid-range product,Regularly busy,Regularly busy,Fewest Orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,both,...,Mid-range product,Regularly busy,Slowest days,Fewest Orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,both,...,Mid-range product,Regularly busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,both,...,Mid-range product,Least busy,Slowest days,Fewest Orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,both,...,Mid-range product,Least busy,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


# 06. Exporting File

In [32]:
ords_prods_merged.to_pickle(r'/Users/suzandiab/Documents/Instacart Basket Analysis/02 Data/Prepared Data/ords_prods_merge_flagged.pkl')