Table of Contents:  
1. Import libraries and data.
2. Create a loyalty flag for existing customers using the transform() and loc() functions.  
3. Compare the spending habits of the 3 identified types of customers.
4. Create a spending flag for each user based on the avg price across all their orders.
5. Determine frequent vs non-frequent customers.
6. Export dataframe.  

1. Import libraries and data

In [2]:
import pandas as pd
import os
import numpy as np

In [8]:
# Define path
path = r'/Users/samlisik/Documents/Instacart Basket Analysis'

In [9]:
# Import the ords_prods_merge dataframe
ords_prods_merge = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_v2.pkl'))

In [4]:
# Find the aggregated mean of the "order_number" column grouped by "department_id"
ords_prods_merge.groupby('department_id').agg({'order_number': ['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1.0,15.457838
2.0,17.27792
3.0,17.170395
4.0,17.811403
5.0,15.213779
6.0,16.439806
7.0,17.225802
8.0,15.34065
9.0,15.895474
10.0,20.197148


Analysis: Comparing full dataframe vs. subset results

The results for the full dataframe show a similar pattern to the subset. The departments with the highest and lowest average order numbers are mostly the same.

However, the mean values are slightly higher in the full dataframe. This means that customers in the full dataset tend to have placed a few more orders on average.

Overall, the subset gave a good indication of the general trend, but the full dataframe provides a more accurate picture since it includes all the data.

2. Create a loyalty flag for existing customers using the transform() and loc() functions

In [5]:
# Create max_order column
ords_prods_merge['max_order'] = ords_prods_merge.groupby('user_id')['order_number'].transform(np.max)

  ords_prods_merge['max_order'] = ords_prods_merge.groupby('user_id')['order_number'].transform(np.max)


In [6]:
# Create a loyalty flag using loc()
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order'] > 10), 'loyalty_flag'] = 'Regular customer'
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [7]:
ords_prods_merge['loyalty_flag'].value_counts()

loyalty_flag
Regular customer    15892126
Loyal customer      10294330
New customer         6249785
Name: count, dtype: int64

3. Compare the spending habits of the 3 identified types of customers

In [8]:
# Aggregate basic statistics of prices by loyalty_flag
ords_prods_merge.groupby('loyalty_flag').agg({'prices': ['mean', 'min', 'max', 'median']})

Unnamed: 0_level_0,prices,prices,prices,prices
Unnamed: 0_level_1,mean,min,max,median
loyalty_flag,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Loyal customer,10.344655,1.0,99999.0,7.4
New customer,13.042653,1.0,99999.0,7.4
Regular customer,12.393828,1.0,99999.0,7.4


Key Insight:

While Loyal customers purchase frequently, they tend to buy lower-priced products on average. New and Regular customers are slightly more likely to purchase higher-priced items (though the extreme max value may distort this observation).

4. Create a spending flag for each user based on the avg price across all their orders

In [9]:
# Step 1: Group by user_id and calculate the mean price per user
ords_prods_merge['avg_price'] = ords_prods_merge.groupby('user_id')['prices'].transform('mean')

In [10]:
# Step 2: Create the spending flag using loc()
ords_prods_merge.loc[ords_prods_merge['avg_price'] < 10, 'spending_flag'] = 'Low spender'
ords_prods_merge.loc[ords_prods_merge['avg_price'] >= 10, 'spending_flag'] = 'High spender'

In [11]:
# Check the result
ords_prods_merge[['user_id', 'avg_price', 'spending_flag']].head(20)

Unnamed: 0,user_id,avg_price,spending_flag
0,1,6.367797,Low spender
1,1,6.367797,Low spender
2,1,6.367797,Low spender
3,1,6.367797,Low spender
4,1,6.367797,Low spender
5,1,6.367797,Low spender
6,1,6.367797,Low spender
7,1,6.367797,Low spender
8,1,6.367797,Low spender
9,1,6.367797,Low spender


5. Determine frequent vs non-frequent customers

In [13]:
# Step 1: Calculate the median of 'days_since_prior_order' per user
ords_prods_merge['median_days_since_last_order'] = ords_prods_merge.groupby('user_id')['days_since_last_order'].transform('median')

In [15]:
# Step 2: Create the order frequency flag based on the criteria
ords_prods_merge.loc[ords_prods_merge['median_days_since_last_order'] > 20, 'order_frequency_flag'] = 'Non-frequent customer'
ords_prods_merge.loc[(ords_prods_merge['median_days_since_last_order'] > 10) & (ords_prods_merge['median_days_since_last_order'] <= 20), 'order_frequency_flag'] = 'Regular customer'
ords_prods_merge.loc[ords_prods_merge['median_days_since_last_order'] <= 10, 'order_frequency_flag'] = 'Frequent customer'

In [16]:
# Step 3: Check the results
ords_prods_merge['order_frequency_flag'].value_counts(dropna=False)

order_frequency_flag
Frequent customer        21578719
Regular customer          7217556
Non-frequent customer     3639966
Name: count, dtype: int64

In [10]:
# Clean data
# Mark extreme values as missing
ords_prods_merge.loc[ords_prods_merge['prices'] >100, 'prices'] = np.nan

In [11]:
# Max-value check on the “prices” column
ords_prods_merge['prices'].max()

25.0

6. Export dataframe

In [13]:
# Export the dataframe as a pickle file
ords_prods_merge.to_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'ords_prods_merge_v3.pkl'))