# Customer Segmentation with RFM Analysis

Scenario:
You work for Aetna’s Medicare Analytics team and want to segment customers based on their Recency, Frequency, and Monetary Value (RFM) to target high-value customers.
- Who are the top customers?
- Which customers may leave?
- Who could be valuable customers?
- Which customers can be kept?
- Who is most likely to respond to campaigns?

In [None]:
WITH purchase_summary AS (
    SELECT 
        customer_id,
        MAX(purchase_date) AS last_purchase,
        COUNT(order_id) AS frequency,
        SUM(total_amount) AS monetary_value
    FROM transactions
    WHERE purchase_date BETWEEN '2023-01-01' AND '2023-12-31'
    GROUP BY customer_id
)
SELECT 
    customer_id,
    DATEDIFF(DAY, last_purchase, '2024-01-01') AS recency, -- Days since last purchase
    frequency,
    monetary_value
FROM purchase_summary;


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

## Get Input Data

In [68]:
df= pd.read_csv('/Users/rosiebai/Downloads/rfm_data.csv')
df.head()

Unnamed: 0,customer_id,recency,frequency,monetary_value
0,1,103,44,2137.52
1,2,349,2,1420.3
2,3,271,13,2982.13
3,4,107,40,4566.2
4,5,72,2,1092.78


In [69]:
df.shape

(1000, 4)

## Who are the top customers? 


In [70]:
# top 20% customers in terms of recency
df_sorted = df.sort_values(by = 'recency', ascending= True)
# create 5 bins and assign scores from 5 to 1
df_sorted['recency_score'] = pd.qcut(df_sorted['recency'], q = 5, labels = [5,4,3,2,1])
df_sorted[df_sorted['recency_score'] == 1]

Unnamed: 0,customer_id,recency,frequency,monetary_value,recency_score
168,169,286,21,1321.04,1
318,319,286,5,1558.16,1
713,714,286,32,3795.92,1
557,558,286,49,2270.62,1
790,791,286,47,1821.10,1
...,...,...,...,...,...
556,557,361,30,1873.13,1
644,645,361,11,1272.55,1
893,894,363,32,514.59,1
627,628,364,34,710.73,1


In [71]:
df['recency'].describe()

count    1000.000000
mean      181.374000
std       103.360018
min         1.000000
25%        97.750000
50%       180.000000
75%       268.000000
max       364.000000
Name: recency, dtype: float64

In [72]:
# top 20% customers in terms of frequency 
df_sorted = df_sorted.sort_values(by = 'frequency', ascending = True)
# create 5 bins and assign scores from 5 to 1
df_sorted['frequency_score'] = pd.qcut(df_sorted['frequency'], q = 5, labels = [1,2,3,4,5])
df_sorted[df_sorted['frequency_score']== 5]['customer_id']
df_sorted.head()

Unnamed: 0,customer_id,recency,frequency,monetary_value,recency_score,frequency_score
752,753,51,1,4918.43,5,1
427,428,158,1,3313.8,3,1
567,568,233,1,3981.28,2,1
810,811,34,1,675.01,5,1
239,240,259,1,1407.1,2,1


In [73]:
df['frequency'].describe()

count    1000.000000
mean       25.394000
std        13.864811
min         1.000000
25%        13.000000
50%        26.000000
75%        37.000000
max        49.000000
Name: frequency, dtype: float64

In [74]:
# top 20% customers in terms of monetary_value 
df_sorted = df_sorted.sort_values(by = 'monetary_value', ascending = True)
# create 5 bins and assign scores from 5 to 1
df_sorted['monetary_score'] = pd.qcut(df_sorted['monetary_value'], q = 5, labels = [1,2,3,4,5])
df_sorted[df_sorted['monetary_score'] == 5]['customer_id']

121    122
741    742
471    472
535    536
250    251
      ... 
886    887
555    556
370    371
597    598
107    108
Name: customer_id, Length: 200, dtype: int64

## Which customer might leave?

In [75]:
# high recency value, low frequency, and low monetary value
df_sorted[(df_sorted['recency_score'] == 5) & (df_sorted['frequency_score']== 1) & (df_sorted['monetary_score'] == 1)]

Unnamed: 0,customer_id,recency,frequency,monetary_value,recency_score,frequency_score,monetary_score
108,109,41,5,288.65,5,1,1
583,584,39,11,339.83,5,1,1
249,250,39,8,384.68,5,1,1
922,923,34,9,401.11,5,1,1
985,986,1,4,446.89,5,1,1
603,604,7,5,635.76,5,1,1
810,811,34,1,675.01,5,1,1
634,635,39,3,836.6,5,1,1
267,268,37,3,917.75,5,1,1


## Who could be valuable customers?

In [88]:
# i think it's the high monetary score customers
df_sorted[(df_sorted['monetary_score'] == 5)]

Unnamed: 0,customer_id,recency,frequency,monetary_value,recency_score,frequency_score,monetary_score
121,122,360,27,3999.18,1,3,5
741,742,358,4,3999.68,1,1,5
471,472,204,45,4003.64,3,5,5
535,536,277,13,4010.35,2,2,5
250,251,338,29,4011.76,1,3,5
...,...,...,...,...,...,...,...
886,887,14,16,4977.42,5,2,5
555,556,355,1,4981.85,1,1,5
370,371,176,42,4983.65,3,5,5
597,598,239,26,4991.82,2,3,5


## Which customers can be kept?

In [84]:
# the customers who don't have to be high value but definitely not low value 
df_sorted[(df_sorted['monetary_score'].isin([2,3,4])) & (df_sorted['recency_score'].isin([2,3,4])) & (df_sorted['frequency_score'].isin([2,3,4]))]

Unnamed: 0,customer_id,recency,frequency,monetary_value,recency_score,frequency_score,monetary_score
490,491,147,23,1027.48,4,3,2
673,674,143,33,1030.58,4,4,2
709,710,197,29,1046.75,3,3,2
81,82,213,30,1049.17,3,3,2
754,755,205,26,1049.98,3,3,2
...,...,...,...,...,...,...,...
708,709,256,13,3942.08,2,2,4
831,832,155,27,3952.81,3,3,4
317,318,277,20,3960.65,2,2,4
738,739,197,37,3974.63,3,4,4


## Who is most likely to respond to campaigns?

In [87]:
# i think it's the high frequent customers
df_sorted[(df_sorted['frequency_score'] == 5)]

Unnamed: 0,customer_id,recency,frequency,monetary_value,recency_score,frequency_score,monetary_score
992,993,356,41,100.15,1,5,1
78,79,252,47,198.37,2,5,1
825,826,360,46,214.29,1,5,1
65,66,106,47,255.79,4,5,1
437,438,117,44,256.28,4,5,1
...,...,...,...,...,...,...,...
470,471,238,41,4844.02,2,5,5
324,325,4,44,4879.27,5,5,5
473,474,341,47,4900.85,1,5,5
361,362,288,41,4919.21,1,5,5


## K-means clustering

In [3]:
# Apply KMeans clustering for segmentation
kmeans = KMeans(n_clusters=4, random_state = 123)
df['segment'] = kmeans.fit_predict(df[['recency','frequency','monetary_value']])

In [4]:
centroids = kmeans.cluster_centers_

In [10]:
centroids_df = pd.DataFrame(centroids, columns = ['recency', 'frequency', 'monetary_value'])
centroids_df

Unnamed: 0,recency,frequency,monetary_value
0,173.761905,26.174603,657.414246
1,184.569811,25.101887,3140.779925
2,191.995798,25.369748,1923.335504
3,175.428571,24.930612,4372.74951


In [11]:
# Map segments to labels
segment_map = {0:'New Customers', 1:'Loyal Customers', 2:'Low Value Customers', 3: 'High Value Customers'}

## Key Characteristics of cusotmer labels

- new customer: low recency value, low monetary value and low frequency, because they just joined the brand
- low value customer: high recency value, low or medium monetary value and low requency of purchase. they have joined for a while but didn't make too much purchases and the last purchase was a long time ago.
- high value customer: high monetary value, not necessary high frequency (maybe low or medium), not necessary low recency. 
- loyal customer: high frequency for sure, not necessary high monetary value, low recency of purchase. 

## Insights from RFM Segmentation


- High Value Customers → Target with exclusive Medicare plans.
- New Customers → Educate on additional services.
- Loyal Customers → Retain with rewards programs.
- Low-Value Customers → Reactivation campaigns.