# Projects -Cohort Analysis for assessing customer retention in E-commerce industry

## 03 - RFM and Customer Tenure Feature Engineering

In this notebook, we calculate key customer metrics:
- **Recency:** Days since last purchase
- **Frequency:** Number of purchase orders
- **Monetary Value:** Total amount spent
- **Customer Tenure:** Days since first recorded purchase

These features will be used for customer segmentation, churn prediction, and business insights.

In [2]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [6]:
# Load Data
df = pd.read_csv("../dataset/Cleaned_Dataset_ecommerce2.csv")
df.head(10)

Unnamed: 0,InvoiceNo,InvoiceDate,CustomerID,StockCode,Description,Quantity,UnitPrice,TotalPrice,Country,InvoiceMonth,InvoiceDate2,Hour
0,536365,2010-12-01 08:26:00,17850.0,SC1734,Electronics,65,10.23,664.95,Egypt,2010-12,2010-12-01,8
1,536365,2010-12-01 08:26:00,17850.0,SC2088,Furniture,95,19.61,1862.95,Mali,2010-12,2010-12-01,8
2,536365,2010-12-01 08:26:00,17850.0,SC3463,Books,78,61.49,4796.22,Mali,2010-12,2010-12-01,8
3,536365,2010-12-01 08:26:00,17850.0,SC6228,Toys,15,24.73,370.95,South Africa,2010-12,2010-12-01,8
4,536365,2010-12-01 08:26:00,17850.0,SC2149,Toys,50,38.83,1941.5,Rwanda,2010-12,2010-12-01,8
5,536365,2010-12-01 08:26:00,17850.0,SC7895,Toys,41,45.31,1857.71,Sierra Leone,2010-12,2010-12-01,8
6,536365,2010-12-01 08:26:00,17850.0,SC8608,Books,44,39.31,1729.64,Benin,2010-12,2010-12-01,8
7,536366,2010-12-01 08:28:00,17850.0,SC3216,Toys,47,77.35,3635.45,Burkina Faso,2010-12,2010-12-01,8
8,536366,2010-12-01 08:28:00,17850.0,SC1236,Kitchenware,19,35.11,667.09,Nigeria,2010-12,2010-12-01,8
9,536367,2010-12-01 08:34:00,13047.0,SC4513,Furniture,55,3.21,176.55,Cote d'Ivoire,2010-12,2010-12-01,8


In [37]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], errors='coerce')
df['InvoiceDate2'] = pd.to_datetime(df['InvoiceDate2'], errors='coerce')

In [41]:
df.dtypes

InvoiceNo               object
InvoiceDate     datetime64[ns]
CustomerID             float64
StockCode               object
Description             object
Quantity                 int64
UnitPrice              float64
TotalPrice             float64
Country                 object
InvoiceMonth            object
InvoiceDate2    datetime64[ns]
Hour                     int64
dtype: object

In [49]:
# 1. Set a reference date
reference_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)
print(f'ref date : {reference_date}')

ref date : 2011-12-10 12:50:00


In [53]:
# 2. Group by CustomerID and calculate

# Recency
recency_df = df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (reference_date - x.max()).days})
recency_df.rename(columns={'InvoiceDate': 'Recency'}, inplace=True)

# Frequency
frequency_df = df.groupby('CustomerID')['InvoiceNo'].nunique().to_frame('Frequency')

# Monetary
monetary_df = df.groupby('CustomerID')['TotalPrice'].sum().to_frame('Monetary')

In [59]:
# 3. Merge all three metrics into one dataset

rfm = recency_df.merge(frequency_df, on='CustomerID').merge(monetary_df, on='CustomerID')
rfm.head()

Unnamed: 0_level_0,Recency,Frequency,Monetary
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346.0,326,2,5342.4
12347.0,2,7,431501.0
12348.0,75,4,82378.47
12349.0,19,1,176075.12
12350.0,310,1,48173.37


In [69]:
# 4. Calculate Customer Tenure

# First purchase date vs. reference date
tenure_df = df.groupby('CustomerID').agg({'InvoiceDate': lambda x: (reference_date - x.min()).days})
tenure_df.rename(columns={'InvoiceDate': 'Tenure'}, inplace=True)

# Add to RFM
rfm = rfm.merge(tenure_df, on='CustomerID')
rfm

Unnamed: 0_level_0,Recency,Frequency,Monetary,Tenure_x,Tenure_y
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12346.0,326,2,5342.40,326,326
12347.0,2,7,431501.00,367,367
12348.0,75,4,82378.47,358,358
12349.0,19,1,176075.12,19,19
12350.0,310,1,48173.37,310,310
...,...,...,...,...,...
18280.0,278,1,18907.36,278,278
18281.0,181,1,26009.01,181,181
18282.0,8,3,36010.74,126,126
18283.0,4,16,2008747.62,337,337


In [67]:
# 5. Save the features

rfm.to_csv('../dataset/rfm_customer_features.csv', index=True)
#rfm.head()