# RFM Analysis - CDNOW Dataset (Beginner-Friendly)

**Dataset**: CDNOW Customer Data

**Source**: GitHub / R packages / Academic datasets

**Description**: Classic RFM teaching dataset from an online CD retailer (1997-1998)

**Complexity**: Low - Perfect for learning RFM basics

## Dataset Features
- Small, clean dataset (~70K transactions, ~23K customers)
- Simple structure: CustomerID, Date, Quantity, Amount
- No missing values
- Ideal for quick RFM prototyping

## Learning Objectives
1. Basic RFM calculation
2. Scoring methods
3. Simple segmentation
4. Quick visualizations
5. Actionable insights

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Load Data

The CDNOW dataset typically has these columns:
- Customer ID
- Transaction Date
- Number of CDs purchased
- Total Amount

In [None]:
# Load CDNOW dataset
# Common sources:
# 1. GitHub: https://github.com/boboppie/CDNOW_RFM
# 2. R package: BTYD
# 3. Academic datasets

# Option 1: Load from text file (typical format)
# Column names: customer_id, date, num_cds, amount
df = pd.read_csv('CDNOW_master.txt', sep='\s+', header=None,
                 names=['customer_id', 'date', 'num_cds', 'amount'])

# Option 2: If CSV
# df = pd.read_csv('cdnow.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head(10)

In [None]:
# Basic data exploration
print("Data Info:")
print(df.info())
print("\nBasic Statistics:")
print(df.describe())
print(f"\nUnique Customers: {df['customer_id'].nunique():,}")
print(f"Total Transactions: {len(df):,}")

## 2. Data Preparation

In [None]:
# Convert date to datetime
# CDNOW dates are typically in YYYYMMDD format
df['date'] = pd.to_datetime(df['date'].astype(str), format='%Y%m%d')

# Check for any data quality issues
print("Missing values:")
print(df.isnull().sum())

print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
print(f"Time span: {(df['date'].max() - df['date'].min()).days} days")

# Check for negative values
print(f"\nNegative amounts: {(df['amount'] < 0).sum()}")
print(f"Zero amounts: {(df['amount'] == 0).sum()}")

In [None]:
# Remove any problematic transactions (if needed)
df_clean = df[df['amount'] > 0].copy()
print(f"Clean dataset: {len(df_clean):,} transactions")

## 3. Calculate RFM Metrics

### Step-by-step RFM calculation for learning

In [None]:
# Set analysis date (day after last transaction)
analysis_date = df_clean['date'].max() + timedelta(days=1)
print(f"Analysis Date: {analysis_date}")

# Step 1: Calculate Recency (days since last purchase)
recency = df_clean.groupby('customer_id')['date'].max().reset_index()
recency['Recency'] = (analysis_date - recency['date']).dt.days
recency = recency[['customer_id', 'Recency']]

print("\nRecency calculated:")
print(recency.head())
print(f"Recency range: {recency['Recency'].min()} to {recency['Recency'].max()} days")

In [None]:
# Step 2: Calculate Frequency (number of purchases)
frequency = df_clean.groupby('customer_id')['date'].count().reset_index()
frequency.columns = ['customer_id', 'Frequency']

print("Frequency calculated:")
print(frequency.head())
print(f"\nFrequency distribution:")
print(frequency['Frequency'].describe())

In [None]:
# Step 3: Calculate Monetary (total spend)
monetary = df_clean.groupby('customer_id')['amount'].sum().reset_index()
monetary.columns = ['customer_id', 'Monetary']

print("Monetary calculated:")
print(monetary.head())
print(f"\nMonetary distribution:")
print(monetary['Monetary'].describe())

In [None]:
# Step 4: Combine into RFM table
rfm = recency.merge(frequency, on='customer_id')
rfm = rfm.merge(monetary, on='customer_id')

print("RFM Table Created:")
print(rfm.head(10))
print(f"\nRFM Summary Statistics:")
print(rfm[['Recency', 'Frequency', 'Monetary']].describe())

## 4. RFM Visualization

In [None]:
# Distribution of RFM metrics
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Recency
axes[0].hist(rfm['Recency'], bins=50, color='skyblue', edgecolor='black')
axes[0].axvline(rfm['Recency'].mean(), color='red', linestyle='--', 
                linewidth=2, label=f'Mean: {rfm["Recency"].mean():.0f} days')
axes[0].set_xlabel('Recency (days)', fontsize=12)
axes[0].set_ylabel('Number of Customers', fontsize=12)
axes[0].set_title('Recency Distribution', fontsize=14, fontweight='bold')
axes[0].legend()

# Frequency
axes[1].hist(rfm['Frequency'], bins=50, color='lightgreen', edgecolor='black')
axes[1].axvline(rfm['Frequency'].mean(), color='red', linestyle='--', 
                linewidth=2, label=f'Mean: {rfm["Frequency"].mean():.1f}')
axes[1].set_xlabel('Frequency (# purchases)', fontsize=12)
axes[1].set_ylabel('Number of Customers', fontsize=12)
axes[1].set_title('Frequency Distribution', fontsize=14, fontweight='bold')
axes[1].legend()

# Monetary
axes[2].hist(rfm['Monetary'], bins=50, color='salmon', edgecolor='black')
axes[2].axvline(rfm['Monetary'].mean(), color='red', linestyle='--', 
                linewidth=2, label=f'Mean: ${rfm["Monetary"].mean():.2f}')
axes[2].set_xlabel('Monetary ($)', fontsize=12)
axes[2].set_ylabel('Number of Customers', fontsize=12)
axes[2].set_title('Monetary Distribution', fontsize=14, fontweight='bold')
axes[2].legend()

plt.tight_layout()
plt.show()

## 5. RFM Scoring

### Method 1: Quintile-based Scoring (1-5)

In [None]:
# Create RFM scores using quintiles
# Note: Lower recency is better, so we reverse the labels

rfm['R_Score'] = pd.qcut(rfm['Recency'], q=5, labels=[5, 4, 3, 2, 1], duplicates='drop').astype(int)
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), q=5, labels=[1, 2, 3, 4, 5], duplicates='drop').astype(int)
rfm['M_Score'] = pd.qcut(rfm['Monetary'], q=5, labels=[1, 2, 3, 4, 5], duplicates='drop').astype(int)

# Combined RFM score
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

# Total score
rfm['RFM_Total'] = rfm['R_Score'] + rfm['F_Score'] + rfm['M_Score']

print("RFM with Scores:")
print(rfm.head(10))

print("\nScore Distribution:")
print(f"R_Score: {rfm['R_Score'].value_counts().sort_index().to_dict()}")
print(f"F_Score: {rfm['F_Score'].value_counts().sort_index().to_dict()}")
print(f"M_Score: {rfm['M_Score'].value_counts().sort_index().to_dict()}")

In [None]:
# Visualize score distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

rfm['R_Score'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='skyblue')
axes[0].set_xlabel('R Score', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_title('Recency Score Distribution', fontsize=14, fontweight='bold')
axes[0].tick_params(axis='x', rotation=0)

rfm['F_Score'].value_counts().sort_index().plot(kind='bar', ax=axes[1], color='lightgreen')
axes[1].set_xlabel('F Score', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Frequency Score Distribution', fontsize=14, fontweight='bold')
axes[1].tick_params(axis='x', rotation=0)

rfm['M_Score'].value_counts().sort_index().plot(kind='bar', ax=axes[2], color='salmon')
axes[2].set_xlabel('M Score', fontsize=12)
axes[2].set_ylabel('Count', fontsize=12)
axes[2].set_title('Monetary Score Distribution', fontsize=14, fontweight='bold')
axes[2].tick_params(axis='x', rotation=0)

plt.tight_layout()
plt.show()

## 6. Simple Customer Segmentation

In [None]:
# Simple 4-segment approach based on RFM_Total
# Total score ranges from 3 to 15

def simple_segment(score):
    if score >= 12:
        return 'Best Customers'
    elif score >= 9:
        return 'High Value'
    elif score >= 6:
        return 'Medium Value'
    else:
        return 'Low Value'

rfm['Simple_Segment'] = rfm['RFM_Total'].apply(simple_segment)

print("Simple Segmentation Distribution:")
print(rfm['Simple_Segment'].value_counts())

# Visualize
segment_counts = rfm['Simple_Segment'].value_counts()

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

segment_counts.plot(kind='bar', ax=axes[0], color=['green', 'lightgreen', 'orange', 'red'])
axes[0].set_xlabel('Segment', fontsize=12)
axes[0].set_ylabel('Number of Customers', fontsize=12)
axes[0].set_title('Customer Segmentation', fontsize=14, fontweight='bold')
axes[0].tick_params(axis='x', rotation=45)

axes[1].pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%', 
            startangle=90, colors=['green', 'lightgreen', 'orange', 'red'])
axes[1].set_title('Segment Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

## 7. Segment Analysis

In [None]:
# Analyze each segment
segment_analysis = rfm.groupby('Simple_Segment').agg({
    'customer_id': 'count',
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': ['mean', 'sum']
}).round(2)

segment_analysis.columns = ['_'.join(col).strip() for col in segment_analysis.columns.values]
segment_analysis = segment_analysis.rename(columns={'customer_id_count': 'Customer_Count'})

print("Segment Analysis:")
print(segment_analysis)

# Calculate revenue contribution
total_revenue = rfm['Monetary'].sum()
segment_analysis['Revenue_Percentage'] = (segment_analysis['Monetary_sum'] / total_revenue * 100).round(2)

print("\nRevenue Contribution by Segment:")
print(segment_analysis[['Customer_Count', 'Monetary_sum', 'Revenue_Percentage']])

In [None]:
# Visualize segment metrics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Average metrics by segment
segment_avg = rfm.groupby('Simple_Segment')[['Recency', 'Frequency', 'Monetary']].mean()

segment_avg['Recency'].plot(kind='bar', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('Average Recency by Segment', fontsize=12, fontweight='bold')
axes[0, 0].set_ylabel('Days')
axes[0, 0].tick_params(axis='x', rotation=45)

segment_avg['Frequency'].plot(kind='bar', ax=axes[0, 1], color='lightgreen')
axes[0, 1].set_title('Average Frequency by Segment', fontsize=12, fontweight='bold')
axes[0, 1].set_ylabel('Purchases')
axes[0, 1].tick_params(axis='x', rotation=45)

segment_avg['Monetary'].plot(kind='bar', ax=axes[1, 0], color='salmon')
axes[1, 0].set_title('Average Monetary by Segment', fontsize=12, fontweight='bold')
axes[1, 0].set_ylabel('$')
axes[1, 0].tick_params(axis='x', rotation=45)

# Revenue contribution
segment_revenue = rfm.groupby('Simple_Segment')['Monetary'].sum().sort_values(ascending=False)
segment_revenue.plot(kind='bar', ax=axes[1, 1], color='darkgreen')
axes[1, 1].set_title('Total Revenue by Segment', fontsize=12, fontweight='bold')
axes[1, 1].set_ylabel('$')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## 8. Top Customers

In [None]:
# Find best customers (highest RFM scores)
top_20 = rfm.nlargest(20, 'RFM_Total')[['customer_id', 'Recency', 'Frequency', 'Monetary', 
                                          'RFM_Score', 'RFM_Total', 'Simple_Segment']]

print("Top 20 Customers:")
print(top_20)

# Best customers summary
best_customers = rfm[rfm['Simple_Segment'] == 'Best Customers']
print(f"\nBest Customers Segment:")
print(f"  Count: {len(best_customers)} ({len(best_customers)/len(rfm)*100:.1f}%)")
print(f"  Total Revenue: ${best_customers['Monetary'].sum():,.2f}")
print(f"  Revenue %: {best_customers['Monetary'].sum()/rfm['Monetary'].sum()*100:.1f}%")
print(f"  Avg Value: ${best_customers['Monetary'].mean():,.2f}")

## 9. Simple Business Insights

In [None]:
# Key metrics
total_customers = len(rfm)
total_revenue = rfm['Monetary'].sum()
avg_customer_value = rfm['Monetary'].mean()
avg_frequency = rfm['Frequency'].mean()
avg_recency = rfm['Recency'].mean()

print("=" * 60)
print("CDNOW RFM ANALYSIS - KEY INSIGHTS")
print("=" * 60)

print(f"\n1. OVERALL METRICS")
print(f"   Total Customers: {total_customers:,}")
print(f"   Total Revenue: ${total_revenue:,.2f}")
print(f"   Average Customer Value: ${avg_customer_value:.2f}")
print(f"   Average Purchases per Customer: {avg_frequency:.1f}")
print(f"   Average Days Since Last Purchase: {avg_recency:.0f}")

print(f"\n2. SEGMENT BREAKDOWN")
for segment in ['Best Customers', 'High Value', 'Medium Value', 'Low Value']:
    seg_data = rfm[rfm['Simple_Segment'] == segment]
    count = len(seg_data)
    revenue = seg_data['Monetary'].sum()
    print(f"\n   {segment}:")
    print(f"   - Count: {count:,} ({count/total_customers*100:.1f}%)")
    print(f"   - Revenue: ${revenue:,.2f} ({revenue/total_revenue*100:.1f}%)")
    print(f"   - Avg Value: ${seg_data['Monetary'].mean():.2f}")

print(f"\n3. QUICK WINS")
recent_high_value = rfm[(rfm['R_Score'] >= 4) & (rfm['M_Score'] >= 4) & (rfm['F_Score'] <= 2)]
print(f"   Recent high-spenders with low frequency: {len(recent_high_value)}")
print(f"   → ACTION: Increase purchase frequency with targeted offers")

at_risk = rfm[(rfm['R_Score'] <= 2) & (rfm['F_Score'] >= 3)]
print(f"\n   At-risk customers (used to buy frequently): {len(at_risk)}")
print(f"   Potential revenue loss: ${at_risk['Monetary'].sum():,.2f}")
print(f"   → ACTION: Re-engagement campaign")

print("\n" + "=" * 60)
print("\nACTION PLAN:")
print("1. Reward 'Best Customers' with loyalty program")
print("2. Upsell to 'High Value' customers")
print("3. Re-engage 'At Risk' customers with special offers")
print("4. Convert 'Medium Value' to 'High Value' with bundles")
print("=" * 60)

## 10. Export Results

In [None]:
# Export RFM results
rfm.to_csv('cdnow_rfm_results.csv', index=False)
print("RFM results exported to: cdnow_rfm_results.csv")

# Export segment summary
segment_analysis.to_csv('cdnow_segment_summary.csv')
print("Segment summary exported to: cdnow_segment_summary.csv")

# Export top customers
top_100 = rfm.nlargest(100, 'RFM_Total')
top_100.to_csv('cdnow_top_100_customers.csv', index=False)
print("Top 100 customers exported to: cdnow_top_100_customers.csv")

## Bonus: Quick RFM Function

Here's a reusable function for quick RFM analysis on any similar dataset:

In [None]:
def quick_rfm_analysis(data, customer_col, date_col, amount_col, analysis_date=None):
    """
    Quick RFM analysis function
    
    Parameters:
    - data: DataFrame with transaction data
    - customer_col: name of customer ID column
    - date_col: name of date column
    - amount_col: name of amount column
    - analysis_date: reference date (default: day after max date)
    
    Returns:
    - DataFrame with RFM scores and segments
    """
    if analysis_date is None:
        analysis_date = data[date_col].max() + timedelta(days=1)
    
    # Calculate RFM
    rfm = data.groupby(customer_col).agg({
        date_col: lambda x: (analysis_date - x.max()).days,
        customer_col: 'count',
        amount_col: 'sum'
    }).reset_index(drop=True)
    
    rfm.columns = ['Recency', 'Frequency', 'Monetary']
    
    # Score
    rfm['R_Score'] = pd.qcut(rfm['Recency'], q=5, labels=[5,4,3,2,1], duplicates='drop').astype(int)
    rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), q=5, labels=[1,2,3,4,5], duplicates='drop').astype(int)
    rfm['M_Score'] = pd.qcut(rfm['Monetary'], q=5, labels=[1,2,3,4,5], duplicates='drop').astype(int)
    
    rfm['RFM_Total'] = rfm['R_Score'] + rfm['F_Score'] + rfm['M_Score']
    
    # Simple segmentation
    rfm['Segment'] = rfm['RFM_Total'].apply(lambda x: 
        'Best' if x >= 12 else 'High' if x >= 9 else 'Medium' if x >= 6 else 'Low')
    
    return rfm

# Example usage:
# rfm_quick = quick_rfm_analysis(df_clean, 'customer_id', 'date', 'amount')
# print(rfm_quick.head())

## Summary

This notebook demonstrated:
1. ✅ Loading and preparing CDNOW data
2. ✅ Step-by-step RFM calculation
3. ✅ Scoring methodology
4. ✅ Simple customer segmentation
5. ✅ Visual analysis
6. ✅ Actionable business insights
7. ✅ Reusable RFM function

**Next Steps:**
- Apply this to your own dataset
- Experiment with different scoring methods
- Try more sophisticated segmentation
- Integrate with marketing automation