# Assignment 4: Customer Segmentation with Clustering

**Student Name:** [Your Name Here]

**Date:** [Date]

---

## Assignment Overview

You've been hired as a data science consultant by a UK-based online gift retailer. They're spending the same amount on marketing to all customers regardless of value. Your task: segment their customer base using transaction data from 2009-2011, identify distinct customer groups, and provide actionable recommendations for each segment.

---

## Step 1: Import Libraries and Load Data

In [1]:
pip install pandas matplotlib seaborn scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


In [None]:
# Load the Online Retail II dataset
# TODO: Load online_retail_II.csv from the data folder
df = None  # Replace with pd.read_csv()

# Display basic information
# TODO: Display the first few rows and basic info about the dataset


print("\n" + "="*80)
print("CHECKPOINT: Verify dataset loaded correctly")
print(f"Dataset shape: {df.shape if df is not None else 'Not loaded'}")
print(f"Date range: [Check InvoiceDate column]")
print("="*80)

---
## Step 2: Aggregate Transaction Data to Customer-Level RFM Features

### Clean Transaction Data

Before aggregating to customer-level, clean the transaction data:
- Remove rows with missing Customer ID
- Remove returns (negative Quantity)
- Create TotalSpend column (Quantity × Price)
- Convert InvoiceDate to datetime

In [None]:
# Clean the data
# TODO: Remove missing Customer IDs


# TODO: Remove negative quantities (returns)


# TODO: Create TotalSpend column


# TODO: Convert InvoiceDate to datetime


print("\n" + "="*80)
print("CHECKPOINT: After data cleaning")
print(f"Remaining transactions: {len(df) if df is not None else 'N/A'}")
print(f"Unique customers: {df['Customer ID'].nunique() if df is not None else 'N/A'}")
print("="*80)

### Calculate RFM Features for Each Customer

Create three features for each customer:
- **Recency**: Days since last purchase (use December 10, 2011 as reference date)
- **Frequency**: Total number of unique invoices
- **Monetary**: Total amount spent

In [None]:
# Set reference date for recency calculation
reference_date = pd.to_datetime('2011-12-10')

# TODO: For each Customer ID, calculate:
# - Recency: (reference_date - max(InvoiceDate)).days
# - Frequency: count of unique Invoice numbers
# - Monetary: sum of TotalSpend

rfm_df = None  # Replace with aggregated DataFrame

print("\n" + "="*80)
print("CHECKPOINT: RFM Features Created")
if rfm_df is not None:
    print(f"Number of customers: {len(rfm_df)}")
    print(f"\nRFM Summary Statistics:")
    print(rfm_df.describe())
print("="*80)

---
## Step 3: Standardize Features and Determine Optimal k

### Standardize RFM Features

K-means is sensitive to feature scale, so standardize features to mean=0, std=1

In [None]:
# TODO: Use StandardScaler to standardize Recency, Frequency, and Monetary
scaler = StandardScaler()
rfm_scaled = None  # Replace with scaled features

print("\n" + "="*80)
print("CHECKPOINT: Features Standardized")
if rfm_scaled is not None:
    print(f"Scaled features shape: {rfm_scaled.shape}")
    print(f"Mean of scaled features: {rfm_scaled.mean(axis=0)}")
    print(f"Std of scaled features: {rfm_scaled.std(axis=0)}")
print("="*80)

### Elbow Method: Test k from 2 to 10

Calculate inertia (within-cluster sum of squares) for different values of k

In [None]:
# TODO: Test k values from 2 to 10
# For each k:
#   - Train KMeans(n_clusters=k, random_state=42)
#   - Store inertia value'
from sklearn.cluster import KMeans

inertias = []
k_range = range(2, 11)

# Your code here


print("\n" + "="*80)
print("CHECKPOINT: Elbow Method Calculated")
print(f"Tested k values: {list(k_range)}")
print("="*80)

In [None]:
# TODO: Plot the elbow curve
# x-axis: k values
# y-axis: inertia



### Silhouette Score Analysis

Calculate silhouette scores to validate cluster quality

In [None]:
# TODO: Calculate silhouette scores for k from 2 to 10
# For each k:
#   - Train KMeans
#   - Calculate silhouette_score(rfm_scaled, labels)

silhouette_scores = []

# Your code here


print("\n" + "="*80)
print("CHECKPOINT: Silhouette Scores Calculated")
print("="*80)

In [None]:
# TODO: Plot silhouette scores
# x-axis: k values
# y-axis: silhouette score



### Select Optimal k

**Your k selection justification (write 2-3 sentences):**

[Based on the elbow plot and silhouette scores, explain why you chose your k value. What did you observe at the elbow point? What were the silhouette scores like for different k values?]

In [None]:
# TODO: Set your chosen k value
optimal_k = None  # Replace with your chosen k (e.g., 4, 5, or 6)

print(f"Chosen k value: {optimal_k}")

---
## Step 4: Train K-Means Model and Visualize Segments

### Train Final K-Means Model

In [None]:
# TODO: Train KMeans with your optimal_k and random_state=42
kmeans = None  # Replace with trained model

# TODO: Add cluster labels to rfm_df
# rfm_df['Cluster'] = ...


print("\n" + "="*80)
print("CHECKPOINT: K-Means Model Trained")
print(f"Number of clusters: {optimal_k}")
if 'Cluster' in rfm_df.columns:
    print(f"\nCluster sizes:")
    print(rfm_df['Cluster'].value_counts().sort_index())
print("="*80)

### Visualize Customer Segments

Create a 2D scatter plot showing Frequency vs Monetary, colored by cluster

In [None]:
# TODO: Create scatter plot
# x-axis: Frequency
# y-axis: Monetary
# color: Cluster
# Include legend



### Calculate Cluster Centers

Show the mean RFM values for each cluster

In [None]:
# TODO: Calculate mean Recency, Frequency, and Monetary for each cluster
cluster_summary = None  # Replace with grouped DataFrame

print("\n" + "="*80)
print("CLUSTER CENTERS (Mean RFM Values)")
print("="*80)
# TODO: Display cluster_summary

print("="*80)

---
## Step 5: Interpret Segments and Provide Business Recommendations

### Segment 0: [Descriptive Name]

**Customer Profile (3-5 sentences):**

[Describe this segment's characteristics. What are their R, F, M values? How do they differ from other segments? What business value do they represent? What are their purchasing patterns?]

In [None]:
# TODO: Calculate detailed statistics for Segment 0
# Show mean, median, min, max for R, F, M


### Segment 1: [Descriptive Name]

**Customer Profile (3-5 sentences):**

[Describe this segment]

In [None]:
# TODO: Calculate detailed statistics for Segment 1


### Segment 2: [Descriptive Name]

**Customer Profile (3-5 sentences):**

[Describe this segment]

In [None]:
# TODO: Calculate detailed statistics for Segment 2


### [Continue for remaining segments]

[Add sections for Segment 3, 4, etc. depending on your chosen k]

---
## Business Recommendations

### Recommendation 1: [Title]

**Which segment(s) does this target?** [Segment name(s)]

**Recommendation (3-5 sentences):**

[Provide a specific, actionable recommendation directly tied to your cluster findings. What marketing strategy should the company implement? What channels should they use? What offers or messaging would work? Why will this work for this segment?]

### Recommendation 2: [Title]

**Which segment(s) does this target?** [Segment name(s)]

**Recommendation (3-5 sentences):**

[Your second recommendation]

### Recommendation 3: [Title]

**Which segment(s) does this target?** [Segment name(s)]

**Recommendation (3-5 sentences):**

[Your third recommendation]

---
## Step 6: Submit Your Work

Before submitting:
1. Make sure all code cells run without errors
2. Verify you have:
   - RFM features properly calculated
   - Elbow method and silhouette score visualizations
   - Written justification for your k selection
   - Customer segment scatter plot
   - Descriptive names and profiles for each segment
   - Three specific business recommendations
3. Check that all visualizations display correctly

Then push to GitHub:
```bash
git add .
git commit -m 'completed customer segmentation assignment'
git push
```

Submit your GitHub repository link on the course platform.