## 🛠️ Mod5 Data Challenge 7: Churn and Activation 


**Why this activity?**  
Use real transactions to compute simple **activation** and **monthly churn** so we can reason about product health and early value.


**Dataset:** UCI Online Retail II (Excel, 2009–2011). 

**Goals — You will be able to:**
1) Load & minimally prep the Online Retail II Excel data  
2) Define a simple **activation** metric  
3) Estimate **monthly churn** (active in month *t* but not in *t+1*)  
4) Explain changes and propose concrete actions

**Interview practice:**
- Q1: Why can a small churn increase (e.g., 3% → 5%) be a big deal over a year?
- Q2: How would you define activation for a product without purchases?



### 👩‍🏫 Instructor-Led Demo (25 minutes)

### Step 1 — Load the Online Retail II Excel and prepare fields (You've seen this already!)

**Notes:**  
- Read both sheets, parse dates, keep all rows (including cancellations/returns), drop only missing `CustomerID`.  
- We’ll compute a helper `amount` and a `snapshot_date` (max date + 1 day).  
- The goal is to practice metrics, not heavy cleaning.




In [2]:
# import packages per usual 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import timedelta

In [3]:
#YOU'VE SEEN THIS CODE ALREADY :) 

# Path to your downloaded Excel file
DATA_PATH = "../data/online_retail_II.xlsx"  

# 1) Read both sheets and concatenate
sheets = ["Year 2009-2010", "Year 2010-2011"]
df_list = [pd.read_excel(DATA_PATH, sheet_name=s, engine="openpyxl") for s in sheets]
df = pd.concat(df_list, ignore_index=True)

# 2) Parse datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"], errors="coerce")

# 3) Drop missing CustomerID (required to build RFM)
df = df.dropna(subset=["Customer ID"]).copy()
df["Customer ID"] = df["Customer ID"].astype(int)

# 4) Line total (Monetary basis uses positive spend only)
df["amount"] = df["Quantity"] * df["Price"]
snapshot_date = df["InvoiceDate"].max().normalize() + pd.Timedelta(days=1)

df.head()


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country,amount
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085,United Kingdom,83.4
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.0
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.0
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085,United Kingdom,100.8
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085,United Kingdom,30.0


In [4]:
df

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country,amount
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085,United Kingdom,83.40
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.00
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085,United Kingdom,81.00
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.10,13085,United Kingdom,100.80
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085,United Kingdom,30.00
...,...,...,...,...,...,...,...,...,...
1067366,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680,France,12.60
1067367,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680,France,16.60
1067368,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680,France,16.60
1067369,581587,22138,BAKING SET 9 PIECE RETROSPOT,3,2011-12-09 12:50:00,4.95,12680,France,14.85


### Step 2 — Customer ledger
**Notes:**  
- For each customer: first invoice (signup proxy), last invoice, number of **unique** invoices.  
- We’ll anchor activation logic to first invoice date.



In [None]:
cust = (
    df.groupby("Customer ID")
      .agg(
          first_invoice=("InvoiceDate", "min"),
          last_invoice=("InvoiceDate", "max"),
          unique_invoices=("Invoice", "nunique")
      )
      .reset_index()
)
cust.head()


Unnamed: 0,Customer ID,first_invoice,last_invoice,unique_invoices
0,12346,2009-12-14 08:34:00,2011-01-18 10:17:00,17
1,12347,2010-10-31 14:20:00,2011-12-07 15:52:00,8
2,12348,2010-09-27 14:59:00,2011-09-25 13:13:00,5
3,12349,2009-12-04 12:49:00,2011-11-21 09:51:00,5
4,12350,2011-02-02 16:01:00,2011-02-02 16:01:00,1


### Step 3 — Activation (≥2 invoices within 30 days of first)

**Notes:**  
- “Activation” is the first moment when real value is likely.  
- For simplicity: a customer is **activated** if they complete **≥2 unique invoices** within **30 days** of the first invoice.


In [6]:
# Prepare a 30-day window per customer
cust["act_window_end"] = cust["first_invoice"] + pd.Timedelta(days=30)

# Attach window to each row
df_win = df.merge(cust[["Customer ID", "first_invoice", "act_window_end"]], on='Customer ID', how='left')

# Count unique invoices in the 30-day window
in_win = (df_win["InvoiceDate"] >= df_win["first_invoice"]) & (df_win['InvoiceDate'] < df_win['act_window_end'])
win_counts = (
    df_win.loc[in_win]
          .groupby("Customer ID")["Invoice"]
          .nunique()
          .rename("invoices_30d")
          .to_frame()
)

# Merge back; users with no activity in window get 0
cust_act = cust.merge(win_counts, on="Customer ID", how="left").fillna({"invoices_30d": 0})

# Activation rule
cust_act["activated_30d"] = (cust_act["invoices_30d"] >= 2).astype(int)
activation_rate = cust_act['activated_30d'].mean() *100
print(f"Activation (≥2 invoices within 30 days): {activation_rate:.2f}%")

cust_act[["Customer ID","first_invoice","invoices_30d","activated_30d"]].head()


Activation (≥2 invoices within 30 days): 34.45%


Unnamed: 0,Customer ID,first_invoice,invoices_30d,activated_30d
0,12346,2009-12-14 08:34:00,7,1
1,12347,2010-10-31 14:20:00,1,0
2,12348,2010-09-27 14:59:00,1,0
3,12349,2009-12-04 12:49:00,1,0
4,12350,2011-02-02 16:01:00,1,0


### 👩‍🏫 Student Led (20 minutes) -- CHURN

### Step 1 — Create `year_month` column
We’ll use this to group invoices into monthly buckets.


In [13]:
# Create a year-month column

df["year_month"] = df['InvoiceDate'].dt.to_period('M')
df['year_month']

0          2009-12
1          2009-12
2          2009-12
3          2009-12
4          2009-12
            ...   
1067366    2011-12
1067367    2011-12
1067368    2011-12
1067369    2011-12
1067370    2011-12
Name: year_month, Length: 824364, dtype: period[M]

### Step 2 — Track who is active each month
You’ll create a dictionary of sets: {month: set of customer IDs}



In [19]:
active_sets = (
    df.groupby('year_month')['Customer ID']
    .apply(lambda s: set(s.unique()))
    .to_dict()
)

### Step 3 — Calculate churn from each month to the next

Churn rate = (active in t but not in t+1) / active in t



In [20]:
months_sorted = sorted(active_sets.keys())
rows = []
for i in range(len(months_sorted) - 1):
    m, m_next = months_sorted[i], months_sorted[i+1]
    active_m, active_next = active_sets[m], active_sets[m_next]
    if len(active_m) == 0:
        continue
    churners = active_m - active_next
    churn_rate = (len(churners)/len(active_m)) * 100
    rows.append({
        "month": str(m),
        "active_in_m": len(active_m),
        "active_in_next": len(active_next),
        "churners_to_next": len(churners),
        "monthly_churn_rate_%": round(churn_rate, 2)
    })

monthly_churn = pd.DataFrame(rows)
monthly_churn.head()

Unnamed: 0,month,active_in_m,active_in_next,churners_to_next,monthly_churn_rate_%
0,2009-12,1045,786,653,62.49
1,2010-01,786,807,481,61.2
2,2010-02,807,1111,442,54.77
3,2010-03,1111,998,676,60.85
4,2010-04,998,1062,572,57.31


## Wrap‑Up (15 mins): Activation & Churn

Students Discussion Prompts:
- What happens if activation = only 1 invoice? Or 3?
- If churn spikes in a month, what should a product team investigate?
- How would you visualize churn trends for different segments?


🎯 What you learned:
- How to define a simple activation metric using real invoice behavior
- How to compute churn: who *was* active and who didn’t return
- Real data doesn’t always align neatly — always consider your metric assumptions



🚀 Next Steps:
Try calculating **cohort churn** by signup month or country!


***

### Answers Below (Don't look Ro, that's cheating v.v)

In [None]:
df["year_month"] = df["InvoiceDate"].dt.to_period("M")


In [None]:
active_sets = (
    df.groupby("year_month")["Customer ID"]
    .apply(lambda s: set(s.unique()))
    .to_dict()
)


In [None]:
months_sorted = sorted(active_sets.keys())
rows = []
for i in range(len(months_sorted) - 1):
    m, m_next = months_sorted[i], months_sorted[i+1]
    active_m, active_next = active_sets[m], active_sets[m_next]
    if len(active_m) == 0:
        continue
    churners = active_m - active_next
    churn_rate = (len(churners) / len(active_m)) * 100
    rows.append({
        "month": str(m),
        "active_in_m": len(active_m),
        "active_in_next": len(active_next),
        "churners_to_next": len(churners),
        "monthly_churn_rate_%": round(churn_rate, 2)
    })

monthly_churn = pd.DataFrame(rows)
monthly_churn.head()

