## 🛠️ Mod5 Data Challenge 7: Churn and Activation 


**Why this activity?**  
Use real transactions to compute simple **activation** and **monthly churn** so we can reason about product health and early value.


**Dataset:** UCI Online Retail II (Excel, 2009–2011). 

**Goals — You will be able to:**
1) Load & minimally prep the Online Retail II Excel data  
2) Define a simple **activation** metric  
3) Estimate **monthly churn** (active in month *t* but not in *t+1*)  
4) Explain changes and propose concrete actions

**Interview practice:**
- Q1: Why can a small churn increase (e.g., 3% → 5%) be a big deal over a year?
- Q2: How would you define activation for a product without purchases?



### 👩‍🏫 Instructor-Led Demo (25 minutes)

### Step 1 — Load the Online Retail II Excel and prepare fields (You've seen this already!)

**Notes:**  
- Read both sheets, parse dates, keep all rows (including cancellations/returns), drop only missing `CustomerID`.  
- We’ll compute a helper `amount` and a `snapshot_date` (max date + 1 day).  
- The goal is to practice metrics, not heavy cleaning.




In [None]:
# import packages per usual 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import timedelta

In [None]:
#YOU'VE SEEN THIS CODE ALREADY :) 

# Path to your downloaded Excel file
DATA_PATH = "../data/online_retail_II.xlsx"  

# 1) Read both sheets and concatenate
sheets = ["Year 2009-2010", "Year 2010-2011"]
df_list = [pd.read_excel(DATA_PATH, sheet_name=s, engine="openpyxl") for s in sheets]
df = pd.concat(df_list, ignore_index=True)

# 2) Parse datetime
df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"], errors="coerce")

# 3) Drop missing CustomerID (required to build RFM)
df = df.dropna(subset=["Customer ID"]).copy()
df["Customer ID"] = df["Customer ID"].astype(int)

# 4) Line total (Monetary basis uses positive spend only)
df["amount"] = df["Quantity"] * df["Price"]
snapshot_date = df["InvoiceDate"].max().normalize() + pd.Timedelta(days=1)

df.head()


### Step 2 — Customer ledger
**Notes:**  
- For each customer: first invoice (signup proxy), last invoice, number of **unique** invoices.  
- We’ll anchor activation logic to first invoice date.



In [None]:
cust = (
    df.groupby(None)
      .agg(
          first_invoice=("InvoiceDate", "min"),
          last_invoice=(None, "max"),
          unique_invoices=("Invoice", None)
      )
      .reset_index()
)
cust.head()


### Step 3 — Activation (≥2 invoices within 30 days of first)

**Notes:**  
- “Activation” is the first moment when real value is likely.  
- For simplicity: a customer is **activated** if they complete **≥2 unique invoices** within **30 days** of the first invoice.


In [None]:
# Prepare a 30-day window per customer
cust["act_window_end"] = cust["first_invoice"] + pd.Timedelta(None)

# Attach window to each row
df_win = df.merge(cust[["Customer ID", "first_invoice", "act_window_end"]], on=None, how=None)

# Count unique invoices in the 30-day window
in_win = (df_win["InvoiceDate"] >= df_win["first_invoice"]) & (None)
win_counts = (
    df_win.loc[in_win]
          .groupby("Customer ID")["Invoice"]
          .nunique()
          .rename("invoices_30d")
          .to_frame()
)

# Merge back; users with no activity in window get 0
cust_act = cust.merge(win_counts, on="Customer ID", how="left").fillna({"invoices_30d": 0})

# Activation rule
cust_act["activated_30d"] = (cust_act["invoices_30d"] >= 2).astype(int)
activation_rate = None
print(f"Activation (≥2 invoices within 30 days): {activation_rate:.2f}%")

cust_act[["Customer ID","first_invoice","invoices_30d","activated_30d"]].head()


### 👩‍🏫 Student Led (20 minutes) -- CHURN

### Step 1 — Create `year_month` column
We’ll use this to group invoices into monthly buckets.


In [None]:
# Create a year-month column

df["year_month"] = df[None].dt.to_period(None)


In [None]:
df["year_month"] = df["InvoiceDate"].dt.to_period("M")


### Step 2 — Track who is active each month
You’ll create a dictionary of sets: {month: set of customer IDs}



In [None]:
active_sets = (
    df.groupby(None)[None]
    .apply(lambda s: set(None))
    .to_dict()
)

In [None]:
active_sets = (
    df.groupby("year_month")["Customer ID"]
    .apply(lambda s: set(s.unique()))
    .to_dict()
)


### Step 3 — Calculate churn from each month to the next

Churn rate = (active in t but not in t+1) / active in t



In [None]:
months_sorted = sorted(active_sets.keys())
rows = []
for i in range(len(months_sorted) - 1):
    m, m_next = months_sorted[i], months_sorted[i+1]
    active_m, active_next = active_sets[m], active_sets[m_next]
    if len(active_m) == None:
        continue
    churners = None
    churn_rate = (None) * 100
    rows.append({
        "month": str(m),
        "active_in_m": len(active_m),
        "active_in_next": len(active_next),
        "churners_to_next": len(churners),
        "monthly_churn_rate_%": round(churn_rate, 2)
    })

monthly_churn = pd.DataFrame(rows)
monthly_churn.head()

In [None]:
months_sorted = sorted(active_sets.keys())
rows = []
for i in range(len(months_sorted) - 1):
    m, m_next = months_sorted[i], months_sorted[i+1]
    active_m, active_next = active_sets[m], active_sets[m_next]
    if len(active_m) == 0:
        continue
    churners = active_m - active_next
    churn_rate = (len(churners) / len(active_m)) * 100
    rows.append({
        "month": str(m),
        "active_in_m": len(active_m),
        "active_in_next": len(active_next),
        "churners_to_next": len(churners),
        "monthly_churn_rate_%": round(churn_rate, 2)
    })

monthly_churn = pd.DataFrame(rows)
monthly_churn.head()



## Wrap‑Up (15 mins): Activation & Churn

Students Discussion Prompts:
- What happens if activation = only 1 invoice? Or 3?
- If churn spikes in a month, what should a product team investigate?
- How would you visualize churn trends for different segments?


🎯 What you learned:
- How to define a simple activation metric using real invoice behavior
- How to compute churn: who *was* active and who didn’t return
- Real data doesn’t always align neatly — always consider your metric assumptions



🚀 Next Steps:
Try calculating **cohort churn** by signup month or country!
