## 🛠️ Mod5 Data Challenge 5: Cohorts & Retention 


**Why this activity?**  
You’ll build **cohort retention** from real transaction data to answer: *Are customers coming back after their first purchase?* This extends today’s lecture on Cohorts & Retention to a real dataset.

**Dataset:** UCI Online Retail II.  View more [HERE](https://archive.ics.uci.edu/dataset/502/online+retail+ii) 

**Goals — You will be able to:**
1) Load & clean real transaction data (remove cancellations, keep positive quantities/prices).  
2) Build **time‑anchored cohorts** (first purchase period).  
3) Compute a **cohort retention matrix** (monthly or weekly).  
4) Explain differences and propose product/marketing actions.

**Interview practice:**
- **Q1:** Why can MAU look stable while retention collapses?  
- **Q2:** Day/Month‑1 retention fell for the newest cohort—what do you check first?  
- **Q3:** When would you choose weekly vs monthly cohorts?


### 👩‍🏫 Instructor-Led Demo (25 minutes)

### Step 1:  Load & clean the Online Retail II Excel

We will:  
1) Read **both sheets** from the Excel file and concatenate.  
2) Parse dates.  
3) Drop rows with missing CustomerID and standardize types.  
4) Create an **activity date** column (we’ll use InvoiceDate).

Note:  We are keeping returns and cancellations in the data (for now) this may affect KPIs (like revenue), which is outside the scope of this cohort exercise but VERY important to point out.  


In [None]:
# import packages per usual 
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Path to your downloaded Excel file
DATA_PATH = None  

# 1) Read both sheets and concatenate
sheets = ["Year 2009-2010", "Year 2010-2011"]
df_list = [pd.read_excel(DATA_PATH, sheet_name=s, engine="openpyxl") for s in sheets]
df = pd.concat(df_list, ignore_index=True)

# 2) Change InvoiceDate to datetime
df["InvoiceDate"] = None

# 4) Drop missing CustomerID and standardize types
df = df.dropna(subset=["Customer ID"]).copy()
df["Customer ID"] = None

# 5) Activity date (alias for clarity)
df["activity_date"] = df["InvoiceDate"]

df.head()


#### Step 2:  Filter for a country (UK) and define cohorts 

We will:  
1) Keep rows where `Country == "United Kingdom"`.  
2) Define **cohort_month** as each customer’s first activity month.  
3) Define **activity_month** for each row.  
4) Compute **month_number** = months since cohort start.

**Speaker Notes:** Cohorts are anchored by **first activity**; “Month 0” is the signup month; “Month 1” is the following month, etc.


In [None]:
# 1) Country filter
df_uk = df[None].copy()

# 2) First activity month (cohort anchor) per customer
df_uk["cohort_month"] = None

# 3) Activity month per row
df_uk["activity_month"] = None

# 4) Months since cohort start
df_uk["month_number"] = None

df_uk[["Customer ID","activity_date","cohort_month","activity_month","month_number"]].head()



**Think:  Why do we see duplicate customer ID's here with the same activity date**?

#### Step 3:  Build a monthly retention matrix and normalize by cohort size

We will:  
1) Count **unique active customers** by (`cohort_month`, `month_number`).  
2) Pivot to cohorts × months.  
3) Divide each row by **Month 0** to get retention fractions.

**Speaker Notes:** Using unique customers avoids overcounting heavy purchasers; we measure presence, not volume.


In [None]:
# 1) Unique active customers per (cohort_month, month_number)
cohort_counts_m = df_uk.groupby(["cohort_month","month_number"])["Customer ID"].nunique().reset_index()

# 2) Pivot
retention_m = None

# 3) Normalize by cohort size (Month 0)
cohort_size_m = retention_m.iloc[:, 0]
retention_m_frac = retention_m.div(cohort_size_m, axis=0)

retention_m_frac

#### Step 4:  Heatmap & Interpretation

We will:  
1) Plot a seaborn heatmap of the **monthly** retention (0–1).  
2) Write 2–3 sentences on what improves/worsens by Month 1 & Month 2 for UK, and 1 action you’d take.

**Speaker Notes:** Visuals make cohort stories click; ask “what happened around low‑retention cohorts?”


In [None]:
# Plot heatmap of monthly retention (0–1)
mat_m = retention_m_frac.copy().astype(float).sort_index(axis=0).sort_index(axis=1)
plt.figure(figsize=(10, 5))
ax = sns.heatmap(mat_m, annot=True, fmt=".0%", cbar_kws={"label": "Retention (0–1)"}, vmin=0, vmax=1)
ax.set_title("United Kingdom — Monthly Cohort Retention")
ax.set_xlabel("Months Since First Purchase")
ax.set_ylabel("Cohort (Signup Month)")
plt.tight_layout()
plt.show()

### 👩‍💻 Student-Led Section (20 minutes) -- ANSWER KEY

### Step 1: Filter to Germany (DE) and define weekly cohorts

  
1) Keep rows where `Country == "Germany"`.  
2) Define **cohort_week** as each customer’s first activity week.  
3) Define **activity_week** for each row.  
4) Compute **week_number** = weeks since cohort start.





In [None]:
# 1) Country filter
df_de = df[df["Country"] == "Germany"].copy()

# 2) First activity week (cohort anchor) per customer do your groupby
df_de["cohort_week"] = None

# 3) Activity week per row
df_de["activity_week"] = df_de["activity_date"].dt.to_period("W")

# 4) Weeks since cohort start (subtract cohort week from activity week)
df_de["week_number"] = None

df_de[["Customer ID","activity_date","cohort_week","activity_week","week_number"]].head()

#### Step 2:  Build a **weekly** retention matrix and normalize by cohort size
  
1) Count **unique active customers** by (`cohort_week`, `week_number`).  
2) Pivot to cohorts × weeks.  
3) Divide each row by **Week 0** to get retention fractions.


In [None]:
# 1) Unique active customers per (cohort_week, week_number)
cohort_counts_w = df_de.groupby(None)["Customer ID"].nunique().reset_index()

# 2) Pivot
retention_w = cohort_counts_w.pivot(None)

# 3) Normalize by cohort size (Week 0)
cohort_size_w = retention_w.iloc[:, 0]
retention_w_frac = retention_w.div(cohort_size_w, axis=0)

retention_w_frac

### Step 3:  Weekly heatmap & interpretation (Can be hard to see all weeks together -- try filtering OR Tableau on your own later)

1) Plot a seaborn heatmap of weekly retention (0–1).  
2) Write 2–3 sentences interpreting where Germany shows stronger/weaker weekly retention and one hypothesis for *why*.


In [None]:
# Plot heatmap of weekly retention (0–1)
mat_w = retention_w_frac.copy().astype(float).sort_index(axis=0).sort_index(axis=1)
plt.figure(figsize=(10, 5))
ax = sns.heatmap(None, annot=True, fmt=".0%", cbar_kws={"label": "Retention (0–1)"}, vmin=0, vmax=1)
ax.set_title("Germany — Weekly Cohort Retention")
ax.set_xlabel("Weeks Since First Purchase")
ax.set_ylabel("Cohort (Signup Week)")
plt.tight_layout()
plt.show()

### Wrap‑Up: Reading & Acting on Cohorts -- (15 mins)

**Students:  Be prepared to reflect and answer the interview questions at the top of this notebook.**  

- **Time matters:** Cohorts track the *same users* over time; MAU can hide churn if new users replace old ones.
- **Granularity choice:** Weekly cohorts reveal short‑term dynamics (promos, bugs); monthly cohorts smooth noise but may hide spikes.
- **Market differences:** UK monthly vs Germany weekly shows how market and period choice change the narrative.
- **Actions:** If Month/Week‑1 retention dips, audit onboarding, returns/shipping policy, payment UX, or marketing sources in that signup window.
- **Next:** Segment cohorts by acquisition channel or product category to find who sticks and why. Tie changes to a single success metric (e.g., +5pp in Month‑1 retention for new cohorts).
