## Mod 5 Lecture 7 Code-Along:  Churn and Activation 

**What you’ll do**
- Generate a small synthetic user‑event dataset
- Define a simple **activation** metric (users who reach a first “value” moment)
- Estimate **monthly churn** (users active in month *t* who don’t return in *t+1*)
- Read and explain the outputs

**Key ideas**
- Small changes in churn compound over time
- Activation (early value) is often the strongest driver of retention


## Step 1 — Create a tiny synthetic dataset

**Notes:**  
We’ll simulate a few months of app activity for ~600 users.  
Each “event” is a generic user action (login, view, click, etc.).  
We’ll keep it **simple** and reproducible (fixed random seed).


In [18]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(7)

# Basic knobs
N_USERS = 600
MONTHS  = pd.period_range("2023-01", periods=6, freq="M")  # Jan..Jun
USERS   = np.arange(1, N_USERS+1)

# Simulate each user's signup month (earlier months more likely)
signup_probs = np.array([0.25, 0.22, 0.18, 0.15, 0.12, 0.08])
signup_probs = signup_probs / signup_probs.sum()
signup_month = rng.choice(MONTHS, size=N_USERS, p=signup_probs)

users = pd.DataFrame({
    'user_id' :USERS,
    'signup_month' :signup_month
})
users.head()

Unnamed: 0,user_id,signup_month
0,1,2023-03
1,2,2023-05
2,3,2023-04
3,4,2023-01
4,5,2023-02


In [19]:
# RUN CELL WITHOUT CHANGES -- UNDERSTAND THIS CODE
# For each user & month >= signup, simulate whether they were "active" that month
# Start with a base activation chance that decays slightly over time to mimic churn
rows = []
for uid, sgn in zip(users["user_id"], users["signup_month"]):
    for m in MONTHS:
        if m < sgn:
            continue
        # probability of being active decays with months since signup
        months_since = m.ordinal - sgn.ordinal
        base_p = 0.6 * (0.85 ** months_since)  # simple decay curve
        active = rng.random() < base_p
        if active:
            # If active, simulate a small number of events (1..4)
            n_events = rng.integers(1, 5)
            for _ in range(n_events):
                rows.append({"user_id": uid, "year_month": m, "event_type": "generic_event"})

events = pd.DataFrame(rows)
events.head()


Unnamed: 0,user_id,year_month,event_type
0,1,2023-05,generic_event
1,1,2023-05,generic_event
2,1,2023-05,generic_event
3,1,2023-06,generic_event
4,2,2023-05,generic_event


## Step 2 — Define an example Activation metric

Students:  What are some insights after running this cell?

**Notes:**  
We’ll call a user **activated** if they generated **≥ 3 events** in their **signup month**.  
(Think: they explored enough to reach a first “Aha!” moment.)


In [21]:
# Events in signup month
events_signup = events.merge(
    users.rename(columns={"signup_month": "cohort"}),
    on='user_id',
    how='right'  # keep all users (even with zero events)
)
events_signup

Unnamed: 0,user_id,year_month,event_type,cohort
0,1,2023-05,generic_event,2023-03
1,1,2023-05,generic_event,2023-03
2,1,2023-05,generic_event,2023-03
3,1,2023-06,generic_event,2023-03
4,2,2023-05,generic_event,2023-05
...,...,...,...,...
2923,599,2023-06,generic_event,2023-04
2924,599,2023-06,generic_event,2023-04
2925,599,2023-06,generic_event,2023-04
2926,600,2023-05,generic_event,2023-02


In [22]:
# Count events within the cohort (signup) month
events_signup["in_signup_month"] = (events_signup["year_month"] == events_signup["cohort"])
signup_counts = (
    events_signup[events_signup["in_signup_month"]]
    .groupby('user_id', as_index=False)
    .size()
    .rename(columns={"size": "signup_events"})
)

# Users with no events get 0
signup_counts = users[["user_id"]].merge(signup_counts, on='user_id', how="left").fillna({"signup_events": 0})

# Activation rule: ≥ 3 events in signup month
signup_counts["activated"] = (signup_counts['signup_events']>=3).astype(int)

activation_rate = (signup_counts['activated'].mean()*100)
print(f"Activation rate (≥3 events in signup month): {activation_rate:.1f}%")

signup_counts.head()


Activation rate (≥3 events in signup month): 28.3%


Unnamed: 0,user_id,signup_events,activated
0,1,0.0,0
1,2,2.0,0
2,3,0.0,0
3,4,0.0,0
4,5,4.0,1


In [23]:
signup_counts

Unnamed: 0,user_id,signup_events,activated
0,1,0.0,0
1,2,2.0,0
2,3,0.0,0
3,4,0.0,0
4,5,4.0,1
...,...,...,...
595,596,1.0,0
596,597,1.0,0
597,598,2.0,0
598,599,0.0,0


## Step 3 — Compute monthly churn 

Students:  What are some insights after running this cell?

**Notes:**  
- A user is **active** in a month if they have at least one event.  
- **Churn from month t to t+1** = users active in *t* who **do not** appear in *t+1*.  
We’ll compute this across the 6 months we simulated.


In [25]:
# Active users per month (set for easy set arithmetic)
active_sets = (
    events.groupby("year_month")["user_id"]
    .apply(lambda s: set(s.unique()))
    .reindex(MONTHS, fill_value=set())
    .to_dict()
)

# Build churn table
rows = []
months = list(MONTHS)
for i in range(len(months) - 1):
    m, m_next = months[i], months[i+1]
    active_m, active_next = active_sets[m], active_sets[m_next]
    if len(active_m) == 0:
        continue
    churners = active_m - active_next
    churn_rate = (len(churners)/len(active_m))*100
    rows.append({
        "month": str(m),
        "active_in_m": len(active_m),
        "active_in_next": len(active_next),
        "churners_to_next": len(churners),
        "monthly_churn_rate_%": round(churn_rate, 2)
    })

churn_table = pd.DataFrame(rows)
churn_table

Unnamed: 0,month,active_in_m,active_in_next,churners_to_next,monthly_churn_rate_%
0,2023-01,80,157,35,43.75
1,2023-02,157,206,84,53.5
2,2023-03,206,221,109,52.91
3,2023-04,221,235,133,60.18
4,2023-05,235,231,145,61.7
