# Cohort Retention Teaching Notebook v3 (Explain-First)

This version is optimized for teaching clarity: start with grains and sanity, then definitions, then charts, then an exercise.

**Export rule used in this notebook:** code is visible and commented for learning.


In [None]:
# Setup: load only required files and fail fast if anything is missing.
from pathlib import Path
import json
import sys
import pandas as pd

cwd = Path.cwd().resolve()
REPO_ROOT = None
for root in [cwd] + list(cwd.parents):
    if (root / 'data_processed').exists() and (root / 'docs').exists():
        REPO_ROOT = root
        break
if REPO_ROOT is None:
    raise FileNotFoundError('Could not locate repo root with data_processed/ and docs/.')

src_root = REPO_ROOT / 'src'
if str(src_root) not in sys.path:
    sys.path.insert(0, str(src_root))

from retention.policies import HORIZON_H, MIN_COHORT_N, OBSERVED_ONLY, RIGHT_CENSOR_MODE

DP = REPO_ROOT / 'data_processed'

required = [
    DP / 'scope_receipts.json',
    DP / 'order_lines.csv',
    DP / 'orders.csv',
    DP / 'customers.csv',
    DP / 'customer_month_activity.csv',
    DP / 'chart1_logo_retention_heatmap.csv',
    DP / 'chart2_net_proxy_heatmap.csv',
    DP / 'chart2_family_scatter.csv',
    DP / 'chart3_m2_by_family.csv',
    DP / 'gate_a.json',
    DP / 'confound_m2_family_all_vs_retail.csv',
]
missing = [str(p.relative_to(REPO_ROOT)) for p in required if not p.exists()]
if missing:
    raise FileNotFoundError('Missing required artifacts: ' + ', '.join(missing))

scope = json.loads((DP / 'scope_receipts.json').read_text(encoding='ascii'))
gate_a = json.loads((DP / 'gate_a.json').read_text(encoding='ascii'))

# Load core tables.
order_lines = pd.read_csv(DP / 'order_lines.csv')
orders = pd.read_csv(DP / 'orders.csv')
customers = pd.read_csv(DP / 'customers.csv')
cma = pd.read_csv(DP / 'customer_month_activity.csv')
chart1 = pd.read_csv(DP / 'chart1_logo_retention_heatmap.csv')
chart2 = pd.read_csv(DP / 'chart2_net_proxy_heatmap.csv')
chart2_scatter = pd.read_csv(DP / 'chart2_family_scatter.csv')
chart3 = pd.read_csv(DP / 'chart3_m2_by_family.csv')
confound = pd.read_csv(DP / 'confound_m2_family_all_vs_retail.csv')

print('Loaded all required inputs from', DP)
print(f'Policies: H={HORIZON_H}, MIN_COHORT_N={MIN_COHORT_N}, OBSERVED_ONLY={OBSERVED_ONLY}, RIGHT_CENSOR_MODE={RIGHT_CENSOR_MODE}')


## 1) Grain Map (raw -> modeled tables)
Think of this as the data staircase. Every step should have expected grain and row logic.

- Raw Excel rows: original invoice-line records from one or more sheets
- `order_lines.csv`: normalized line-item grain (should closely reconcile to raw rows)
- `orders.csv`: one row per order_id
- `customers.csv`: one row per cohort-universe customer
- `customer_month_activity.csv`: one row per customer per month 0..6 (exactly 7 rows/customer)


### Stop & Answer
1. What does the grain map protect you from?  
Answer: It prevents mixing incompatible row levels (line, order, customer, customer-month).

2. If `customers.csv` has fewer rows than `orders.csv`, is that expected?  
Answer: Yes. Customers are unique people in the cohort universe; orders can be many per customer.

In [None]:
# Grain map table with row counts.
grain_map = pd.DataFrame([
    {'stage': 'raw_sum_rows (from scope receipts)', 'rows': int(scope['raw_sum_rows'])},
    {'stage': 'order_lines.csv', 'rows': int(len(order_lines))},
    {'stage': 'orders.csv', 'rows': int(len(orders))},
    {'stage': 'customers.csv', 'rows': int(len(customers))},
    {'stage': 'customer_month_activity.csv', 'rows': int(len(cma))},
])
print(grain_map.to_string(index=False))


### What pattern should you look for? (Heatmaps)
- Vertical decay: does retention/value fade as months_since_first increases?
- Cohort shifts: do newer cohort rows look better/worse than older rows at the same month index?
- Anomalies: sudden spikes/drops can indicate promo periods, returns waves, or mapping/credit issues.
- White/blank = missing/suppressed, not zero.


### You do -> Then we show: File map for joins
First try to describe each file's grain and join path yourself. Then run the cell to confirm.

In [None]:
# File map: grain, primary key, and joins used in the teaching flow.
file_map = pd.DataFrame([
    {'file': 'scope_receipts.json', 'grain': 'workbook-level receipt', 'primary_key': 'n/a', 'joins_used': 'none'},
    {'file': 'order_lines.csv', 'grain': 'order line', 'primary_key': 'order_id + sku + order_ts (practical line grain)', 'joins_used': 'roll up to orders by order_id'},
    {'file': 'orders.csv', 'grain': 'order', 'primary_key': 'order_id', 'joins_used': 'join to customers on customer_id; aggregate to months'},
    {'file': 'customers.csv', 'grain': 'cohort customer', 'primary_key': 'customer_id', 'joins_used': 'left join onto full month grid'},
    {'file': 'customer_month_activity.csv', 'grain': 'customer-month (0..6)', 'primary_key': 'customer_id + months_since_first', 'joins_used': 'source for retention metrics and chart tables'},
])
print(file_map.to_string(index=False))


## H=6 full grid = 7 rows per customer (0..6)
Every cohort-universe customer gets exactly one row for each `months_since_first` in `0..6`.
That means **7 rows per customer**, even if a later month has no observed purchases.
This guarantees structural comparability before any censoring/suppression logic in chart tables.


## Micro-Lab 1: Build the grid from scratch (for 5 customers)
**You do:** Rebuild a 0..6 month grid for 5 cohort customers using `customers.csv` + `orders.csv`, then verify core cohort invariants.

In [None]:
# CA-42 micro-lab: rebuild 7-row customer grid for five customers and assert core invariants.
sample_customers = (
    customers['customer_id']
    .astype(str)
    .drop_duplicates()
    .sort_values(kind='stable')
    .head(5)
    .tolist()
)
cust5 = customers[customers['customer_id'].astype(str).isin(sample_customers)][['customer_id', 'cohort_month']].copy()
cust5['customer_id'] = cust5['customer_id'].astype(str)
cust5['cohort_period'] = pd.PeriodIndex(cust5['cohort_month'], freq='M')
months = pd.DataFrame({'months_since_first': list(range(7))})
grid = cust5.assign(_k=1).merge(months.assign(_k=1), on='_k', how='inner').drop(columns='_k')
grid['activity_period'] = grid['cohort_period'] + grid['months_since_first']
grid['activity_month'] = grid['activity_period'].astype(str)

valid_orders = orders[(orders['is_valid_purchase'] == 1)].copy()
valid_orders['customer_id'] = valid_orders['customer_id'].astype(str)
valid_orders = valid_orders[valid_orders['customer_id'].isin(sample_customers)]
valid_orders['activity_month'] = pd.to_datetime(valid_orders['order_ts'], errors='coerce').dt.to_period('M').astype(str)
valid_counts = (
    valid_orders.groupby(['customer_id', 'activity_month'], as_index=False)
    .agg(orders_count_valid=('order_id', 'nunique'))
)

lab1 = grid.merge(valid_counts, on=['customer_id', 'activity_month'], how='left')
lab1['orders_count_valid'] = lab1['orders_count_valid'].fillna(0).astype(int)
lab1['is_retained_logo'] = (lab1['orders_count_valid'] > 0).astype(int)

pass_rows = lab1.groupby('customer_id').size().eq(7).all()
months_ok = (
    lab1.sort_values(['customer_id', 'months_since_first'], kind='stable')
    .groupby('customer_id')['months_since_first']
    .apply(list)
    .apply(lambda vals: vals == list(range(7)))
    .all()
)
month0_ok = (
    lab1[lab1['months_since_first'] == 0]
    .groupby('customer_id')['is_retained_logo']
    .max()
    .eq(1)
    .all()
)

print(('PASS' if pass_rows else 'FAIL') + ' grid_rows_per_customer')
print(('PASS' if months_ok else 'FAIL') + ' months_since_first_0_to_6')
print(('PASS' if month0_ok else 'FAIL') + ' month0_logo_retention')

cols = ['customer_id', 'cohort_month', 'activity_month', 'months_since_first', 'orders_count_valid', 'is_retained_logo']
print(lab1[cols].sort_values(['customer_id', 'months_since_first'], kind='stable').to_string(index=False))


**Expected output**

`PASS grid_rows_per_customer`

`PASS months_since_first_0_to_6`

`PASS month0_logo_retention`

...followed by a 35-row table (5 customers x 7 months).

Plain-English: This lab proves the cohort grid shape is deterministic (7 rows/customer) and that month 0 retention is structural for cohort-universe customers. If any line fails, cohort math is broken before chart interpretation.

## 2) Sanity Checks (this prevents false stories)
### Check A: raw_sum_rows vs order_lines_rows
If this is not close, your downstream metrics can be untrustworthy.

### Check B: join explosion detection
Join explosion happens when a key is not unique and a join multiplies rows unexpectedly.
Rule used here: processed line rows > raw_sum_rows * 1.05 => **explosion suspected**.


### Stop & Answer
1. What would a join explosion look like in receipts?  
Answer: `order_lines_rows` materially above `raw_sum_rows` (here threshold is >5%).

2. Why is month0 retention ~100% by definition?  
Answer: Cohort customers are defined by having a valid first purchase, so month 0 must include activity.

In [None]:
# Sanity checks from receipts + modeled tables.
raw_sum = int(scope['raw_sum_rows'])
line_rows = int(len(order_lines))
delta_pct = ((line_rows - raw_sum) / raw_sum) * 100 if raw_sum else 0.0

if line_rows > raw_sum * 1.05:
    sanity_flag = 'JOIN_EXPLOSION_SUSPECTED'
elif line_rows < raw_sum * 0.95:
    sanity_flag = 'UNDERCOUNT'
else:
    sanity_flag = 'RECONCILED'

rows_per_customer = cma.groupby('customer_id').size()
month0 = cma[cma['months_since_first'] == 0]

print(f'raw_sum_rows={raw_sum}')
print(f'order_lines_rows={line_rows}')
print(f'delta_pct={delta_pct:.2f}%')
print(f'sanity_flag={sanity_flag}')
print(f'full_grid_7_rows_per_customer={(rows_per_customer == 7).all()}')
print(f'month0_retention_100pct={(month0["is_retained_logo"] == 1).all()}')


### You do -> Then we show: Reconciliation assert exercise
Exercise: compute `raw_sum_rows` and `order_lines` rows, then assert they are within +/-1%.

In [None]:
# Exact exercise code: fail fast if line-item scope is not reconciled.
raw_sum_rows = int(scope['raw_sum_rows'])
order_lines_rows = int(len(order_lines))
delta_ratio = abs(order_lines_rows - raw_sum_rows) / raw_sum_rows if raw_sum_rows else 0.0
assert delta_ratio <= 0.01, (
    f'Reconciliation failed: raw_sum_rows={raw_sum_rows}, order_lines_rows={order_lines_rows}, delta_ratio={delta_ratio:.4%}'
)
print(f'PASS raw_sum_rows={raw_sum_rows} order_lines_rows={order_lines_rows} delta_ratio={delta_ratio:.4%}')


Expected output pattern:
`PASS raw_sum_rows=1067371 order_lines_rows=1067371 delta_ratio=0.0000%`

## Micro-Lab 2: Sanity + Scope Reconciliation
**You do:** Load `scope_receipts.json`, print scope fields, and assert processed line rows reconcile to raw rows within +/-1%.

In [None]:
# CA-43 micro-lab: scope and reconciliation check from receipts.
scope_lab = json.loads((DP / 'scope_receipts.json').read_text(encoding='ascii'))
raw_sum_rows = int(scope_lab['raw_sum_rows'])
processed_order_lines_rows = scope_lab['processed_order_lines_rows']
processed_order_lines_rows = int(processed_order_lines_rows) if processed_order_lines_rows is not None else 0
delta_ratio = abs(processed_order_lines_rows - raw_sum_rows) / raw_sum_rows if raw_sum_rows else 0.0
recon_ok = delta_ratio <= 0.01

print('sheets_detected=', scope_lab['sheets_detected'])
print('sheet_rows=', scope_lab['sheet_rows'])
print('raw_sum_rows=', raw_sum_rows)
print('processed_order_lines_rows=', processed_order_lines_rows)
print('reconcile_status=', scope_lab['reconcile_status'])
print((('PASS' if recon_ok else 'FAIL') + f' scope_reconciliation_within_1pct delta_ratio={delta_ratio:.4%}'))


**Expected output**

`reconcile_status= OK_BOTH_SHEETS`

`PASS scope_reconciliation_within_1pct delta_ratio=0.0000%`

Plain-English: We ingested both sheets because the workbook has two yearly tabs and receipts explicitly include both in `sheet_rows` and `raw_sum_rows`. Line-item sanity confirms processed `order_lines` reconciles to raw rows within +/-1%, so join explosion risk is not indicated.

## 3) Frozen Definitions (v1.2)
These are fixed project definitions and should be stated consistently:

- **valid purchase**: `(order_gross > 0) & (is_cancel_invoice == 0)`
- **credit-like**: `is_cancel_invoice OR (order_net_proxy < 0)`
- **net retention proxy** (cohort month t):
  `sum(net_revenue_proxy_total at t) / sum(gross_revenue_valid at month 0)`
- **denominator choice**: exclude cohorts where month-0 denominator is zero


### Stop & Answer
1. Why do we separate `order_gross` and `order_net_proxy`?  
Answer: Gross defines valid purchases cleanly; net proxy captures credits/refunds for value tracking.

2. Why exclude cohorts with month-0 denominator = 0 from Chart 2?  
Answer: Division would be undefined or unstable, so those cohorts are explicitly ineligible.

In [None]:
# Show real receipts for Gate A and denominator guard context.
baseline = (
    cma[cma['months_since_first'] == 0]
    .groupby('cohort_month', as_index=False)
    .agg(baseline_gross=('gross_revenue_valid', 'sum'))
)
eligible = int((baseline['baseline_gross'] > 0).sum())
excluded = int((baseline['baseline_gross'] == 0).sum())

print(f"Gate A pct valid purchases with non-positive net: {gate_a['gate_a_pct_valid_nonpositive_net']:.4f}%")
print(f"Gate A trigger fired: {gate_a['trigger_fired']}")
print(f"Chart2 denominator guard cohorts: eligible={eligible}, excluded={excluded}")


### You do -> Then we show: Credit-like truth table
Predict outcomes first, then verify with the 4-row truth table.

In [None]:
# Truth table examples for is_credit_like = is_cancel_invoice OR (order_net_proxy < 0).
truth = pd.DataFrame([
    {'example': 'A', 'is_cancel_invoice': 0, 'order_net_proxy':  12.50},
    {'example': 'B', 'is_cancel_invoice': 0, 'order_net_proxy':  -3.20},
    {'example': 'C', 'is_cancel_invoice': 1, 'order_net_proxy':  14.00},
    {'example': 'D', 'is_cancel_invoice': 1, 'order_net_proxy':  -8.00},
])
truth['net_proxy_lt_zero'] = truth['order_net_proxy'] < 0
truth['is_credit_like'] = (truth['is_cancel_invoice'] == 1) | truth['net_proxy_lt_zero']
print(truth[['example', 'is_cancel_invoice', 'order_net_proxy', 'net_proxy_lt_zero', 'is_credit_like']].to_string(index=False))


## Micro-Lab 3: Net proxy retention (manual)
**You do:** For one cohort (Chart 2 cohort grain), manually rebuild the net proxy ratio 0..6 and validate it against `chart2_net_proxy_curves.csv`.

In [None]:
# CA-44 micro-lab: manual Chart 2 reconstruction for one cohort and one family grain (ALL_FAMILIES).
cohort_choice = str(chart2['cohort_month'].dropna().sort_values(kind='stable').iloc[0])
family_choice = 'ALL_FAMILIES'
max_observed_month = pd.to_datetime(scope['raw_max_date'], errors='coerce').to_period('M')

cma_one = cma[cma['cohort_month'] == cohort_choice].copy()
gross_m0 = float(cma_one.loc[cma_one['months_since_first'] == 0, 'gross_revenue_valid'].sum())
net_curve = (
    cma_one.groupby('months_since_first', as_index=False)
    .agg(net_proxy=('net_revenue_proxy_total', 'sum'))
    .query('months_since_first >= 0 and months_since_first <= 6')
    .sort_values('months_since_first', kind='stable')
)
cohort_period = pd.Period(cohort_choice, freq='M')
net_curve['activity_period'] = cohort_period + net_curve['months_since_first'].astype(int)
net_curve['is_observed'] = net_curve['activity_period'] <= max_observed_month
net_curve['gross_m0'] = gross_m0
net_curve['ratio'] = net_curve['net_proxy'] / gross_m0 if gross_m0 > 0 else pd.NA

# Effective n for this point = customers with any activity signal in that month.
net_curve['n_customers'] = 0
for i in range(len(net_curve)):
    m = int(net_curve.loc[i, 'months_since_first'])
    month_rows = cma_one[cma_one['months_since_first'] == m].copy()
    has_any = (month_rows['orders_count_valid'] > 0) | (month_rows['net_revenue_proxy_total'] != 0)
    net_curve.loc[i, 'n_customers'] = int(month_rows.loc[has_any, 'customer_id'].nunique())

net_curve.loc[~net_curve['is_observed'], 'ratio'] = pd.NA
net_curve.loc[net_curve['n_customers'] < 50, 'ratio'] = pd.NA

cmp = chart2[chart2['cohort_month'] == cohort_choice][['months_since_first', 'net_retention_proxy']].copy()
cmp['months_since_first'] = cmp['months_since_first'].astype(int)
check = net_curve.merge(cmp, on='months_since_first', how='left')
check['abs_diff'] = (check['ratio'] - check['net_retention_proxy']).abs()
max_abs_diff = float(check['abs_diff'].fillna(0).max()) if len(check) else 0.0

denom_ok = gross_m0 > 0
match_ok = max_abs_diff <= 1e-9

print(f'selected_cohort_month={cohort_choice}')
print(f'selected_family={family_choice}')
print(f'max_observed_month={max_observed_month}')
print(('PASS' if denom_ok else 'FAIL') + ' denominator_positive')
print(('PASS' if match_ok else 'FAIL') + f' chart2_ratio_match_within_rounding max_abs_diff={max_abs_diff:.12f}')
print(check[['months_since_first', 'gross_m0', 'net_proxy', 'n_customers', 'is_observed', 'ratio']].to_string(index=False))


**Expected output**

`PASS denominator_positive`

`PASS chart2_ratio_match_within_rounding ...`

...plus a 7-row table with `months_since_first`, `gross_m0`, `net_proxy`, `ratio`.

Plain-English (why denominator is fixed):
- Month 0 gross valid revenue is the common baseline, so each later month is comparable on the same scale.
- Keeping denominator fixed isolates value decay/recovery in the numerator (`net_revenue_proxy_total`).
- Right-censoring sets unobserved future months to missing, which avoids treating end-of-data as real zero value.


## Right-censoring: why later cohorts have missing months
When the dataset ends (raw_max_date), some future cohort-month combinations are **unobservable**.
For those cells, we keep values as missing (`NaN`), not zero, so we do not fake a retention collapse.

Explicit note: **"2011-06 gets 'darker' later" is only interpretable for observable months; any post-cutoff darkness/blankness would be a chart artifact if treated as zero.**


In [None]:
# Worked example: near-end cohort observability and censoring behavior.
cohort_example = '2011-06'
raw_max = pd.to_datetime(scope['raw_max_date'], errors='coerce')
max_obs = raw_max.to_period('M')

ex = chart2[chart2['cohort_month'] == cohort_example][['cohort_month', 'months_since_first', 'net_retention_proxy', 'n_customers']].copy()
if ex.empty:
    ex = chart2[chart2['cohort_month'] != 'ALL_WEIGHTED'][['cohort_month', 'months_since_first', 'net_retention_proxy', 'n_customers']].copy()
    cohort_example = str(ex['cohort_month'].astype(str).sort_values(kind='stable').iloc[-1])
    ex = ex[ex['cohort_month'] == cohort_example].copy()

ex['cohort_period'] = pd.PeriodIndex(ex['cohort_month'].astype(str), freq='M')
ex['activity_period'] = ex['cohort_period'] + ex['months_since_first'].astype(int)
ex['activity_month'] = ex['activity_period'].astype(str)
ex['is_observable'] = ex['activity_period'] <= max_obs
ex['is_missing_value'] = ex['net_retention_proxy'].isna()

print(f'cohort_example={cohort_example}, max_observed_month={max_obs}')
print(ex[['months_since_first', 'activity_month', 'is_observable', 'n_customers', 'net_retention_proxy', 'is_missing_value']].to_string(index=False))

print('Why missing must be missing (not zero): unobservable months are outside dataset scope, so zero would be a fabricated outcome.')


Mini exercise:
1. Pick one `customer_id`.
2. Print their 7-row grid (`months_since_first` 0..6).
3. Explain which rows are censored (if `activity_month` is after max observed month) and why those rows are *missing in chart tables*, not zero behavior.


## 4) The 3 Charts (question each chart answers)
1. **Heatmap (logo retention)**: "How does repeat behavior decay across cohorts over months 0..H?"
2. **Family scatter (M2)**: "Which first-order families combine weak repeat (logo) and weak value after credits (net proxy)?"
3. **M2 by first_product_family**: "Which first families are strongest/weakest at Month 2 and how big are samples?"

Right-censor rule: months after max_observed_month are unobserved and shown as missing, not zero.


### Why we do not show 3 random cohort curves (and use a heatmap instead)
- Cherry-pick risk: picking 3 cohorts can accidentally tell the story you want instead of the story the data supports.
- Censoring distortion: late cohorts hit end-of-data, which can look like a crash unless missing is shown explicitly.
- Decision mismatch: the exec question is overall pattern and where to act, not one-off cohort anecdotes.


### Chart 1 Caption Prompt
**Write caption here:** _[Your one-sentence caption]_

<details><summary>Now compare to answer key</summary>

Open `docs/TEACHING_ANSWER_KEY.md` -> **Chart 1**.

</details>

### Chart 2 Caption Prompt
**Write caption here:** _[Your one-sentence caption]_

<details><summary>Now compare to answer key</summary>

Open `docs/TEACHING_ANSWER_KEY.md` -> **Chart 2** and map to memo section `## Plays` (refund drag / returns mitigation).

</details>

### Chart 3 Caption Prompt
**Write caption here:** _[Your one-sentence caption]_

<details><summary>Now compare to answer key</summary>

Open `docs/TEACHING_ANSWER_KEY.md` -> **Chart 3** and map to memo section `## Top 3 Target Families`.

</details>

### Stop & Answer
1. Which chart answers retention shape over time by cohort month?  
Answer: Chart 1 heatmap (cohort_month x months_since_first).

2. Which chart tells you where to test first-order family interventions?  
Answer: Chart 3 M2 retention by `first_product_family` with sample size labels.

In [None]:
# Chart-table previews (same inputs used by story artifact charts).
print('Chart1 preview (logo heatmap table):')
print(chart1.head(5).to_string(index=False))

print('\nChart2 appendix heatmap preview (net proxy by cohort-month):')
print('unique cohorts:', int(chart2['cohort_month'].nunique()))
m2 = chart2[chart2['months_since_first'] == 2].copy()
print(m2[['cohort_month', 'n_customers', 'net_retention_proxy']].head(10).to_string(index=False))

print('\nChart2 story scatter preview (family impact at M2):')
scatter_preview = chart2_scatter[['first_product_family', 'n_customers', 'x_m2_logo_retention', 'y_m2_net_retention_proxy', 'rank_priority']].copy()
print(scatter_preview.sort_values('rank_priority', kind='stable').to_string(index=False))

print('\nChart3 table:')
print(chart3.to_string(index=False))


## Worked Example: Family X
Use one family from the scatter to practice executive interpretation from numbers only.


In [None]:
# Worked mini-case: choose the weakest net-proxy family among decision-safe n.
fam = chart2_scatter.copy()
fam['n_customers'] = fam['n_customers'].astype(int)
fam = fam[fam['n_customers'] >= MIN_COHORT_N].copy()
if fam.empty:
    raise ValueError(f'No family rows with n>={MIN_COHORT_N}')

case = fam.sort_values(['y_m2_net_retention_proxy', 'rank_priority'], ascending=[True, True], kind='stable').iloc[0]
family = str(case['first_product_family'])
n_customers = int(case['n_customers'])
m2_logo = float(case['x_m2_logo_retention'])
m2_net = float(case['y_m2_net_retention_proxy'])

print(f'Family: {family}')
print(f'n_customers: {n_customers}')
print(f'M2 logo retention: {m2_logo:.3f}')
print(f'M2 net proxy retention: {m2_net:.3f}')

if m2_net < m2_logo:
    interpretation = 'Net proxy trails logo -> refund/credit drag is likely part of the weakness.'
    action = 'Start with returns-mitigation + post-purchase quality messaging for this family.'
else:
    interpretation = 'Logo is the main weakness -> repeat demand issue is likely larger than refund drag.'
    action = 'Start with replenishment timing + repeat purchase incentive for this family.'

print('Interpretation:', interpretation)
print('Action:', action)
print('Caveat: Directional not causal; validate with controlled experiments.')


Plain-English: This is the interview move. You name one family, state `n`, compare logo vs net proxy, then pick one concrete action.


### You do -> Then we show: Caption writing drill
Write your own one-sentence caption for each chart before reading the examples.

Your turn:
- Chart 1 caption: __________________________
- Chart 2 caption: __________________________
- Chart 3 caption: __________________________

Strong example captions:
- Chart 1: 'Recent cohorts show a similar month-0 start but diverge by months 2-4, indicating uneven repeat decay across acquisition windows.'
- Chart 2: 'Eligible cohorts separate on net proxy by month 2, showing that refund-adjusted value can diverge even when logo retention looks comparable.'
- Chart 3: 'Month-2 retention varies materially by first_product_family, with n-labels indicating which family gaps are decision-relevant vs too small to trust.'

## 5) Mini Exercise (runnable)
Pick 1 customer and inspect their full 7-row grid.

What to explain after running:
- Month 0 is cohort start month (first valid purchase month)
- Months 1..6 are lifecycle months after first purchase
- `is_retained_logo` flips to 1 when there is at least one valid purchase in that month
- Censoring check: if `activity_month > max_observed_month`, that month is unobservable and should be missing in right-censored chart tables


### Stop & Answer
1. Explain `months_since_first` in one sentence.  
Answer: It is the integer month distance between an activity month and the customer's cohort month, constrained to 0..6 for this project.

2. Why is month0 retention ~100%?  
Answer: Cohort inclusion requires at least one valid purchase, so each included customer has month-0 valid activity by construction.

In [None]:
# Exercise: choose a customer with full grid, print months 0..6, and mark censored rows.
exercise_customer_id = str(cma['customer_id'].astype(str).iloc[0])
customer_grid = cma[cma['customer_id'].astype(str) == exercise_customer_id].copy()
customer_grid = customer_grid.sort_values('months_since_first', kind='stable')

max_obs = pd.to_datetime(scope['raw_max_date'], errors='coerce').to_period('M')
customer_grid['cohort_period'] = pd.PeriodIndex(customer_grid['cohort_month'].astype(str), freq='M')
customer_grid['activity_period'] = customer_grid['cohort_period'] + customer_grid['months_since_first'].astype(int)
customer_grid['is_censored_for_charts'] = customer_grid['activity_period'] > max_obs

print('exercise_customer_id:', exercise_customer_id)
print('max_observed_month:', max_obs)
print(customer_grid[['customer_id', 'cohort_month', 'activity_month', 'months_since_first', 'orders_count_valid', 'gross_revenue_valid', 'net_revenue_proxy_total', 'is_retained_logo', 'is_censored_for_charts']].to_string(index=False))

censored_months = customer_grid.loc[customer_grid['is_censored_for_charts'], 'months_since_first'].tolist()
if censored_months:
    print(f'Explanation: months {censored_months} are censored for chart tables because they are beyond max observed month {max_obs}.')
else:
    print('Explanation: this customer has no censored months within 0..6 given dataset end.')


## 6) Expert-style wrap-up (what to say)
- Grains reconcile from raw to order_lines (no join explosion signal).
- Cohort construction passes full-grid and Month-0 checks.
- Gate receipts are in place before interpretation.
- Three chart questions map directly to one decision: prioritize first-product-family tests.


### Stop & Answer
1. What is your one-line decision frame from this analysis?  
Answer: Prioritize retention tests by first-order family using M2 logo retention plus net proxy context.

2. Why must you say "directional, not causal"?  
Answer: The analysis is observational segmentation, not randomized causal inference.

## Appendix: Net Retention Proxy Heatmap (Readable)
Appendix-only: this view is cohort-month-centric and mainly for sanity-checking the right-censor/suppression behavior.
White/blank = not observed (censored) or suppressed (n<MIN_COHORT_N). Blank is not zero.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
heat = chart2.copy()
heat['months_since_first'] = heat['months_since_first'].astype(int)
pivot = heat.pivot(index='cohort_month', columns='months_since_first', values='net_retention_proxy')
pivot = pivot.reindex(columns=list(range(HORIZON_H + 1))).sort_index()
vals = pivot.values.astype(float)
nonnull = vals[~np.isnan(vals)]
p95 = float(np.percentile(nonnull, 95)) if len(nonnull) else 1.0
vmax = max(1.0, p95)
n0 = heat[heat['months_since_first'] == 0][['cohort_month','n_customers_m0']].drop_duplicates()
n0_map = dict(zip(n0['cohort_month'].astype(str), n0['n_customers_m0'].astype(int)))

mat = np.ma.masked_invalid(vals)
fig, ax = plt.subplots(figsize=(10, 6))
img = ax.imshow(mat, aspect='auto', cmap='YlGnBu', vmin=0.0, vmax=vmax)
ax.set_xticks(range(HORIZON_H + 1))
ax.set_xticklabels([str(i) for i in range(HORIZON_H + 1)])
yt = pivot.index.astype(str).tolist()
ax.set_yticks(range(len(yt)))
ax.set_yticklabels([f"{cm} (n0={n0_map.get(cm,0)})" for cm in yt])
ax.set_xlabel('months_since_first')
ax.set_ylabel('cohort_month')
ax.set_title(f'Appendix Net Proxy Heatmap (vmax={vmax:.2f}, n>={MIN_COHORT_N})')
cbar = fig.colorbar(img, ax=ax)
cbar.set_label('net_retention_proxy')
plt.tight_layout()
plt.show()
