# Lab 6: Credible Evidence for Policy Evaluation

From correlation to causation in mobile money welfare effects

> **Expected Time**
>
> -   All students: ~75 minutes
> -   Extension activities: ~30 minutes

<figure>
<a
href="https://colab.research.google.com/github/quinfer/fin510-colab-notebooks/blob/main/labs/lab06_open_banking.ipynb"><img
src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
<figcaption>Open in Colab</figcaption>
</figure>

## The Core Challenge: Does Mobile Money Reduce Poverty?

You observe that M-Pesa users in Kenya earn £100/month while non-users
earn £80/month. Can you conclude M-Pesa **caused** the £20 difference?

**Three possible scenarios:**

1.  **True causal effect**: M-Pesa caused £20 income increase
2.  **Pure selection bias**: High earners adopted M-Pesa (zero causal
    effect)
3.  **Partial selection**: Better prospects adopted + M-Pesa helped (£10
    real effect, but we’d estimate £20)

This lab teaches you how to distinguish these scenarios using
**difference-in-differences (DiD)**, the workhorse method in policy
evaluation.

## Learning Objectives

By the end of this lab, you will:

-   Understand why simple comparisons fail (selection bias)
-   Implement difference-in-differences estimation
-   Test the parallel trends assumption
-   Estimate heterogeneous effects (gender, income, location)
-   Interpret effect sizes and economic significance
-   Connect findings to Suri & Jack (2016) *Science* paper

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## Part 1: The Selection Bias Problem (15 minutes)

### Simulate the Problem

Let’s create data that LOOKS like M-Pesa increases income, but actually
doesn’t.

In [None]:
np.random.seed(42)
n = 1000

# Scenario B: Pure selection bias (M-Pesa has ZERO causal effect)
# High-ability people (£100 potential) adopt M-Pesa
# Low-ability people (£80 potential) don't adopt

ability = np.random.choice([80, 100], size=n, p=[0.7, 0.3])
adopts_mpesa = (ability == 100).astype(int)  # High ability → adopt
income = ability + np.random.normal(0, 10, n)  # Income = ability + noise

df_bias = pd.DataFrame({
    'adopts_mpesa': adopts_mpesa,
    'income': income,
    'ability': ability  # We observe this in simulation, not in real data!
})

print("Observed correlation (naive comparison):")
print(df_bias.groupby('adopts_mpesa')['income'].mean())

### YOUR TURN: Calculate the Misleading Estimate

In [None]:
# Task 1.1: Calculate the naive treatment effect
# Hint: mean(income | adopts_mpesa=1) - mean(income | adopts_mpesa=0)

mean_adopters = df_bias[df_bias['adopts_mpesa'] == 1]['income'].mean()
mean_non_adopters = df_bias[df_bias['adopts_mpesa'] == 0]['income'].mean()
naive_effect = mean_adopters - mean_non_adopters

print(f"Naive estimate: £{naive_effect:.2f}")
print(f"True causal effect: £0.00")
print(f"Bias: £{naive_effect:.2f}")

> **The Problem**
>
> We estimated a £20 effect, but the true effect is £0. Selection bias
> makes M-Pesa **look** effective when it isn’t. This is why
> policymakers need causal methods, not just correlations.

## Part 2: Difference-in-Differences Solution (30 minutes)

### The DiD Logic

**Key insight**: Even if M-Pesa adopters differ from non-adopters at
baseline, we can control for those **fixed differences** by looking at
**changes over time**.

In [None]:
np.random.seed(123)
n_households = 400
time_periods = [2008, 2010, 2012, 2014]

# Generate household characteristics
household_ability = np.random.normal(85, 15, n_households)  # Fixed characteristic
household_id = np.arange(n_households)

# Treatment assignment based on agent proximity (2010 rollout)
# High proximity households adopt in 2010
proximity = np.random.uniform(0, 1, n_households)
treats_after_2010 = (proximity > 0.5).astype(int)

# Create panel data
data = []
for t in time_periods:
    for i in range(n_households):
        # Treatment starts in 2010 for high-proximity households
        treated = 1 if (treats_after_2010[i] == 1 and t >= 2010) else 0
        post = 1 if t >= 2010 else 0
        
        # Income = baseline + time trend + treatment effect
        true_treatment_effect = 12  # £12 true causal effect
        time_trend = (t - 2008) * 0.5  # Small positive trend for everyone
        
        income = (
            household_ability[i] +  # Fixed baseline (removed by DiD)
            time_trend +  # Common time trend
            treated * true_treatment_effect +  # Causal effect
            np.random.normal(0, 8)  # Random noise
        )
        
        data.append({
            'household_id': i,
            'year': t,
            'income': income,
            'treated': treated,
            'post': post,
            'ability': household_ability[i],
            'proximity': proximity[i]
        })

df_panel = pd.DataFrame(data)
print(f"Panel data: {len(df_panel)} observations ({n_households} households × {len(time_periods)} years)")
print(df_panel.head(12))

### Visualize Pre-Treatment Trends

**Parallel trends assumption**: Treated and control groups would have
followed the same trajectory absent treatment.

In [None]:
# Calculate average income by year and treatment group
trends = df_panel.groupby(['year', 'treated'])['income'].mean().reset_index()
trends_pivot = trends.pivot(index='year', columns='treated', values='income')

plt.figure(figsize=(10, 6))
plt.plot(trends_pivot.index, trends_pivot[0], 'o-', label='Control (low proximity)', linewidth=2)
plt.plot(trends_pivot.index, trends_pivot[1], 's-', label='Treated (high proximity)', linewidth=2)
plt.axvline(x=2008, color='gray', linestyle='--', alpha=0.5, label='Pre-treatment')
plt.axvline(x=2010, color='red', linestyle='--', linewidth=2, label='M-Pesa rollout')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Income (£/month)', fontsize=12)
plt.title('Parallel Trends: Income Trajectories by Treatment Group', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

> **What to Look For**
>
> **Parallel pre-trends (2008 only)**: Both groups should have similar
> trajectories BEFORE 2010. If pre-trends differ, the DiD assumption
> fails—treated and control groups aren’t comparable even after
> differencing.

### YOUR TURN: Implement DiD Regression

The DiD estimator uses this regression:

$$
\text{Income}_{it} = \beta_0 + \beta_1 \text{Treated}_i + \beta_2 \text{Post}_t + \beta_3 (\text{Treated}_i \times \text{Post}_t) + \varepsilon_{it}
$$

**The coefficient β₃ is your causal estimate.**

In [None]:
from sklearn.linear_model import LinearRegression

# Task 2.1: Create the interaction term
df_panel['treated_x_post'] = df_panel['treated'] * df_panel['post']

# Task 2.2: Run the regression
X = df_panel[['treated', 'post', 'treated_x_post']]
y = df_panel['income']

model = LinearRegression()
model.fit(X, y)

# Task 2.3: Extract the DiD estimate
did_estimate = model.coef_[2]  # Coefficient on the interaction term

print(f"\nDiD Estimate: £{did_estimate:.2f}")
print(f"True Effect: £12.00")
print(f"Estimation Error: £{abs(did_estimate - 12):.2f}")

> **Note**
>
> ### Interpretation
>
> If you get ~£12, congratulations! You’ve recovered the true causal
> effect despite selection bias. DiD controlled for:
>
> -   Fixed differences in ability (β₁ captures this)
> -   Common time trends (β₂ captures this)
> -   The interaction (β₃) isolates the treatment effect

## Part 3: Heterogeneous Effects (20 minutes)

Suri & Jack (2016) found **larger effects for women**. Let’s estimate
heterogeneous effects by gender.

In [None]:
# Add gender to our data (50/50 split)
df_panel['female'] = df_panel['household_id'] % 2

# Re-simulate income with gender-specific treatment effects
def income_with_gender(row):
    base = row['ability']
    time_trend = (row['year'] - 2008) * 0.5
    
    # Women benefit MORE from M-Pesa (£18 vs £8 for men)
    treatment_effect = 18 if row['female'] == 1 else 8
    treatment_gain = row['treated'] * treatment_effect
    
    return base + time_trend + treatment_gain + np.random.normal(0, 8)

df_panel['income'] = df_panel.apply(income_with_gender, axis=1)

### YOUR TURN: Estimate Effects by Gender

In [None]:
# Task 3.1: Run separate DiD regressions for men and women
df_male = df_panel[df_panel['female'] == 0].copy()
df_female = df_panel[df_panel['female'] == 1].copy()

# Male DiD estimate
X_male = df_male[['treated', 'post', 'treated_x_post']]
y_male = df_male['income']
model_male = LinearRegression()
model_male.fit(X_male, y_male)
male_effect = model_male.coef_[2]

# Female DiD estimate
X_female = df_female[['treated', 'post', 'treated_x_post']]
y_female = df_female['income']
model_female = LinearRegression()
model_female.fit(X_female, y_female)
female_effect = model_female.coef_[2]

print(f"\nHeterogeneous Effects:")
print(f"Male households: £{male_effect:.2f}")
print(f"Female households: £{female_effect:.2f}")
print(f"Difference: £{female_effect - male_effect:.2f}")

### Visualize Heterogeneous Effects

In [None]:
effects = pd.DataFrame({
    'Gender': ['Male-headed', 'Female-headed'],
    'Effect': [male_effect, female_effect],
    'True': [8, 18]
})

fig, ax = plt.subplots(figsize=(8, 6))
x = np.arange(len(effects))
width = 0.35

ax.bar(x - width/2, effects['Effect'], width, label='Estimated', color='steelblue', alpha=0.8)
ax.bar(x + width/2, effects['True'], width, label='True Effect', color='coral', alpha=0.8)

ax.set_ylabel('Income Effect (£/month)', fontsize=12)
ax.set_title('Heterogeneous Treatment Effects by Gender', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(effects['Gender'])
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

> **Why Women Benefit More**
>
> Suri & Jack (2016) found three mechanisms:
>
> 1.  **Consumption smoothing**: Women receive more remittances from
>     family networks
> 2.  **Savings accumulation**: Women use M-Pesa as secure savings
>     (protect from household pressure)
> 3.  **Occupational choice**: Women shift from subsistence farming to
>     business activities
>
> This isn’t just statistical—it reflects real barriers women face in
> traditional finance.

## Part 4: Economic Significance (10 minutes)

Statistical significance ≠ economic significance. Let’s assess impact
magnitude.

In [None]:
# Calculate baseline consumption and poverty line
baseline_income = df_panel[df_panel['year'] == 2008]['income'].mean()
poverty_line = 40  # £40/month extreme poverty threshold

print(f"Baseline average income: £{baseline_income:.2f}/month")
print(f"Female treatment effect: £{female_effect:.2f}/month")
print(f"Percentage increase: {(female_effect / baseline_income * 100):.1f}%")

# How many households moved above poverty line?
df_2014 = df_panel[df_panel['year'] == 2014]
treated_above = (df_2014[df_2014['treated'] == 1]['income'] > poverty_line).sum()
control_above = (df_2014[df_2014['treated'] == 0]['income'] > poverty_line).sum()

treated_total = (df_2014['treated'] == 1).sum()
control_total = (df_2014['treated'] == 0).sum()

print(f"\nPoverty reduction:")
print(f"Treated group above poverty line: {treated_above}/{treated_total} ({treated_above/treated_total*100:.1f}%)")
print(f"Control group above poverty line: {control_above}/{control_total} ({control_above/control_total*100:.1f}%)")
print(f"Difference: {(treated_above/treated_total - control_above/control_total)*100:.1f} percentage points")

> **Note**
>
> ### Connecting to Suri & Jack (2016)
>
> Their findings:
>
> -   194,000 households (2% of Kenya) lifted out of extreme poverty
> -   Female-headed households: 18.5% consumption increase
> -   9.2 percentage point poverty reduction (22% relative reduction)
>
> Your simulated analysis should show similar patterns—meaningful
> economic impacts beyond statistical significance.

------------------------------------------------------------------------

## Extension: Data Quality Critique of FinTech App Studies (15 min)

**Learning Objective:** Apply Week 2 measurement concepts to evaluate
open banking/fintech app research claims

> **Connection to [Week 2: Data &
> Measurement](../chapters/02_data_measurement.qmd) & [Ch 06: Open
> Banking Data
> Quality](../chapters/06_open_banking_financial_inclusion.qmd#sec-open-banking-data)**
>
> FinTech app studies face: - **Measurement validity**: Transaction
> categorization errors (~20-30% miscategorized) - **Selection bias**:
> App adopters differ from non-adopters (younger, tech-savvy,
> financially organized) - **Cannot distinguish**: App effect vs user
> selection (do apps improve behavior, or do organized people adopt
> apps?)

### Exercise: Critique a FinTech App Study

**Scenario:** A budgeting app (e.g., Yolt, Emma, Money Dashboard)
claims:

> *“App users save 15% more per month compared to non-users”*

**Your task:** Identify **3 data quality threats** to this claim:

1.  **Selection bias**: Who adopts budgeting apps?
    -   Hypothesis: People already motivated to save adopt apps
        (selection)
    -   Alternative: Apps cause improved saving (causal effect)
    -   **How to distinguish?**: Would need RCT (random assignment) or
        natural experiment
2.  **Measurement validity**: How is “saving” measured?
    -   Apps see bank account data only (miss cash savings, pension
        contributions)
    -   “Saving” = Income - Spending, but income volatile (bonuses,
        freelance)
    -   Comparison group (non-users): No data, so claim likely compares
        app users to… what baseline?
3.  **Survivorship bias**: Who stops using the app?
    -   Users who fail to save might abandon app (not counted in “users”
        statistics)
    -   Only successful savers remain → inflated success rate

**Write 150-200 words** explaining one of these threats in detail and
how it biases the 15% saving claim.

### Quick Code Check: Simulating Selection Bias

In [None]:
# Simulate selection bias in app adoption
np.random.seed(42)
n_people = 1000

# True savings rate (unobserved individual characteristic)
people = pd.DataFrame({
    'savings_propensity': np.random.normal(0.15, 0.08, n_people),  # Mean 15%, std 8%
})

# App adoption: People with HIGH savings propensity adopt apps more
people['adopt_app'] = (
    people['savings_propensity'] > 0.12  # Above-average savers adopt
).astype(int)

# Observed savings (same for all, but we only observe app users)
people['savings_rate'] = people['savings_propensity'] + np.random.normal(0, 0.02, n_people)

# Calculate apparent "app effect"
app_users_savings = people[people['adopt_app'] == 1]['savings_rate'].mean()
non_users_savings = people[people['adopt_app'] == 0]['savings_rate'].mean()
apparent_effect = app_users_savings - non_users_savings

print(f"Simulated Selection Bias:")
print(f"  App users savings rate:     {app_users_savings:.1%}")
print(f"  Non-users savings rate:     {non_users_savings:.1%}")
print(f"  Apparent 'app effect':      {apparent_effect:.1%} ({apparent_effect/non_users_savings*100:.0f}% increase)")
print(f"\n⚠️  But the app did NOTHING—this is pure selection bias!")
print(f"     People with high savings propensity adopted the app.")

**Key insight:** Observational studies can show “app users save more”
even if app has zero causal effect—just selection!

------------------------------------------------------------------------

## Summary and Reflection

### What You Learned

1.  **Selection bias** makes naive comparisons misleading
2.  **Difference-in-differences** controls for fixed differences and
    common trends
3.  **Parallel trends assumption** is critical—always test it visually
4.  **Heterogeneous effects** reveal who benefits most
5.  **Economic significance** matters more than statistical significance

### Connection to Policy

**Why this matters for Open Banking:**

If we want to evaluate whether Open Banking improves financial
inclusion, we can’t just compare users to non-users (selection bias!).
We need:

-   Panel data tracking same individuals over time
-   Variation in access (UK rollout by bank, geographic differences)
-   Credible control groups (late adopters, non-mandatory banks)

Without causal methods, we’d waste billions on ineffective policies.

## Extension Activities (Optional)

### Extension 1: Robustness Checks

What if parallel trends fail? Test sensitivity:

In [None]:
# Add differential pre-trends to treated group
# Re-run DiD and see how estimates change
# Discuss implications for validity

### Extension 2: Event Study

Plot treatment effects year-by-year to visualize dynamics:

In [None]:
# Create year dummies × treated interactions
# Estimate coefficients for each year
# Plot event study graph

### Extension 3: Read the Paper

[Suri & Jack (2016)](https://doi.org/10.1126/science.aah5309) - “The
long-run poverty and gender impacts of mobile money”

Focus on:

-   How they construct the instrument (agent proximity)
-   Why IV + DiD is stronger than either alone
-   Mechanisms (consumption smoothing, savings, occupational choice)

## Assessment Connections

If your module includes assessed short answers or essays, this lab’s
framework (naïve comparison → selection bias → causal method →
heterogeneous effects → economic significance) gives you a credible
structure for policy evaluation.

If you work with panel data and can identify a meaningful “treatment”
and “control” timing, difference-in-differences is one valid
strategy—provided the identifying assumptions are plausible and
discussed.

------------------------------------------------------------------------

> **Resources**
>
> -   [Angrist & Pischke
>     (2009)](https://www.mostlyharmlesseconometrics.com/) - “Mostly
>     Harmless Econometrics” (Chapter 5: DiD)
> -   [Cunningham (2021)](https://mixtape.scunning.com/) - “Causal
>     Inference: The Mixtape” (free online textbook)
> -   [World Bank DIME
>     Wiki](https://dimewiki.worldbank.org/wiki/Main_Page) - Development
>     Impact Evaluation resources and tools