# Step 2: Why we use Difference‑in‑Differences (DID)  
DID is a statistical method that lets us compare **before vs after** for “treated” packages (those that experience a major event), **while controlling for background trends** (like seasonality or overall npm growth).  
- The `treated_post` variable (1 if package is “treated” and the time is after the event) helps isolate the **effect of the event**.  
- By controlling for package-specific effects (`C(project)`) and month effects (`C(month)`), we reduce confounding from other factors.


In [None]:
# 02_did_analysis.ipynb
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10,4)
import statsmodels.formula.api as smf

# Load prepost dataset
df = pd.read_csv('prepost_downloads_real.csv')

# DID regression
df['treated_post'] = df['treated'] * df['post']
model = smf.ols('downloads ~ treated_post + C(project) + C(month)', data=df).fit(cov_type='cluster', cov_kwds={'groups': df['project']})
print(model.summary())
with open('did_summary.txt','w') as f:
    f.write(model.summary().as_text())

# Plot downloads over time
df['month_dt'] = pd.to_datetime(df['month'] + '-01')
plt.figure(figsize=(12,6))
for p in df['project'].unique():
    sub = df[df['project']==p].sort_values('month_dt')
    plt.plot(sub['month_dt'], sub['downloads'], label=p)
plt.legend(); plt.title('Monthly downloads by project'); plt.xlabel('Month'); plt.ylabel('Downloads'); plt.grid(True); plt.show()

# DID Regression Results
We ran a simple model to measure the effect of our "treatment event" on downloads.  

- Positive `treated_post` = downloads increased after treatment.
- Negative `treated_post` = downloads decreased.
- Each project and month is accounted for separately to isolate the effect.

This helps us understand whether changes in policies (like CSL adoption) actually encourage engineers to contribute more or less.
