#### Exercise 11 - Solution - Compare Synthetic Control to Event Study and Difference-in-Differences

An e-commerce company operating in 10 different regions implemented a new marketing strategy in a single test region (Region 5) starting October 1 2024 continuing indefinitely. As we've generated the data ourselves, we know the ground truth is that the treatment increased revenue in that region by 20% over what it would have been if not for the change. 

In Exercise 10, you built a synthetic control model using Lasso weights to estimate that 20% average treatment effect. Use the same data but model with Difference-in-Differences and Event Study

In [2]:
import pandas as pd
import statsmodels.formula.api as smf 


In [3]:
df = pd.read_csv("../data/synthetic_control_revenue_data.csv", parse_dates=True)

# First: Replace the space between "Region" and the number with an underscore
df.columns = [col.replace(" ", "_") for col in df.columns]
df.head()


Unnamed: 0,Month,Region_1,Region_2,Region_3,Region_4,Region_5,Region_6,Region_7,Region_8,Region_9,Region_10
0,2023-01-01,9322.565026,10763.201411,8379.885931,9258.455623,9450.947929,10437.677905,10060.643647,9985.189491,9328.222526,9639.947623
1,2023-02-01,9737.938268,11701.398724,11241.961029,10991.843109,10222.822882,10700.643686,10281.504076,10915.543547,10560.931888,11581.508977
2,2023-03-01,11287.917691,11792.099443,11363.351753,9207.389538,10847.371295,11319.571857,12205.97794,12243.919931,10102.988763,11022.111375
3,2023-04-01,12584.751131,12967.746822,12867.541821,11259.416674,11975.964764,12603.817098,11528.820985,11631.156162,11660.287992,12052.554474
4,2023-05-01,11816.404647,11426.054667,11258.764335,11598.348889,11540.80387,12154.049899,10765.484396,12401.222295,11194.799213,11972.444224


In [4]:
# Set parameters 
treated_region = "Region_5"
control_regions = [col for col in df.columns if col.startswith("Region") and col != treated_region]
event_start = "2024-10-01" 


#### Estimate impacts using an event study with all comparison regions

In [5]:

df['post'] = (df['Month'] >= event_start).astype(int)
formula = f"{treated_region} ~ {' + '.join(control_regions)} + post"

event_study_model = smf.ols(formula, data=df).fit()
event_study_att = event_study_model.params['post']
print(event_study_model.summary())


# Notes: From the simulated data, we know that Region 5 is a function of Regions 1, 4, and 7.
# The coefficients for the other regions are pushed closer to zero in the Lasso model, but have 
# small, but meaningful non-zero coefficients in the event study model. 

                            OLS Regression Results                            
Dep. Variable:               Region_5   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.996
Method:                 Least Squares   F-statistic:                     631.8
Date:                Mon, 11 Aug 2025   Prob (F-statistic):           8.93e-16
Time:                        21:12:08   Log-Likelihood:                -138.27
No. Observations:                  24   AIC:                             298.5
Df Residuals:                      13   BIC:                             311.5
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   -184.3119    222.631     -0.828      0.4

In [6]:

est_baseline_revenue = df[df['post'] == 1]["Region_5"].mean() - event_study_att
print(f"Estimated Baseline Revenue: {round(est_baseline_revenue, 2)}")
print(f"Event Study Pct Increase: {round(100*event_study_att/est_baseline_revenue,1)}%")

Estimated Baseline Revenue: 10792.32
Event Study Pct Increase: 19.6%


#### Reshape the data (Month-x-Region for each row) and estimate impacts using DiD

In [7]:
# Reshape the DataFrame into a long format
long_df = pd.melt(df, id_vars=["Month"], value_vars=control_regions + [treated_region], 
                  var_name="Region", value_name="Revenue")


# Set up DiD
long_df['post'] = (long_df['Month'] >= event_start).astype(int)
long_df['treated'] = (long_df['Region'] == treated_region).astype(int)

# Fit a difference-in-differences model using statsmodels
did_model_long = smf.ols("Revenue ~ treated + post + treated:post", data=long_df).fit()
print(did_model_long.summary())

# Calculate the average treatment effect on the treated (ATT) from the long format model
att_long = did_model_long.params['treated:post']
treated_revenue = long_df[(long_df['post'] == 1) & (long_df['treated'] == 1)]["Revenue"].mean()
baseline_revenue = treated_revenue - att_long
print(f"Estimated Baseline Revenue (long format): {round(baseline_revenue, 2)}")
print(f"Pct Increase (long format): {round(100 * att_long / baseline_revenue, 0)}%")


                            OLS Regression Results                            
Dep. Variable:                Revenue   R-squared:                       0.036
Model:                            OLS   Adj. R-squared:                  0.023
Method:                 Least Squares   F-statistic:                     2.916
Date:                Mon, 11 Aug 2025   Prob (F-statistic):             0.0350
Time:                        21:12:08   Log-Likelihood:                -2123.3
No. Observations:                 240   AIC:                             4255.
Df Residuals:                     236   BIC:                             4268.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept     1.043e+04    123.422     84.535   