## [6주차 과제] 메타 러너

패널데이터를 이해, 활용하는 방법을 이해하기 위한 과제입니다.

---
**과제 가이드**

- date : 날짜
- city : 도시
- region : 지역
- treated : 광고 (처치변수)
- engagement : 참여도 (결과변수)
- 광고를 접한 후의 참여도를 확인하는 가상의 마케팅 사례 데이터입니다.
- 빈칸이 없는 부분은 확인용이니 확인 후 넘어가주세요.
---



과제 제출 기한: 2024년 7월 4일(목) 11:00<br>
제출 방법: `#24s-causal-inference` Slack 채널에 작성한 Notion 링크 공유<br>

*출제자: 정성윤, 유찬영<br>
최종 수정일: 2024년 6월 30일(일)*

# 8장 - 이중차분법


## 8.1 패널데이터


##### 라이브러리 및 함수 정의

In [None]:
from toolz import *

import pandas as pd
import numpy as np

import statsmodels.formula.api as smf

import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib

from cycler import cycler

color=['0.0', '0.4', '0.8']
default_cycler = (cycler(color=color))
linestyle=['-', '--', ':', '-.']
marker=['o', 'v', 'd', 'p']

plt.rc('axes', prop_cycle=default_cycler)

##### 데이터

In [None]:
import pandas as pd
import numpy as np

mkt_data = (pd.read_csv("./data/Processed_New_Dataset_1.csv")
            .astype({"date":"datetime64[ns]"}))

mkt_data.head()

In [None]:
(mkt_data
 .assign(w = lambda d: d["treated"]*d["post"])
 .groupby(["w"])
 .agg({"date":[min, max]}))

## 8.2 표준 이중차분법


In [None]:
did_data = (mkt_data
            .groupby(["treated", "post"])
            .agg({"engagement":"mean", "date": "min"}))           # 실험군과 대조군으로 나눈 후 평균 계산

did_data

반사실 추정값을 구한 후 ATT를 계산해주세요

In [None]:
y0_est = _____



In [None]:
att = ____
att

In [None]:
mkt_data.query("post==1").query("treated==1")["tau"].mean()

### 8.2.1 이중차분법과 결과 변화


In [None]:
pre = mkt_data.query("post==0").groupby("city")["engagement"].mean()
post = mkt_data.query("post==1").groupby("city")["engagement"].mean()

delta_y = ((post - pre)
           .rename("delta_y")
           .to_frame()
           # add the treatment dummy
           .join(mkt_data.groupby("city")["treated"].max()))

delta_y.tail()

In [None]:
(delta_y.query("treated==1")["delta_y"].mean() 
 - delta_y.query("treated==0")["delta_y"].mean())

In [None]:
did_plt = did_data.reset_index()


plt.figure(figsize=(10,4))

sns.scatterplot(data=did_plt.query("treated==0"), x="date", y="engagement", s=100, color="C0", marker="s")
sns.lineplot(data=did_plt.query("treated==0"), x="date", y="engagement", label="Control", color="C0")

sns.scatterplot(data=did_plt.query("treated==1"), x="date", y="engagement", s=100, color="C1", marker="x")
sns.lineplot(data=did_plt.query("treated==1"), x="date", y="engagement", label="Treated", color="C1",)

plt.plot(did_data.loc[1, "date"], [did_data.loc[1, "engagement"][0], y0_est], color="C2", linestyle="dashed", label="Y(0)|D=1")
plt.scatter(did_data.loc[1, "date"], [did_data.loc[1, "engagement"][0], y0_est], color="C2", s=50)

plt.xticks(rotation = 45)
plt.legend()


# 질문 1
- 해당 그래프에서 ATT를 나타내는 포인트를 찾아주세요
- `Y(0)|D=1`는 어떻게 계산된 값인가요?

### 8.2.2 이중차분법과 OLS


포화회귀모델을 사용하여 DID 추정량을 구해주세요

In [None]:
did_data = (mkt_data
            .groupby(["city", "post"])
            .agg({"engagement":"mean", "date": "min", "treated": "max"})
            .reset_index())

did_data.head()

In [None]:
import statsmodels.formula.api as smf

smf.ols(
    '____', data=did_data
).fit().params["____"]

### 8.2.3 이중차분법과 고정효과


더미변수를 사용하여 추정해주세요

In [None]:
m = smf.ols('engagement ~ ____',
            data=did_data).fit()

m.params["treated:post"]

# 질문 2

- `*`연산자에 대하여 자유롭게 서술해주세요

### 8.2.4 이중차분법과 블록 디자인


In [None]:
import matplotlib.ticker as plticker


fig, (ax1, ax2) = plt.subplots(2,1, figsize=(9, 12), sharex=True)

heat_plt = (mkt_data
            .assign(treated=lambda d: d.groupby("city")["treated"].transform(max))
            .astype({"date":"str"})
            .assign(treated=mkt_data["treated"] * mkt_data["post"])
            .pivot(index="city", columns="date", values="treated")  # 여기서 수정
            .reset_index()
            .sort_values(by=max(mkt_data["date"].astype(str)), ascending=False)  # 정렬 부분도 수정 필요
            .reset_index()
            .drop(columns=["city"])
            .rename(columns={"index": "city"})
            .set_index("city"))



sns.heatmap(heat_plt, cmap="gray", linewidths=0.01, linecolor="0.5", ax=ax1, cbar=False)

ax1.set_title("Treatment Assignment")


sns.lineplot(data=mkt_data.astype({"date":"str"}),
             x="date", y="engagement", hue="treated", ax=ax2)

loc = plticker.MultipleLocator(base=2.0)
# ax2.xaxis.set_major_locator(loc)
ax2.vlines("2021-05-15", mkt_data["engagement"].min(), mkt_data["engagement"].max(), color="black", ls="dashed", label="Interv.")
ax2.set_title("Outcome Over Time")

plt.xticks(rotation = 50)

In [None]:
m = smf.ols('engagement ~ treated*post', data=mkt_data).fit()

m.params["treated:post"]

In [None]:
m = smf.ols('engagement ~ treated:post + C(city) + C(date)',
            data=mkt_data).fit()

m.params["treated:post"]

### 8.2.5 추론


군집화 후 신뢰구간을 확인하고 기존 표준오차, 다른 데이터의 군집표준오차와 비교해주세요

In [None]:
m = smf.ols(
    'engagement ~ ___', data=mkt_data
).fit(cov_type='cluster', cov_kwds={'groups': mkt_data['city']})

print("ATT:", m.params["treated:post"])
m.conf_int().loc["treated:post"]

In [None]:
m = smf.ols('engagement ~ ___',
            data=mkt_data).fit()

print("ATT:", m.params["treated:post"])
m.conf_int().loc["treated:post"]

In [None]:
m = smf.ols(
    'engagement ~ ___', data=did_data
).fit(cov_type='cluster', cov_kwds={'groups': did_data['city']})

print("ATT:", m.params["treated:post"])
m.conf_int().loc["treated:post"]

# 질문 3

- did_data에 대한 신뢰 구간이 기존 데이터(mkt_data)와 차이가 나는 이유를 자유롭게 설명해주세요

## 8.4 시간에 따른 효과 변동


In [None]:
def did_date(df, date):
    df_date = (df
               .query("date==@date | post==0")
               .query("date <= @date")
               .assign(post = lambda d: (d["date"]==date).astype(int)))
    
    m = smf.ols(
        'engagement ~ ___', data=df_date
    ).fit(cov_type='cluster', cov_kwds={'groups': df_date['city']})
    
    att = ___
    ci = ___
    
    return pd.DataFrame({"att": att, "ci_low": ci[0], "ci_up": ci[1]},
                        index=[date])

In [None]:
post_dates = sorted(mkt_data["date"].unique())[1:]

atts = pd.concat([did_date(mkt_data, date)
                  for date in post_dates])

atts.head()

In [None]:

plt.figure(figsize=(10,4))
plt.plot(atts.index, atts["att"], label="Est. ATTs")

plt.fill_between(atts.index, atts["ci_low"], atts["ci_up"], alpha=0.1)

plt.vlines(pd.to_datetime("2021-05-15"), -2, 3, linestyle="dashed", label="intervention")
plt.hlines(0, atts.index.min(), atts.index.max(), linestyle="dotted")

plt.plot(atts.index, mkt_data.query("treated==1").groupby("date")[["tau"]].mean().values[1:], color="0.6", ls="-.", label="$\\tau$")

plt.xticks(rotation=45)
plt.title("DID ATTs Over Time")
plt.legend()


## 8.5 이중차분법과 공변량


In [None]:
mkt_data_all = (pd.read_csv("./data/Processed_New_Dataset_2.csv")
                .astype({"date":"datetime64[ns]"}))

In [None]:
plt.figure(figsize=(15,6))
sns.lineplot(data=mkt_data_all.groupby(["date", "region", "treated"])[["engagement"]].mean().reset_index(),
             x="date", y="engagement", hue="region", style="treated", palette="gray")

plt.vlines(pd.to_datetime("2021-05-15"), 15, 55, ls="dotted", label="Intervention")
plt.legend(fontsize=14)

plt.xticks(rotation=25)

In [None]:
print("True ATT: ", mkt_data_all.query("treated*post==1")["tau"].mean())

m = smf.ols('engagement ~ treated:post + C(city) + C(date)',
            data=mkt_data_all).fit()

print("Estimated ATT:", ___)

In [None]:
# 지역 공변량을 추가해서 확인해 보세요!
m = smf.ols('engagement ~ treated:post + C(city) + C(date) + _____',
            data=mkt_data_all).fit()
m.params["treated:post"] 

In [None]:
# 지역 더미 변수를 활용해서 채워주세요!
m_saturated = smf.ols('engagement ~ ___',
                      data=mkt_data_all).fit()

atts = m_saturated.params[m_saturated.params.index.str.contains("post:treated")]
atts

In [None]:
reg_size = (mkt_data_all.groupby("region").size()
            /len(mkt_data_all["date"].unique()))

base = atts[0]

np.array([reg_size[0]*base]+
         [(att+base)*size
          for att, size in zip(atts[1:], reg_size[1:])]
        ).sum()/sum(reg_size)

# 질문 3
- 현 상황에서 위와 같은 방법을 수행하는 것이 아래의 방법보다 실용적일까요?

In [None]:
m = smf.ols('engagement ~ post*(treated + C(region))',
            data=mkt_data_all).fit()

m.summary().tables[1]

## 8.7 처치의 시차 도입

# 질문 4
- 시차 도입시 고려해야할 사항에 대해서 자유롭게 써주세요!

In [None]:
mkt_data_cohorts = (pd.read_csv("./data/Processed_New_Dataset_3.csv")
                    .astype({
                        "date":"datetime64[ns]",
                        "cohort":"datetime64[ns]"}))

mkt_data_cohorts.head()

In [None]:
plt_data = (mkt_data_cohorts
            .astype({"date": "str"})
            .assign(treated_post=lambda d: d["treated"] * (d["date"] >= d["cohort"]))
            .pivot(index="city", columns="date", values="treated_post")
            .reset_index()
            .sort_values(by=list(sorted(mkt_data_cohorts.query("cohort!='2100-01-01'")["cohort"].astype("str").unique())), ascending=False)
            .reset_index(drop=True)
            .rename_axis(None, axis=1)
            .set_index("city"))



plt.figure(figsize=(16,8))

sns.heatmap(plt_data, cmap="gray",cbar=False)
plt.text(18, 18, "Cohort$=G_{05/15}$", size=14)
plt.text(38, 65, "Cohort$=G_{06/04}$", size=14)
plt.text(55, 110, "Cohort$=G_{06/20}$", size=14)
plt.text(35, 170, "Cohort$=G_{\\infty}$", color="white", size=14, weight=3);

In [None]:
mkt_data_cohorts_w = mkt_data_cohorts.query("region=='W'")
mkt_data_cohorts_w.head()

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10))

plt_data = (mkt_data_cohorts_w
            .groupby(["date", "cohort"])
            [["engagement"]]
            .mean()
            .reset_index()
)



for color, cohort in zip(["C0", "C1", "C2", "C3"], mkt_data_cohorts_w.query("cohort!='2100-01-01'")["cohort"].unique()):
    df_cohort = plt_data.query("cohort==@cohort")
    sns.lineplot(data=df_cohort, x="date", y="engagement",
                 label=pd.to_datetime(cohort).strftime('%Y-%m-%d'), ax=ax1)
    ax1.vlines(x=cohort, ymin=25, ymax=50, color=color, ls="dotted", lw=3)
    
    
sns.lineplot(data=plt_data.query("cohort=='2100-01-01'"), x="date", y="engagement", label="$\infty$", lw=4, ls="-.", ax=ax1)
        
ax1.legend()
ax1.set_title("Multiple Cohorts - West Region");


plt_data = (mkt_data_cohorts_w
            .assign(days_to_treatment = lambda d: (pd.to_datetime(d["date"])-pd.to_datetime(d["cohort"])).dt.days)
            .groupby(["date", "cohort"])
            [["engagement", "days_to_treatment"]]
            .mean()
            .reset_index()
)


for color, cohort in zip(["C0", "C1", "C2", "C3"], mkt_data_cohorts_w.query("cohort!='2100-01-01'")["cohort"].unique()):
    df_cohort = plt_data.query("cohort==@cohort")
    sns.lineplot(data=df_cohort, x="days_to_treatment", y="engagement",
                 label=pd.to_datetime(cohort).strftime('%Y-%m-%d'), ax=ax2)

ax2.vlines(x=0, ymin=25, ymax=50, color="black", ls="dotted", lw=3)

ax2.set_title("Multiple Cohorts (Aligned) - West Region")
ax2.legend();

plt.tight_layout()

In [None]:
twfe_model = smf.ols(
    "engagement ~ treated:post + C(date) + C(city)",
    data=mkt_data_cohorts_w
).fit()

true_tau = mkt_data_cohorts_w.query("post==1&treated==1")["tau"].mean()

print("True Effect: ", true_tau)
print("Estimated ATT:", twfe_model.params["treated:post"])

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15, 10), sharex=True)

# fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 10))

cohort_erly='2021-06-04'
cohort_late='2021-06-20'

## Early vs Late
did_df = (mkt_data_cohorts_w
            .loc[lambda d: d["date"].astype(str) < cohort_late]
            .query(f"cohort=='{cohort_late}' | cohort=='{cohort_erly}'")
            .assign(treated = lambda d: (d["cohort"] == cohort_erly)*1,
                    post = lambda d: (d["date"].astype(str) >= cohort_erly)*1))

m = smf.ols(
    "engagement ~ treated:post + C(date) + C(city)",
    data=did_df
).fit()


# print("Estimated", m.params["treated:post"])
# print("True", did_df.query("post==1 & treated==1")["tau"].mean())

plt_data = (did_df
            .assign(installs_hat_0 = lambda d: m.predict(d.assign(treated=0)))
            .groupby(["date", "cohort"])
            [["engagement", "post", "treated", "installs_hat_0"]]
            .mean()
            .reset_index())


sns.lineplot(data=plt_data, x="date", y="engagement", hue="cohort", ax=ax1)
sns.lineplot(data=plt_data.query("treated==1 & post==1"),
             x="date", y="installs_hat_0", ax=ax1, ls="-.", alpha=0.5, label="$\hat{Y}_0|T=1$")


ax1.vlines(pd.to_datetime(cohort_erly), 26, 38, ls="dashed")
ax1.legend()
ax1.set_title("Early vs Late")


# ## Late vs Early

did_df = (mkt_data_cohorts_w
            .loc[lambda d: d["date"].astype(str) > cohort_erly]
            .query(f"cohort=='{cohort_late}' | cohort=='{cohort_erly}'")
            .assign(treated = lambda d: (d["cohort"] == cohort_late)*1,
                    post = lambda d: (d["date"].astype(str) >= cohort_late)*1))

m = smf.ols(
    "engagement ~ treated*post + C(date) + C(city)",
    data=did_df
).fit()

# print("Estimated", m.params["treated:post"])
# print("True", did_df.query("post==1 & treated==1")["tau"].mean())


plt_data = (did_df
            .assign(installs_hat_0 = lambda d: m.predict(d.assign(treated=0)))
            .groupby(["date", "cohort"])
            [["engagement", "post", "treated", "installs_hat_0"]]
            .mean()
            .reset_index())


sns.lineplot(data=plt_data, x="date", y="engagement", hue="cohort", ax=ax2)
sns.lineplot(data=plt_data.query("treated==1 & post==1"),
             x="date", y="installs_hat_0", ax=ax2, ls="-.", alpha=0.5, label="$\hat{Y}_0|T=1$")

ax2.vlines(pd.to_datetime("2021-06-20"), 32, 45, ls="dashed")
ax2.legend()
ax2.set_title("Late vs Early")

### 8.7.1 시간에 따른 이질적 효과


In [None]:
formula = "engagement ~ ____"

twfe_model = smf.ols(formula, data=mkt_data_cohorts_w).fit()

In [None]:
df_pred = (
    mkt_data_cohorts_w
    .query("post==1 & treated==1")
    .assign(y_hat_0=lambda d: twfe_model.predict(d.assign(treated=0)))
    .assign(effect_hat=lambda d: d["engagement"] - d["y_hat_0"])
)

print("Number of param.:", len(twfe_model.params))
print("True Effect: ", df_pred["tau"].mean())
print("Pred. Effect: ", df_pred["effect_hat"].mean())

In [None]:
formula = "engagement ~ ____"

twfe_model = smf.ols(formula, data=mkt_data_cohorts_w.astype({"date":str, "cohort": str})).fit()

effects = (twfe_model.params[twfe_model.params.index.str.contains("treated")]
           .reset_index()
           .rename(columns={0:"param"})
           .assign(cohort=lambda d: d["index"].str.extract(r'C\(cohort\)\[(.*)\]:'))
           .assign(date=lambda d: d["index"].str.extract(r':C\(date\)\[(.*)\]'))
           .assign(date=lambda d: pd.to_datetime(d["date"]), cohort=lambda d: pd.to_datetime(d["cohort"])))

plt.figure(figsize=(10,4))
sns.lineplot(data=effects, x="date", y="param", hue="cohort", palette="gray")
plt.xticks(rotation=45)
plt.ylabel("Estimated Effect")
plt.legend(fontsize=12)


# 질문 5
- 아래 코드는 어떤 실험 대상을 주체로 진행되는 과정인가요?

In [None]:
cohorts = sorted(mkt_data_cohorts_w["cohort"].unique())

treated_G = cohorts[:-1]
nvr_treated = cohorts[-1]

def did_g_vs_nvr_treated(df: pd.DataFrame,
                         cohort: str,
                         nvr_treated: str,
                         cohort_col: str = "cohort",
                         date_col: str = "date",
                         y_col: str = "engagement"):
    did_g = (
        df
        .loc[lambda d:(d[cohort_col] == cohort)|
                      (d[cohort_col] == nvr_treated)]
        .assign(treated = lambda d: (d[cohort_col] == cohort)*1)
        .assign(post = lambda d:(pd.to_datetime(d[date_col])>=cohort)*1)
    )
    
    att_g = smf.ols(f"{y_col} ~ treated*post",
                    data=did_g).fit().params["treated:post"]
    size = len(did_g.query("treated==1 & post==1"))
    return {"att_g": att_g, "size": size}


atts = pd.DataFrame(
    [did_g_vs_nvr_treated(mkt_data_cohorts_w, cohort, nvr_treated)
     for cohort in treated_G]
)
    
atts

In [None]:
(atts["att_g"]*atts["size"]).sum()/atts["size"].sum()

### 8.7.2 공변량


In [None]:
formula = """
engagement ~ ___"""

twfe_model = smf.ols(formula, data=mkt_data_cohorts).fit()

In [None]:
df_pred = (
    mkt_data_cohorts
    .query("post==1 & treated==1")
    .assign(y_hat_0=lambda d: twfe_model.predict(d.assign(treated=0)))
    .assign(effect_hat=lambda d: d["engagement"] - d["y_hat_0"])
)

print("Number of param.:", len(twfe_model.params))
print("True Effect: ",  df_pred["tau"].mean())
print("Pred. Effect: ", df_pred["effect_hat"].mean())