# 効果検証入門
## 正しい比較のための因果推論／計量経済学の基礎

### 1.4 R によるメールマーケティングの効果の検証

参考: https://note.nkmk.me/python-pandas-to-csv/

In [1]:
# データの読み込み
# df = pd.read_csv('http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')
# df.to_csv('Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv', index=False)
df = pd.read_csv('Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv')

参考: https://qiita.com/HEM_SP/items/56cd62a1c000d342bd70

In [2]:
# データの準備
male_df = df.copy().query('segment != "Womens E-Mail"')
male_df['treatment'] = male_df['segment'].map(lambda x: 1 if x == 'Mens E-Mail' else 0)
male_df.head(3)

Unnamed: 0,recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,conversion,spend,treatment
1,6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,0,0.0,0
3,9,5) $500 - $750,675.83,1,0,Rural,1,Web,Mens E-Mail,0,0,0.0,1
8,9,5) $500 - $750,675.07,1,1,Rural,1,Phone,Mens E-Mail,0,0,0.0,1


参考
- https://qiita.com/propella/items/a9a32b878c77222630ae
- https://python-analytics.hatenadiary.jp/entry/2018/04/21/182220

In [3]:
# 集計による比較
male_df.groupby('treatment').agg({'treatment':'count', 'conversion': np.mean, 'spend': np.mean})

Unnamed: 0_level_0,treatment,conversion,spend
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,21306,0.005726,0.652789
1,21307,0.012531,1.422617


参考: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html

In [4]:
# 平均の差に対して有意差検定を実行
stats.ttest_ind(
    male_df.query('treatment == 0')['spend'],
    male_df.query('treatment == 1')['spend'],
    equal_var = True,
    nan_policy = 'raise',
)

Ttest_indResult(statistic=-5.300090294465472, pvalue=1.163200872605869e-07)

参考
- https://teratail.com/questions/134846
- https://pythondatascience.plavox.info/numpy/乱数を生成

In [5]:
# バイアスのあるデータの作成
def func(row):
    if row['history'] > 300 or row['recency'] < 6 or row['channel'] == "Multichannel":
        return 0.5 if row['treatment'] == 0 else 1
    else:
        return 1 if row['treatment'] == 0 else 0.5

biased_data = male_df.copy()
biased_data['obs_rate'] = biased_data.apply(func, axis=1)
np.random.seed(seed=2)
biased_data['random_number'] = [np.random.rand() for _ in range(len(biased_data))]
biased_data = biased_data.query('random_number < obs_rate')

In [6]:
# セレクションバイアスのあるデータで平均を比較
biased_data.groupby('treatment').agg({'treatment':'count', 'conversion': np.mean, 'spend': np.mean})
# 平均の差に対して有意差検定を実行
stats.ttest_ind(
    biased_data.query('treatment == 0')['spend'],
    biased_data.query('treatment == 1')['spend'],
    equal_var = True,
    nan_policy = 'raise',
)

Unnamed: 0_level_0,treatment,conversion,spend
treatment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,14650,0.005051,0.586214
1,17246,0.013278,1.517082


Ttest_indResult(statistic=-5.331882684896141, pvalue=9.786007294094342e-08)

### Links
- P.vi: https://callingbullshit.org
- P.vii: https://statistics.fas.harvard.edu/people/donald-b-rubin
- P.vii: http://bayes.cs.ucla.edu/jp_home.html
- P.viii: https://www.rieti.go.jp/jp/special/ebpm_report/002.html
- P.xiii: https://github.com/ghmagazine/cibook
- P.10: https://www.nippyo.co.jp/shop/book/8075.html
- P.24: https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html

参考
- https://qiita.com/nekoumei/items/648726e89d05cba6f432