## 傾向スコア
- 岩波DS vol.3 CM接触のアプリ利用への因果効果推定（https://github.com/iwanami-datascience/vol3/tree/master/kato%26hoshino ）

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

### データの読み込み

In [2]:
data = pd.read_csv('./q_data_x.csv')
data.columns

Index(['cm_dummy', 'gamedummy', 'area_kanto', 'area_keihan', 'area_tokai',
       'area_keihanshin', 'age', 'sex', 'marry_dummy', 'job_dummy1',
       'job_dummy2', 'job_dummy3', 'job_dummy4', 'job_dummy5', 'job_dummy6',
       'job_dummy7', 'job_dummy8', 'inc', 'pmoney', 'fam_str_dummy1',
       'fam_str_dummy2', 'fam_str_dummy3', 'fam_str_dummy4', 'fam_str_dummy5',
       'child_dummy', 'T', 'F1', 'F2', 'F3', 'M1', 'M2', 'M3', 'TVwatch_day',
       'gamesecond', 'gamecount'],
      dtype='object')

In [3]:
X = data[['TVwatch_day', 'age', 'sex', 'marry_dummy', 'child_dummy', 'inc', 'pmoney','area_kanto', 'area_tokai', 'area_keihanshin', 
          'job_dummy1', 'job_dummy2', 'job_dummy3', 'job_dummy4', 'job_dummy5', 'job_dummy6', 'job_dummy7',
          'fam_str_dummy1', 'fam_str_dummy2', 'fam_str_dummy3', 'fam_str_dummy4']]
Z = data['cm_dummy']

### 傾向スコアの推定
- $e(X_i) = p(Z_i=1|X_i)$

ユーザー$i$が処置群に存在する（$Z_i=1$）確率

In [4]:
lr = LogisticRegression()
lr.fit(X, Z)
ps = lr.predict_proba(X)[:, 1]

In [5]:
print('AUC = {:.3f}'.format(roc_auc_score(y_true=Z, y_score=ps)))

AUC = 0.789


In [6]:
data = pd.concat([data, pd.DataFrame(ps, columns=['propensity_score'])], axis = 1)

### 因果効果推定

In [7]:
data_treated = data[data['cm_dummy']==1]
data_untreated = data[data['cm_dummy']==0]

- CM接触群とCM非接触群の平均値の比較差 

$E(Y_1|Z=1) - E(Y_0|Z=0)$

割り付け$Z$がランダムであれば、$Z$の因果効果は正しく推定できる

In [8]:
avgT = np.mean(data_treated['gamedummy'])
avgU = np.mean(data_untreated['gamedummy'])
diff = avgT - avgU

print('E(Y1|Z=1) = {:.3f}'.format(avgT))
print('E(Y0|Z=0) = {:.3f}'.format(avgU))
print('diff = {:.3f}'.format(diff))

E(Y1|Z=1) = 0.075
E(Y0|Z=0) = 0.073
diff = 0.002


- 平均処置効果（ATE, Average Treatment Effect）

$ATE = E(Y_1) - E(Y_0)$

（すべての人がCMに接触した場合のアプリ利用）-（すべての人がCMに接触しない場合のアプリ利用）

In [9]:
E1 = np.sum((data['cm_dummy']/data['propensity_score'])*data['gamedummy']) / np.sum(data['cm_dummy']/data['propensity_score'])
E0 = np.sum(((1-data['cm_dummy'])/(1-data['propensity_score']))*data['gamedummy']) \
        / np.sum((1-data['cm_dummy'])/(1-data['propensity_score']))
ATE = E1 - E0

print('E1 = {:.3f}'.format(E1))
print('E0 = {:.3f}'.format(E0))
print('ATE = {:.3f}'.format(ATE))

E1 = 0.088
E0 = 0.062
ATE = 0.027


- 処置群における平均処置効果（ATT, Average Treatment effect on the Treated）

$ATT=E(Y_1|Z=1)-E(Y_0|Z=1)$

CM接触者におけるアプリ利用の伸び

In [11]:
E1 = np.mean(data_treated['gamedummy'])
E0 = np.sum(((1-data['cm_dummy'])*data['propensity_score']/(1-data['propensity_score']))*data['gamedummy']) \
        / np.sum((1-data['cm_dummy'])*data['propensity_score']/(1-data['propensity_score']))
ATT = E1 - E0

print('E1 = {:.3f}'.format(E1))
print('E0 = {:.3f}'.format(E0))
print('ATT = {:.3f}'.format(ATT))

E1 = 0.075
E0 = 0.048
ATT = 0.027
