# RCTデータで検証

## RCTを行ったデータの準備

**Rコード**

```R
# ライブラリーimport
library(tidyverse)
library(broom)

# データフレーム読み込み
email_data <- read_csv("http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv")

# 男性のみに限定
male_df <- email_data %>%
    filter(segment != "Womens E-Mail") %>%
    mutate(
        treatment = if_else(segment == "Mens E-Mail", 1, 0)
    )
```

**データ説明**

|変数名|説明|
|:-------|:------|
|recency|最後の購入からの経過月数|
|history_segment|昨年の購入額の階層|
|history|昨年の購入額|
|mens|昨年に男物の商品を購入しているか|
|womens|昨年に女物の商品を購入しているか|
|zipcode|zipcodeをもとに地区を分類したもの|
|newbie|過去12ヶ月以内に新しくユーザになったか|
|channel|昨年においてどのチャンネルから購入したか|
|segment|どのメールが配信されたか|
|visit|メールが配信されてから2週間以内にサイトへ来訪したか|
|conversion|メールが配信されてから2週間以内に購入したか|
|spend|購入した際の購入額|

In [10]:
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

In [2]:
df_email_data = pd.read_csv("http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv")

In [3]:
df_male = df_email_data.query("segment != 'Womens E-Mail'").copy()
df_male["treatment"] = np.where(df_male["segment"] == "Mens E-Mail", 1, 0)
df_male.head()

Unnamed: 0,recency,history_segment,history,mens,womens,zip_code,newbie,channel,segment,visit,conversion,spend,treatment
1,6,3) $200 - $350,329.08,1,1,Rural,1,Web,No E-Mail,0,0,0.0,0
3,9,5) $500 - $750,675.83,1,0,Rural,1,Web,Mens E-Mail,0,0,0.0,1
8,9,5) $500 - $750,675.07,1,1,Rural,1,Phone,Mens E-Mail,0,0,0.0,1
13,2,2) $100 - $200,101.64,0,1,Urban,0,Web,Mens E-Mail,1,0,0.0,1
14,4,3) $200 - $350,241.42,0,1,Rural,1,Multichannel,No E-Mail,0,0,0.0,0


## RCTデータの集計と有意差検定

**Rコード**
```R
# 集計による比較
summary_by_segment <- male_df %>%
    group_by(treatment) %>%
    summarise(
        conversion_rate = mean(conversion),
        spend_mean = mean(spend),
        count = n(),
        .groups = "drop"
    )

# 介入群の購買データ
mens_mail <- male_df %>%
    filter(treatment == 1) %>%
    pull(spend)

# コントロールの購買データ
no_mail <- male_df %>%
    filter(treatment == 0) %>%
    pull(spend)

# 2群の平均についてt検定
rct_ttest <- t.test(mens_mail, no_mail, var.equal = FALSE)
rct_ttest
```

In [9]:
df_summary_by_segment = (df_male
                         .groupby("treatment")
                         .agg({"conversion": "mean", "spend": "mean", "channel": "size"})
                         .reset_index()
                        )
df_summary_by_segment

Unnamed: 0,treatment,conversion,spend,channel
0,0,0.005726,0.652789,21306
1,1,0.012531,1.422617,21307


In [11]:
mens_mail = df_male.query("treatment == 1")["spend"]
no_mail = df_male.query("treatment == 0")["spend"]

rct_ttest = ttest_ind(mens_mail, no_mail, equal_var=False)
rct_ttest

Ttest_indResult(statistic=5.300140358411668, pvalue=1.1638149682254859e-07)

# バイアスのあるデータで検証

## バイアスのあるデータの準備

In [7]:
set.seed(1)

# 条件に反応するサンプルの量を半分にする
obs_rate_c <- 0.5
obs_rate_t <- 0.5

# バイアスのあるデータの作成
biased_data <- male_df %>%
    mutate(
        obs_rate_c = if_else(
            (history > 300) | (recency < 6) | (channel == "Multichannel"),
            obs_rate_c,
            1
        ),
        obs_rate_t = if_else(
            (history > 300) | (recency < 6) | (channel == "Multichannel"),
            1,
            obs_rate_t
        ),
        random_number = runif(n = NROW(male_df))
    ) %>%
    filter(
        (treatment == 0 & random_number < obs_rate_c) |
        (treatment == 1 & random_number < obs_rate_t)
    )

## バイアスのあるデータの集計と有意差検定

In [9]:
summary_by_segment_biased <- biased_data %>%
    group_by(treatment) %>%
    summarise(
        conversion_rate = mean(conversion),
        spend_mean = mean(spend),
        count = n(),
        .groups = "drop"
    )
summary_by_segment_biased

treatment,conversion_rate,spend_mean,count
<dbl>,<dbl>,<dbl>,<int>
0,0.004977838,0.5483062,14665
1,0.013431794,1.5277526,17198


In [10]:
# 介入群の購買データ
mens_mail_biased <- biased_data %>%
    filter(treatment == 1) %>%
    pull(spend)

# コントロール群の購買データ
no_mail_biased <- biased_data %>%
    filter(treatment == 0) %>%
    pull(spend)

# 2群の平均の差についてt検定
rct_ttest_biased <- t.test(mens_mail_biased, no_mail_biased, var.equal = F)
rct_ttest_biased


	Welch Two Sample t-test

data:  mens_mail_biased and no_mail_biased
t = 5.9164, df = 27557, p-value = 3.33e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6549631 1.3039298
sample estimates:
mean of x mean of y 
1.5277526 0.5483062 


# 回帰分析でバイアスのあるデータの検証

## バイアスのあるデータで回帰分析

In [13]:
biased_reg <- lm(data = biased_data, formula = spend ~ treatment + history)

biased_reg_coef <- tidy(biased_reg)
biased_reg_coef

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),0.324199564,0.14443899,2.244543,0.0248043
treatment,0.902610917,0.174305713,5.178321,2.252514e-07
history,0.001092682,0.000336606,3.246176,0.001170872


## 効果検証のための回帰分析で行わないこと