# Mobile Games Retention A/B Testing Work Out
## 1.EDA
## 2.卡方检验(Chisquare Test)
## 3.总体比例Z检验(Proportions Z Test)
## 4.自助法+两独立样本均值T检验(Bootstrap + Independent-Samples T Test)  
## 结论(Conclusion)

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 1.EDA

In [None]:
df = pd.read_csv('/kaggle/input/mobile-games-ab-testing/cookie_cats.csv')
df.head()

In [None]:
df.info()

In [None]:
# 查看是否有重复ID
df.userid.duplicated().sum()

The data we have is from 90,189 players that installed the game while the AB-test was running. The variables are:

* userid - a unique number that identifies each player.
* version - whether the player was put in the control group (gate_30 - a gate at level 30) or the group with the moved gate (gate_40 - a gate at level 40).
* sum_gamerounds - the number of game rounds played by the player during the first 14 days after install.
* retention_1 - did the player come back and play 1 day after installing?
* retention_7 - did the player come back and play 7 days after installing?

When a player installed the game, he or she was randomly assigned to either gate_30 or gate_40. As a sanity check, let's see if there are roughly the same number of players in each AB group.

In [None]:
# 查看各组样本数量
df.version.value_counts()

两组样本数量大致相同

In [None]:
# gate_30次日留存数
df.groupby('version').retention_1.sum()

In [None]:
# gate_30次日留存率
df.groupby('version').retention_1.mean()

In [None]:
from statsmodels.stats.proportion import proportion_confint
lower_30,upper_30=proportion_confint(20034,44700,alpha=0.05)
print('gate_30组1日留存率95置信区间 - 上界：{:.3%} , 下界：{:.3%}'.format(lower_30,upper_30))
lower_40,upper_40=proportion_confint(20119,45489,alpha=0.05)
print('gate_40组1日留存率95置信区间 - 上界：{:.3%} , 下界：{:.3%}'.format(lower_40,upper_40))

In [None]:
# gate_30 7日留存数
df.groupby('version').retention_7.sum()

In [None]:
# gate_30 7日留存率
df.groupby('version').retention_7.mean()

In [None]:
lower_30,upper_30=proportion_confint(8502,44700,alpha=0.05)
print('gate_30组7日留存率95置信区间 - 上界：{:.3%} , 下界：{:.3%}'.format(lower_30,upper_30))
lower_40,upper_40=proportion_confint(8279,45489,alpha=0.05)
print('gate_40组7日留存率95置信区间 - 上界：{:.3%} , 下界：{:.3%}'.format(lower_40,upper_40))

## 2.卡方检验
* H0：在第30关和第40关设置gate对用户留存没有显著影响
* H1：在第30关和第40关设置gate对用户留存有影响

### 次日留存

In [None]:
A = df[df.version=='gate_30']
B = df[df.version=='gate_40']

In [None]:
retain_1d_30 = len(A[A.retention_1==True])
notretain_1d_30 = len(A[A.retention_1==False])
retain_1d_40 = len(B[B.retention_1==True])
notretain_1d_40 = len(B[B.retention_1==False])

In [None]:
observed_1d = pd.DataFrame({'gate_30' : {'retain_1d' : retain_1d_30, 'notretain_1d' : notretain_1d_30},
                         'gate_40' : {'retain_1d' : retain_1d_40, 'notretain_1d' : notretain_1d_40}
                         }) 
observed_1d

In [None]:
retain_1d = retain_1d_30 + retain_1d_40
notretain_1d = notretain_1d_30 + notretain_1d_40
print(retain_1d)
print(notretain_1d)

In [None]:
# 理论次日留存率
rate_exp_1d = retain_1d/(retain_1d + notretain_1d)
rate_exp_1d

In [None]:
exp_retain_1d = len(df)*rate_exp_1d
exp_notretain_1d = len(df)*(1 - rate_exp_1d)
print(exp_retain_1d)
print(exp_notretain_1d)

In [None]:
from scipy import stats
observed = [notretain_1d_30,notretain_1d_40,retain_1d_30,retain_1d_40]
expected = [exp_notretain_1d/2,exp_notretain_1d/2,exp_retain_1d/2,exp_retain_1d/2,]
stats.chisquare(f_obs= observed , f_exp = expected)

在双侧检验中，p值小于α(0.05)，故拒绝原假设，相对于gate_40组，gate_30组的用户次日留存率更高。

### 7日留存

In [None]:
retain_7d_30 = len(A[A.retention_7==True])
notretain_7d_30 = len(A[A.retention_7==False])
retain_7d_40 = len(B[B.retention_7==True])
notretain_7d_40 = len(B[B.retention_7==False])

In [None]:
observed_7d = pd.DataFrame({'gate_30' : {'retain_7d' : retain_7d_30, 'notretain_7d' : notretain_7d_30},
                         'gate_40' : {'retain_7d' : retain_7d_40, 'notretain_7d' : notretain_7d_40}
                         }) 
observed_7d

In [None]:
retain_7d = retain_7d_30 + retain_7d_40
notretain_7d = notretain_7d_30 + notretain_7d_40
print(retain_7d)
print(notretain_7d)

In [None]:
# 理论7日留存率
rate_exp_7d = retain_7d/(retain_7d + notretain_7d)
rate_exp_7d

In [None]:
exp_retain_7d = len(df)*rate_exp_7d
exp_notretain_7d = len(df)*(1 - rate_exp_7d)
print(exp_retain_7d)
print(exp_notretain_7d)

In [None]:
observed = [notretain_7d_30,notretain_7d_40,retain_7d_30,retain_7d_40]
expected = [exp_notretain_7d/2,exp_notretain_7d/2,exp_retain_7d/2,exp_retain_7d/2,]
stats.chisquare(f_obs= observed , f_exp = expected)

在双侧检验中，p值小于α(0.05)，故拒绝原假设，相对于gate_40组，gate_30组的用户7日留存率更高。

* 卡方检验结论：相对于gate_40组，gate_30组的用户次日留存率和7日留存率都更高。

## 3.总体比例Z检验
* H0：在第30关设置gate的用户留存率比在第40关设置gate的用户留存率高出至少1%
* H1：在第30关设置gate的用户留存率与在第40关设置gate的用户留存率之差低于1%

In [None]:
observed_1d

In [None]:
from statsmodels.stats.proportion import proportions_ztest
z_score,p_value = proportions_ztest([20034,20119],[20034+24666,20119+25370],alternative='smaller',value=0.01)
print('Z值：{}  P值{}'.format(z_score,p_value))

P值为0.11，故接受原假设，在第30关设置gate的用户次日留存率比在第40关设置gate的用户次日留存率高出至少1%

### 7日留存

In [None]:
observed_7d

In [None]:
z_score,p_value = proportions_ztest([8502,8279],[8502+36198,8279+37210],alternative='smaller',value=0.01)
print('Z值：{}  P值{}'.format(z_score,p_value))

P值为0.24，故接受原假设，在第30关设置gate的用户7日留存率比在第40关设置gate的用户7日留存率高出至少1%

* 总体比例Z检验结论：在第30关设置gate的用户留存率比在第40关设置gate的用户留存率高出至少1%

## 4.自助法+两独立样本均值T检验
* H0：在第30关和第40关设置gate对用户留存没有显著影响
* H1：在第30关和第40关设置gate对用户留存有影响

### 次日留存

In [None]:
# 参考datacamp上的自助法(bootstrap)
boot_1d = []
for i in range(2000):
    boot_mean = df.sample(frac=1,replace=True).groupby('version')['retention_1'].mean()
    boot_1d.append(boot_mean)

boot_1d = pd.DataFrame(boot_1d)

boot_1d.plot(kind='kde')

In [None]:
boot_1d

In [None]:
prob=(boot_1d.gate_30-boot_1d.gate_40 > 0).mean()
print('gate_30次日留存率大于gate_40次日留存率的概率：{:.1%}'.format(prob))

In [None]:
import matplotlib.pyplot as plt
# 正态分布检验
stats.probplot(boot_1d['gate_30'],dist='norm',plot=plt)
stats.probplot(boot_1d['gate_40'],dist='norm',plot=plt)

由Q-Q图可以看出，两组数据均符合正态分布

In [None]:
# 方差齐性检验
stats.levene(boot_1d['gate_30'],boot_1d['gate_40'])

P值大于0.05，认为两组数据方差齐性

In [None]:
# 两独立样本T检验
stats.ttest_ind(boot_1d['gate_30'],boot_1d['gate_40'],equal_var=True)

p值为0，故拒绝原假设，相对于gate_40组，gate_30组的用户次日留存率更高。

### 7日留存

In [None]:
# 参考datacamp上的自助法(bootstrap)
boot_7d = []
for i in range(2000):
    boot_mean = df.sample(frac=1,replace=True).groupby('version')['retention_7'].mean()
    boot_7d.append(boot_mean)

boot_7d = pd.DataFrame(boot_7d)

boot_7d.plot(kind='kde')

In [None]:
boot_7d

In [None]:
prob=(boot_7d.gate_30-boot_7d.gate_40 > 0).mean()
print('gate_30留存率大于gate_40留存率的概率：{:.1%}'.format(prob))

In [None]:
# 正态分布检验
stats.probplot(boot_7d['gate_30'],dist='norm',plot=plt)
stats.probplot(boot_7d['gate_40'],dist='norm',plot=plt)

由Q-Q图可以看出，两组数据均符合正态分布

In [None]:
# 方差齐性检验
stats.levene(boot_7d['gate_30'],boot_7d['gate_40'])

P值大于0.05，认为两组数据方差齐性

In [None]:
# 两独立样本T检验
stats.ttest_ind(boot_1d['gate_30'],boot_1d['gate_40'],equal_var=True)

p值为0，故拒绝原假设，相对于gate_40组，gate_30组的用户7日留存率更高

* 自助法+两独立样本均值T检验结论：相对于gate_40组，gate_30组的用户留存率更高

## 结论
gate_30组的用户留存率更高（比gate_40组高出至少1%）