## 通过AB测试数据的分析判断新旧两版页面在用户转化上是否有显著区别

In [1]:
#导包
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#查看数据
ABtest=pd.read_csv('ab_data.csv')

In [3]:
ABtest.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [5]:
ABtest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


- user_id 用户ID

- timestamp 用户行为时间戳

- group 标记实验组和对照组 

    - control	表示就页面（对照组）
    - treatment 新页面（实验组）
- landing_page 页面版本

- converted 是否转化（0 没有转化 1 已转化）

## 数据分析

In [7]:
# 检差缺失值
ABtest.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

In [14]:
#检差是否存在逻辑错误的情况
t_old_error =len(ABtest[(ABtest['group']=='treatment')&(ABtest['landing_page']=='old_page')])
t_new_error =len(ABtest[(ABtest['group']=='control')&(ABtest['landing_page']=='new_page')])

In [17]:
print('实验组看到老页面的人数：{},对照组看到新页面的人数：{}'.format(t_old_error,t_new_error))

实验组看到老页面的人数：1965,对照组看到新页面的人数：1928


In [18]:
# 排除不匹配的数据
ABtest=ABtest[((ABtest['group']=='treatment')&(ABtest['landing_page']=='new_page'))|((ABtest['group']=='control')&(ABtest['landing_page']=='old_page'))]

In [19]:
ABtest.shape

(290585, 5)

In [20]:
#检查重复值
ABtest.drop_duplicates(subset='user_id').head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [21]:
#删除重复值 仅保留最近的一条
ABtest.drop_duplicates(subset=['user_id'],inplace=True,keep='first')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ABtest.drop_duplicates(subset=['user_id'],inplace=True,keep='first')


In [22]:
ABtest=ABtest.reset_index()

In [23]:
ABtest.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290584 entries, 0 to 290583
Data columns (total 6 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   index         290584 non-null  int64 
 1   user_id       290584 non-null  int64 
 2   timestamp     290584 non-null  object
 3   group         290584 non-null  object
 4   landing_page  290584 non-null  object
 5   converted     290584 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 13.3+ MB


In [25]:
# 查看新页面和老页面的用户占比
rate=ABtest[ABtest.landing_page=='new_page'].shape[0]/ABtest.shape[0]
rate

0.5000619442226688

看到新页面和老页面的用户各占50%左右，流量分配均衡

## 假设检验

#### 提出零假设和备用假设

- 记旧页面的转化率为p1，新页面的转化率为p2
- 零假设：p1>p2 即 p1-p2>0
- 备用假设： p1<p2 即p1-p2<0

#### 确定抽样分布类型

本次实验满足的判断结果只有0和1(转化和未转化) 符合0-1分布

该样本是 独立双样本 且 n>30 总体的均值和标准差未知，因此用Z检验

显著性水平 0.05

Z = (p1-p2)/sqrt[p1*(1-p1)/n1+p2*(1-p2)/n2]

In [27]:
# 筛选出旧版和新版用户数(query 是条件过滤函数)
n_old=ABtest.query("group=='control'").shape[0]
n_new=ABtest.query("group=='treatment'").shape[0]

In [28]:
#新版旧版转化人数
convert_old=ABtest.query("group=='control'&converted==1").shape[0]
convert_new=ABtest.query("group=='treatment'&converted==1").shape[0]

In [29]:
#新版旧版转化率
p_old=convert_old/n_old
p_new=convert_new/n_new

In [32]:
p_c=(convert_old+convert_new)/(n_old+n_new)

In [34]:
z=(p_old-p_new)/np.sqrt(p_c*(1-p_c)*(1/n_old +1/n_new))

In [36]:
from scipy.stats import norm

In [37]:
z_alpha=norm.ppf(0.05)

In [40]:
print('旧版总受试用用户数:',n_old,'旧版转化用户数:',convert_old,'旧版转化率:',p_old)
print('新版总受试用用户数:',n_new,'新版转化用户数:',convert_new,'旧版转化率:',p_new)
print('转化率的联合估计',p_c)
print('检验统计量z:',z)
print(z_alpha)
result='落入拒绝域，拒绝零假设' if abs(z) >abs(z_alpha) else "接受零假设"
print(result)

旧版总受试用用户数: 145274 旧版转化用户数: 17489 旧版转化率: 0.1203863045004612
新版总受试用用户数: 145310 新版转化用户数: 17872 旧版转化率: 0.12299222352212512
转化率的联合估计 0.12168942543292129
检验统计量z: -2.1484056695589
-1.6448536269514729
落入拒绝域，拒绝零假设


In [42]:
# 合并标准差
std_old=ABtest[ABtest.landing_page=='old_page'].converted.std()
std_new=ABtest[ABtest.landing_page=='new_page'].converted.std()
s=np.sqrt(((n_old-1)*std_old**2+(n_new-1)*std_new**2)/(n_old+n_new))
d=(p_old-p_new)/s
print('Cohen\'s d为',d)

Cohen's d为 -0.007971019822418027


Cohen's d 的值-0.0079绝对值很小，说明两者虽然有显著性水平5% 统计意义上的显著差异，但是差异的大小不显著