### Implementing an A/B Test
We have downloaded data from Kaggle, whcih is testing conversion rate of two groups (control and treatment) exposed to two different landing pages.

For this example we are goign to test the null hypothesis
    
    P(conversion in the treatement group) - P(conversion in control group) = 0

Alternative hypothesis:
    
    P(conversion in the treatement group) - P(conversion in control group) != 0
    
#### Points to remember
Baseline rate — an estimate of the metric being analyzed before making any changes

Practical significance level — the minimum change to the baseline rate that is useful to the business, for example an increase in the conversion rate of 0.001% may not be worth the effort required to make the change whereas a 2% change will be

Confidence level — also called significance level is the probability that the null hypothesis (experiment and control are the same) is rejected when it shouldn’t be

Sensitivity — the probability that the null hypothesis is not rejected when it should be

In [1]:
import numpy as np
import pandas as pd

import seaborn as sns
import statsmodels.stats.api as sms
import scipy.stats as stats
import matplotlib.pyplot as plt
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportions_chisquare

%matplotlib inline
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'retina'

  import pandas.util.testing as tm


In [2]:
raw_data = pd.ExcelFile("data_ctrl_treat_grp.xlsx")
df = raw_data.parse(raw_data.sheet_names[0])


In [3]:
print("Number of rows: ", df.shape[0], " Number of columns: ", df.shape[1])
df.head()

Number of rows:  294478  Number of columns:  5


Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,00:11:48.600000,control,old_page,0
1,804228,00:01:45.200000,control,old_page,0
2,661590,00:55:06.200000,treatment,new_page,0
3,853541,00:28:03.100000,treatment,new_page,0
4,864975,00:52:26.200000,control,old_page,1


In [4]:
df["group"].value_counts()

treatment    147276
control      147202
Name: group, dtype: int64

In [5]:
print(df.groupby(['group', 'landing_page']).size())
print('\n')
print(df.groupby(['group', 'converted']).size())

group      landing_page
control    new_page          1928
           old_page        145274
treatment  new_page        145311
           old_page          1965
dtype: int64


group      converted
control    0            129479
           1             17723
treatment  0            129762
           1             17514
dtype: int64


In [6]:
#some of the control group saw the new_page and some tretment group saw the old_page - delete these instances
mask1 = (df["group"] == "control") & (df["landing_page"] == "new_page")
index_to_drop1 = df[mask1].index
df = df.drop(index_to_drop1)

mask2 = (df["group"] == "treatment") & (df["landing_page"] == "old_page")
index_to_drop2 = df[mask2].index
df = df.drop(index_to_drop2)


In [7]:
df["group"].value_counts()

treatment    145311
control      145274
Name: group, dtype: int64

In [12]:
# getting rid of duplicate users
df.drop_duplicates(subset ='user_id',keep ='first',inplace = True)


In [13]:
#Show the % split between users who saw new vs old page
#Calculate pooled probability
mask = (df["group"] == "control")
conversions_control = df["converted"][mask].sum() # its the total number of conversions in control group
total_users_control = df["converted"][mask].count()

mask = (df["group"] == "treatment")
conversions_treatment = df["converted"][mask].sum()# its the total number of conversions in treatment group
total_users_treatment = df["converted"][mask].count() # total number of people in treatment group

print("Split of control users who saw old page vs treatment users who saw new page: ", 
          round(total_users_control / df["converted"].count() * 100, 2), "% ",
          round((total_users_treatment / df["converted"].count()) * 100, 2), "%")

#count number of users who converted in each group
print("Number of control users who converted on old page: ", conversions_control)
print("Percentage of control users who converted: ", round((conversions_control / total_users_control) * 100, 2), "%")

mask = (df["group"] == "treatment")
print("Number of treatment users who converted on new page: ", conversions_treatment)
print("Percentage of treatment users who converted: ", round((conversions_treatment/ total_users_treatment) * 100, 2), "%")

Split of control users who saw old page vs treatment users who saw new page:  49.99 %  50.01 %
Number of control users who converted on old page:  17489
Percentage of control users who converted:  12.04 %
Number of treatment users who converted on new page:  17264
Percentage of treatment users who converted:  11.88 %


In [14]:
print(len(mask))
print((conversions_control))
print(total_users_control)

290584
17489
145274


#### Setting test parameters
determingin the sample size 

In [28]:
#Check what sample size is required
baseline_rate = conversions_control / total_users_control
print('baseline_rate', baseline_rate)
practical_significance = 0.01 #user defined
confidence_level = 0.05 #user defined, for a 95% confidence interval
sensitivity = 0.8 #user defined

effect_size = sms.proportion_effectsize(baseline_rate, baseline_rate + practical_significance)
# to conduct the z test for two independent samples
sample_size = sms.NormalIndPower().solve_power(effect_size = effect_size, power = sensitivity, 
                                               alpha = confidence_level, ratio=1)
print("Required sample size: ", round(sample_size), " per group")

baseline_rate 0.1203863045004612
Required sample size:  17209  per group
