# E-commerce A/B Testing

We will be working on a dataset of an A/B test run by an e-commerce website. The company has developed a new web page in order to try and increase the number of users who "convert",  meaning the number of users who decide to buy the company's product. Our goal is to work through this notebook to help the company understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

import random
random.seed(42)

import warnings
warnings.filterwarnings('ignore')

import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.stats.api as sms

import os

In [2]:
os.chdir('/Users/neera/Documents/PROJECTS/AB Testing')
df = pd.read_csv('ab_test.csv')

In [3]:
df.head()

Unnamed: 0,id,time,con_treat,page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,0
3,853541,28:03.1,treatment,new_page,0
4,864975,52:26.2,control,old_page,1


## Data Cleaning

#### Change column names to more easy-to-understood names

In [4]:
df.columns = ['user_id', 'timestamp', 'group', 'landing_page', 'converted']

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


#### Checking if there are any missing values

In [6]:
df.isnull().sum()

user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64

#### Checking the number of mismatched rows between 'treatment' & 'new_page' and 'control' & 'old_page'

In [7]:
num_treatment = df[df['group'] == 'treatment'].shape[0]
num_new_page = df[df['landing_page'] == 'new_page'].shape[0]
diff = abs(num_treatment - num_new_page)

print("""
Treatment = {}
New Page = {}
Difference = {}
""".format(num_treatment, num_new_page, diff))


Treatment = 147276
New Page = 147239
Difference = 37



In [8]:
num_control = df[df['group'] == 'control'].shape[0]
num_old_page = df[df['landing_page'] == 'old_page'].shape[0]
diff = abs(num_control - num_old_page)

print("""
Control = {}
Old Page = {}
Difference = {}
""".format(num_control, num_old_page, diff))


Control = 147202
Old Page = 147239
Difference = 37



There are mismatches between number of users assigned to treatment and the number of those landed on treatment page. This might indicate a problem with the data and needs further exploration.

In [9]:
mismatches = pd.crosstab(index=df['group'], columns=df['landing_page'])
print(mismatches)

landing_page  new_page  old_page
group                           
control           1928    145274
treatment       145311      1965


1928 users are classified as control group but landed on the new page.
<br>
1965 users are classified as getting treatment but landed on the old page.

In [10]:
df_mismatch = df[(df['group'] == 'treatment') & (df['landing_page'] == 'old_page')
               |(df['group'] == 'control') & (df['landing_page'] == 'new_page')]

num_mismatch = df_mismatch.shape[0]
percent_mismatch = round(num_mismatch / len(df) * 100, 2)

print("Number of mismatched rows: {} rows".format(num_mismatch))
print("Percent of mismatched rows: {} percent".format(percent_mismatch))

Number of mismatched rows: 3893 rows
Percent of mismatched rows: 1.32 percent


There are 3893 rows where treatment does not match with new_page or control does not match with old_page, we cannot be sure if this row truly received the new or old page.

#### Selecting only the matched rows

In [11]:
df_matched = df[(df['group'] == 'treatment') & (df['landing_page'] == 'new_page')
        |(df['group'] == 'control') & (df['landing_page'] == 'old_page')]

df_matched.shape

(290585, 5)

#### Double check if there are any mismatched rows

In [12]:
df_matched[(df_matched['group'] == 'treatment') & (df_matched['landing_page'] == 'old_page')
               |(df_matched['group'] == 'control') & (df_matched['landing_page'] == 'new_page')].shape[0]

0

#### Check and drop user_id duplicate

In [13]:
df_matched['user_id'].nunique()

290584

In [14]:
len(df_matched) - df_matched['user_id'].nunique()

1

In [15]:
df_matched[df_matched.duplicated('user_id') == True]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
2893,773192,55:59.6,treatment,new_page,0


In [16]:
df_matched[df_matched['user_id'] == 773192]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,37:58.8,treatment,new_page,0
2893,773192,55:59.6,treatment,new_page,0


It seems like the duplicated user ID can be rationalized as the same user which lands on the new page two times but decided not to convert both times.
<br>
In this case, we can simply delete one of the entry and treat the user as non-converted user.

In [17]:
df_matched = df_matched.drop_duplicates('user_id') 

In [18]:
df_matched.shape

(290584, 5)

**Findings:**
- There are 290584 unique users
- The data does not contain missing values
- 1.32% of the whole data contains mismatched information
- The data contains duplicated users. We detect duplicated users based on difference of the number of data and the number of unique user id
<br>

Then we dropped the mismatched data and the duplicated user id.

## Probabilities

#### The probability of an individual converting regardless of the page they receive

In [19]:
converted = round(df_matched.converted.mean(),4)
converted

0.1196

#### The probability of conversion between individuals in control vs treatment group

In [20]:
df_matched['user_id'] = df_matched['user_id'].astype(str)
round(df_matched.groupby('group').mean(), 4)

Unnamed: 0_level_0,converted
group,Unnamed: 1_level_1
control,0.1204
treatment,0.1188


#### The probability of an individual receiving the new page vs old page

In [21]:
pd.DataFrame(round(df_matched['landing_page'].value_counts(normalize = True), 4))

Unnamed: 0,landing_page
new_page,0.5001
old_page,0.4999


- Overal probability of an individual converting, regardelss of landing page type, is 11.96%
- Control group has a conversion probability of 12.04%
- Treatment group has a conversion probability of 11.88%
- The probability of user receiving new landing page is 50.01%

There is little difference between conversion probabilities on the treatment and control group, which the numbers later will be examined further whether they can serve as statistical evidence that a new landing page leads to more conversions or not. The probability of users converting in both control and treatment groups is also quite similar to the probability of users converting regardless of the page they received.

## Calculating Sample Size

In [22]:
control_convertrate = df_matched[df_matched['group'] == 'control']['converted'].mean() * 100

#### Calculating effect size based on our expected rates

In [23]:
effect_size = sms.proportion_effectsize(control_convertrate/100, control_convertrate/100+0.02)

#### Calculating sample size needed

In [24]:
required_n = sms.NormalIndPower().solve_power(
    effect_size, 
    power=0.8, 
    alpha=0.05, 
    ratio=1
    )                                                  

required_n = np.ceil(required_n)                                                    
print("Required data for each group: {}".format(required_n))

Required data for each group: 4444.0


For this experiment, we need at least 4444 observations for each group.

In [25]:
pd.DataFrame(df_matched['group'].value_counts())

Unnamed: 0,group
treatment,145310
control,145274


Here, we have around 145000 data for each group. This data size is way larger than the required 4444 data for each group.

## A/B Testing

- Control group has a conversion probability of 12.04%
- Treatment group has a conversion probability of 11.88%

We can already deterimine that the new design does not improve the conversion rate because of the negative difference.
However, this does not necessarily means that our new design performs worse than the old page.

**HYPOTHESIS**
<br>
If we want to assume that there is significant difference between the conversion rates of control and treatment group, our null and alternative hypotheses be:

$$
𝐻0: 𝑝_{𝑛𝑒𝑤} = 𝑝_{𝑜𝑙𝑑}
$$
$$
𝐻1: 𝑝_{𝑛𝑒𝑤} \neq 𝑝_{𝑜𝑙𝑑} 
$$

The confidence level will be set as 95% or 0.95. Hence,  𝛼 = 1−0.95 = 0.05

Finally, we will perform some statistical test to compare these two groups. Considering our large sample size, we can use z-test to calculate our p-value. Otherwise, t-student test can be performed instead.

In [26]:
convert_old = df_matched[(df_matched['converted'] == 1) & (df_matched['landing_page'] == 'old_page')]['user_id'].nunique()
convert_new = df_matched[(df_matched['converted'] == 1) & (df_matched['landing_page'] == 'new_page')]['user_id'].nunique()
n_old = df_matched[df_matched['landing_page'] == 'old_page']['user_id'].nunique()
n_new = df_matched[df_matched['landing_page'] == 'new_page']['user_id'].nunique()

In [27]:
z_score, p_value = sm.stats.proportions_ztest(
    np.array([convert_new,convert_old]),
    np.array([n_new,n_old]), 
    alternative = 'two-sided')

In [28]:
z_score, p_value 

(-1.3109241984234394, 0.18988337448195103)

Using the testing above, we can see that the p-value is 0.19 which is is above our  𝛼 = 0.05  threshold.

This p-value clearly suggests that we failed to reject the null hypothesis  𝐻0. This emphasizes the initial conclusion that there is no significant impact for the new page.
Thus, the new page did not perform better than the old design.

## Conclusions

- Without having to conduct the A/B testing, we can already determine that the new design does not improve the conversion rate because of the negative difference.
- However, this does not necessarily means that our new design performs worse than the old page.
- Using A/B testing, we tried to determine whether there is enough statistical evidence to conclude that the new page has significant difference compared to the old page.
- The p-value above 0.05 suggested otherwise.