<a href="https://colab.research.google.com/github/yulianthyho/AB-Testing-Conversion-Rate-in-New-Landing-Page/blob/main/A_B_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A/B Testing**



# Overview
An e-commerce company is revamping a new landing page. The company want to experiment whether the new landing page will give better conversion rate before rolling out to a wider audience.

**About the dataset**

We were given the experiment result from control and experimental/treatment group. We have hypothesis that the new page (treatment group) will give a better conversion rate.

**Goal**

Help company in deciding which landing page is better (keep the old page or implement the new one). 

#Import Library

In [None]:
#import library
import pandas as pd
import numpy as np

#Load the Dataset

In [None]:
#load dataset
url = 'https://docs.google.com/spreadsheets/d/18QhOSytsU9g5qsjeq5HWmmhlm-UNdnV9bhdKwfA_v0I/edit#gid=842283717'
url = url.replace('/edit#gid=', '/export?format=csv&gid=')
df = pd.read_csv(url)
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:49,control,old_page,0
1,804228,2017-01-12 8:01:45,control,old_page,0
2,661590,2017-01-11 16:55:06,treatment,new_page,0
3,853541,2017-01-08 18:28:03,treatment,new_page,0
4,864975,2017-01-21 1:52:26,control,old_page,1


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       5000 non-null   int64 
 1   timestamp     5000 non-null   object
 2   group         5000 non-null   object
 3   landing_page  5000 non-null   object
 4   converted     5000 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 195.4+ KB


#Data Cleaning

In [None]:
#checking missing values
print('Number of missing data for each column:')
print(df.isna().sum())


Number of missing data for each column:
user_id         0
timestamp       0
group           0
landing_page    0
converted       0
dtype: int64


There is no missing values in the dataset

In [None]:
#checking duplicated data
df[df['user_id'].duplicated(False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
988,698120,2017-01-22 7:09:38,control,new_page,0
1899,773192,2017-01-09 5:37:59,treatment,new_page,0
2656,698120,2017-01-15 17:13:43,control,old_page,0
2893,773192,2017-01-14 2:56:00,treatment,new_page,0


There are duplicated in user_id. Since we don't know exactly what group of that user_id belong (either they are in control or treatment) , we will drop that rows.

In [None]:
#drop duplicated user_id
df.drop_duplicates(['user_id'],keep=False,inplace=True)

In [None]:
#checking typos
df['group'].value_counts()

control      2533
treatment    2463
Name: group, dtype: int64

In [None]:
df['landing_page'].value_counts()

old_page    2531
new_page    2465
Name: landing_page, dtype: int64

In [None]:
df['converted'].value_counts()

0    4352
1     644
Name: converted, dtype: int64

There is no typos in our dataset

After data is clean. Next, we should do a validation. This condition must be true for each groups.
* Condition 1: **control group** must receive **old** landing page only
* Condition 2: **treatment** group must receive **new** landing page only


In [None]:
#take a look of the proportions of the group
df.groupby(['landing_page','group'])['converted'].value_counts()

landing_page  group      converted
new_page      control    0              24
                         1               5
              treatment  0            2115
                         1             321
old_page      control    0            2188
                         1             316
              treatment  0              25
                         1               2
Name: converted, dtype: int64

From the result above, we can see that there are inconsistency.
- There are user in control who receive new landing page
- There are user in treatment group who receive old page


For that reason, I will do a grouping based on those valid condition.

In [None]:
#group control
control = (df['group']=='control') & (df['landing_page']=='old_page')
df_control = df[control].copy()
df_control.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:49,control,old_page,0
1,804228,2017-01-12 8:01:45,control,old_page,0
4,864975,2017-01-21 1:52:26,control,old_page,1
5,936923,2017-01-10 15:20:49,control,old_page,0
7,719014,2017-01-17 1:48:30,control,old_page,0


In [None]:
df_control.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2504 entries, 0 to 4998
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       2504 non-null   int64 
 1   timestamp     2504 non-null   object
 2   group         2504 non-null   object
 3   landing_page  2504 non-null   object
 4   converted     2504 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 117.4+ KB


There are 2504 user in control group

In [None]:
#group treatment
treatment = (df['group']=='treatment') & (df['landing_page']=='new_page')
df_treatment = df[treatment].copy()
df_treatment.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
2,661590,2017-01-11 16:55:06,treatment,new_page,0
3,853541,2017-01-08 18:28:03,treatment,new_page,0
6,679687,2017-01-19 3:26:47,treatment,new_page,1
8,817355,2017-01-04 17:58:09,treatment,new_page,1
9,839785,2017-01-15 18:11:07,treatment,new_page,1


In [None]:
df_treatment.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2436 entries, 2 to 4999
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_id       2436 non-null   int64 
 1   timestamp     2436 non-null   object
 2   group         2436 non-null   object
 3   landing_page  2436 non-null   object
 4   converted     2436 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 114.2+ KB


There are 2436 user in treatment group

# Exploratory Data Analysis (EDA)

## Control Group

In [None]:
print('The timestamp of control group range in from ', df_control['timestamp'].min(), 'to', df_control['timestamp'].max())
print('Number of converted user in control group =', (df_control['converted']).sum())
print('Number of total user in control group =', len(df_control))
print('Conversion rate in control group =', (df_control['converted'].mean()*100).round(2),'%')
print('Standard deviation in control group =', (df_control['converted'].std()).round(2))

The timestamp of control group range in from  2017-01-02 13:47:13 to 2017-01-24 9:51:06
Number of converted user in control group = 316
Number of total user in control group = 2504
Conversion rate in control group = 12.62 %
Standard deviation in control group = 0.33


## Treatment Group

In [None]:
print('The timestamp of treatment group range in from', df_treatment['timestamp'].min(), 'to', df_treatment['timestamp'].max())
print('Number of converted user in treatment group =', (df_treatment['converted']).sum())
print('Number of total user in treatment group =', len(df_treatment))
print('Conversion rate in treatment group:', (df_treatment['converted'].mean()*100).round(2),'%')
print('Standard deviation in treatment group =', (df_treatment['converted'].std()).round(2))

The timestamp of treatment group range in from 2017-01-02 13:42:41 to 2017-01-24 9:18:21
Number of converted user in treatment group = 321
Number of total user in treatment group = 2436
Conversion rate in treatment group: 13.18 %
Standard deviation in treatment group = 0.34


**Insight** :
- The number of user in both group is quite balanced (around 2500 users)
- The conversion rate in treatment group (new landing page) is slightly higher than the conversion rate in control group (old landing page). **But is this difference significant enough?**

In order to identify the difference, A/B hypothesis testing should be done first. 

# A/B testing (Hypothesis Testing)

We can use Z-test for the hypothesis testing ( number of sample is 2500 , which is > n_sample 30 ).

Our hypotesis are:

Ho : Convension rate in the new landing page is **the same** to the conversion rate in old landing page.

H1 : Convension rate in the new landing page is **greater than** the conversion rate in old landing page.

In [None]:
from statsmodels.stats.weightstats import ztest
(stat, pvalue) = ztest(df_treatment['converted'], df_control['converted'], alternative='larger')
print('Z-score =', stat)
print('p-value =', pvalue)

Z-score = 0.5844676291126292
p-value = 0.2794528689861486


**Result:**

The p_value > 0.05, we do not have enough evidence to reject null hypothesis. It means that the convension rate between new landing page and old landing page is same or it doesn't give better conversion rate than the old one (**accept null hypothesis**)

**Recommendation**

Since there is no difference between the convension rate in both landing page, the company can keep the old one to save the budget, while do improvement to increase the convention rate.