<a href="https://colab.research.google.com/github/selene518/A-B-Testing/blob/main/A_B_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Conversion Analysis on Web Design

## 1. Introduction
For this project, we will be analyzing the results of an A/B test run by an e-commerce website.  The final goal is to help the company understand if they should implement the new page, keep the old page, or perhaps run the experiment longer to make their decision

## 2. Import data & data cleaning

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
df=pd.read_csv('sample ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,11:48.6,control,old_page,0
1,804228,01:45.2,control,old_page,0
2,661590,55:06.2,treatment,new_page,1
3,853541,28:03.1,treatment,new_page,1
4,864975,52:26.2,control,old_page,1


Get an overview of the dataset

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
user_id         294478 non-null int64
timestamp       294478 non-null object
group           294478 non-null object
landing_page    294478 non-null object
converted       294478 non-null int64
dtypes: int64(2), object(3)
memory usage: 11.2+ MB


Check if `group` aligns with `landing_page`

In [None]:
((df.group=='treatment') & (df.landing_page=='old_page')).sum()

1965

In [None]:
((df.group=='control') & (df.landing_page=='new_page')).sum()

1928

In [None]:
df['misaligned']=((df.group=='treatment') & (df.landing_page=='old_page')) | ((df.group=='control') & (df.landing_page=='new_page'))
df = df[-df['misaligned']]

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 290585 entries, 0 to 294477
Data columns (total 6 columns):
user_id         290585 non-null int64
timestamp       290585 non-null object
group           290585 non-null object
landing_page    290585 non-null object
converted       290585 non-null int64
misaligned      290585 non-null bool
dtypes: bool(1), int64(2), object(3)
memory usage: 13.6+ MB


Check unique users

In [None]:
df.user_id.nunique()

290584

In [None]:
df['user_id'].value_counts().sort_values(ascending=False).head()

773192    2
639032    1
663620    1
778364    1
645179    1
Name: user_id, dtype: int64

# make assumption:
## let's assume

In [None]:
df[df['user_id']==773192]

Unnamed: 0,user_id,timestamp,group,landing_page,converted,misaligned
1899,773192,37:58.8,treatment,new_page,1,False
2893,773192,55:59.6,treatment,new_page,0,False


In [None]:
df.drop(1899, axis = 0,inplace = True)

In [None]:
df['user_id'].value_counts().sort_values(ascending=False).head()

630836    1
639032    1
663620    1
778364    1
645179    1
Name: user_id, dtype: int64

How many users in each group

In [None]:
df[['user_id','group']].groupby('group').count()

Unnamed: 0_level_0,user_id
group,Unnamed: 1_level_1
control,145274
treatment,145310


Conversion rate in each group

In [None]:
df[['user_id','group','converted']].groupby('group').agg({'user_id':'count','converted':'mean', 'converted':'sum'})

Unnamed: 0_level_0,user_id,converted
group,Unnamed: 1_level_1,Unnamed: 2_level_1
control,145274,0.120386
treatment,145310,0.125353


## 3. Analyze results

Let's assume that the new page does not have higher conversion rate than the old page at 5% Type I error. So the hypothesis woule be:

**null:** **$p_{new}$** - **$p_{old}$** <=0

**alternative:** **$p_{new}$** - **$p_{old}$** >0

In [None]:
convert_old = df[df.group=='control'].converted.sum()
convert_new = df[df.group=='treatment'].converted.sum()
n_old = len(df[df.group=='control'].converted)
n_new= len(df[df.group=='treatment'].converted)

convert_old, convert_new, n_old, n_new

(17489, 18215, 145274, 145310)

In [None]:
conversion_dic = {'Views':{'Control':n_old,'Test':n_new},'Converts': {'Control':convert_old,'Test':convert_new}}
conversion_table = pd.DataFrame(conversion_dic)
conversion_table['Conversion %'] = conversion_table['Converts'] / conversion_table['Views']
conversion_table['Conversion %'] = conversion_table['Conversion %'].apply(lambda x: str(np.round(x,3)*100)+'%')
conversion_table

Unnamed: 0,Views,Converts,Conversion %
Control,145274,17489,12.0%
Test,145310,18215,12.5%


### 3.1 Z-test in our way

In [None]:
def z_test(p1,p0,n1,n0):
    delta = p1-p0
    p = (p1*n1 + p0*n0) / (n1+n0)
    return delta / np.sqrt(p*(1-p)*(1/n1 + 1/n0))

In [None]:
p1 = convert_new / n_new
p0 = convert_old / n_old
n1 = n_new
n0 = n_old

In [None]:
z_value = z_test(p1,p0,n1,n0)
z_value

4.077481782861739

In [None]:
from scipy.stats import norm
p_value = 1- norm.cdf(z_value)

p_value

2.276304781123617e-05

p-value is less than 0.05, so we can reject null hypothesis and accept alternative hypothesis that
<br>**$p_{new}$** - **$p_{old}$** >0, the new page has higher conversion rate than the old page

### 3.2 Z-test in Statsmodels

In [None]:
import statsmodels.api as sm

In [None]:
z_score, p_value = sm.stats.proportions_ztest([convert_new, convert_old], [n_new, n_old], alternative='larger')
z_score, p_value

(4.077481782861739, 2.276304781118429e-05)

p-value is less than 0.05, so we can reject null hypothesis and accept alternative hypothesis that
<br>**$p_{new}$** - **$p_{old}$** >0, the new page has higher conversion rate than the old page