AB Testing, also known as unpaired t test, is a parametric test to evaluate whether two independent groups have statistically significant differences. In order to do the testing, there are two assumptions that must be checked:
* Normality: The proximity of median and mean values in the datasets.
* Variance homogeneity: The similarity of distribution in both datasets. 

The datasets contains the website information of a company. There are the user reactions for the ads for two different bidding options in the datasets. Control Group dataset contains the maximum bidding option whereas Test Group contains average bidding. It is wanted to clarify whether the average bidding is more advantageous than the maximum bidding.
The variables:
* Impression: Number of ad views
* Click: Number of clicks the ads
* Purchase: Number of purchases after clicking the ad
* Earning: Earning amount after the purchases

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/ab-testing-dataset/ab_testing.xlsx


In [2]:
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import shapiro, ttest_ind
import scipy.stats as stats
pd.set_option('display.max_columns', None)
!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 4.1 MB/s            
[?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9


In [3]:
Control_Group = pd.read_excel("../input/ab-testing-dataset/ab_testing.xlsx", sheet_name='Control Group')  # maximum bidding
Control_Group.head()

Unnamed: 0,Impression,Click,Purchase,Earning
0,82529.459271,6090.077317,665.211255,2311.277143
1,98050.451926,3382.861786,315.084895,1742.806855
2,82696.023549,4167.96575,458.083738,1797.827447
3,109914.400398,4910.88224,487.090773,1696.229178
4,108457.76263,5987.655811,441.03405,1543.720179


In [4]:
Test_Group = pd.read_excel("../input/ab-testing-dataset/ab_testing.xlsx", sheet_name='Test Group') # average bidding
Test_Group.head()

Unnamed: 0,Impression,Click,Purchase,Earning
0,120103.503796,3216.547958,702.160346,1939.611243
1,134775.943363,3635.082422,834.054286,2929.40582
2,107806.620788,3057.14356,422.934258,2526.244877
3,116445.275526,4650.473911,429.033535,2281.428574
4,145082.516838,5201.387724,749.860442,2781.697521


In [5]:
groupA = Control_Group["Purchase"]
groupB = Test_Group["Purchase"]

# Testing of Assumptions

In [6]:
# Normality test: Null hypothesis H0 is built as to have the normality and H1 as to not have the normality.

test_stat, pvalue = shapiro(groupA)
print('Test stat = %.4f, p-value = %.4f' % (test_stat, pvalue))
# p-value > 0.05, so H0 is accepted. Normality exists.

test_stat, pvalue = shapiro(groupB)
print('Test stat = %.4f, p-value = %.4f' % (test_stat, pvalue))
# p-value > 0.05, so H0 is accepted. Normality exists.

Test stat = 0.9773, p-value = 0.5891
Test stat = 0.9589, p-value = 0.1541


In [7]:
# Variance Homogeneity: Null hypothesis H0 is built as variances to be homogeneous and H1 as variances to not be homogeneous.

test_stat, pvalue = stats.levene(groupA,groupB)
print('Test stat = %.4f, p-value = %.4f' % (test_stat, pvalue))
# p-value > 0.05, so H0 is accepted. Variances are homogeneous.


Test stat = 2.6393, p-value = 0.1083


# Applying the hypothesis

In [8]:
# The assumptions are satisfied, so unpaired t test can be done between two groups.
# H0: There is no statistically significant difference between the average of purchase numbers of Control Group and Test Group.
# H1: There is a statistically significant difference between the average of purchase numbers of Control Group and Test Group.

res, pvalue = ttest_ind(groupA,groupB,equal_var=True)
print('Test Stat = %.4f, p-value = %.4f' % (res, pvalue))

# p-value > 0.05, so H0 is accepted. Therefore there is no statistically significant difference between Control group that is offered maximum bidding and Test Group that is offered average bidding.


Test Stat = -0.9416, p-value = 0.3493
