### Using scipy.stats to implement traditional A/B tests 

In [3]:
#importing the packages
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import chi2
from scipy.stats import chi2_contingency

In [4]:
#importing the data set
df = pd.read_csv("C:\\Users\\avra\\Downloads\\advertisement_clicks.csv", header = 0)

In [5]:
df.head()

Unnamed: 0,advertisement_id,action
0,B,1
1,B,1
2,A,0
3,B,0
4,A,1


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
advertisement_id    2000 non-null object
action              2000 non-null int64
dtypes: int64(1), object(1)
memory usage: 31.3+ KB


In [11]:
df.groupby("advertisement_id").count()

Unnamed: 0_level_0,action
advertisement_id,Unnamed: 1_level_1
A,1000
B,1000


In [12]:
df.groupby("advertisement_id").sum()

Unnamed: 0_level_0,action
advertisement_id,Unnamed: 1_level_1
A,304
B,372


The data set has two columns. The advertisement_id tells us about the advertisement option(s) - [A or B] and action tells if we had a successful click on it for that transaction. (1 indicates a click and 0 indicates no click). 

Let us create two subsets of the data frame and seperate the Ad As and Bs out. 

In [6]:
a = df[df.advertisement_id == "A"]
b = df[df.advertisement_id == "B"]

#### Test set up

The test we are going to do is to check which advertisement has a better CTR. We would be using a simple Hypothesis test as stated below:

                                        H0: mu_A >= mu_B
                                        H1: mu_A < mu_B

Where mu_A is mean of CTR for Ad:A and mu_B is the mean of CTR for Ad B.

We will be using the student t distribution to calcualte the means (assuming that the variance shown in the data set is the known population variance) to test the above postulated hypothesis. 

In [20]:
#calculating the variance of each series
var_a = a.action.var(ddof = 1)
var_b = b.action.var(ddof = 1)

In [21]:
t,p = stats.ttest_ind(a.action, b.action, equal_var = False)
t,p

(-3.2211732138019786, 0.0012972410374001632)

#### Test result
Since the p value is less than 0.05, we have suffieicent evidence per the test to reject the Null Hypothesis. 

#### Testing the hypothesis assuming unknown population variance

In [24]:
A_click = a.action.sum()
A_nclick = a.action.count() - A_click

B_click = b.action.sum()
B_nclick = b.action.count() - B_click

In [25]:
T = np.array([[A_click, A_nclick], [B_click, B_nclick]])

In [26]:
chi2, p, dof, ex = stats.chi2_contingency(T, correction = False)

In [27]:
p

0.0013069502732125406

#### Test results

Again, p-value < 0.05 => We can reject the Null hypothesis and conclude that Ad B has a higher/better CTR than Ad A.

---