## A/B Testing - Advertisement

#### Business Scenario

#### Questions
* Which advertisement is better in terms of higher click-through rate?
* How to ensure same environment for the comparison?
  * conduct the A/B testing in the same period of time
    * group A users watch advertisement A WHILE group B users watch advertisement B
  * group A users and group B users are randomly chosen by **user_id** to eliminate effect of demographic information like gender/age/class/race/education...
* How to tell the comparison result is significant or not?
  * Is there any bias ?  (i.e. bias caused by small sample size)
  * Statistical hypothesis testing - Chi-Square Statistic


#### Dataset Description
Date Range: 2013-10-01 to 2013-10-31
1 Month Data

#### Reference
https://medium.com/swlh/how-to-run-chi-square-test-in-python-4e9f5d10249d

In [190]:
import pandas as pd
import numpy as np

df_imp = pd.read_csv('data/impression.csv')
df_imp

Unnamed: 0,log_date,app_name,test_name,test_case,user_id,transaction_id
0,2013-10-01,game-01,sales_test,B,36703,25622
1,2013-10-01,game-01,sales_test,A,44339,25623
2,2013-10-01,game-01,sales_test,B,32087,25624
3,2013-10-01,game-01,sales_test,B,10160,25625
4,2013-10-01,game-01,sales_test,B,46113,25626
...,...,...,...,...,...,...
87919,2013-10-31,game-01,sales_test,A,55838,85311
87920,2013-10-31,game-01,sales_test,B,50754,85312
87921,2013-10-31,game-01,sales_test,B,52080,85313
87922,2013-10-31,game-01,sales_test,B,57610,85314


In [191]:
df_click = pd.read_csv('data/click.csv')
df_click

Unnamed: 0,log_date,app_name,test_name,test_case,user_id,transaction_id
0,2013-10-01,game-01,sales_test,B,15021,25638
1,2013-10-01,game-01,sales_test,B,351,25704
2,2013-10-01,game-01,sales_test,B,8276,25739
3,2013-10-01,game-01,sales_test,B,1230,25742
4,2013-10-01,game-01,sales_test,B,17471,25743
...,...,...,...,...,...,...
8593,2013-10-31,game-01,sales_test,B,7238,85283
8594,2013-10-31,game-01,sales_test,B,42035,85291
8595,2013-10-31,game-01,sales_test,B,56076,85295
8596,2013-10-31,game-01,sales_test,B,52080,85313


In [192]:
df = df_imp.merge(df_click[['transaction_id','log_date']], how='left', on='transaction_id')
df['is_null_log_date_y'] = df['log_date_y'].isna()
df['is_click'] = [0 if x else 1 for x in df['is_null_log_date_y']]
df

Unnamed: 0,log_date_x,app_name,test_name,test_case,user_id,transaction_id,log_date_y,is_null_log_date_y,is_click
0,2013-10-01,game-01,sales_test,B,36703,25622,,True,0
1,2013-10-01,game-01,sales_test,A,44339,25623,,True,0
2,2013-10-01,game-01,sales_test,B,32087,25624,,True,0
3,2013-10-01,game-01,sales_test,B,10160,25625,,True,0
4,2013-10-01,game-01,sales_test,B,46113,25626,,True,0
...,...,...,...,...,...,...,...,...,...
87919,2013-10-31,game-01,sales_test,A,55838,85311,,True,0
87920,2013-10-31,game-01,sales_test,B,50754,85312,,True,0
87921,2013-10-31,game-01,sales_test,B,52080,85313,2013-10-31,False,1
87922,2013-10-31,game-01,sales_test,B,57610,85314,2013-10-31,False,1


In [193]:
# df = df.groupby(['test_case']).agg(no_of_users=('user_id', 'count'),no_of_clicks=('is_click', sum), click_through_rate=('is_click',np.mean))
# df
contigency= pd.crosstab(df['test_case'], df['is_click'])
contigency

is_click,0,1
test_case,Unnamed: 1_level_1,Unnamed: 2_level_1
A,40592,3542
B,38734,5056


In [194]:
from scipy.stats import chi2_contingency

# Chi-square test of independence.
c, p, dof, expected = chi2_contingency(contigency)

print('chi2 statistics : ', c)
print('p-value : ', p)
print('degree of freedom : ', dof)
print('expected frequencies : ',expected)



chi2 statistics :  308.37505289322877
p-value :  4.934139633785632e-69
degree of freedom :  1
expected frequencies :  [[39818.18029207  4315.81970793]
 [39507.81970793  4282.18029207]]
