The biggest e-commerce company called <b>FaceZonGoogAppFlix</b> as approached us (a data science consulting firm) as a new client! 

They have a potential new webpage designed with the intention to increase their current conversion rates of 12% by 0.35% or more. With such an ambiguous task, they have full trust in us to give them a recommendation whether to implement the new page or keep the old page. Unforunately they haven't built up a data science capability in their company, but they've used an external software called 'A/B Tester' for 23 days and come back to us with a dataset. What do we do?

The goal of this notebook will serve as a primer into A/B testing and go through as many of the foundations of A/B testing I found online. In my opinion, A/B testing isn't really explored properly or well enough, so hopefully this notebook will fill in this void. Through this notebook, I've placed some references and consolidated pieces of their methodologies here.

    This notebook assumes a very surface and high-level level understanding of A/B Testing - a statistical way to compare two or more versions (A or B?) to determine not only which one performs better but also to understand if the difference between two of them is statistically significant (Data Science Dojo, 2018).
    
    Furthermore, the current conversion rate of 12% is chosen from assumptions from the control data, and the intention to increase conversion rates by 0.35% is intentionally chosen as the lowest practical value to achieve the highest minimum Sample Size required. More on these later.

In [1]:
#first, load packages:
import pandas as pd
import numpy as np
import scipy.stats as ss
import math as mt
import itertools

In [2]:
#next, let's have a look at our dataset:
data = pd.read_csv('../input/ab-testing/ab_data.csv')
df = data.copy()
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [3]:
df.info()
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294478 entries, 0 to 294477
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294478 non-null  int64 
 1   timestamp     294478 non-null  object
 2   group         294478 non-null  object
 3   landing_page  294478 non-null  object
 4   converted     294478 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB
(294478, 5)


<u><b>Data Cleansing:</b></u>

In [4]:
#we want to analyse data of unique user's conversion rates given their unique landing page:
df['user_id'].nunique()
#290584 unique users vs 294478 rows - why the discrepancy?

290584

In [5]:
#Locate where treatment does not match with new_page or control does not match with old_page, and drop these rows
i = df[((df['group']=='treatment') ==(df['landing_page']=='new_page')) == False].index
df2 = df.drop(i)

In [6]:
# Number of unique users in df2 was return above, how many rows do we have now?
df2.shape[0]

290585

In [7]:
#If the number of unique rows is 1 greater than the number of unique users, then we have a duplicate user somewhere. We'll find the duplicate row first:
df2[df2.duplicated(['user_id'], keep=False)]

Unnamed: 0,user_id,timestamp,group,landing_page,converted
1899,773192,2017-01-09 05:37:58.781806,treatment,new_page,0
2893,773192,2017-01-14 02:55:59.590927,treatment,new_page,0


In [8]:
#drop duplicate:
df2.drop_duplicates(subset ='user_id',keep ='first',inplace = True)

<b><u>Probabilities:</u></b>

In [9]:
#total/pooled probability of conversion:
P_pool = (df2.query('converted == 1').converted.count())/df2.shape[0]
P_pool

0.11959708724499628

In [10]:
#probability of conversion given a user was in the control group:
control_df = df2.query('group =="control"')
P_old = control_df['converted'].mean()
P_old

0.1203863045004612

In [11]:
#probability of conversion given a user was in the treatment group:
treatment_df = df2.query('group =="treatment"')
P_new = treatment_df['converted'].mean()
P_new

0.11880806551510564

We can quickly observe that our new page isn't doing too hot on conversion improvement. In fact, it's doing slightly worse than the control group!

Hypothetically we could end the test here, but there are a few checks and measures we must do before and after an A/B test to ensure our experiment as run properly.

<u><b>Metrics</b></u>

This is probably something that I've seen where it's not exactly talked about in most A/B tests found in Kaggle, but is a crucial talking point.

The purpose of Metrics (both Invariate and Evaluation Metrics) are to ensure during our experiment that we didn't screw up throughout its lifecycle. It helps verify that the experiment was conducted as expected and that other factors did not influence the data which we collected.

Imagine for this particular A/B experiment, we propose to divide users exactly in half into the old and new page (50/50) BEFORE we even begin. Now imagine we don't bother checking this Metric during or after the experiment, and we conclude that the new page is better, whoopee. But what if we checked the proportion AFTER the test was (33/67)? Maybe the A/B test software had a bug? Maybe the code to split the traffic had a bug? Either way...If this split drastically changes throughout our experiment, our data is inherently wrong.

There are two types of Metrics: 

1. Invariant (or Guardrail Metric, which seems to be the most updated term (Kohavi, Tang & Xu, 2020)) - similar to the example above, we expect Guardrail Metrics to not change.

2. Evaluation Metric - these Metrics we DO expect to change due to the nature of A/B testing. For each of these evaluation metrics, they should be related to the business objectives. 

There are many Metrics in a real A/B test we should always have our eyes on (the way we present change to a part of the population, the way we collect data, seasonality and timing of the experiment, daily active users, Click Through Probability, etc.)

One Guardrail Metric we can test given our dataset is the <b>number of unique users in each page group</b>. The experiment was designed to keep a 50/50 split, so let's see if this holds true post-A/B testing.

One Evaluation Metric we want to define will be the <b>increase in conversions</b>. For this metric, we want to define $D_{min}$: the minimum change which is practically significant to the business. In this case, the practical minimum difference would be the 0.35%, closely related to our business objectives of increasing our current conversion rates of 12% by 0.35%. Therefore, we'll set $D_{min}$ = 0.0035. The 0.0035 is also usually a sign that our clients are a big e-commerce company and such an increase in conversion would have a huge impact on their top-line (Yu, 2020).

<u><b>Guardrail Metric Check:</b></u>

As stated above, we want to verify our Guardrail Metrics before diving deeper. We want to count the proportion of users in each group and see if there is a significant difference in the proportion amounts. If a statistically significant difference is present, it implies a biased experiment and our results should not be relied on.

How we test this is through the use of Confidence Intervals (CI). If one were to google the formula for CI with the t-distribution, we would get:

$CI =
 \bar{x} \ \pm \ z_{1-\alpha} \ \frac{s}{\sqrt{n}}$

- $CI$ = Confidence Interval
- $\bar{x}$ = Sample Mean
- $z_{1-\alpha}$ = Confidence Level Value (For simplicity, I'll write z to denote this)
- $s$ = sample standard mean (notice how we don't use σ, since σ represents a known standard deviation, and t-distribution is used for s, or unknown σ)
- $n$ = sample size.

We can adjust this formula using the Central Limit Theorem where a large enough summed Binomial Distribution will converge towards the Normal distribution, or $X \sim N(np,\sqrt{np(1-p)} \ )$ given our sample size is large enough (typically ≥ 30). 

This means that for our sampling proportion $\hat{p} = \frac{X}{n}$: our distribution will converge towards $N( \hat{p},\sqrt{ \frac{ \hat{p} (1 - \hat{p})}{n}} )$.

Therefore, our Binomial Confidence Intervals are calculated by:

$CI =
 \hat{p} \ \pm \ z \ \sqrt{ \frac{ \hat{p} (1 - \hat{p})}{n}}$

We will now use the above to calculate our Confidence Interval for p̂ and see if we've kept the approximate 50/50 split:



In [12]:
#proportion of users seeing the new vs old page:
N_new = df2.query('landing_page == "new_page"').landing_page.count()
N_old = df2.query('landing_page == "old_page"').landing_page.count()
proportion = (N_old/df2.shape[0],N_new/df2.shape[0])
proportion

(0.4999380557773312, 0.5000619442226688)

In [13]:
#function for getting z-scores for alpha. For our experiemnt where alpha = 5%, keep in mind we want to input 1-alpha/2 for Confidence Intervals.
def get_z_score(alpha):
    return ss.norm.ppf(alpha)

In [14]:
#Guardrail Check on differences in proportions:
sd = round(mt.sqrt((0.5*(1-0.5))/df2.shape[0]),4)
CI = (0.5 - sd*get_z_score(1-0.05/2), 0.5 + sd*get_z_score(1-0.05/2))
print('Does the control group proportion ' + str(N_old/df2.shape[0]) + ' lie within ' + str(CI) + '?')

Does the control group proportion 0.4999380557773312 lie within (0.49823603241391395, 0.5017639675860861)?


We've calculated 95% confidence intervals for our p̂ = 0.5 which lies within the CI. This means that we've passed our Guardrail Metric that the number of unique users is equal for each group.

<u><b>Sample Size</b></u>

Having a required Sample Size is one of the important cornerstones of a successful A/B Test and is dependant on 3 factors (Saleh, 2018):

1. Power of the test (usually $1 - \beta = 0.8$): Probability of detecting that there is a difference between the conversion rates and rejecting the Null Hypothesis when the Null Hypothesis is True (Type II Error).    

2. Significance level of the test ($\alpha = 0.05$) - Probability that we reject the Null Hypothesis while it should NOT be rejected (Type I Error).

3. Minimal Desired Effect (MDE, or $D_{min}$) - The desired relevant difference between the rates you would like to discover - what is the minimum improvement for a test to be worthwhile?

As a discussion point, A/B Tests can begin to go bad when a Sample Size is not calculated before the A/B test, or 'peeking' at your results before the total number of observations is less than your minimum Sample Size and concluding prematurely when you think you've achieved a positive result. Usually if an A/B Test is running and there aren't enough users to match the required Sample Size, the 3 factors would have to be adjusted to lower the minimium Sample Size.

This online calculator does a great job of calculating Sample Size: https://www.evanmiller.org/ab-testing/sample-size.html

However, we're going to manually calculate the required Sample Size using two different methods:

1. The statistical rule of thumb:

$n ≈
 \frac{16 \sigma^{2}}{\delta^{2}}$ (van Belle, 2008)

2. Evan Miller's Calculator above, but we'll use a formula that is the foundation of Evan Miller's Calculator (zthomas.nc, 2019)


In [15]:
#1. Using statistical rule of thumb to calculate minimum sample size per variation:
16*(0.12*(1-0.12))/pow(0.0035,2)

137926.53061224488

    If using the statistical rule of thumb, the minimum sample size per variation = 137,926/variation. Given we have 2 groups (treatment and control): the total minimum sample size = 275,852. Since we have a total sample size of 290,584, our a/b test will have enough statistical power and significance.

In [16]:
#calculating the minimum sample size for the ab test:
def get_sampSize(sds,alpha,beta,d):
    n=pow((get_z_score(1-alpha/2)*sds[0]+get_z_score(1-beta)*sds[1]),2)/pow(d,2)
    return n

In [17]:
#baseline + expected change standard deviation calculations
def get_sds(p,d):
    sd1=mt.sqrt(2*p*(1-p))
    sd2=mt.sqrt(p*(1-p)+(p+d)*(1-(p+d)))
    sds=[sd1,sd2]
    return sds

In [18]:
#2. Using Evan Miller's Calculator but deriving the values ourselves:
round(get_sampSize(get_sds(0.12, 0.0035),0.05,0.2,0.0035))

135830

    If using Evan Miller's calculator, the minimum sample size per group = 135,830 users/variation. Given we have 2 groups (treatment and control): the total minimum sample size = 271,660 users. Since we have a total sample size of 290,584, our a/b test will have enough statistical power and significance.

Now, it's time to view various methods to conclude if $P_{new} = P_{old}$ or $P_{new} > P_{old}$:

<b><u>1. Binomial Proportion Confidence Intervals:</u></b>

This method is quoted as the 'most common' method for A/B testing, where we find Confidence Intervals (CI) for both $P_{new}$ and $P_{old}$. If we construct similar intervals for both and compare them, we will end up in either scenario: (jkndrkn, 2012)

1. The Intervals do not overlap: This implies that we can say with some level of confidence that one is better than the other, therefore providing enough evidence to reject the Null Hypothesis. This level of confidence seems to be $≈ 1-e \alpha^{1.91}%$ (Lan, 2011). So if there is overlap and the 95% CI are the same size, the difference is significant at the 99.5% level.

2. The Intervals do overlap: Then it is either a sign that our population does not have enough statistical power, or we do not have enough evidence to reject the Null Hypothesis that $P_{new} = P_{old}$.

There is a relationship between CI comparisons and hypothesis tests - given that the sample sizes are not too different and the two sets have similar standard deviations. This method wouldn't derive highly precise p-values from comparing two CIs, but there is a good write up that tries to quantify this (Lan, 2011).

Finding the 'true' conversion rate of a particular group is usually impossible or difficult, but we can use our calculated $P_{new}$ and $P_{old}$ as point estimations to find the Confidence Intervals for the 'true' $P_{new}$ and $P_{old}$. A good explanation for beginners about using point intervals to calculate a CI for the 'true' conversion rate can be found in the Appendix of <i>"A/B Testing: The Most Powerful Way to Turn Clicks into Customers"</i> (Siroker & Koomen, 2013)

In [19]:
CI_old = (P_old - get_z_score(1-0.025/2)*mt.sqrt(P_old*(1-P_old)/N_old),P_old + get_z_score(1-0.025/2)*mt.sqrt(P_old*(1-P_old)/N_old))
CI_new = (P_new - get_z_score(1-0.025/2)*mt.sqrt(P_new*(1-P_new)/N_new),P_new + get_z_score(1-0.025/2)*mt.sqrt(P_new*(1-P_new)/N_new))
print('Do ' + str(CI_old) + ' and ' + str(CI_new) + ' overlap?') 

Do (0.11847266343679363, 0.12229994556412876) and (0.11690554055275011, 0.12071059047746117) overlap?


Both CI intervals overlap plenty as $CI_{new}$ is completely contained within $CI_{old}$ , which means we do not reject the Null Hypothesis that $P_{new} = P_{old}$.

    This means that the new page is not better than the old page.

While our case is quite evident that the overlap is significantly clear, slight overlaps could tempt us to draw the same conclusion to reject the Null Hypothesis. However, this is a common misinterpretation of overlapping CIs when comparing groups. Failure to do so could result in incorrect or misleading conclusions being drawn (Tan & Tan, 2010, pp. 278).

<u><b>2. Z-test:</b></u>

We can use existing packages to calculate our test statistic and p-values and test for proportions based on the z-test. This is similar to the Binomial Proportion Confidence Interval Test, is quantitatively easier to draw conclusions out of due to it returning a p-value:

In [20]:
import statsmodels.api as sm
#returning the total number of conversions for each group:
convert_old = df2.query("landing_page == 'old_page' and converted == 1").shape[0]
convert_new = df2.query("landing_page == 'new_page' and converted == 1").shape[0]

In [21]:
#calculating the z-score + p-value using the z-test (one-sided):
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [N_old, N_new], alternative='smaller')
z_score, p_value

(1.3109241984234394, 0.9050583127590245)

Given our p-value ≈ 0.9 > 0.05, we do not reject the Null Hypothesis.

    This means that the new page is not better than the old page.

<b><u>3. Hypothesis testing on d̂ and Effect Size:</u></b>

A couple of methdologies are found in Udacity's A/B test course, where we also consider the pooled probability and standard deviations, under the assumption that the variances within each sample are equal (Shetty, 2016). The reason we do this is so we can do a z-test under the context of our Evaluation Metric $D_{min}$, and observe if our difference is practically significant to the business:


$\hat{p}_{pool} =
 \frac{X_{new} + X_{old}}{N_{new} + N_{old}}$


$SD_{pool} = 
\sqrt{\hat{p}_{pool}(1 - \hat{p}_{pool})(\frac{1}{N_{new}} + \frac{1}{N_{old}})}$


All under the null hypothesis of $\hat{d} = P_{new} - P_{old}$ where $\hat{d} \sim N(0,SD_{pool})$.

What we will perform is the same CI calculations we've done above, but using $SD$ as our standard deviations:

$CI_{diff} =
 \hat{d} \ \pm SE_{pool}$

After $CI_{diff}$ is calculated, the change is statistically significant if 0 lies outside the $CI_{diff}$ (equivalent to the above z-test in terms of Hypothesis Testing). 

However, another additional conclusion we can draw is if our Evaluation Metric $D_{min}$ is practically significant if it is outside $CI_{diff}$, especially if $D_{min}$ is below $CI_{diff}$. However, there is no statistical test that can truly tell you whether the effect is large enough to be important, so some level of subject area knowledge and expertise must be applied whether the effect is big enough to be meaningful (Frost, 2018).

<u>F-test:</u>

First, we must test if our variances between our treatment and control are significantly different. Therefore, we'll be using the F-test in Scipy Stats. Note that the F-Test is sensitive to non-normalities of groups, but our large enough sample size allows the Central Limit Theorem to take effect.


In [22]:
SE_new = mt.sqrt(P_new*(1-P_new)/N_new)
SE_old = mt.sqrt(P_old*(1-P_old)/N_old)

Using the F-test (under the Null Hypothesis that the variances are equal) to determine if the variances within each sample are equal, noting that the p-value = 1 - CDF (Zach, 2020):


In [23]:
p = 1 - ss.f.cdf(pow(SE_new,2)/pow(SE_old,2), N_new - 1, N_old - 1)
p

0.986811599380792

Since the p-value is almost 1, we do not reject the Null Hypothesis that the variances are the same. Now that we've verified that our variances within each sample are equal, we'll continue calculating $CI_{diff}$ :

In [24]:
#Calculating pooled standard deviation and pooled CI:
d_hat = P_new - P_old
SE_pool = mt.sqrt(P_pool*(1-P_pool)*(1/N_old+1/N_new))
CI_diff = (d_hat - get_z_score(1-0.05/2)*SE_pool, d_hat + get_z_score(1-0.05/2)*SE_pool)
print('The change due to the experiment is ' + str(round(d_hat,4)*100) + '%')
print('Is 0 inside the interval of (' + str(round(CI_diff[0],4)) + ', ' + str(round(CI_diff[1],4)) +')?')
print('Is D_min = 0.0035 inside the interval of (' + str(round(CI_diff[0],4)) + ', ' + str(round(CI_diff[1],4)) + ')?')  

The change due to the experiment is -0.16%
Is 0 inside the interval of (-0.0039, 0.0008)?
Is D_min = 0.0035 inside the interval of (-0.0039, 0.0008)?


Because 0 falls within our $CI_{diff}$ interval, the change due to the experiment is not statistically significant, therefore we do not reject the Null Hypothesis that $\hat{d} = P_{new} - P_{old}$ where $\hat{d} \sim N(0,SD_{pool})$. Furthermore, given our one-sided test where $D_{min}$ is above our confidence intervals, it is also not practically significant.

    This means that the new page is not better than the old page.


<b><u>4. Sign Test:</u></b>

The last test to consider when picking a winner in an A/B test is a sign test - checking the trend of change we observe by day. If we check how many days the conversion was higher in the Treatment vs the New, we can use this as our number of successes for our binomial distribution. The null hypothesis of the sign test is that the population median is equal to some value $M$ - implying the observations greater than $M$ (denoted by $r^{+}$) are equal to the observations less than $M$ (denoted by $r^{-}$).

Therefore, the Null Hypothesis is: $r^{+}, r^{-} \sim Bin(n,p)$ where $p = \frac{1}{2}$ (Shier, 2004).

Note that we are only concerning ourselves with the number of days (observations) that the conversion was higher for Treatment > number of days that the conversion was higher for Control. Therfore, this will be a one-sided test and we'll divide the p-value by 2.


In [25]:
#building out another dataframe that will be useful in a sign test:
dfsign = df2[['timestamp','group','converted']]

In [26]:
#timestamp is a concatenation of date + exact time of user landing on the page. We only want the date, or 1st 10 characters:
dfsign['timestamp'] = df2['timestamp'].astype(str).str[:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [27]:
dfsign.head()

Unnamed: 0,timestamp,group,converted
0,2017-01-21,control,0
1,2017-01-12,control,0
2,2017-01-11,treatment,0
3,2017-01-08,treatment,0
4,2017-01-21,control,1


In [28]:
#grouping by timestamp and group will sum all the conversions for a group on a given day:
dfsign2 = dfsign.groupby(['timestamp','group']).sum().reset_index()
dfsign2.head(20)

Unnamed: 0,timestamp,group,converted
0,2017-01-02,control,359
1,2017-01-02,treatment,342
2,2017-01-03,control,750
3,2017-01-03,treatment,753
4,2017-01-04,control,802
5,2017-01-04,treatment,763
6,2017-01-05,control,792
7,2017-01-05,treatment,748
8,2017-01-06,control,762
9,2017-01-06,treatment,833


In [29]:
#pivot the table, keeping the timestamps along the rows, the groups now in the columns and the sum of conversions as our values:
dfsign3 = dfsign2.pivot(index=['timestamp'],columns=['group'],values=['converted']).reset_index()
dfsign3.head(6)

Unnamed: 0_level_0,timestamp,converted,converted
group,Unnamed: 1_level_1,control,treatment
0,2017-01-02,359,342
1,2017-01-03,750,753
2,2017-01-04,802,763
3,2017-01-05,792,748
4,2017-01-06,762,833
5,2017-01-07,799,768


In [30]:
#building out a conditional column - we want it to be 1 when the treatment did better than the control, else 0:
dfsign3['sign'] = np.where(dfsign3[('converted','treatment')] > dfsign3[('converted','control')], 1,0)
dfsign3.head()

Unnamed: 0_level_0,timestamp,converted,converted,sign
group,Unnamed: 1_level_1,control,treatment,Unnamed: 4_level_1
0,2017-01-02,359,342,0
1,2017-01-03,750,753,1
2,2017-01-04,802,763,0
3,2017-01-05,792,748,0
4,2017-01-06,762,833,1


In [31]:
#find the total number of times that our daily treatment conversion was greater than our daily control conversion
successes = dfsign3[('sign','')].sum()
p_sign = successes/dfsign3.shape[0]
print('We have ' + str(successes) + ' out of ' + str(dfsign3.shape[0]) + ' days where the total daily treatment conversions was greater than its control counterpart. This brings our probability of success to be ' + str(round(p_sign,4)) + '.')

We have 10 out of 23 days where the total daily treatment conversions was greater than its control counterpart. This brings our probability of success to be 0.4348.


It's finally time to do our sign test - it is a relatively simple test (statistically speaking) which can be done via an online calculator:

https://www.graphpad.com/quickcalcs/binomial1/

The calculator also provides a sound explanation on the stats behind it after inputting your variables. However, we'll go the extra mile to quickly work out our p-values:

In [32]:
#binomial distribution:
def bin(x,n):
    return mt.factorial(n)/(mt.factorial(x)*mt.factorial(n-x))*0.5**x*0.5**(n-x)

#sum all successes up until our desired success:   
def get_1side_pvalue(x,n):
    return list(itertools.accumulate([bin(i,n) for i in range(0,x+1)]))[-1]


In [33]:

print('The two-sided P-value is ' + str(round(2*get_1side_pvalue(successes,dfsign3.shape[0]),4)))
print('The one-sided P-value is ' + str(round(get_1side_pvalue(successes,dfsign3.shape[0]),4)))

The two-sided P-value is 0.6776
The one-sided P-value is 0.3388


Given our one-sided p-value = 0.3388 > 0.05, we do not reject the Null Hypothesis that there is no difference between the groups.
    
    This means that the new page is not better than the old page.


<u><b>5. Chi-Squared Test:</b></u>

One statistical test that came out from one of LinkedIn Learning's A/B testing courses (Wahi, 2020) is the Chi-Squared Analysis (or $\chi^{2}$ test). If we constructed a 2x2 contingency table for our observed frequencies in our dataset, and compared it to a 2x2 contingency table for the expected frequencies in our dataset, we can perform the $\chi^{2}$ test under the Null Hypothesis that there is no relationship that exists on between our conversion vs their treatment/control group in the population.

For reference, our 2x2 contingency table will have two groups: treatment/control or converted/not converted. We want to make 4 calculations that will be in our table:
1. Treatment, converted
2. Treatment, not converted
3. Control, converted
4. Control, not converted

In [34]:
#doing the 4 calculations as above:
treatment_converted = treatment_df.converted.sum()
treatment_not_converted = treatment_df.size - treatment_df.converted.sum()
control_converted = control_df.converted.sum()
control_not_converted = control_df.size - control_df.converted.sum()

#create the array to do our chi-squared test: treatment/control along the rows and converted/not converted along the columns:
Chi = np.array([[treatment_converted,treatment_not_converted],[control_converted,control_not_converted]])
Chi

array([[ 17264, 709286],
       [ 17489, 708881]])

In [35]:
#using scipt stats to perform our chi squared test:
print(ss.chi2_contingency(Chi,correction=False)[1])

0.2131252933770616


Given our p-value = 0.2131 > 0.05, we do not reject the Null Hypothesis that the assigned treatment/control group has no effect on conversion.
    
    This means that the new page is not better than the old page.

<b><u>Conclusion:</u></b>

The A/B experiment was designed to determine if <b>FaceZonGoogAppFlix</b>'s new webpage would improve the conversion rate of their users compared to their existing one. 

After going through Metric checks, minimum sample size requirements and multiple statistical methods to determine a winner of the A/B test, We've seen that <b>FaceZonGoogAppFlix</b>'s underlying goal had not been reached with their new webpage. 

Therefore, we recommend that to not continue with the new webpage change, but pursue other experiments.

<b><u>Discussion:</u></b>


While this was a satisfactory A/B test, A/B test data in a real-life setting will not be as clear-cut as what we've been provided.

Designing the A/B test involves many steps to prevent factors that threaten your validity:

1. Issues such as the seasonality effect need to be heavily considered. An individual user's behaviour will change a lot depending on time, what day of the week it is (weekday vs weekend) and seasonality. A solution would be the hold-back method: launch the change to everyone except for one small hold-back group of users, and continue comparing their behavior to the control group (Peng, 2017). In our case, our A/B test was run over a duration of over 3 weeks right after the New Year, which should smooth out seasonality effects.

2. Other noteworthy and more direct psychological effects are change aversion and the novelty effect which affect our treatment group. Cohort Analysis may be helpful (Peng, 2017).

3. There can be statistical issues such as Interference, where the assumption of independence is violated between treatment and control groups. Interference will result in what is often known as the overestimation bias, because resources are being co-utilized by members in both control and treatment groups (Rosidi, 2021).

4. Business Impact discussions can involve engineering costs/maintenance, customer support, opportunity cost, etc. If we run a successful A/B test and decide on changing our website that results in a worthwhile increase in our metric, but our ongoing costs off-set this, then that must be taken into consideration (Peng, 2017).

5. There are even considerably minute effects that could poison our A/B test data, such as the flicker effect: This is when the original variation flashes on the treatment user's screens before variation B gets loaded (Dube, 2019).

<b><u>References:</u></b>

Data Science Dojo (2018) <i>What is A/B Testing? | Data Science in Minutes</i>, Youtube.

Dube, S. (2019) <i>Top 6 A/B Testing Questions Answered | #CRO </i>, Invesp.

Frost, J. (2018) <i>Practical vs. Statistical Significance</i>,  Statistics by Jim.

jkndrkn (2012) <i>Safely determining sample size for A/B testing</i>, Stats.stackexchange.

Kohavi, R., Tang, D. & Xu, Y. (2020) <i>Trustworthy Online Controlled Experiments - A practical guide to A/B Testing</i>, Cambridge University Press.

Lan (2011) <i>SRelation between confidence interval and testing statistical hypothesis for t-test</i>, Stats.stackexchange.

Peng, K. (2017) <i>A Summary of Udacity A/B Testing Course</i>, Towards Data Science.

Rosidi, N. (2021) <i>A/B Testing Data Science Interview Questions Guide</i>, Stratascratch.

Saleh, K. (2018) <i>How To Calculate A/B Testing Sample Size</i>, Invesp.

Shetty, N. (2016) <i>A/B Testing - AWS</i>, Rstudio.

Shier, R. (2004) <i>Statistics: 2.1 The sign test</i>, Mathematics Learning Support Centre.

Tan, S. G. & Tan, S. B. (2010) <i>The Correct Interpretation of Confidence Intervals</i>, Proceedings of Singapore Healthcare, Volume 19, Number 3.

van Belle, G. (2008) <i>Statistical Rules of Thumb. 2nd edition</i>, WileyInterscience.

Wahi, M. (2020) <i>The Data Science of Experimental Design</i>, LinkedIn Learning.

Yu, P. (2020) <i>Understanding Power Analysis in AB Testing</i>, Towards Data Science.

Zach (2020) <i>How to Perform an F-Test in Python</i>, Statology.

zthomas.nc (2019) <i>AB test sample size calculation by hand</i>, Stats.stackexchange. hand.</i>, Stats.stackexchange.