<a href="https://colab.research.google.com/github/tellosofia/MSU-Library_A-B-Testing/blob/main/ABT4_MSU_A_B_Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns

•	**Click-through rate (CTR) for the homepage.** Number of clicks on the button divided by the total visits to the page. Selected as a measure of the initial ability of the category title to attract users.

•	**Drop-off rate for the category pages.** Percentage of visitors who exit the site from a given category page (like Interact, Connect, Learn, etc.) without exploring any of its subpages. This metric is an indicator of how well the category page fulfils user expectations. A lower drop-off rate is preferable as it implies users are engaged and finding the information they need.

•	**Homepage-return rate for the category pages.** Measures how often users who visit specific category pages (Interact, Connect, Learn, etc.) from the library homepage end up returning to the homepage. Similarly, to the drop-off rate, this metric helps us infer whether users are finding what they need on these category pages. If they frequently return to the homepage, it suggests they might not be finding the desired information on the category pages. As such, the homepage-return rate serves as a useful indicator of how well each category page is meeting users’ expectations. Ideally, we want to minimize the homepage-return rate, which would indicate that users are finding what they need on the first try.



- Null Hypothesis: all versions have the same CTR.

- Alternative Hypothesis: there is a difference in the CTR for the different versions.

In [2]:
clicks = [42, 53, 21, 38, 45]
no_clicks = [10283-42, 2742-53, 2747-21, 3180-38, 2064-45]

MSU_results = pd.DataFrame([clicks, no_clicks],
                             columns = ['Interact', 'Connect', 'Learn', 'Help', 'Services'],
                             index = ['clicks', 'no-clicks'])

MSU_results

Unnamed: 0,Interact,Connect,Learn,Help,Services
clicks,42,53,21,38,45
no-clicks,10241,2689,2726,3142,2019


In [3]:
ctr_interact = 42/(10241+42)
ctr_connect = 53/(2689+53)
ctr_learn = 21/(2726+21)
ctr_help = 38/(3149+38)
ctr_services = 45/(2019+45)

In [4]:
print('CTR Interact: ', ctr_interact)
print('CTR Connect: ', ctr_connect)
print('CTR Learn: ', ctr_learn)
print('CTR Help: ', ctr_help)
print('CTR Services: ', ctr_services)

CTR Interact:  0.0040844111640571815
CTR Connect:  0.019328956965718454
CTR Learn:  0.007644703312704768
CTR Help:  0.011923438970818953
CTR Services:  0.02180232558139535


In [5]:
from scipy import stats

chisq, pvalue, df, expected = stats.chi2_contingency(MSU_results)

In [6]:
alpha = 0.1

In [7]:
chisq, pvalue, df, expected

(96.7432353798328,
 4.852334301093838e-20,
 4,
 array([[   97.3694804 ,    25.96393224,    26.01127712,    30.11134374,
            19.5439665 ],
        [10185.6305196 ,  2716.03606776,  2720.98872288,  3149.88865626,
          2044.4560335 ]]))

In [8]:
pvalue < alpha

True

In [9]:
if pvalue > alpha:
  print("The p-value is larger than alpha, we fail to reject the null hypothesis")
else:
  print("The p-value is smaller than alpha, we reject the null hypothesis")

The p-value is smaller than alpha, we reject the null hypothesis


If the p-value is smaller than our significance level, it will mean that **we can reject the Null Hypothesis** because it is very likely that the clicks depend on the version of the website.

But does this tell us the real winner? Not yet! This just tells us that some version(s) indeed performed better (or worse) than others. **We can be sure that the best version (Services) performed better than the worst one (Interact), but we cannot be certain that the differences between “Services” and “Connect” or “Learn” are significant.**

Whenever that happens (getting significant results when comparing more than 2 variants), we perform a post-hoc test, consisting of running a new chi-square test for each pair of variants.

Repeating a test many times increases the probability of incorrectly rejecting the null hypothesis (also called a **“Type 1 error”**). Think about it this way: if you are 90% confident you won’t make a mistake in scenario 1, and 90% confident you won’t make a mistake in scenario 2, your confidence of not making a mistake in any of the 2 scenarios is 0.9 * 0.9 = 81%. The same happens when running multiple tests: every time we run a single test, there is a probability we are mistakenly rejecting the null hypothesis (expressed by the p-value), and the probability of making any mistake increases as we perform more tests. **In practice, that means we need to adjust our alpha (the p-value threshold for rejecting the null hypothesis).**

There are several approaches to do that. We are going to follow the Bonferroni Adjustment, as suggested in the book Passion Driven Statistics by Alan T. Arnholt:

  *For post hoc tests following a Chi-Square, we use what is referred to as the Bonferroni Adjustment. [...] this adjustment is used to counteract the problem of Type I Error that occurs when multiple comparisons are made. Following a Chi-Square test that includes an explanatory variable with 3 or more groups, we need to subset to each possible paired comparison. When interpreting these paired comparisons, rather than setting the α-level (p-value) at 0.1*, we divide 0.1* by the number of paired comparisons that we will be making. The result is our new α-level (p-value).*


# Bonferroni Adjustment:

In [11]:
alpha_b = alpha / 10
alpha_b

0.01

In [12]:
MSU_results

Unnamed: 0,Interact,Connect,Learn,Help,Services
clicks,42,53,21,38,45
no-clicks,10241,2689,2726,3142,2019


In [20]:
MSU_connect = MSU_results[['Interact', 'Connect']]
MSU_learn = MSU_results[['Interact', 'Learn']]
MSU_help = MSU_results[['Interact', 'Help']]
MSU_services = MSU_results[['Interact', 'Services']]
MSU_connect2 = MSU_results[['Connect', 'Learn']]
MSU_connect3 = MSU_results[['Connect', 'Help']]
MSU_connect4 = MSU_results[['Connect', 'Services']]
MSU_learn2 = MSU_results[['Learn', 'Help']]
MSU_learn3 = MSU_results[['Learn', 'Services']]
MSU_help2 = MSU_results[['Help', 'Services']]


display(MSU_connect),
display(MSU_learn),
display(MSU_help),
display(MSU_services),
display(MSU_connect2),
display(MSU_connect3),
display(MSU_connect4),
display(MSU_learn2),
display(MSU_learn3),
display(MSU_help2)

Unnamed: 0,Interact,Connect
clicks,42,53
no-clicks,10241,2689


Unnamed: 0,Interact,Learn
clicks,42,21
no-clicks,10241,2726


Unnamed: 0,Interact,Help
clicks,42,38
no-clicks,10241,3142


Unnamed: 0,Interact,Services
clicks,42,45
no-clicks,10241,2019


Unnamed: 0,Connect,Learn
clicks,53,21
no-clicks,2689,2726


Unnamed: 0,Connect,Help
clicks,53,38
no-clicks,2689,3142


Unnamed: 0,Connect,Services
clicks,53,45
no-clicks,2689,2019


Unnamed: 0,Learn,Help
clicks,21,38
no-clicks,2726,3142


Unnamed: 0,Learn,Services
clicks,21,45
no-clicks,2726,2019


Unnamed: 0,Help,Services
clicks,38,45
no-clicks,3142,2019


In [21]:
chisq_connect, pvalue_connect, df_connect, expected_connect = stats.chi2_contingency(MSU_connect)
chisq_learn, pvalue_learn, df_learn, expected_learn = stats.chi2_contingency(MSU_learn)
chisq_help, pvalue_help, df_help, expected_help = stats.chi2_contingency(MSU_help)
chisq_services, pvalue_services, df_services, expected_services = stats.chi2_contingency(MSU_services)
chisq_connect2, pvalue_connect2, df_connect2, expected_connect2 = stats.chi2_contingency(MSU_connect2)
chisq_connect3, pvalue_connect3, df_connect3, expected_connect3 = stats.chi2_contingency(MSU_connect3)
chisq_connect4, pvalue_connect4, df_connect4, expected_connect4 = stats.chi2_contingency(MSU_connect4)
chisq_learn2, pvalue_learn2, df_learn2, expected_learn2 = stats.chi2_contingency(MSU_learn2)
chisq_learn3, pvalue_learn3, df_learn3, expected_learn3 = stats.chi2_contingency(MSU_learn3)
chisq_help2, pvalue_help2, df_help2, expected_help2 = stats.chi2_contingency(MSU_help2)

In [22]:
display(pvalue_connect),
display(pvalue_learn),
display(pvalue_help),
display(pvalue_services),
display(pvalue_connect2),
display(pvalue_connect3),
display(pvalue_connect4),
display(pvalue_learn2),
display(pvalue_learn3),
display(pvalue_help2)

2.2250331654688293e-16

0.025419824342152637

9.03599988558687e-07

5.719451224375125e-18

0.00027678881264505827

0.02808815288948292

0.6188771123975272

0.12512753088691322

5.0540996583731365e-05

0.007370912499282061

In [23]:
pvalues = [pvalue_connect, pvalue_learn, pvalue_help, pvalue_services, pvalue_connect2, pvalue_connect3,
           pvalue_connect4, pvalue_learn2, pvalue_learn3, pvalue_help2]
names = ['connect', 'learn', 'help', 'services', 'connect2', 'connect3', 'connect4', 'learn2', 'learn3', 'help2']

In [24]:
for pvalue,name in zip(pvalues, names):
    if pvalue > alpha_b:
        print(f"The p-value for {name} is larger than alpha, we fail to reject the null hypothesis")
    else:
      print(f"The p-value for {name} is smaller than alpha, we reject the null hypothesis")

The p-value for connect is smaller than alpha, we reject the null hypothesis
The p-value for learn is larger than alpha, we fail to reject the null hypothesis
The p-value for help is smaller than alpha, we reject the null hypothesis
The p-value for services is smaller than alpha, we reject the null hypothesis
The p-value for connect2 is smaller than alpha, we reject the null hypothesis
The p-value for connect3 is larger than alpha, we fail to reject the null hypothesis
The p-value for connect4 is larger than alpha, we fail to reject the null hypothesis
The p-value for learn2 is larger than alpha, we fail to reject the null hypothesis
The p-value for learn3 is smaller than alpha, we reject the null hypothesis
The p-value for help2 is smaller than alpha, we reject the null hypothesis
