# Frequentist A/B testing

![ab-testing.png](attachment:ab-testing.png)

## Learning Goal

- Understand what is A/B testing?
- How A/B testing is carried out.
- How it is used in industry?
- Run a simple example.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.stats.power as smp
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportions_chisquare
plt.rcParams['figure.figsize'] = 8, 6
sns.set_context('notebook', font_scale = 1.5, rc = {'lines.linewidth': 2.5})
sns.set_style('whitegrid')
sns.set_palette('deep')
red = sns.xkcd_rgb['vermillion']
blue = sns.xkcd_rgb['dark sky blue']
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Warm-up

## Point Estimate

- Point estimates are estimates of population parameters based on sample data.
- For instance, if we wanted to know the average age of registered voters in the U.S., we could take a survey of registered voters and then use the average age of the respondents as a point estimate of the average age of the population as a whole.
- The sample mean is usually not exactly the same as the population mean.?
- **Let's investigate point estimates by generating a population of random age data and then drawing a sample from it to estimate the mean**

In [None]:
# generate some random number to serve as our population
np.random.seed(10)
population_ages1 = stats.poisson.rvs(loc = 18, mu = 35, size = 150000)
population_ages2 = stats.poisson.rvs(loc = 18, mu = 10, size = 100000)
population_ages = np.concatenate((population_ages1, population_ages2))
print('population mean:', np.mean(population_ages))

In [None]:
# let's take a sample of ages from our population
np.random.seed(6)
sample_ages = np.random.choice(population_ages, size = 500)
print('sample mean:', np.mean(sample_ages))

### What does this tell us?

## Sampling Distributions and The Central Limit Theorem

- Many statistical procedures assume that data follows a normal distribution
- Since the normal distribution has nice properties like being symmetric and having the majority of the data clustered within a few standard deviations of the mean. 
- Real world data is often not normally distributed and the distribution of a sample tends to mirror the distribution of the population.

In [None]:
fig = plt.figure(figsize = (12, 6))
plt.subplot(1, 2, 1)
plt.hist(population_ages)
plt.title('Population')
plt.subplot(1, 2, 2)
plt.hist(sample_ages)
plt.title('Sample')
plt.show()

- The plot reveals the data is bimodal distribution with two high density peaks. 
- The sample we drew from this population has roughly the same shape and skew.
- This leads to our next topic, the **central limit theorem**.

## Central Limit Theorem

- At a high level, the theorem states the distribution of many sample means, known as a sampling distribution, will be normally distributed. 
- The result holds even if the underlying distribution itself is not normally distributed. Therefore,  we can treat the sample mean as if it were drawn normal distribution.

In [None]:
np.random.seed(10)
samples = 200
point_estimates = [np.random.choice(population_ages, size = 500).mean()
                   for i in range(samples)]

plt.hist(point_estimates)
plt.show()

- The sampling distribution appears to be roughly normal, despite the bimodal population distribution that the samples were drawn from. 
- In addition, the mean of the sampling distribution approaches the true population mean:

In [None]:
population_ages.mean() - np.mean(point_estimates)

- Knowing that the sampling distribution will take the shape of a normal distribution is what makes the theorem so powerful, as it is the foundation of concepts such as confidence intervals and margins of error.

## Confidence Interval

- A point estimate can give us a rough idea of a population parameter like the mean, but estimates are prone to error. 
- **A confidence interval is a range of values above and below a point estimate that captures the true population parameter at some predetermined confidence level.**

$$
\begin{align}
\text{point estimate}\ \pm \text{critical value * standard error}
\end{align}
$$

- **Critical value** is the number of standard deviations we'd have to go from the mean of the normal distribution to capture the proportion of the data associated with the desired confidence level. 
- For instance, we know that roughly 95% of the data in a normal distribution lies within 2 standard deviations from the mean, so we could use 2 as the z-critical value for a 95% confidence interval (although it is more exact to get z-critical values with `stats.norm.ppf()`)

- Generally the **standard error** for a point estimate is estimated from the data and computed using a formula.
- For example, the standard error for the sample mean is $\frac{s}{ \sqrt{n} }$, where $s$ is the standard deviation and $n$ is the number of samples.


- **Margin of Error** =$\pm \text{critical value * standard error}$.

- **Note that the confidence intervals framework can be easily adapted for any estimator that has a nearly normal sampling distribution. e.g. sample mean, two sample mean, sample proportion and two sample proportion (we'll later see). All we have to do this is change the way that we're calculating the standard error.**

## Confidence Interval for mean when ${\sigma}$ is known 


In [None]:
np.random.seed(10)
sample_size = 1000
sample = np.random.choice(population_ages, size = sample_size)
sample_mean = sample.mean()

confidence = 0.95
z_critical = stats.norm.isf(.025)
print('z-critical value:', z_critical)                     

pop_stdev = population_ages.std()
margin_of_error = z_critical * (pop_stdev / np.sqrt(sample_size))
confin_Interval = sample_mean - margin_of_error, sample_mean + margin_of_error
print('point esimate:', sample_mean)
print('Confidence interval:', confin_Interval)

- Notice that the confidence interval we calculated captures the true population mean of 43.0023.
- Let's create several confidence intervals and plot them to get a better sense of what it means to "capture" the true mean:

In [None]:
# np.random.seed(12)
confidence = 0.95
sample_size = 1000

intervals = []
sample_means = []
for sample in range(100):
    sample = np.random.choice(population_ages, size = sample_size)
    sample_mean = sample.mean()
    sample_means.append(sample_mean)

    z_critical = stats.norm.ppf(q = confidence + (1 - confidence) / 2)                    
    pop_std = population_ages.std()
    margin_error = z_critical * (pop_stdev / np.sqrt(sample_size))
    confint = sample_mean - margin_error, sample_mean + margin_error 
    intervals.append(confint)
    

plt.figure(figsize = (20, 8))
plt.errorbar(x = np.arange(0.1, 100, 1), y = sample_means, 
             yerr = [(top - bot) / 2 for top, bot in intervals], fmt = 'o')

plt.hlines(xmin = 0, xmax = 100,
           y = population_ages.mean(), 
           linewidth = 2.0, color = red)
plt.show()

- More formally, the definition of a 95% confidence interval means that **95% of confidence intervals, created based on random samples of the same size from the same population will contain the true population parameter**.

##  Hypothesis Testing

In a **two-sided test** the null hypothesis is rejected if the test statistic is either too small or too large. Thus the rejection region for such a test consists of two parts: one on the left and one on the right.
![two_tail.png](attachment:two_tail.png)

For a **left-tailed test**, the null hypothesis is rejected if the test statistic is too small. Thus, the rejection region for such a test consists of one part, which is left from the center.
![left_tail.png](attachment:left_tail.png)

For a **right-tailed test**, the null hypothesis is rejected if the test statistic is too large. Thus, the rejection region for such a test consists of one part, which is right from the center.

![right_tail.png](attachment:right_tail.png)

- Frequentist statistic's hypothesis testing uses a p-value to weigh the strength of the evidence (what the data is telling you about the population).
- p-value is defined as **the probability of obtaining the observed or more extreme outcome, given that the null hypothesis is true (not the probability that the alternative hypthesis is true)**.
- It is a number between 0 and 1 and is **generally** interpreted in the following way:

\begin{array}{l|l}
\hline
 \text{p-value} & \text{Evidence against } H_0   \\
\hline
\ p > 0.10 &  \text{Weak or no evidence}   \\
\ 0.05 < p \le 0.10 & \text{Moderate evidence}    \\
\ 0.01 < p \le 0.05 & \text{Strong evidence}    \\
\  p \le 0.01 & \text{Very strong evidence}    \\
\hline 
\end{array}

# Frequentist A/B testing

## Activation

- https://hbr.org/2017/06/a-refresher-on-ab-testing

- A/B testing is essentially a simple randomized trial. 

- The key idea is that because we randomize which landing page (or treatment in the case of a randomized clinical trial) someone goes to.
- After a large number of visitors, the groups of people who visited the two pages are completely comparable in respect of all characteristics (e.g. age, gender, location, and anything else you can think of!). 
- Because the two groups are comparable, we can compare the outcomes (e.g. amount of advertising revenue) between the two groups to obtain an unbiased, and fair, assessment of the relative effectiveness (in terms of our defined outcome) of the two designs.

## A/B Testing Process

![AB-testing-work.svg](attachment:AB-testing-work.svg)

Scenario: We ran an A/B test with two different versions of a web page, A and B, for which we count the number of visitors and whether they convert or not. We can summarize this in a contingency table showing the frequency distribution of the events:

In [None]:
data = pd.DataFrame({
    'version': ['A', 'B'],
    'not_converted': [4514, 4473],
    'converted': [486, 527]
})[['version', 'not_converted', 'converted']]
data

- The conversion rate of each version is, 486/(486 + 4514) = 9.72% for A and 10.5% for B.
- With such a relatively small difference, however, can we convincingly say that the version B converts better? 
- To test the statistical significance of a result like this, a hypothesis testing can be used.

## Comparing Two Proportions

- Let's formalize our thought process a little bit, suppose that we have obtained data from *n* visitors, $n_A$ of which have been (randomly) sent to page A, and $n_B$ of which have been sent to page B. 


- Let $X_A$ and $X_B$ denote the number of visitors for whom we obtained a 'successful' outcome in the two groups. 


- The proportion of successes in the two groups is then given by $\hat{p_A} = X_A/n_A$ and $\hat{p_B} = X_B/n_B$ respectively. 


- The estimated difference in success rates is then give by the difference in proportions: $\hat{p_A} - \hat{p_B}$:

## Hypothesis Test on Two Proportions
- To assess whether we have statistical evidence that the two pages' success rates truely differ, we can perform a hypothesis test.


- The null hypothesis that we want to test is that the two pages' true success rates are equal, whereas the alternative is that they differ.


- If $p_A$ = the proportion of the page A population whom we obtained a successful outcome and $p_B$ = the proportion of the page B population whom we obtained a successful outcome then we are interested in testing the following hypothesis:

$$
\begin{align}
H_0:p_A = p_B \text{ versus } H_A: p_A \neq p_B
\end{align}
$$

- The null hypothesis says that 'page type' and 'outcome' are statistically independent of each other. In words, this means knowing which page someone is sent to tells you nothing about the chance that they will convert.

## Test Statistics for Two Proportions


For our test the appropriate test statistic is a test for differences in proportions:


$$
\begin{align}
Z
&= \frac{ (\hat{p_A} - \hat{p_B}) - (p_A - p_B) }{SE(p_A - p_B)} \\
&= \frac{ (\hat{p_A} - \hat{p_B}) - 0 }{\sqrt{\hat{p} (1 - \hat{p}) \left( \frac{1}{n_A} + \frac{1}{n_B} \right)}}
\end{align}
$$

Where $\hat{p} = (X_A + X_B)/(n_A + n_B)$

In [None]:
# def two_proprotions_test(success_a, size_a, success_b, size_b):
#     """
#     A/B test for two proportions;
#     given a success a trial size of group A and B compute
#     its zscore and pvalue
    
#     Parameters
#     ----------
#     success_a, success_b : int
#         Number of successes in each group
        
#     size_a, size_b : int
#         Size, or number of observations in each group
    
#     Returns
#     -------
#     zscore : float
#         test statistic for the two proportion z-test

#     pvalue : float
#         p-value for the two proportion z-test
#     """
#     prop_a = success_a / size_a
#     prop_b = success_b / size_b
#     prop_pooled = (success_a + success_b) / (size_a + size_b)
#     var = prop_pooled * (1 - prop_pooled) * (1 / size_a + 1 / size_b)
#     zscore = np.abs(prop_b - prop_a) / np.sqrt(var)
#     one_side = 1 - stats.norm(loc = 0, scale = 1).cdf(zscore)
#     pvalue = one_side * 2
#     prop_diff = prop_b - prop_a

#     return prop_diff, zscore, pvalue

In [None]:
# success_a = 486
# size_a = 5000
# success_b = 527
# size_b = 5000

# prop_diff, zscore, pvalue = two_proprotions_test(success_a, size_a, success_b, size_b)
# print('prop_diff= {:.3f}, zscore = {:.3f}, pvalue = {:.3f}'.format(prop_diff,zscore, pvalue))

### Statsmodels Implementation 

In [None]:
# where we pass in the success (they call the argument counts)
# and the total number for each group (they call the argument nobs,
# number of observations)
counts = np.array([486, 527])
nobs = np.array([5000, 5000])

zscore, pvalue = proportions_ztest(counts, nobs, alternative = 'two-sided')
print('zscore = {:.3f}, pvalue = {:.3f}'.format(zscore, pvalue))

## Effect Size
- Effect size for proportion: ${2 * (\arcsin(\sqrt(prop1)) - \arcsin(\sqrt(prop2)))}$
- [Cohen's H](https://en.wikipedia.org/wiki/Cohen%27s_h)

In [None]:
effect_size=sm.stats.proportion_effectsize(0.0972, 0.105)
effect_size

# Power

- Up till this point, we've been using the 5000 as the total number of observations/samples that are involved in the A/B testing process. 
- The next question that we'll address is, in real world scenarios, how many obeservations do we need in order to draw a valid verdict on the test result. 

###  type 1 error
- Rejecting the null hypothesis when it is actually true is called a type 1 error, often denoted as $\alpha$. 
- Committing a type 1 error is a false positive because we end up recommending something that does not work. 

###  type 2 error

- A type 2 error, often denoted as $\beta$, occurs when we do not reject the null hypothesis when it is actually false. 
- This is a false negative because we end up doing nothing when we should have taken action. 

**We need to consider both types of errors when choosing the sample size.**

## Significance Level and Statistical Power
- **Significance level:** Governs the chance of a false positive. A significance level of 0.05 means that there is a 5% chance of a false positive. 
- **Statistical power** Power of 0.80 means that there is an 80% chance that if there was an effect, we would detect it (or a 20% chance that we'd miss the effect). In other words, power is equivalent to $1 - \beta$. 

| Scenario       | $H_0$ is true                      | $H_0$ is false            |
|:--------------:|:----------------------------------:|:-------------------------:|
|  Accept $H_0$  |  Correct Decision                  |  Type 2 Error (1 - power) |
|  Reject $H_0$  |  Type 1 Error (significance level) |  Correct decision         |

[The concepts of power and significance level can seem somewhat convoluted at first glance. A good way to get a feel for the underlying mechanics is to plot the probability distribution of $Z$ assuming that the null hypothesis is true. Then do the same assuming that the alternative hypothesis is true, and overlay the two plots.](https://rpsychologist.com/d3/NHST/)


## Power Example

Consider the following: $$H_0: p_A = p_B, H_1: p_A \neq p_B$$ 

- Sample size, n=5,000 (assume equal sample sizes for the control and experiment groups)


In [None]:
smp.NormalIndPower().solve_power(effect_size, nobs1=5000, alpha=0.05, ratio=1)

## Sample Size

In [None]:
smp.NormalIndPower().solve_power(effect_size, power=0.8, alpha=0.05)

## A/B Testing in Industry

- A **media company** might want to increase readership, increase the amount of time readers spend on their site, and amplify their articles with social sharing. To achieve these goals, they might test variations on:

    + Email sign-up models  
    + Recommended content  
    + Social sharing buttons  
    
- A **travel company** may want to increase the number of successful bookings are completed on their website or mobile app, or may want to increase revenue from ancillary purchases. To improve these metrics, they may test variations of:

    + Homepage search models
    + Search results page
    + Ancillary product presentation

- An **e-commerce company** might want to increase the number of completed checkouts, the average order value, or increase holiday sales. To accomplish this, they may A/B test:

    + Homepage promotions
    + Navigation elements
    + Checkout funnel components
    
- A **technology company** might want to increase the number of high-quality leads for their sales team, increase the number of free trial users, or attract a specific type of buyer. They might test:

    + Lead form components
    + Free trial signup flow
    <li>Homepage messaging and <a href="https://www.optimizely.com/optimization-glossary/call-to-action/">call-to-action</a></li>
