# Statistical Significance Testing

__The objective in this notebook is to demonstrate different forms of statistical tests on recorded metrics of a hypothetical A/B test to see if there is a statistical difference between the two groups.__


_Hypthetical experiment setup:_ We've collected data for a web-based experiment. We're testing a layout change to see if this affects the proportion of people who click on a button to go to the download page. This experiment is designed to have a cookie-based diversion, and we record two things from each user: 
- which page version they received
- whether or not they accessed the download page during the data recording period. 

In [17]:
# import packages
import numpy as np
import pandas as pd
import scipy.stats as stats
# from statsmodels.stats.weightstats import ztest
# from statsmodels.stats import proportion as proptests

import matplotlib.pyplot as plt
% matplotlib inline

In [2]:
# import data
data = pd.read_csv('data/statistical_significance_data.csv')
display(data.head(5))
display(data.shape)

Unnamed: 0,condition,click
0,1,0
1,0,0
2,0,0
3,1,1
4,1,0


(999, 2)

__Observations:__ In the dataset, the 'condition' column takes a 0 for the control group, and 1 for the experimental group. The 'click' column takes a values of 0 for no click, and 1 for a click. There are 999 samples in the set.

### EDA

In [3]:
# Count participants per group and calculate mean click rates
df = pd.DataFrame()
df['participants'] = data['condition'].value_counts()
df['clickrate']= data.groupby('condition').mean()
df

Unnamed: 0,participants,clickrate
1,508,0.112205
0,491,0.07943


---

## 1. Checking the Invariant Metric

**Invariant metrics**: Metrics that we hope will not be different between groups. Metrics in this category serve to check that the experiment is running as expected. 

In this case, we should check that the number of visitors assigned to each group is similar. We want to do a _two-sided hypothesis test_ on the proportion of visitors assigned to one of our conditions. 

There are two main approaches for that:

1. __Simulation-based approach:__ We can simulate the number of visitors that would be assigned to each group for the number of total observations, assuming that we have an expected 50/50 split (200'000 repetitions should provide a good speed-variability balance in this case) and then see in how many simulated cases we get as extreme or more extreme a deviation from 50/50 that we actually observed. The proportion of flagged simulation outcomes gives us a p-value on which to assess our observed proportion. We hope to see a larger p-value, insufficient evidence to reject the null hypothesis.

2. __Analytic approach:__ We could use the exact binomial distribution to compute a p-value for the test. The more usual approach, however, is to use the normal distribution approximation. (This is possible thanks to our large sample size and the central limit theorem). Because we appoximate a discrete distribution by a continuous distribution, we should perform a [continuity correction](https://en.wikipedia.org/wiki/Continuity_correction), to get a precise p-value. This means either adding or subtracting 0.5 to the total count before computing the area underneath the curve. (e.g. If we had 415 / 850 assigned to the control group, then the normal approximation would take the area to the left of $(415 + 0.5) / 850 = 0.489$ and to the right of $(435 - 0.5) / 850 = 0.511$.)


In [4]:
# get number of trials and number of 'successes'
n_obs = data.shape[0]
n_control = data.groupby('condition').size()[0]

### Simulation-based approach

In [7]:
# simulate outcomes under null, compare to observed outcome
p = 0.5
n_trials = 200_000

samples = np.random.binomial(n_obs, p, n_trials)


print("p-value: ", (np.logical_or(samples <= n_control, samples >= (n_obs - n_control)).mean()))

p-value:  0.612905


### Analytic approach




In [8]:
# Compute a z-score and p-value
p = 0.5
sd = np.sqrt(p * (1-p) * n_obs)
z = ((n_control + 0.5) - p * n_obs) / sd

print("z-score: ", z)
print("p-value: ", 2 * stats.norm.cdf(z))

z-score:  -0.5062175977346661
p-value:  0.6127039025537114


In [4]:
import numpy as np

p = 300 / 2000
sd = np.sqrt(p * (1-p) * n_obs)

NameError: name 'n_obs' is not defined

**Results:** The p-value is around .613. Since the difference between the groups isn't statistically significant, we can move on to the test on the evaluation metric.

---

## 2. Checking the Evaluation Metric

__Evaluation metrics:__ The metrics by which we compare groups. Ideally, we hope to see a difference between groups that will tell us if our manipulation was a success.

Evaluation metric in this case: The click-through rate. We want to see that the experimental group has a significantly larger click-through rate than the control group. This is a _one-tailed test._


Notes on the approaches:

1. __The simulation approach__ for this metric isn't too different from the approach for the invariant metric. You'll need the overall click-through rate as the common proportion to draw simulated values from for each group. You may also want to perform more simulations since there's higher variance for this test.

2. __Analytic approaches:__ We can make use of the normal approximation again. In addition to the pooled click-through rate, we'll need a pooled standard deviation in order to compute a _z-score_. (While there is a continuity correction possible in this case as well, it's much more conservative than the p-value that a simulation will usually imply. Computing the z-score and resulting p-value without a continuity correction should be closer to the simulation's outcomes, though slightly more optimistic about there being a statistical difference between groups.)

In [9]:
# get number of trials and overall 'success' rate under null
n_control = data.groupby('condition').size()[0]
n_exper = data.groupby('condition').size()[1]
rate_null = data['click'].mean()

rate_control = data.groupby('condition').mean().iloc[0]
rate_exper = data.groupby('condition').mean().iloc[1]
rate_diff = float(rate_exper - rate_control)

# check
print("Difference in ctr: ", rate_diff)

Difference in ctr:  0.03277498917523293


### Simulation-based approach


In [10]:
# simulate outcomes under null, compare to observed outcome
n_trials = 200_000

control_clicks = np.random.binomial(n_control, rate_null, n_trials)
exper_clicks = np.random.binomial(n_exper, rate_null, n_trials)
samples = (exper_clicks / n_exper) - (control_clicks / n_control)

print("p-value: ", (samples >= (rate_diff)).mean())

p-value:  0.039305


### Analytic approach


In [11]:
# compute standard error, z-score, and p-value
se_p = np.sqrt(rate_null * (1-rate_null) * (1/n_control + 1/n_exper))

z = (rate_diff) / se_p
print("z-score: ", z)
print("p-value: ", 1-stats.norm.cdf(z))

z-score:  1.7571887396196666
p-value:  0.039442821974613684


In [29]:
data['click'][data['condition'] == 1].mean()

0.11220472440944881

**Results:** The p-value at around .039 indicates a statistically significant difference at an alpha = .05 level, so we can conclude that the experiment had the desired effect.

---