# Variance Reduction Techniques

This notebook will showcase several common techniques to reduce metric variance, which is used to increase metric sensitivity for AB testing. The dataset to be investigated with is provided by Starbucks and shared within the Data Scientist Nano-degree program. It contains customer promotion and purchase data, along with seven measures. You can know more about it by visiting this [link](https://drive.google.com/file/d/18klca9Sef1Rs6q8DW4l7o349r8B70qXM/view). 

In [34]:
# load libraries
import pandas as pd
import numpy as np

In [5]:
# load dataset
# Here is the introduction of this dataset: 
# https://drive.google.com/file/d/18klca9Sef1Rs6q8DW4l7o349r8B70qXM/view
data_set = pd.read_csv('./training_ab_starbucks.csv')

## Data Exploration

In [3]:
data_set.head()

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,1,No,0,2,30.443518,-1.165083,1,1,3,2
1,3,No,0,3,32.15935,-0.645617,2,3,2,2
2,4,No,0,2,30.431659,0.133583,1,1,4,2
3,5,No,0,0,26.588914,-0.212728,2,1,4,2
4,8,Yes,0,3,28.044332,-0.385883,1,1,2,2


In [32]:
# number of total users
nr_users = data_set.shape[0]

In [33]:
# number of customers who received the promotion or not
group_aggr = data_set.groupby(['Promotion']).count().reset_index()
group_promoted = group_aggr.loc[group_aggr['Promotion'] == 'Yes']['ID'].iloc[0] # received
group_not_promoted = group_aggr.loc[group_aggr['Promotion'] == 'No']['ID'].iloc[0] # not received

print("This dataset contains {} customers, in which {} of them received promotion and the rest {} did not.".format(str(nr_users), str(group_promoted), str(group_not_promoted)))


This dataset contains 84534 customers, in which 42364 of them received promotion and the rest 42170 did not.


Other than that, this dataset also contains seven measures, V1 to V7, and one business metric which tells whether the customer purchase or not. The purpose of this notebook is adopting different variance reduction techniques and look at how much variance each method is able to reduce compared against adopting nothing instead.

Bytepawn published a very helpful [article](https://bytepawn.com/five-ways-to-reduce-variance-in-ab-testing.html), which introduced five techniques:

1. Increase sample size
2. Move towards an even split
3. Reduce variance in the metric definition
4. Stratification
5. CUPED

Whay will I do, differently from the article from Bytepawn, is validating these techniques against the real world dataset, rather than simulating the numbers.