# Variance Reduction Techniques

This notebook will showcase several common techniques to reduce metric variance, which is used to increase metric sensitivity for AB testing. The dataset to be investigated with is provided by Starbucks and shared within the Data Scientist Nano-degree program. It contains customer promotion and purchase data, along with seven measures. You can know more about it by visiting this [link](https://drive.google.com/file/d/18klca9Sef1Rs6q8DW4l7o349r8B70qXM/view). 

In [102]:
# load libraries
import pandas as pd
import numpy as np
import math


In [103]:
# load dataset
# Here is the introduction of this dataset: 
# https://drive.google.com/file/d/18klca9Sef1Rs6q8DW4l7o349r8B70qXM/view
data_set = pd.read_csv('./training_ab_starbucks.csv')

## Data Exploration

In [104]:
data_set.head()

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,1,No,0,2,30.443518,-1.165083,1,1,3,2
1,3,No,0,3,32.15935,-0.645617,2,3,2,2
2,4,No,0,2,30.431659,0.133583,1,1,4,2
3,5,No,0,0,26.588914,-0.212728,2,1,4,2
4,8,Yes,0,3,28.044332,-0.385883,1,1,2,2


In [105]:
# no null value in the dataset
data_set.isnull().sum()

ID           0
Promotion    0
purchase     0
V1           0
V2           0
V3           0
V4           0
V5           0
V6           0
V7           0
dtype: int64

In [106]:
# number of total users
nr_users = data_set.shape[0]

In [107]:
# number of customers who received the promotion or not
group_aggr = data_set.groupby(['Promotion']).count().reset_index()
group_promoted = group_aggr.loc[group_aggr['Promotion'] == 'Yes']['ID'].iloc[0] # received
group_not_promoted = group_aggr.loc[group_aggr['Promotion'] == 'No']['ID'].iloc[0] # not received

print("This dataset contains {} customers, in which {} of them received promotion and the rest {} did not.".format(str(nr_users), str(group_promoted), str(group_not_promoted)))


This dataset contains 84534 customers, in which 42364 of them received promotion and the rest 42170 did not.


Other than that, this dataset also contains seven measures, V1 to V7, and one business metric which tells whether the customer purchase or not. The purpose of this notebook is adopting different variance reduction techniques and look at how much variance each method is able to reduce compared against adopting nothing instead.

Bytepawn published a very helpful [article](https://bytepawn.com/five-ways-to-reduce-variance-in-ab-testing.html), which introduced five techniques:

1. Increase sample size
2. Move towards an even split
3. Reduce variance in the metric definition
4. Stratification
5. CUPED

Whay will I do, differently from the article from Bytepawn, is validating these techniques against the real world dataset, rather than simulating the numbers.

Before diving into the details, let's figure out the date type of each column -

We have **Promotion** as a binary data which we can split the users into two groups - control and treatment;

**purchase** is another binary data where we know customers made purchase or not. In business, we usually aggregate it into conversion rate to evaluate the performance.

For **V1** and **V4** to **V7**, they are all integers, which we will regard them as category data.

Lastly, the **V2** and **V3** variables are floats, and we will look at the mean average to evaluate the metrics.

In [108]:
data_set.dtypes

ID             int64
Promotion     object
purchase       int64
V1             int64
V2           float64
V3           float64
V4             int64
V5             int64
V6             int64
V7             int64
dtype: object

In [109]:
# turn int columns such as 'purchase','V1','V4','V5','V6','V7' to category type
categorical_columns = ['purchase','V1','V4','V5','V6','V7']
for column in categorical_columns:
    data_set[column] = data_set[column].astype('category')

In [145]:
# classify metrincs into different list based on their types
mean_metrics = ['V2','V3']
binomial_metrics = ['purchase']
categorical_metrics = ['V1','V4','V5','V6','V7']
aggr_metric = ['Promotion']

In [110]:
# summary of the dataset
data_set.describe(include = 'all')

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
count,84534.0,84534,84534.0,84534.0,84534.0,84534.0,84534.0,84534.0,84534.0,84534.0
unique,,2,2.0,4.0,,,2.0,4.0,4.0,2.0
top,,Yes,0.0,1.0,,,2.0,3.0,3.0,2.0
freq,,42364,83494.0,31631.0,,,57450.0,32743.0,21186.0,59317.0
mean,62970.972413,,,,29.9736,0.00019,,,,
std,36418.440539,,,,5.010626,1.000485,,,,
min,1.0,,,,7.104007,-1.68455,,,,
25%,31467.25,,,,26.591501,-0.90535,,,,
50%,62827.5,,,,29.979744,-0.039572,,,,
75%,94438.75,,,,33.344593,0.826206,,,,


In [139]:
# variance of continous data i.e. V2 and V3

def mean_variance(col):
    """
    input:
    df: the dataset we want to calculate variance of mean metrics
    aggr: column to aggregate the measures
    mean_cols: columns we evaluate the means
    
    output:
    an aggregated dataset showcasing the variance of each mean measures of the df dataset
    """
    variance_float_dataset = col.var()
    return variance_float_dataset

In [134]:
# calculate variance for categorical variables
# this stackoverflow link explains how 
# https://stats.stackexchange.com/questions/421307/variance-maybe-of-categorical-data
# a bigger entropy means a more evenly distributed or a smaller variance of the categorical counts of the variable

def category_variance(col):
    """
    input:
    an array of categorical values as one variable
    
    output:
    calculate the entropy value, which is the variance for categorical variable
    reference: https://stats.stackexchange.com/questions/421307/variance-maybe-of-categorical-data
    """
    category_counts = list(col.value_counts())
    total_freq = sum(category_counts)
    alpha = 1
    probs = []

    for count in category_counts:
        p = (count + alpha) / (total_freq + len(category_counts) * alpha)
        probs.append(p)
    
    log_sum = 0
    for p in probs:
        log_sum += p*math.log(p)
    
    entropy = 0 - log_sum    
    return entropy

#categorical_dataset = data_set[['Promotion','V1','V4','V5','V6','V7']]

variance_categorical_dataset = categorical_dataset.groupby("Promotion").agg(category_variance).reset_index()

In [135]:
# variance of binary variable, such as the purchase column
def binomial_variance(col):
    """
    input:
    an array of binomial values as one variable
    
    output:
    return the value of the variance of the array
    reference: https://stats.stackexchange.com/questions/191444/variance-in-estimating-p-for-a-binomial-distribution
    """
    value_counts = list(data_set['purchase'].value_counts())
    sum_freq = value_counts[0] + value_counts[1]
    p = value_counts[0] / sum_freq
    variance = (p * (1-p)) / sum_freq
    return variance

binomial_dataset = data_set[['Promotion','purchase']]
variance_bino_dataset = binomial_dataset.groupby("Promotion").agg(binomial_variance).reset_index()

In [150]:
# put variance of each variable together into one table

def merged(df1, df2, df3):
    """
    input: dataset we want to merged together
    
    output: a merged dataset where we have the variance of each measure
    """
    merged_df = pd.merge(df1, 
         df2, 
         on='Promotion', how='inner')

    df = pd.merge(merged_df, 
         df3, 
         on='Promotion', how='inner')
    # sort the columns
    df = df[['Promotion','purchase','V1','V2','V3','V4','V5','V6','V7']]
    return df

### 1. Increase sample size

Regardless it is mean, binomial or categorical data, the variance of each measure is influenced by the sample size.

Let's randomly take 25%, 50% and 75% of the dataset and calculate the variance of each metric.

In [115]:
data_set.head()

Unnamed: 0,ID,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7
0,1,No,0,2,30.443518,-1.165083,1,1,3,2
1,3,No,0,3,32.15935,-0.645617,2,3,2,2
2,4,No,0,2,30.431659,0.133583,1,1,4,2
3,5,No,0,0,26.588914,-0.212728,2,1,4,2
4,8,Yes,0,3,28.044332,-0.385883,1,1,2,2


In [147]:
def random_sampling(df, prop = 0.25, random_state = 42):
    """
    input: a dataframe and the share we want to randomly sample from the dataset
    
    output: returned a sampled dataset
    """
    sampled_df = df.sample(frac = prop, random_state = random_state)
    
    return sampled_df

In [189]:
# create a list of proportions we want to sample the original dataset
props = [0.05, 0.25, 0.5, 0.75, 1]

# create a dataset containing the variance of each metric at different sampling rate
outcome = pd.DataFrame()

for prop in props:
    df = random_sampling(data_set, prop = prop)
    variance_mean_dataset = df[aggr_metric + mean_metrics].groupby("Promotion").agg(mean_variance).reset_index()
    variance_categorical_dataset = df[aggr_metric + categorical_metrics].groupby("Promotion").agg(category_variance).reset_index()
    variance_binomial_dataset = df[aggr_metric + binomial_metrics].groupby("Promotion").agg(binomial_variance).reset_index()
    merged_df = merged(variance_mean_dataset,variance_categorical_dataset,variance_binomial_dataset)
    merged_df['prop_sampling'] = prop
    merged_df['sample_size'] = df.shape[0]
    outcome = outcome.append(merged_df)


In [190]:
outcome

Unnamed: 0,Promotion,purchase,V1,V2,V3,V4,V5,V6,V7,prop_sampling,sample_size
0,No,1.437455e-07,1.256658,26.010979,1.008264,0.619052,1.245876,1.385353,0.605214,0.05,4227
1,Yes,1.437455e-07,1.263755,25.405717,1.012895,0.625228,1.210419,1.386085,0.617773,0.05,4227
0,No,1.437455e-07,1.259141,24.944699,1.006074,0.62784,1.222414,1.386248,0.609784,0.25,21134
1,Yes,1.437455e-07,1.255337,25.274806,0.992484,0.628158,1.217535,1.386273,0.609498,0.25,21134
0,No,1.437455e-07,1.253815,25.146281,1.000132,0.629407,1.221187,1.386239,0.610107,0.5,42267
1,Yes,1.437455e-07,1.257019,25.110459,0.990591,0.627671,1.216101,1.386261,0.608508,0.5,42267
0,No,1.437455e-07,1.254171,25.105386,1.004506,0.626354,1.220354,1.386285,0.609656,0.75,63400
1,Yes,1.437455e-07,1.258327,25.23668,0.994788,0.627159,1.215619,1.38627,0.608631,0.75,63400
0,No,1.437455e-07,1.257585,24.967657,1.005839,0.626672,1.218653,1.386292,0.608992,1.0,84534
1,Yes,1.437455e-07,1.257582,25.245032,0.996043,0.627665,1.214831,1.386286,0.609865,1.0,84534


### 2. Move towards an even split

We can look at the dataset created from the **increasing sample size**.

### 3. Reduce variance in the metric definition

### 4. Stratification

### 5. CUPED

## Compare the reduced variance of each technique with the original variance

## Visualization

## Conclusion