# ExpAn: Experiment Analysis

ExpAn is a Python library for the statistical analysis of randomised controlled trials (A/B tests). 

The functions are standalone and can be imported and used from within other projects, and from the command line.

The library is Open Source, published under the MIT license here:

[github.com/zalando/expan](https://github.com/zalando/expan)

## Assumptions used in analysis 

1. Sample-size estimation:
  * Treatment does not affect variance
  * Variance in treatment and control is identical
  * Mean of delta is normally distributed
2. Welch t-test:
  * Mean of means is t-distributed (or normally distributed) 
3. In general:
  * Sample represents underlying population
  * Entities are independent

## Main user stories

As a Data Scientist I want to perform all the basic analysis routines that are typical of a the analysis of an A/B Test (a.k.a. Between-Subject Randomised Control Trial) while retaining access to the raw data so I can perform very also custom analyses in order to answer the questions of stakeholders with little effort.

As an analyst from a different department, I want to be able to bring my own data, and easily be able to use this library to perform analysis.

# Installation

To install the library:

    $ pip install expan

For more information, start with the [README.rst](https://github.com/zalando/expan/blob/master/README.rst)

# ExpAn Architecture


## `data.loaders` seperate details of data from the library

Loaders read the raw data (e.g. **`csv_fetcher.py`**) and construct an `Experiment` object.

## `core.experiment` provides the analysis functionality

**`Experiment`** class provides methods for the analysis of experiment data. Currently we support only the **`deltaKPI`** computation.

## `core.statistics` contains underlying statistical functions

**`Statistics`** class provides methods for statistical computations such as: **`delta`** - computes the difference of means between the samples (x-y) with the confidence intervals, **`bootstrap`** - confidence intervals boostrapping, **`chi-square`** - chi-square homogeneity test on categorical data. 

The class is used by higher-level `experiment` module, and can be used directly from CLI, by passing in `Array`s.

## `core.binning` implements categorical and numerical binnings

The class keeps binning separate from the data.

**`Binning`** class has two subclasses `NumericalBinning` and `CategoricalBinning`. `NumericalBinning` groups data into numerical bins defined by numerical intervals. `CategoricalBinning` bins data into categories. This methods provides binning implementations which can be applied to unseed data as well.

## `core.utils` contains supplied utility methods shared by other classes

Currently it supports methods for generating random data for performing an experiment.

## `core.version` constructs versioning of the package

# Details of Components

In this section we will go deep into the details of the components and present some examples of the usage.

## `data.loaders`

Data loaders can be written as needed to handle different formats (CSV, Parquet, HDF5, etc) and different internal structures, so long as they return an `ExperimentData` object.

Currently, only a simply CSV loader (`data.csv_fetcher`) has been implemented.

We'll bypass this and work with synthesized data for now:

In [1]:
import sys, os
import numpy as np
import pandas as pd

from expan.core.util import generate_random_data
from os.path import dirname, join, realpath
sys.path.insert(0, join(os.getcwd(), 'tests'))

np.random.seed(0)
data,metadata = generate_random_data()

ExpAn core init: v0.6.1


In [2]:
data.head()

Unnamed: 0,entity,variant,normal_same,normal_shifted,feature,normal_shifted_by_feature,treatment_start_time,normal_unequal_variance
0,0,A,-1.487862,-0.616148,non,-1.088533,7,0.003991
1,1,B,-1.125186,1.783682,has,1.167307,3,-3.565511
2,2,B,0.388819,1.007539,non,-1.055948,1,6.704536
3,3,A,-1.173873,-0.889252,non,-0.152459,4,1.209668
4,4,A,1.112634,0.434377,has,0.175988,4,0.148207


In [3]:
metadata

{'experiment': 'random_data_generation',
 'primary_KPI': 'normal_shifted',
 'source': 'simulated'}

## core.experiment.Experiment

This class provides methods for the analysis of experiment data.

### Constructing `Experiment` 

The `Experiment` class has the following parameters to construct an experiment:

| Parameter | Description |
|---|---|
| **control_variant_name** | Indicates which of the variants is to be considered as a baseline (a.k.a. control). |
| **data** | A data you want to run experiment for. An example of the data structure see above. |
| **metadata** | Specifies an experiment name as the mandatory and data source as the optional fields. |
| **report_kpi_names** | A list of strings specifying desired kpis to analyse (empty list by default). |
| **derived_kpis** | A dictionary structured **{'name': ' `<`name_of_the_kpi`>`, 'formula': `<`formula_to_compute_kpi`>`}** (empty list by default) or a list of such dictionaries if more than 1 derived_kpi is wanted. **`<`name_of_the_kpi`>`**: name of the kpi. **`<`formula_to_compute_kpi`>`**: formula to calculate the desired kpi.|
    
**NOTE 1**. You should be careful specifying the correct structure to the derived_kpis dictionary including keys **'name'** and **'formula'**. Otherwise, construction of `Experiment` object will raise an exception.

**NOTE 2**. Specify the derived kpi name in the **report_kpi_names** if you want to see the results for it too.

**NOTE 3**. **data** must contain a column **entity**, a column **variant** and one column each for the kpis you defined.

**NOTE 4**. Fields in **metadata** see below.

```metadata``` should contain the following fields. Optional fields are wrapped in brackets.

| Field | Description |
|---|---|
|**`experiment`**| Name of the experiment, as known to stakeholders. It can be anything meaningful to you.|
|**[`sources`]**| Names of the data sources used in the preparation of this data.|

In [4]:
import expan

exp = expan.experiment.Experiment(control_variant_name='A', 
                                  data=data, 
                                  metadata=metadata, 
                                  report_kpi_names=['derived_kpi_one'],
                                  derived_kpis=[{'name':'derived_kpi_one','formula':'normal_same/normal_shifted'}])

In [5]:
print(exp)

Experiment "random_data_generation" with 1 derived kpis, 1 report kpis, 10000 entities and 2 variants: *A*, B


The wrong input structure (e.g. missing derived_kpis dictionary keys or incorrect kpi keys) will raise an exception.

In [6]:
exp = expan.experiment.Experiment(control_variant_name='A',
                                  data=data, 
                                  metadata=metadata,
                                  report_kpi_names=['normal_shifted', 'normal_same'],
                                  derived_kpis=[{'name':'derived_kpi_1'}])

KeyError: 'Dictionary should have key "formula"'

In our data we have two variants and one them is a baseline or control:

In [7]:
print('Variants: {}'.format(exp.variant_names))
print('Control or baseline variant: {}'.format(exp.control_variant_name))

Variants: set(['A', 'B'])
Control or baseline variant: A


# Now we can start analysing!

## Let's start with a single DeltaKPI of orders:

In [8]:
import warnings
import json

warnings.simplefilter('once', UserWarning)

In [9]:
res_delta = exp.delta()
print(json.dumps(res_delta, indent=2))

{
  "kpis": [
    {
      "variants": [
        {
          "delta_statistics": {
            "treatment_mean": -4.572524000045541, 
            "control_sample_size": 6108, 
            "treatment_sample_size": 6108, 
            "delta": 0.0, 
            "confidence_interval": [
              {
                "percentile": 2.5, 
                "value": -6.445256794169719
              }, 
              {
                "percentile": 97.5, 
                "value": 6.445256794169717
              }
            ], 
            "control_mean": -4.572524000045541
          }, 
          "name": "A"
        }, 
        {
          "delta_statistics": {
            "treatment_mean": -0.007948584804651233, 
            "control_sample_size": 6108, 
            "treatment_sample_size": 3892, 
            "delta": 4.564575415240889, 
            "confidence_interval": [
              {
                "percentile": 2.5, 
                "value": -1.1450816040987393
              }, 
     

### Interpreting the output

| Metric | Description |
|---|---|
|**`treatment_mean`**| the mean of the treatment group |
|**`control_mean`**| the mean of the control group |
|**`control_sample_size`**| the sample size for the control group |
|**`treatment_sample_size`**| the sample size for the treatment group |
|**`delta`**| the difference between the treatment_mean and control_mean |
|**`confidence_interval`**| the confidence interval: **`percentile`** - lower percentile and upper percentile; **`value`** - value for each percentile |

Currently **`deltaKPI`** supports 4 methods to compute `delta`: `fixed_horizon` (default), `group_sequential`, `bayes_factor` and `bayes_precision`. All methods requires different additional parameters.

**`fixed_horizon`** is a default method which has default settings/parameters: 

* `assume_normal=True` - specifies whether normal distribution assumptions can be made.
* `percentiles=[2.5, 97.5]` - list of percentile values for confidence bounds.
* `min_observations=20` - minimum number of observations needed.
* `nruns=10000` - only used if assume normal is false.
* `relative=False` - if relative==True, then the values will be returned as distances below and above the mean, respectively, rather than the absolute values. 


**`group_sequential`**:
* `spending_function='obrien_fleming` - currently we support only Obrient-Fleming alpha spending function for the frequentist early stopping decision. 
* `estimated_sample_size=None` - sample size to be achieved towards the end of experiment.
* `alpha=0.05` - type-I error rate
* `cap=8` - upper bound of the adapted z-score


**`bayes_factor`**
* `distribution='normal'` - name of the KPI distribution model, which assumes a Stan model file with the same name exists.
* `num_iters=25000` - number of iterations of bayes sampling.

**`bayes_precision`**
* `distribution='normal'` - name of the KPI distribution model, which assumes a Stan model file with the same name exists.
* `posterior_width=0.08` - the stopping criterion, threshold of the posterior width.
* `num_iters=25000` - number of iterations of bayes sampling.

If you would like to change any of the default values, just pass them as parameters to delta:

In [10]:
delta_freq = exp.delta(method='fixed_horizon', assume_normal=True, percentiles=[2.5, 99.5])

In [11]:
delta_g_s = exp.delta(method='group_sequential', estimated_sample_size=1000)

In [12]:
delta_bayes_factor = exp.delta(method='bayes_factor', distribution='normal')

In [13]:
print(json.dumps(delta_bayes_factor, indent=2))

{
  "kpis": [
    {
      "variants": [
        {
          "delta_statistics": {
            "treatment_sample_size": 6108, 
            "control_sample_size": 6108, 
            "treatment_mean": -4.572524000045541, 
            "delta": 0.0, 
            "number_of_iterations": 25000, 
            "confidence_interval": [
              {
                "percentile": 2.5, 
                "value": -8.396765530428699
              }, 
              {
                "percentile": 97.5, 
                "value": 3.5794735964894677
              }
            ], 
            "stop": true, 
            "control_mean": -4.572524000045541
          }, 
          "name": "A"
        }, 
        {
          "delta_statistics": {
            "treatment_sample_size": 3892, 
            "control_sample_size": 6108, 
            "treatment_mean": -0.007948584804651233, 
            "delta": 4.564575415240889, 
            "number_of_iterations": 25000, 
            "confidence_interval": [
    

## Using Bootstrapping:

We implement boostrapping for data which is not normally distributed.

We switch the flag 'assume_normal' to False for the `delta` function:

In [14]:
res_delta = exp.delta(assume_normal=False)
print(json.dumps(res_delta, indent=2))

{
  "kpis": [
    {
      "variants": [
        {
          "delta_statistics": {
            "treatment_mean": -4.572524000045541, 
            "control_sample_size": 6108, 
            "treatment_sample_size": 6108, 
            "delta": 0.0, 
            "confidence_interval": [
              {
                "percentile": 2.5, 
                "value": -6.451719231491042
              }, 
              {
                "percentile": 97.5, 
                "value": 6.412118814781721
              }
            ], 
            "control_mean": -4.572524000045541
          }, 
          "name": "A"
        }, 
        {
          "delta_statistics": {
            "treatment_mean": -0.007948584804651233, 
            "control_sample_size": 6108, 
            "treatment_sample_size": 3892, 
            "delta": 4.564575415240889, 
            "confidence_interval": [
              {
                "percentile": 2.5, 
                "value": 0.06194124312058608
              }, 
     

You may not notice here: bootstrapping takes considerably longer time than assuming the normality before running experiment. If we do not have an explicit reason to use it, it is almost always better to leave it off.

## core.binning

Defines a Binning class that represents a particular binning of a data, such that the same binning can then be applied to unseen data.

Numerical and Categorical Binnings are defined.

Tries to handle skewed data.

In [15]:
a = exp.data[exp.data.variant == 'A']
b = exp.data[exp.data.variant == 'B']

### Now we create the binning

This simply determines the thresholds appropriate for creating the requested number of bins...

In [16]:
import expan.core.binning as binning

bins = binning.create_binning(a.loc[:,'treatment_start_time'])

print(bins)

NumericalBinning with 8 bins:
 0: [0.0,1.0)
 1: [1.0,2.0)
 2: [2.0,3.0)
 3: [3.0,4.0)
 4: [4.0,5.0)
 5: [5.0,6.0)
 6: [6.0,8.0)
 7: [8.0,9.0]


### We can *apply* this binning to the same data:

In [17]:
a_bins = bins.label(a.treatment_start_time)

pd.DataFrame(a_bins).head(10)

Unnamed: 0,0
0,"[6.0,8.0)"
1,"[4.0,5.0)"
2,"[4.0,5.0)"
3,"[3.0,4.0)"
4,"[2.0,3.0)"
5,"[8.0,9.0]"
6,"[0.0,1.0)"
7,"[0.0,1.0)"
8,"[6.0,8.0)"
9,"[4.0,5.0)"


### And we can *apply* it to different data:

In [18]:
b_bins = bins.label(b.treatment_start_time)

pd.DataFrame(b_bins).head(10)

Unnamed: 0,0
0,"[3.0,4.0)"
1,"[1.0,2.0)"
2,"[4.0,5.0)"
3,"[8.0,9.0]"
4,"[8.0,9.0]"
5,"[5.0,6.0)"
6,"[6.0,8.0)"
7,"[5.0,6.0)"
8,"[8.0,9.0]"
9,"[8.0,9.0]"


Note that there is a hidden 'catch-all' bin...

This is implemented as the last entry in the arrays, making indexing very easy: an unknown bin is always -1.

In [19]:
bins.uppers

array([ 1.,  2.,  3.,  4.,  5.,  6.,  8.,  9.])

In [20]:
bins._uppers

array([  1.,   2.,   3.,   4.,   5.,   6.,   8.,   9.,  nan])

### Bin labels can be arbitrarily formatted:

Without running the binning algorithm on the data again.

In [21]:
print(bins)

NumericalBinning with 8 bins:
 0: [0.0,1.0)
 1: [1.0,2.0)
 2: [2.0,3.0)
 3: [3.0,4.0)
 4: [4.0,5.0)
 5: [5.0,6.0)
 6: [6.0,8.0)
 7: [8.0,9.0]


In [22]:
print(bins.__str__('{conditions}'))

NumericalBinning with 8 bins:
 0: 0.0<=x<1.0
 1: 1.0<=x<2.0
 2: 2.0<=x<3.0
 3: 3.0<=x<4.0
 4: 4.0<=x<5.0
 5: 5.0<=x<6.0
 6: 6.0<=x<8.0
 7: 8.0<=x<=9.0


In [23]:
print(bins.__str__('{iter.uppercase}'))

NumericalBinning with 8 bins:
 0: A
 1: B
 2: C
 3: D
 4: E
 5: F
 6: G
 7: H


In [24]:
print(bins.__str__('{iter.uppercase}: From {lo:.2f} \t To {up:.2f}'))

NumericalBinning with 8 bins:
 0: A: From 0.00 	 To 1.00
 1: B: From 1.00 	 To 2.00
 2: C: From 2.00 	 To 3.00
 3: D: From 3.00 	 To 4.00
 4: E: From 4.00 	 To 5.00
 5: F: From 5.00 	 To 6.00
 6: G: From 6.00 	 To 8.00
 7: H: From 8.00 	 To 9.00


## core.statistics

Here the underlying statistical functions are implemented. These are used by the higher-level `experiment` module, and can indeed be used directly by passing in NumPy `Array`s.

The more interesting functions are:

### `bootstrap`

Bootstraps the Confidence Intervals for a particular function comparing two samples. NaNs are ignored (discarded before calculation).

This function, as well as others such as `normal_sample_difference`, and `delta`, take as input a list of percentiles, and return the values corresponding to those percentiles. This implementation is very general, allowing us to use the same functions for one-sided as well as two-sided tests, as well as more exactly recreating an output distribution (e.g. if we want to graphically depict more than 95% confidence intervals).

### `delta`

Uses either bootstrap or standard normal assumptions to compute the difference between two arrays.

## core.utils

A `commons` module: contains supplied utility methods shared by other classes.

Currently it supports methods for generating random data for performing an experiment.

# That's it! Try it out for yourself:


[github.com/zalando/expan](https://github.com/zalando/expan)