In [24]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from scipy.stats import norm

# Import data


Data source: https://github.com/nirupamaprv/Analyze-AB-test-Results

In [2]:
df = pd.read_csv('./data/ab_edited.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


In [4]:
old = df.query("group == 'control'")['converted']
new = df.query("group == 'treatment'")['converted']

# A/B Test (using `stats` package)

In [9]:
convert_old = sum(old)
convert_new = sum(new)
n_old = len(df.query("group == 'control'"))
n_new = len(df.query("group == 'treatment'"))

In [12]:
z_score, p_value = sm.stats.proportions_ztest([convert_old, convert_new], [n_old, n_new], alternative='smaller')
z_score, p_value

(1.3116075339133115, 0.905173705140591)

# A/B Test (using `numpy`)

**Assumptions**:
- X and Y are independent
- X and Y have same variance $\sigma^2$
- X and Y from normal distribution
---

- For 2nd assumption: Perform Levene test 
    * Null hypothesis: samples have same variances
    * Test statistic

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/d101da8a32b3a16edae40d04fa996e47dee6c120" width="300">

In [14]:
from scipy.stats import levene
levene(old, new)

LeveneResult(statistic=1.7203126672199422, pvalue=0.18965383907560016)

- **p-value**: larger than significance level: .05
- **Conclusion**: X and Y have same population variance

- Calculate $\bar{X}$ and $\bar{Y}$

In [113]:
mean_old = np.mean(old) 
mean_new = np.mean(new) 
mean_old, mean_new

(0.1203863045004612, 0.11880724790277405)

- Calculate $S_x$ and $S_y$

In [114]:
s_old = np.std(old, ddof = 1) 
s_new = np.std(new, ddof = 1) 
s_old, s_new

(0.32541384592046235, 0.32356267742508843)

- Calculate $N_x$ and $N_y$

In [115]:
n_old = len(old) 
n_new = len(new) 
n_old, n_new

(145274, 145311)

If $N_x <> N_y$:

Pooled Standard Deviation: $S_p = \sqrt{\frac{(N_x-1)S_x^2+(N_y-1)S_y^2}{N_x+N_y-2}}$

Std of mean: $S_{\bar{X} - \bar{Y}} = S_p\sqrt{1/N_x+1/N_y} $

Test Statistics: $T = \frac{\bar{X} - \bar{Y}}{S_{\bar{X} - \bar{Y}}}$


Distribution: $ T$ ~ $t(N_x+N_y-2) $

---
If $N_x = N_y$:

Pooled Standard Deviation: $S_p = \sqrt{S_x^2+S_y^2}$

Std of mean: $S_{\bar{X} - \bar{Y}} = S_p\sqrt{2/N} $

Test Statistics: $T = \frac{\bar{X} - \bar{Y}}{S_{\bar{X} - \bar{Y}}}$


Distribution: $ T$ ~ $t(2N-2) $

---

- Calculate pooled variance

In [116]:
s_p = np.sqrt(( (n_old - 1) * s_old * s_old + (n_new - 1) * s_new * s_new) / (n_old + n_new - 2))
s_p

0.3244894639038153

- Calculate test statistics

In [117]:
T = (mean_old - mean_new) /(s_p * np.sqrt(1 / n_old + 1 / n_new))
T

1.3116069027029211

- Calculate p-value (Single-tail)

<img src="http://www.ttable.org/uploads/2/1/7/9/21795380/published/9754276.png?1517416376" width="400">

In [118]:
from scipy.stats import norm
print(norm.cdf(T))

0.9051735985980927


Conclusion: **CANNOT** reject $H_0$ that old and new has same conversion ratio

# Calculate Sample size / Power


<img src="https://i.ytimg.com/vi/70uNTAP1J-I/maxresdefault.jpg" width="400">

Example: https://www.evanmiller.org/ab-testing/sample-size.html

In [16]:
p_baseline = 0.2
effect_size = 0.05
sig = 0.95 # assume two-tail
sample_size = 1030

- look up table: $Z(\alpha/2) = 1.96$

    
- Generally:
    - $\sigma^2(\bar X - \bar Y) = 2\sigma^2(\bar X) = 2\sigma^2(\frac{x_1 + x_2 + ... + x_n}{N}) = \frac{2\sigma^2_x}{N}$
    - $\bar X - \bar Y$ ~ $N(., \sigma_x \sqrt{2/N})   $
    
      
- for single observation
    - $\sigma^2(x) = p(1-p)$
    - $s^2(x) = \hat p (1-\hat p)$
    
    
- for average/ratio/proportion:
    - $\hat p = \bar X = (x_1 + x_2 + ... + x_n) / N $
    - $Var(\hat p) = Var(\bar X) = \frac{p(1-p)}{N}$
    - $s^2(\hat p) = s^2(\bar X) =\frac{\hat p (1-\hat p)}{N-1}$


- for difference in average/ratio/proportion:
    - $\bar X$ and $\bar Y$: $Var(\bar X - \bar Y) = \frac{2p(1-p)}{N} $
    - $\bar X$ and $\bar Y$: $s^2(\bar X - \bar Y) = \frac{2\hat p(1-\hat p)}{N} $
    
    
- Calculate power of test
    - Refer to the figure above
    - $ ES_{N(0,1)}  = \frac{ES}{s_x}\sqrt{\frac{N}{2}} = \frac{ES}{s_{\bar X - \bar Y}}$
    - red_shade = $ \phi(Z_{critical} - ES_{N(0,1)})) $
    - power = 1 - red_shade
    
    

In [17]:
s_x = np.sqrt(p_baseline * (1 - p_baseline))
s_x

0.4

In [18]:
s_ES =  s_x * np.sqrt( 2 / sample_size)
s_ES

0.01762610596956927

In [19]:
effect_size_N_0_1 = effect_size / s_ES
effect_size_N_0_1

2.836701429477554

In [20]:
phi_value = 1.96 - effect_size_N_0_1
phi_value

-0.8767014294775541

In [25]:
red_shade = norm.cdf(phi_value)
red_shade

0.19032441506917608

In [26]:
power = 1 - red_shade
power

0.8096755849308239

# Case Example

## Problem Statement
- Given a feature difference in facebook app, evaluate if the change will improve user activity.
- Given a UI component change (e.g., button color) in a pageview, evaluate if there are more users clicking.
- Given a pop-up message, whether users will continue enroll in program or not
- Given a new firewall feature in GCP

http://rajivgrover1984.blogspot.com/2015/11/ab-testing-overview.html
>*For example: An online education company tested a change where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that these courses usually require a greater time commitment for successful completion, and suggesting that the student might like to access the course materials for free.*

## Choose Subject (Unit of diversion)


Possible Choice: 
- User id
- Cookie
- Event

## Choose metric

Example of pop-up message and program enrollment

Metrics should **NOT** change:



- Number of cookies, unique # of cookies visiting the page
- Number of clicks on the button (since message shown after clicking)

Metrics that **MAY** change:


- $p = \frac{Number\ of\ users\ actually\ enrolled}{Number\ of\ users\ clicking\ button}$


- $p = \frac{Number\ of\ users\ remain\ enrolled\ for\ 14\ days}{Number\ of\ users\ clicking\ button}$

## Choose size and duration

see above

## Perform sanity check and result analysis

See above

# Network effect

- Sample consistency, for example, GCP, two uses in one collaboration group faces two different features. Or adding a video chatting feature, which only works if both sides have access to it
- Sample independency (Spillover effect), for example, Facebook: many connected components, thus Group A and B are no longer independent.

- Possible solution: community (cluster) based AB test by partitioning nodes to groups, or for a dating app with no prior connections, maybe using demographic/geographical attributes
- Each **cluster** is assigned a treatment, thus unlikely for spillover from control to treatment
- Unit of analysis is reduced, higher varinace as a result

Ref: http://web.media.mit.edu/~msaveski/projects/2016_network-ab-testing.html

<img src="http://web.media.mit.edu/~msaveski/assets/projects/2016_network-ab/main.png" width="500">
