A/B Testing

In [14]:
import pandas as pd
import datetime
from random import sample
from random import seed 

A/B testing is performed following the following structure:
- Background
- Prerequisites
- Experiment design
- Running Experiment
- Result to Decision
- Post-launch monitoring

# Background
First it is important to understand the context of the use case at hand and define the hypothesis clearly.

For example:

Showing similar products while purchasing online products in the shopping cart could increase revenue, but it could also distract customers as they may delay or abandon the purchase.

# Prerequisites
Before running the experiemnts the following needs to be clarified:
- Objective and key metrics
- Product Variants
- Randomization units

## Key Metrics:
- Revenue per user
- number of user in control and treatment groups are similar

### Product Variants:
- 1 control group
- 1 first treatment group
- 1 second treatment group
- etc.

### Randomization Units:
- Assumption: there are enough users (/total number of observations) in our experiement

# Experiment Design
To design the experiment, we first need to ask a few questions to target the right users in the experiment:

## Target population:
Who is the target population? Do we want to target all the users or a specific segment of users? To answer this question, it is possible to analyze the user's journey (funnel). For example, we could select only the users who have the intention to buy products and who started checking out.

## Practical significance boundary:
Practical significance boundary, a.k.a. effect size: for example, we could agree that an increase of USD 2/user in revenue is practically significant. As a result, if the revenue increase by USD 2/user on average, we could launch the change to production

## Power of the test:
Probability of True Positive: detecting a given effect size when the effect is real. Probability = 1 - Beta.
Industry standard: Power = 80%

## Significance Level (alpha):
Probability of a False Positive (Type II error) : Mistakenly concluding that an effect is real (when it's actually due to chance). Probability = alpha
Industry standard: Significance level = 5%

## Sample size:
What is the sample size of the experiment? How many user should be allocated randomly in each group?

Sample size = (16*sigma^2) / delta^2
with:
- sigma: the standard deviation of the population
- delta: difference between treatment and control

Assume sigma is equal to 20. Then, the sample size is 16*20^2 / 2^2 = 1600 unique users in each variant. If there are 3 variants, we need a total population of 4800 unique users for 3 variants.

If a smaller effect size (USD 1/user) or a smaller significance level (alpha=2.5%) has to be detected, then the sample size has to be increased.

## Duration:
A number of elements can affect the duration of an experiment:
- latency
- time required to get the data
- seasonality
- primacy and novelty effects

### Ramp-up plan:
Defining a ramp-up plan allows to have no bugs and handle traffic without latency.
- Expose to a small population
- Gradually increase percentage

Example: 
- day 1: 5% of the treatment group
- day 2: increase to 10% of the treatment group
- day 3: increase to 33% of the tratment group

Assuming there are 2000 users / day entering checkout while purchasing products, then the experiment needs to run for a minimum of 4 days:

day 1: 5% => 100
day 2: 10% => 200
day 1: 33% => 660
day 1: 33% => 660

### Day of week effect:
People behave differently during an entire week. As a result, the experiment should run for at least an entire week to capture this effect.

### Seasonality
Data during holidays cannot be used for analysis and the experiment must run longer

### Primacy or novelty effects:
- Users respond to changes differently, and the duration of the experiment must be chosen carefully


## Sanity check:
- Number of users assigned to groups is respected and random
- Consistency for each variant: Latency when using the product is similar, as it could affect the results

# Running Experiment
In this notebook, a webpage was developed for an ecommerce platform, and we want to test the effect of the new page on the purchase conversion:

## Metrics:
- purchase conversion = nb. converted users / nb. exposed users

## Assumptions:
- No interaction effects between units in control / treatment.



## Data collection
The data was collected and added to a CSV file. Each row is logged when a user is exposed to a variant:

In [15]:
df = pd.read_csv('ab_data.csv')
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted
0,851104,2017-01-21 22:11:48.556739,control,old_page,0
1,804228,2017-01-12 08:01:45.159739,control,old_page,0
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0
4,864975,2017-01-21 01:52:26.210827,control,old_page,1


Columns definition:
- timestamp: time of exposure
- group: bucket (control, treatment)
- landing_page: variants (new_page, old_page)
- converted: converted (1) or not (0)

## Duration: 
The data collected represent 3 weeks of exposure / conversion data:
- Exposure: A user is exposed to their respective variant (webpage)
- Conversion: A user purchases a product within 7 days after being exposed to the variant (webpage).

In [21]:
start_time = datetime.datetime.strptime(df['timestamp'].min(), '%Y-%m-%d %H:%M:%S.%f') # start of the data collection
end_time = datetime.datetime.strptime(df['timestamp'].max(), '%Y-%m-%d %H:%M:%S.%f') #end of the data collection
data_duration = (end_time - start_time).days #number of days of the experiment

t = df['user_id'].nunique()
a = int(df[df['group'] == 'treatment'].shape[0]) # Number of subjects in treatment group
b = int(df[df['group'] == 'control'].shape[0]) # Number of subjects in control group A

print(f"Nb. of unique user_id in experiment: {t}")
print(f"Nb. of total entries in experiment: {df.shape[0]}")
print(f"Nb. of users in treatment group: {a}")
print(f"Nb. users in control group: {b}")
print(f"Duration: {data_duration} days.")

Nb. of unique user_id in experiment: 290584
Nb. of total entries in experiment: 294478
Nb. of users in treatment group: 147276
Nb. users in control group: 147202
Duration: 21 days.


## Data cleaning

In [29]:
counter = df['user_id'].value_counts()
mistake = (counter > 1).value_counts()
print(mistake)
print(f"{mistake[1]} user_ids have been exposed to an old and new page. Those data should be removed as they do not represent a big part of the data collected")

False    286690
True       3894
Name: user_id, dtype: int64
3894 user_ids have been exposed to an old and new page. Those data should be removed as they do not represent a big part of the data collected


In [None]:
valid_users = pd.DataFrame(counter[counter == 1].index, columns=['user_id'])
df = df.merge(valid_users, on=['user_id'])
df.shape()

## Data engineering

In [32]:
#Add week column:
df['week'] = df['timestamp'].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f').isocalendar()[1])
number_of_week = df['week'].value_counts()
df.head()

Unnamed: 0,user_id,timestamp,group,landing_page,converted,week
0,851104,2017-01-21 22:11:48.556739,control,old_page,0,3
1,804228,2017-01-12 08:01:45.159739,control,old_page,0,2
2,661590,2017-01-11 16:55:06.154213,treatment,new_page,0,2
3,853541,2017-01-08 18:28:03.143765,treatment,new_page,0,1
4,864975,2017-01-21 01:52:26.210827,control,old_page,1,3


In [35]:
print(number_of_week)

2    93853
3    93518
1    86058
4    21049
Name: week, dtype: int64


## Defining Variables

In [2]:
# Variables that fully define the A/B test results
a = 1000 # Number of subjects in group A
b = 2000 # Number of subjects in group B
t = a + b # Total number of subjects (A+B)

a_yes = 30 # Yes count in group A
b_yes = 100 # Yes count in group B

#Computer remaining A/B test results
t_yes = a_yes + b_yes # Total yes count
t_no = t - t_yes # Total no count
a_yes_pc = 100* a_yes / a # Yes percentage in A
b_yes_pc = 100* b_yes / b # Yes percentage in B

# A/B testing Statistic: Yes percentage difference (A-B)
ab_yes_pc = a_yes_pc - b_yes_pc 

print(f'Observed Yes rate: B: {b_yes_pc} %, A-B: {ab_yes_pc} %, \n Total counts: Yes: {t_yes}, No: {t_no}')


Observed Yes rate: B: 5.0 %, A-B: -2.0 %, 
 Total counts: Yes: 130, No: 2870


In [3]:
seed(2) # For reproducible results

bag = [1]*t_yes + [0]*t_no #S1 : create a bag with all the data

p = 1000 # number of permutations
perm_res = [0]*p # list for permutation results

for i in range (p):
    bag = sample(bag, k=len(bag)) # S2: Shuffle the bag
    a_rs = bag[:a]                # S3: Random sample A
    b_rs = bag[a:]                # S4: Random sample B
    # Step 5: Compute the test statistic
    perm_res[i] = 100*sum(a_rs)/a - 100*sum(b_rs)/b

# Print representation of the Null hypothesis and can be used to compute the p-value
# n = 5
# for i in range(0, len(perm_res), n):
#     print(str(perm_res[i:i+n]).replace(",", "").replace(".0", ""))

Reminder about Powers and Errors:
- Power: Probability of detecting a give effect size when the effect is real (True Positive). 
- Type I error (False Negative): Mistakenly concluding that an effect is due to chance(when it's actually real). Probability = Beta
- Type II error (False Positive): Mistakenly concluding that an effect is real (when it's actually due to chance). Probability = Alpha

## Counting the number of times we get the values 

In [4]:
# Transform list to pandas series:
perm_res_s = pd.Series(perm_res)
#print(pd.pivot_table(perm_res_s.value_counts().reset_index(), values = 0, columns = "index").to_string(index=False))


## Hypothesis tests

In [5]:
# Two-way hypothesis test (Null: A = B, Alternative: A != B)
extreme_count = sum(perm_res_s.abs() >= abs(ab_yes_pc))

# One-way hypothesis test (Null: A <= B, Alternative: A > B)
pos_extreme_count = sum(perm_res_s >= ab_yes_pc)

print("Number of permutations:", p, 
     "\nTwo-way: Extreme count          :", extreme_count,
     "\nTwo-way: Extreme ratio (p-value):", extreme_count / p,
     "\nOne-way: Extreme count          :", pos_extreme_count,
     "\nOne-way: Extreme ratio (p-value):", pos_extreme_count / p)

Number of permutations: 1000 
Two-way: Extreme count          : 8 
Two-way: Extreme ratio (p-value): 0.008 
One-way: Extreme count          : 998 
One-way: Extreme ratio (p-value): 0.998


## Results to decision
It is recommended to launching a change when the result is
- Statistically significant (based on p-value, if Confidence interval overlaps with 0: NOT statistically significant)
- Practically significant (based on point estimate > practical significance boundary USD 2 / user)

If it is not the case, run a follow-up test with more power would be helpful in this case

- Tradeoffs between different metrics also need to be considered. If different metrics move in different directions, such as revenue and user engagment.
- Costs of launching a change such as engineering maintenance also needs to be considered. When costs are high, we need to make sure the benefit overwheigh the cost (using a practical significance boundary). When costs are low, any positive changes should be launched.
