# Outline
## 1. What is A/B testing?
## 2. How long does it take to run an A/B test?
## 3. Multiple testing problem
## 4. Novelty and primacy effect
## 5. Interference between variants
## 6. Dealing with interference
## 7. A step-by-step walkthrough of a real A/B testing process

____________________________________

## 1. What is A/B testing
### A/B tests (aka controlled experiments)
- used in industry to make decisions
- simplest form: control A, treatment B (control group: existing features; treatment group: new features)
- evaluate features with a subset of users

### What do data scientists do with A/B testing? 
- Typically design A/B testing with given metrics
- involved in all A/B testing components
    - developing a new hypothesis
    - designing A/B tests
    - evaluating results
    - making decisions
    
## 2. Designing an A/B test
### Hong long does it take to run an A/B teset?
#### Step 1: Determine the sample size
    - Type II error (beta; P(accepting H_0 when difference exists)) or power
    - Significance level
    - Minimum detectable effect
    
    - sample size = 16 * sigma^2/(delta^2)
        (sigma: standard deviation of population; delta: difference between treatment & control)
     
     - How parameters influence the sample size?
         - if sigma is larger --> more samples
         - if delta is larger --> less samples
     
     - How to estimate parameters:
         - sigma is obtained from data
         - as for delta, we don't know that before experiments; we can use minimum detectable effect
     
     - Minimum detectable effect: the smallest different matters in practice, e.g., 0.1% increase in revenue

#### Step 2: Use sample size and number of users --> round the duration by weeks

## 3. Multiple testing problem
### Test multiple variants of a feature:
    - colors
    - homepage design
    
### Sample scenario:
10 tests are running with different landing pages, including colors and designs; we can see that 1 case won and the p value is < 0.05. Should we make the change?

#### Answer: No 

Given that there are **multiple testing variants**, we shouldn't use the same significance level, and the probability of false discovery increases. 

### Explanation of increasing probability of false discovery
There are 3 groups, what is the chance of at least one false positive?

#### Answer:
Pr(no false positive) = (1 - 0.05)^3 = 0.95^3 = 0.857

Pr(at least 1 false positive) = 1 - Pr(no false positive) = 0.143

Therefore, the Type I error is over 14%

### Dealing with 'multiple testing' problem:
#### Method I: Bonferroni correction
- significance level = original significance level/ number of tests
- in this case: significance level = 0.05/10 = 0/005
--> This method is too conservative. 

#### Method II: Control False Discovery Rate (FDR) -- mostly used with many metrics
FDR = E(false positives/ rejections)

Eg. With 200 metrics with FDR at 0.05
    
    - 5% False positive
    - at least 1 false positive in 200 metrics
    
## 4. Novelty and Primacy effect
### Definition:
primacy effect (change aversion): people are reluctant to change

novelty effect: people welcome the changes and use more

**Note: Effects will not last long**

An A/B test has larger or smaller initial effect due to novelty or primacy effect

### Sample question:
After we ran an A/B test on a new feature, the test won and we launched the change. However, after a week, the treatment effect quickly declined. What's wrong?

#### Answer: Novelty effect
The repeat usage declined when effect wears off. 

### Ways to rule out the possibility:
- Run tests only on fist time users
- If the test is already running: compare fist time users to old users in treatment group

## 5. Interference between variants
Typical A/B testing design: split users randomly, since users are independent (assumption)

### Cases when the assumption fails:
- social network, eg., Facebook, LinkedIn, Twitter
- two-sided markets, eg., uber, lyft, airbnb

### Sample question:
We want to test a new feature to increase posts created per user. We assign each user randomly. The test won by 1% in terms of posts. 

What would happen after new feature is launched to all users? Will it be the same as 1%? Assume no novelty effect

#### Answer: The difference will not be 1%

Network effect:
- User behaviors are impacted by others
- The effect can spill over the control group
- The difference underestimates the treatment effect
Therefore, the difference will be more than 1%

Two-sided markets
- Resources are shared among control and treatment groups, eg., if treatment group attracts more drivers, less drivers will be available for control group
Therefore, Actual effect < treatment effects

## 6. Dealing with Interference
### Sample question:
A new feature is to provide coupons to our riders. Our goal is to increase rides by decreasing price. 

Testing strategy: evaluate the effect of the new feature

#### Main idea:
- Isolate users

##### Two-sided markets:
*Method I: Geo-based randomization:*
- Split by geolocations
- Eg. New York vs San Francisco
- Big variance since markets are unique

*Method II: Time-based randomization:*
- Split by day of wekk
- Assign all users to either treatment or control
- Only when treatment effect is in short time
- Does not work when treatment effect takes a long time, eg., through a referral program

##### Social network:
*Method I: Create network clusters:*
    
    - people interact mostly within the cluster
    - assign clusters randomly

*Method II: Ego-network randomization:*
    
    - originated from Linkedin
    - A cluster is composed of an "ego" and her "alters"
    - One-out network effect: user either has the feature or not
    - It's simpler and more scalable

### To sum up, 
- all methods have limitations
- we need to evaluate methods based on scenarios
- it's possible to combine methods to get a more reliable result

## 7. A step-by-step walkthrough of a real A/B testing process
### Steps:
#### 1. Prerequisiities 
#### 2. Experiment Design
#### 3. Running Experiment
#### 4. Result to Decision
#### 5. Post-launch Monitoring
______________________________________________________________________

### 1. Prerequisiities 
- Understand Objective & Key Metrics
- Variants
- Randomization units

#### Key Matric
- Normalize revenue by # of users
- Revenue per user

#### Product Variants
Control group & Treatment grouop

#### Randomization units
- Users (Assume enough users)

### 2. Experiment Design
#### 1). Users to target:
- all users vs specific segment of users

#### 2). Which population to select from?
- Eg., if we want to change the website checkout design, we should select users on the 'Checkout' page, instead of the ones still browsing products

#### 3). Practical significance boundary
- Assume practical significance: Revenue increase by &#36;2 per user
- Power of the test: 80%, Significan level: 5% 

--> sample size = 16 * sigma^2/(delta^2)

(sigma: standard deviation of population; delta: difference between treatment & control)

#### 4). Duration to run the experiment
##### Ramp-up plan: 
No bugs & traffic can be handled
- Expose to a small population at first (Start with dozens of users)
- Gradually increase percentage
- Day of week effect (eg., people make more purchases during the weekend) --> Recommended: run experiment for >= 1 whole week
- Seasonality (holiday season) --> Data during holidays: cannot be used for analysis; run experiment longer
- Primacy and novelty effects (Users respond to changes differently)

##### Conclusion: 
- Run the experiment for >= 1 week
- starting with 5% in both groups, and will eventually be about 33% in all groups (eg., control, treatment 1, treatment 2)
- the experiment can be longer when there is novelty or primacy effect
- Number of unique users: OK to get more users in each group; Recommended to be overpowered
- Running the experiment too long --> precision won't improve further results

### 3. Running the experiment

### 4. Results to decision
#### firstly, sanity check
Why sanity checks?
- unreliable if assumptions are violated

Things need to check:
- number of users assigned to groups
- latency when loading the webpage

#### secondly, use hypothesis testing to make recommendations
- Recommend launching a change when the results are both statistically and practically significant
- if uncentainty --> recommendation: don't launch the change; run a follow-up test with more power

**statistically significant**: p-value

**practically significant**: point estimate vs practical significance boundary

### 5. Post-launch Monitoring