# I. Course Overview
## 1. Intro to A/B Testing
A/B testing is a general methodology used online when people want to test out a **new product or a feature**. 

A/B tests allow people to determine **scientifically** how to optimize a website or a mobile app by trying out possible changes and seeing what performs better with users. 

Using A/B tests means that people can get **data** to make decisions rather than relying on intuition or hispsters. 

The **goal** in A/B testing is to design an experiment, that's going to be **robust** and give people **repeatable** results, so that people can actually make a good decision about whether or not to actually launch that product or feature. 

#### Full process of an A/B testing:
- Choose a metric
- Review statistics
- Design experiments
- Analyze results

## 2. Examples of A/B Testing in Industries
- Google tested 41 different shades of blue.
- Amazon initially decided to launch their first personalized product recommendations based on an A/B test showing a huge revenue increase by adding that feature. 
- LinkedIn tested whether to use the top slot on a user's stream for top news articles or an encouragement to add more contacts. 
- Amazon determined that every 100ms increase in page load time decreased sales by 1%. 
- Google’s latency results showed a similar impact for a 100ms delay.

#### Other Examples:

- Kayak tested whether notifying users that their payment was encrypted would make users more or less likely to complete the payment.
- Khan Academy tests changes like letting students know how many other students are working on the exercise with them, or making it easier for students to fast-forward past skills they already have. 

## 3. What you can and can't do with A/B Tests
#### When can you use A/B Testing
- Add premium service
- Move recommendation site: new ranking algorithm
- Change backend-page load time, results users see, etc.
- Test layout of initial page

#### Not suitable: 
- online shopping company: is my site complete? 
- Website selling cars: will a change increase repeat customers or referrals?
- update brand, including main logo

## 4. Other Techniques Apart from A/B Testing
- user experience research (eg., check user logs)
- focus groups
- surveys
- human evaluation

## 5. A/B Testing Business Case - Audacity 

##### Background:
Audacity is a website that creates online finance courses.

Its user flow (also known as **customer funnel**) looks like this: 
- Homepage visits
- Exploring the site
- Create account
- Complete

##### Change we want to make:
Change the 'Start Now' button from orange to pink

##### Experiment:
**Initial Hypothesis**: 

Changing the "Start Now" button from orange to pink will increase how many students explore Audacity courses

##### Choose a metric:
Refining the hypothesis to see which metric to use
- total number of courses completed (not practical)
- number of clicks on the "Start Now" button (not considering the number of page views)
- fraction of clicks: number of clicks/ number of page views; this number is also called **click-through-rate**, or CTR
- **click-through-probability**: number of unique visitors who click/ number of unique visitors to page

We'll use **click-through-probability** as our metric

##### Updated hypothesis:
Changing the "Start Now" button from orange to pink will increase the **click-through-probability** of the website

##### Why click-through-probability over click-through-rate?
Generally speaking, we use a rate when we want to measure the *usability* of the site, and a probability when we want to measure the *total impact*. 

Here in the example, we are interested in whether users are *progressing* to the second level of the funnel, so we picked a probability. 

##### How to compute the rate/ probabaility?
**To compute the rate**: 

- First, work with the engineers to modify the website. Engineers need to change the website so that on every page view, the event can be captured. And then whenever a user clicks, we can also capture the click event. 
- Second, after we captured the data, we sum the page views and the clicks, and then we divide to get the rate.

**To compute the probability**:

count at most 1 child click per page view

##### Repeating the Experiement
Repeated measurement of click-through-probability

**For example**,
visitors = 1000, unique clicks = 100

--> click-through-probability = 10%

Repeat the experiment --> which results would *surprise* us if we repeated the experiment? 
A. 100

B. 103

C. 98

D. 150

E. 900

**The answer is D & E.**

So what can we use when we decide if the number surprises us?

## 6. Review the statistics
- binomial distribution
- confidence intervals
- hypothesis testing

### How do we know how variable the estimate is likely to be?
**binomial**: two exclusive outcomes

#### binomial Distribution
eg. biased coin: p = 3/4

success = heads, failure = tails

mean = p, std dev = square root of (p * (1 - p)/ N)

Suppose N = 20, X = 16, 

then P_hat = 16/20 = 4/5

##### When can we use the binomial distribution?
1. 2 types of outcomes (success, failure)
2. Independent events
3. Identical distribution (p is the same for all)

##### Examples:
- Roll a dice 50 times. Outcomes: 6 or other
- Student completion of course after 2 months. Outcomes: Complete or not

#### Confidence Intervals

##### Calculating a confidence interval

p_hat = X/N (X: # of users who clicked; N: # of total users)

X = 100, N = 1000

To use normal distribution: 

check **N * p_hat > 5** & **N * (1 - p_hat) > 5**

m = margin of error = z * std error = z * square root of (p_hat * (1 - p_hat)/ N)

z-score = 1.96 for 95% confidence interval

--> m = 0.019

--> confidence interval = (0.081, 0.119) 

--> In other words, if we run the experiment on another 1000 users, seeing 80 to 120 clicks would be reasonable

##### eg., 

N = 2000, X = 300, confidence level = 99%, use a two-tailed test. So the CI would be (, ).

Answer:

z-score = 2.58, mean = 0.15, and thus m = 2.58 * square root (0.15 * 0.85 / 2000) = 0.021

So the left of CI = 0.15 - 0.021  = 0.129, and the right of CI = 0.15 + 0.021 = 0.171


##### Establishing Satistical Significance
**Hypothesis Testing**
* Null and Alternative Hypotheses
* P(results due to change)
    - Null hypothesis, or H_0: p_cont = p_exp, aka p_exp - p_cont = 0
    - Alternative hypothesis, or H_A: p_exp - p_cont is not equal to 0
* Measure p_cont_hat and p_exp_hat
* Calculate P(p_cont_hat - p_exp_hat|H_0)
* Reject null if P < 0.05 (alpha)

**Two-tailed vs. one-tailed tests**

The null hypothesis and alternative hypothesis proposed here correspond to a two-tailed test, which allows us to distinguish between three cases:

- A statistically significant positive result
- A statistically significant negative result
- No statistically significant difference.

Sometimes when people run A/B tests, they will use a one-tailed test, which only allows us to distinguish between two cases:

- A statistically significant positive result
- No statistically significant result

Which one we should use depends on what action we will take based on the results. 

* If we're going to launch the experiment for a statistically significant positive change, and otherwise not, then we don't need to distinguish between a negative result and no result, so a one-tailed test is good enough. 
* If we want to learn the direction of the difference, then a two-tailed test is necessary.

### Comparing two samples
Calculate
- pooled standard error
- X_cont, X_exp
- N_cont, N_exp
- P_pool_hat = (X_cont + X_exp)/(N_cont + N_exp)
- SE_pool = square root of (P_pool_hat * (1 - P_pool_hat)/((1/N_cont) + (1/N_exp)))
- d_hat = p_exp_hat - p_cont_hat
Give H_0: d = 0, d_hat is N(0, SE_pool)

--> If d_hat > 1.96 * SE_pool or d_hat < -1.96 * SE_pool, reject null

### Practical or Substantive Significance 
They mean the same thing. 

Basically, on top of **statistical significance**, does the change matter to the **business**? 

Statistical significance is about **repeatability**. 

So we still need to pick a **practical significance boundary**. 

For example, for a business, a 2% change in the click-through-probabiliry would be practically significant. 

#### Design the experiment 
**statistical power vs size trade off**

How many page views

alpha = P(reject null|null true)
beta = P(fail to reject|null fake)

when collecting a small sample --> alpha is low (aka, we are unlikely to launch a bad experiment), but beta is hight (aka we are likely to fail to launch an experiment that actually did have a difference we care about)

1 - beta = sensitivity

We want our experiment to have a high level of sensitivity at the practical significance boundary, and people usually choose 80%. 

when collecting a big sample --> alpha the same, but beta is low, so the power increases (aka, more likely to launch the experiment that actually has the difference)

#### Calculating number of pages views needed
- built-in library
- look up answer in a table
use online calculater (https://www.evanmiller.org/ab-testing/sample-size.html)
##### eg,
baseline conversion rate: the estimated click-through-probability before making the change = 0.1 (X = 100, N = 1000)

minimum detectable effect = practical significance level = 5%

The Minimum Detectable Effect is the smallest effect that will be detected (1-β)% of the time.

Conversion rates in the gray area (baseline conversion rate +- minimum detectable effect) will not be distinguishable from the baseline.

Statistical power = 1 - beta, it means the percent of the time the minimum effect size will be detected, assuming it exists

Significance level alpha: Percent of the time a difference will be detected, assuming one does NOT exist

##### Note on power
Statistics textbooks frequently define power to mean the same thing as sensitivity, that is, 1 - beta. 

However, conversationally power often means the probability that your test draws the correct conclusions, and this probability depends on both alpha and beta. 

In this course, we'll use the second definition, and we'll use sensitivity to refer to 1 - beta.

##### Should we increase or decrease number of page views if we want to keep everything else the same after the change?
Change
1. Higher click-through-probability in control (but still less than 0.5) 
--> increase # of page views
2. Increased practical significance level 
--> decrease # of page views
3. Increased confidence level (1 - alpha)
--> increase # of page views
4. Higher sensitivity (1 - beta)
--> increase # of page views

##### Analyze the results
- Case 1:
N_cont = 10,072, N_exp = 9,886, X_cont = 974, X_exp = 1242; d_min = 0.02, confidence level = 95%
--> P_pool_hat = (974 + 1242) / (10072 + 9886) = 0.111

SE_pool = square root of (0.111 * (1 - 0.111)/(1/10072 + 1/9886) = 0.00445
--> d_hat = p_exp_hat - p_cont_hat = 1242/9886 - 974/10072 = 0.0289, estimated difference

m = z * SE_pool = 1.96 * 0.00445 = 0.0087, the margin of error
and thus the CI is (d_hat - m, d_hat + m) = (0.0289 - 0.0087, 0.0289 + 0.0087), aka (0.0202, 0.0376)

Given that our practical significance boundary is 0.02, and we can be confident that we have at least that big of a change at the 95% level, that is, it is both **statistically** and **practically significant**. Therefore, we would launch the new version.

- Case 2: neutral
- Case 3: statistically significance, but no practical significance

if the CI includes 0 --> means it's possible that the new version has no effect at all

##### Making decisions about uncertain data
Inform the decision-makers; they might use stragetic business issues besides the data

# II. Policy and Ethics for Experiments
