In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

## Permutation Test

|Outcome|Price A|Price B|
|-------|-------|-------|
|Conversion|200|182|
|No conversion|23539|22406|

We want to test if the delta of conversion rate between Price A and Price B is due to chance variation. Because of no normality, we use permutaiton test.

$H_0$: there is no difference of conversion ratios between Price A and Price B  
$H_1$: there is significant difference of conversion ratios between Price A and Price B

In [2]:
con_rate_pa = 200/23539
con_rate_pb = 182/22406
delta = con_rate_pa - con_rate_pb
con_rate_pa, con_rate_pb, delta

(0.008496537660903181, 0.008122824243506204, 0.00037371341739697757)

In [3]:
sizeA, sizeB = 200 + 23539, 182 + 22406
pool = np.hstack([np.array([0] * (23539+22406)), np.array([1] * (200+182))])

In [101]:
def permutation_test(pool, sizeA, sizeB):
    np.random.shuffle(pool)
    
    groupa = pool[:sizeA]
    groupb = pool[-sizeB:]
    
    return(sum(groupa)/sizeA - sum(groupb)/sizeB)

In [111]:
numsample = 10000
estimates = list(map(lambda x:permutation_test(pool, sizeA, sizeB), range(numsample)))
count = [x for x in estimates if x > delta]
p_value = count.__len__()/numsample

In [112]:
p_value

0.3144

## Normal Approximation

#### Normal approximation to binomial distribution
The reason why we use it:
- It is easy for probability calculations
- the binomial distribution is perfectly symmetric if p = 0.5 and hs some skewness when p <> 0.5  
- the normal approxiamtion works best when p is close to 0.5 and when n is large
- roughly, the normal approxiamtion is reasonable if both np >= 10 and n(1-p) >= 10

#### Examples: samples from binomial distribution.
From Binomial distribution $X_1\sim B(n_1p_1, n_1p_1(1-p_1))$, Normal Approximation: 
<center>$X_1\sim N(n_1p_1, n_1p_1(1-p_1))$<center/>

## Proportion Z-test

#### Normal approximation to binomial distribution
The reason why we use it:
- It is easy for probability calculations
- the binomial distribution is perfectly symmetric if p = 0.5 and hs some skewness when p <> 0.5  
- the normal approxiamtion works best when p is close to 0.5 and when n is large
- roughly, the normal approxiamtion is reasonable if both np >= 10 and n(1-p) >= 10

#### Examples: samples from binomial distribution.
- From Binomial distribution B1(n1, p1): sample size n1, success s1
- From Binomial distribution B2(n2, p2): sample size n2, success s2

#### Step 1. We want to test if the probabilities of the binomial distributions we samples from are the same. (p1 VS p2) 
#### Step 2. Normal approximation to Binomial Distribution:
$S_1$(Sample 1) $\sim B(n_1p_1, n_1p_1(1-p_1))$ 
- Approximation: $N(n_1p_1, n_1p_1(1-p_1))$ 
- $P_1$(Proportion 1) = $\frac{S_1}{n_1} \sim N(p_1, \frac{p_1(1-p_1)}{n1})$ 

$S_2$(Sample 2) $\sim B(n_2p_2, n_2p_2(1-p_2))$ 
- Approximation: $N(n_2p_2, n_2p_2(1-p_2))$ 
- $P_2$(Proportion 2) = $\frac{S_2}{n_2} \sim N(p_2, \frac{p_2(1-p_2)}{n2})$ 

#### Step 3. The question is transformed into a comparison of the means between two Normal Distribution
$P_1 - P_2 \sim N(p1 - p2, \frac{p_1(1-p_1)}{n1} + \frac{p_2(1-p_2)}{n2})$  

#### Step 4. Estimation of mean and standard deviation of $P_1 - P_2$
Mean: 
- we use $\hat{p_1} = \frac{s_1}{n_1}$ to estimate $p_1$, $\hat{p_2} = \frac{s_2}{n_2}$ to estimate $p_2$
- we use $\hat{p_1} - \hat{p_2}$ to estimate $p1 - p2$

Standard Error
- Pooled version: given $H_0$($p1 = p2 = p$) is true, we can simplify the Variance into $p(1-p)(\frac{1}{n_1} + \frac{1}{n_2})$, where $p = \frac{n_1p_1 + n2_p2}{n1+n_2}$. Here we use $\hat{p} = \frac{n_1\hat{p_1} + n_2\hat{p_2}}{n1 + n2}$ to estimate $p$. So we can get our estimate to Variance: $\hat{p}(1-\hat{p})(\frac{1}{n_1} + \frac{1}{n_2})$
- Unpooled version: $SE = \frac{\hat{p_1}(1-\hat{p_1})}{n1} + \frac{\hat{p_2}(1-\hat{p_2})}{n2}$

#### Step 5. Z-score and CI
- $Z-score = \frac{(\hat{p_1} - \hat{p_2}) - 0}{SE}$
- CI = $p_1 - p_2 \pm Z_{\alpha/2} * SE = 0 \pm Z_{\alpha/2} * SE$

In [54]:
st.norm.ppf(0.975), st.norm.cdf(1.9599)

(1.959963984540054, 0.9749962601845973)

In [43]:
from statsmodels.stats.proportion import proportions_ztest

count = np.array([200, 182])
nobs = np.array([23739, 22588])

stat, pval = proportions_ztest(count, nobs, alternative='larger')
pval

0.33094407441560325

## One-way Anova

Assumption:
- the samples are independent simple random samples.
- the populations are normally distributed.
- the population variances are equal. $\sigma_1^2 = \sigma_1^2 = \sigma_k^2 = ... = \sigma^2$

one-way ANOVA is a statistical method that tests:  
    $H_0 : \mu_1 = \mu_2 = ... = \mu_k$  
by comparing the variability between groups to the variability within groups