# Review of A/B Testing Tools

- Sample size selection for proportions
  - Single treatment
  - Multitreatment
- Hypothesis test for proportions
- Power analysis for proportions

## Review of Tests

**t-test**

- https://en.wikipedia.org/wiki/Student%27s_t-test#Assumptions
- is used for continuous values.
- one-sided assumes this will be from the t dist: $\frac{ \hat{\mu} - \mu_0 }{ \hat{\sigma} / \sqrt{n} }$
- 2 dist
  - means are normal (valid with enough data due to CLT -- means always converge)
  - same std
  - same size
  - usually still works unless IID assumption is broken

**z-test**

- TODO

**chi-square**

- TODO

### Key Terms

- Power is the likelihood h0 is rejected given that h1 is real
- P-value is the probability of seeing a more extreme result given null hypot is true

### Resources

- https://machinelearningmastery.com/effect-size-measures-in-python/
- http://jpktd.blogspot.com/2013/03/statistical-power-in-statsmodels.html
- https://www.evanmiller.org/ab-testing/sample-size.html

In [2]:
import statsmodels.stats.power as pwr

In [4]:
?pwr.ttest_power

[0;31mSignature:[0m [0mpwr[0m[0;34m.[0m[0mttest_power[0m[0;34m([0m[0meffect_size[0m[0;34m,[0m [0mnobs[0m[0;34m,[0m [0malpha[0m[0;34m,[0m [0mdf[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0malternative[0m[0;34m=[0m[0;34m'two-sided'[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Calculate power of a ttest
    
[0;31mFile:[0m      ~/Documents/learn-python/.venv/lib/python3.7/site-packages/statsmodels/stats/power.py
[0;31mType:[0m      function


In [3]:
pwr.ttest_power(0.2, nobs=60, alpha=0.1, alternative='two-sided')

0.45558175996348543

In [7]:
tt_pwr = pwr.TTestIndPower()
?tt_pwr.solve_power

[0;31mSignature:[0m
[0mtt_pwr[0m[0;34m.[0m[0msolve_power[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0meffect_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnobs1[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0malpha[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpower[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mratio[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0malternative[0m[0;34m=[0m[0;34m'two-sided'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
solve for any one parameter of the power of a two sample t-test

for t-test the keywords are:
    effect_size, nobs1, alpha, power, ratio

exactly one needs to be ``None``, all others need numeric values

Parameters
----------
effect_size : float
    standardized effect size, difference between the two means divided
    by the standard deviati

In [5]:
pwr.TTestIndPower().solve_power(0.3, power=0.75, ratio=1, alpha=0.05, alternative='larger')

120.22320283709918

Mini-example

- 500 loans a day
- Want to try new terms
- Want to see difference in default rates at least 1 percentage point
- Current default rate is 10%

In [37]:
import numpy as np

p = 0.5 # worst case for binomial, when each 
p = 0.15 # assuming null hypot, no change
std_dev = np.sqrt( p * (1-p) )
diff = 0.01
effect_size = diff / std_dev

pwr.TTestIndPower().solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.8,
    ratio=1,
    alternative='two-sided',
)

20015.555002803725

In [40]:
pwr.zt_ind_solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.8,
    ratio=1,
    alternative='two-sided',
)

20014.59429880191

In [46]:
chi_pwr = pwr.GofChisquarePower()
?chi_pwr.solve_power

[0;31mSignature:[0m
[0mchi_pwr[0m[0;34m.[0m[0msolve_power[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0meffect_size[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnobs[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0malpha[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpower[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_bins[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
solve for any one parameter of the power of a one sample chisquare-test

for the one sample chisquare-test the keywords are:
    effect_size, nobs, alpha, power

Exactly one needs to be ``None``, all others need numeric values.

n_bins needs to be defined, a default=2 is used.


Parameters
----------
effect_size : float
    standardized effect size, according to Cohen's definition.
    see :func:`statsmodels.stats.gof.chisquare_effect

In [48]:
import statsmodels
?statsmodels.stats.gof.chisquare_effectsize

[0;31mSignature:[0m
[0mstatsmodels[0m[0;34m.[0m[0mstats[0m[0;34m.[0m[0mgof[0m[0;34m.[0m[0mchisquare_effectsize[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mprobs0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprobs1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcorrection[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcohen[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
effect size for a chisquare goodness-of-fit test

Parameters
----------
probs0 : array_like
    probabilities or cell frequencies under the Null hypothesis
probs1 : array_like
    probabilities or cell frequencies under the Alternative hypothesis
    probs0 and probs1 need to have the same length in the ``axis`` dimension.
    and broadcast in the other dimensions
    Both probs0 and probs1 are normalized to add to one (in the 

In [51]:
effect_size

0.028005601680560196

In [52]:
statsmodels.stats.gof.chisquare_effectsize(
    probs0=[0.15, 0.85],
    probs1=[0.16, 0.84],
)

0.028005601680560224

In [53]:
statsmodels.stats.gof.chisquare_effectsize(
    probs0=[0.15, 0.85],
    probs1=[0.14, 0.87],
)

0.03188756626994474

In [55]:
pwr.GofChisquarePower().solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.8,
)

10007.29714939089

## Hypothesis Test

In [68]:
import statsmodels.stats.api as sm

results1 = [
    [5000, 3081], # A
    [4000, 2700], # B
]

results1[0][0] / sum(results1[0]), results1[1][0] / sum(results1[1])

(0.6187353050365054, 0.5970149253731343)

In [58]:
?sm.proportions_ztest

[0;31mSignature:[0m
[0msm[0m[0;34m.[0m[0mproportions_ztest[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mcount[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnobs[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mvalue[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0malternative[0m[0;34m=[0m[0;34m'two-sided'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprop_var[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Test for proportions based on normal (z) test

Parameters
----------
count : {int, array_like}
    the number of successes in nobs trials. If this is array_like, then
    the assumption is that this represents the number of successes for
    each independent sample
nobs : {int, array_like}
    the number of trials or observations, with the same length as
    count.
value : float, array_like or None, optional
    This is the value of the null hypothesis equal to the proporti

In [62]:
sm.proportions_ztest(
    count=[results1[0][0], results1[1][0]], # Successes
    nobs=[sum(results1[0]), sum(results1[1])],
    value=0, # null is no diff
    alternative='two-sided',
    prop_var=False,
)

(2.693807365822314, 0.007064097883455329)

In [64]:
sm.proportions_ztest(
    count=[results1[0][0], results1[1][0]], # Successes
    nobs=[sum(results1[0]), sum(results1[1])],
    value=0, # null is no diff
    alternative='two-sided',
    prop_var=results1[0][0] / sum(results1[0]), # proportion of control, sample A
)

(2.7065728058306524, 0.006798167406192918)

In [65]:
sm.proportions_ztest(
    count=[results1[0][0], results1[1][0]], # Successes
    nobs=[sum(results1[0]), sum(results1[1])],
    value=0, # null is no diff
    alternative='larger',
    prop_var=results1[0][0] / sum(results1[0]), # proportion of control, sample A
)

(2.7065728058306524, 0.003399083703096459)

In [69]:
sm.proportions_ztest(
    count=[results1[0][0], results1[1][0]], # Successes
    nobs=[sum(results1[0]), sum(results1[1])],
    value=0, # null is no diff
    alternative='smaller',
    prop_var=results1[0][0] / sum(results1[0]), # proportion of control, sample A
)

(2.7065728058306524, 0.9966009162969035)

In [74]:
sm.proportions_chisquare(
    count=[results1[0][0], results1[1][0]], # Successes
    nobs=[sum(results1[0]), sum(results1[1])],
)

(7.256598124158506,
 0.007064097883455526,
 (array([[5000, 3081],
         [4000, 2700]]),
  array([[4920.43840065, 3160.56159935],
         [4079.56159935, 2620.43840065]])))

In [81]:
sm.ttest_ind(
    x1= [0] * results1[0][0] + [1] * results1[0][1],
    x2= [0] * results1[1][0] + [1] * results1[1][1],
    alternative='two-sided',
    usevar='pooled',
    weights=(None, None),
    value=0,
)

(-2.694286560668368, 0.007061918115177303, 14779.0)

All look very similar....

## Power Calculations

In [95]:
p0 = results1[0][0] / sum(results1[0])
p1 = results1[1][0] / sum(results1[1])
diff = p0 - p1
nobs1 = sum(results1[0])
ratio = nobs1 / sum(results1[1])
std = np.sqrt( p0 * (1 - p0) )
effect_size = diff / std

sm.tt_ind_solve_power(
    effect_size=effect_size,
    nobs1=nobs1,
    ratio=ratio,
    alpha=0.5,
    alternative='two-sided',
), effect_size, diff

(0.989350186020283, 0.044719986905560935, 0.021720379663371148)

In [97]:
sm.zt_ind_solve_power(
    effect_size=effect_size,
    nobs1=nobs1,
    ratio=ratio,
    alpha=0.5,
    alternative='two-sided',
)

0.9893507241486387