# Statistical Power & Bayes

Chris Overton  
2016-09-22  

Adapted from versions by several other lecturers

Morning: winding up our frequentist statistics
* Recap: significance vs. causality
* Statistical Power

Afternoon:
* Bayesian Inference

## Statistical Power: Objectives

* Know how to avoid misuse and misinterpretation of frequentist tests
* Understand statistical power in relation to Type II errors
* Understand the tradeoffs between power, significance, effect size, and sample size/cost
* Calculate sample size for a given power and estimated effect size

## Bayesian Inference
* Frequentist vs. Bayesian approaches
* Work with Bayes rule to compute posterior probability
* Prior, liklihood and posterior distributions

## Recap: significance vs. causality
You are given the following data from a clinical trial.  
Each patient either received treatment A or not.  
An hour later, a clinical observation was taken on each patient to determine whether or not they were in state H.  

|        | A | not A | Total |
|------ |---- | ------- | ------ |
| H  | 63 | 89 | 152 |
| not H | 129 | 2104 | 2233 |
| Total | 192 | 2193 | 2385 |

* What statistical test would you apply to these data?
* What is your estimate of the likely result?
* What are the implications for the likely effect A has on H?

## This is a typical chi-square test

Null hypothesis $H_0$: outcome H is independent from treatment A

$$\chi^2 = \sum_{i} \frac{(O_i - E_i)^2}{E_i}$$

Please run a $\chi^2$ test on these data.  

What is the p-value?


In [12]:
## Scipy code
import scipy.stats as st
cont_table = np.array([[63,89],[129,2104]])
print st.chi2_contingency(cont_table)
chi2,pval,dof,exp_array = st.chi2_contingency(cont_table)

print "Chi2 Value: {0}\np-value: {1}\ndof: {2}\nExpected Value Array:\n{3}".format(chi2,pval,dof,exp_array)

(239.82740717540122, 4.2888269144842352e-54, 1, array([[   12.23647799,   139.76352201],
       [  179.76352201,  2053.23647799]]))
Chi2 Value: 239.827407175
p-value: 4.28882691448e-54
dof: 1
Expected Value Array:
[[   12.23647799   139.76352201]
 [  179.76352201  2053.23647799]]


## Recap: significance vs. causality
**Actual **  

|        | A | not A | Total |
|------ |---- | ------- | ------ |
| H  | 63 | 89 | 152 |
| not H | 129 | 2104 | 2233 |
| Total | 192 | 2193 | 2385 | 



**Expected** (rounded to integers)  


|        | A | not A | Total |
|:------ |:---- | :------- | :------ |
| H  | 12 | 140 | 152 |
| not H | 180 | 2053 | 2233 |
| Total | 192 | 2193 | 2385 |

p = 4.3e-54   

What does this mean?

## The punchline
* A is whether the patient took aspirin
* H is whether they (still?) have a headache an hour later

### ==> Frequentist tests often create illusions of causality

In this case, we are tempted to conclude that the causality runs from A to H.

In fact, our 'H' is probably left over from an earlier 'H' that was a likely cause for our 'A'.

So the $\chi^2$ rejection of $H_0$ is correct in having detected a relationship, but the relationship may be different than our first guess.

### Causality

Causality is a branch of statistcs (e.g. by Pearl, Rubin, and Imbens) that attempts to obtain better indicators of causality from the data, by modeling 'counterfactuals': un-measured, but relevant variables.  

- Example: what is the effect of (possibly self-selectied) assignment to be in the aspirin group, separated from the effect of receiving aspirin?  

Among traditional statisticians, the gold standard is still the 'double-blind study.' For our experiment, that would have required random assignment to A and not A groups (such as by giving some patients placebos.) This is clearly not possible for observational studies, and may be unethical.

### Side note on $\chi^2$:
What would the test say if actual H and not H data would have been as 'expected' from the prior test, **but** if we had only been able to collect H data for part of the A group?

**Actual: missing data scenario** 

|        | A | not A | Total |
|:------ |:---- | :------- | :------ |
| H  | 12 | 140 | 152 |
| not H | 180 | 2053 | 2233 |
| N/A | 80 | 0 | 80 |

Wht are two possible interpretations for these results?

In [10]:
cont_table2 = np.array([[12,140],[180,2053],[80,0]])
print st.chi2_contingency(cont_table2)
chi2,pval,dof,exp_array = st.chi2_contingency(cont_table2)

print "Chi2 Value: {0}\np-value: {1}\ndof: {2}\nExpected Value Array:\n{3}".format(chi2,pval,dof,exp_array)

(666.63922292517759, 1.7423364878048128e-145, 2, array([[   16.77241379,   135.22758621],
       [  246.4       ,  1986.6       ],
       [    8.82758621,    71.17241379]]))
Chi2 Value: 666.639222925
p-value: 1.7423364878e-145
dof: 2
Expected Value Array:
[[   16.77241379   135.22758621]
 [  246.4         1986.6       ]
 [    8.82758621    71.17241379]]


## Statistical power calculations: planning ahead for how much data to collect as evidence for an estimated effect
So far, we have talked about how to **interpret** given data.  
Now, we discuss how to plan ahead to collect data that will support a desired interpretation.

## Statistical Power
Recap: the two kinds of errors: Type I, Type II

##  Type I and Type II errors

Type I error: (paranoid: see something that's not really there)

*   Rejecting $H_0$ when it is true
*   Example:
    -   $H_0:$   defendant is innocent
    -   Convicting someone who is innocent

Type II error: (asleep: don't notice something that is really there)

*   Failing to reject $H_0$ when it is false
*   Example:
    -   $H_A:$  defendant is guilty
    -   Acquitting someone who is guilty

##  $H_0$ vs. $H_A$

![$H_0$ vs. $H_A$](images/testing.png)

##  Power

Power is the probability of not making a Type II error, i.e., rejecting $H_0$ when $H_A$ is true:

*   $\alpha = \Pr[ \texttt{reject } H_0 | H_0 \texttt{ is true}]$
*   $\beta = \Pr[ \texttt{reject } H_A | H_A \texttt{ is true}]$
*   $\beta$ is similar to $\alpha$, but if $H_A$ is true
*   $\mathit{power} = 1 - \beta$
*   An experiment with high power is more likely to reject $H_0$ when it is false
*   Typically, set $\mathit{power} = 1 - \beta = 0.80$

### Related terminology: Sensitivity and Specificity, Precision, Recall
* How **sensitive** is a test at detecting $H_a$ (= rejecting $H_0$) when $H_a$ is true?  
$1 - \beta$ = power = '**recall**'

* How **specific** is the test at not being positive (i.e. not rejecting $H_0$) when in real life the $H_0$ is correct?
$1 - \alpha$

Caution: '**precision**' goes backward from statistical test to reality: if test is positive (i.e. reject $H_0$), what is the probability $H_0$ is really false?

![$H_0$ vs. $H_A$](images/testing.png)

##  Given distributions with (different) sample means $\mu_a$ and $\mu_b$ and sample standard deviations $s$, what are the cutoffs and areas?

![$H_0$ vs. $H_A$](images/H0_vs_HA.png)

### Background: recap of z-scores

Realizations $\{x_i\}$ of a random variable X are converted into z-scores \{z(x_i)\}$. 

* What is the (approximate) statistical distribution of the $\{x_i\}$?

* What are their (approximate) mean $\mu$ and standard deviation $s$?

* On the other hand, if we had computed averages of disjoint groups of say 100 $\{z(x_i)\}$ at a time, what would be their approximate distribution?

### Background: recap of z-scores

Realizations $\{x_i\}$ of a random variable X are converted into z-scores. 

* What is the (approximate) statistical distribution of the $\{x_i\}$?   

Trick question! Answer: same distribution as before, except re-scaled.  
In particular, if X was not normally distributed, neither are the z-scores of the $\{x_i\}$

* What are their (approximate) mean $\mu$ and standard deviation $s$?  

0 and 1

* On the other hand, if we had computed averages of disjoint groups of say 100 $\{z(x_i)\}$ at a time, what would be their approximate distribution?  

Careful! N(0, 0.1) - note that averages of n elements have standard deviation $\sqrt(n)$ times smaller.

In [None]:
import scipy.stats as st
st.norm.ppf(alpha)

##  Given distributions with (different) sample means $\mu_a$ and $\mu_b$ and sample standard deviations $s$, what are the cutoffs and areas?

$\alpha$: blue shaded area - desired limit chosen in advance. Cutoff related to $z_{1-\alpha}$  
$\beta$: pink shaded area. Cutoff related to $z_\beta$

![$H_0$ vs. $H_A$](images/H0_vs_HA.png)

## The mighty power formula:

$$n \geq ((Z_{1-\beta} - Z_\alpha) * \frac{s}{\mu_b - \mu_a})^2$$

(Cut over to demo)

TODO: write in derivation

##  Trade-off: significance level vs. power

You must trade-off significance level and power:

*   Decreasing chance of Type I error will increase chance of Type II error
*   Wise men recommend:

| Term                          | Value     |
| :-----------                  | :----     |
| Significance level ($\alpha$) | $0.05$    |
| Confidence level              | $95\%$    |
| Power ($1 - \beta$)           | $0.80$    |

##  Factors affecting measurement of a signal

To increase probability of measuring a signal (rejecting $H_0$):

*   Increase number of observations, $n$
*   Increase effect size, i.e., $\theta_A - \theta_0$
*   Decrease noise, $\sigma^2$

## Real-life 'modern' frequentist statistics: A/B testing

![A/B test](images/AB_test.png)

### Setup: A/B Test our website’s homepage (I)

Our current homepage has a signup conversion rate of 6%. (The standard deviation would be 0.24.)

We want to test a new homepage design to see if we can get a 7% signup rate.  We’ll want an experiment where alpha is 1% and power is 95%.

How many visitors must visit the new homepage in order to fulfill the requirements of this experiment?


$$n \geq 9084$$

### Setup: A/B Test our website’s homepage (II)

Our current homepage has a signup conversion rate of 1%. (The standard deviation would be 0.099.)

We want to test a new homepage design to see if we can get a 1.2% signup rate.  We’ll want an experiment where alpha is 1% and power is 95%.

How many visitors must visit the new homepage in order to fulfill the requirements of this experiment?


$$n \geq 39427$$

### Setup: A/B Test our website’s homepage (III)

Our current homepage has a signup conversion rate of 20%. (The standard deviation would be 0.4.)

We want to test a new homepage design to see if we can get a 30% signup rate.  We’ll want an experiment where alpha is 1% and power is 95%.

How many visitors must visit the new homepage in order to fulfill the requirements of this experiment?


$$n \geq 243 $$

## Summary: Statistical Power

* Know how to avoid misuse and misinterpretation of frequentist tests
* Understand statistical power in relation to Type II errors
* Understand the tradeoffs between power, significance, effect size, and sample size/cost
* Calculate sample size for a given power and estimated effect size


## Afternoon: Bayesian Inference
* Frequentist vs. Bayesian approaches
* Work with Bayes rule to compute posterior probability
* Prior, likelihood and posterior distributions

# Three coins problem - a second look

Three coins are in an Urn.  

Coin $B_1$ has sides HH (i.e. heads on each side)  
Coin $B_2$ has sides HT (like a normal fair coin)  
Coin $B_3$ has sides TT (tails on each side.)

Pull out a coin and flip it. It comes up H.

What is the probability the same coin comes out H if you flip it a second time?

In [14]:
# As tested by simulation:
import random
import pandas as pd
coins = ['HH', 'HT', 'TT']
results = []
for i in range(100):
    coin = random.choice(coins)
    results.append([random.choice(coin) for i in [1,2]])
df = pd.DataFrame(results, columns=['first', 'second']) == 'H'
df.groupby('first').mean()

Unnamed: 0_level_0,second
first,Unnamed: 1_level_1
False,0.173913
True,0.851852


# Our old solution

$ P(X_2 = H) = 1/2 $

$ P(X_2 = H | X_1 = H) = \frac{5}{6} \ne \frac{1}{2} = P(X_2=H) $

Originally, each coin had a probability 1/3 of being picked. Now it is impossible for the coin picked to have been the third, and it is now twice as likely that the coin picked is the second.

# Old solution uses conditional probability

$ P(B|A) = P(A \cap B) / P(A) $

in other words

$ P(X_2=H | X_1=H) = P(X_2 = H \cap X_1 = H) / P(X_1=H) = \frac{\frac{1}{3} + \frac{1}{3}\frac{1}{4}}{\frac{1}{2}}$

Now let's look at this using Bayes formula.

## Bayes Rule
Allows us to compute $P(B|A)$ using information about $P(A|B)$

$$P(B|A) = \frac{P(A|B)P(B)}{P(A)}$$

### Proof (remember this if nothing else):  

The probability for the intersection can be obtained from either end of the equation below:

$$P(B|A) * P(A) = P(A \cap B) = P(A|B)* P(B)$$

### The reson this is helpful: often, it is easier to compute conditional probabilities going in one direction, but you really want conditional probabilities going in the other "hard" direction

## Bayes Rule
Bayes' Rule helps when all you know about $P(B)$ is an initial guess - the **prior**, and you are trying to figure out how additional evidence (A) alters this guess. 

However, it is easier to make conclusions $P(A|B)$ about A from B (the **likelihood**) than what you want - the **posterior probability** $P(B|A)$:

$$P(B|A) = \frac{P(A|B)P(B)}{P(A)}$$

Here, the denominator $P(A)$ might seem hard to compute, but can be obtained using the **Law of Total Probability**


# Law of Total Probability (LOTP)

If $\{B_n\}$ is a partition of a sample space $ X $, meaning $ \cup_i B_i = X$ and $B_i \cap B_j=\emptyset$  $ \forall i, j$

Then for any event $A \subset X$  

$ P(A) = \sum P(A\cap B_i) $

or

$ P(A) = \sum P(A|B_i) P(B_i)$  


## Back to Bayes Rule
Assuming a partition $\{B_n\}$, you can thus re-write the denominator as follows (for any i):

$$P(B_i|A) = \frac{P(A|B_i)P(B_i)}{P(A)}$$

$$ = \frac{P(A|B_i)P(B_i)}{\sum P(A|B_i) P(B_i)}$$

Now, all of the conditional probabilities go in the 'easy' direction from $B_i$'s to $A$.


## Back to the coin problem

Our question asks which of three disjoint events has occurred: whether the coin chosen is $B_1$ (HH), $B_2$ (HT) or $B_3$ (TT).

We want to know about the outcome A = 'the coin comes up heads'

This might seem tricky, but it is easier to reason from $B_i$ to $A$:  
$P(A|B_1) = 1$, $P(A|B_2) = 1/2$, and $P(A|B_3) = 0$

Our prior probability for each $B_i = \frac{1}{3}$

Plugging this into Bayes formula gives:

$$P(B_i|A) = \frac{P(A|B_i)P(B_i)}{\sum P(A|B_i) P(B_i)} = \frac{P(A|B_i)P(B_i)}{1 * 1/3 + 1/2 * 1/3 + 0 * 1/3} = P(A|B_i)P(B_i) * 2$$


## Back to the coin problem 
Note that we had already evaluated each possibility for the numerator when computing the denominator.  

It follows that $P(B_1|A) = 2/3$, $P(B_2|A) = 1/3$, and $P(B_3|0) = 0$

We can now use our *posterior* probabilities of the $B_i$ to calculate the probability of a second H coin flip:

$$P(A|{posterior P(B_i)}) = 2/3 * 1 + 1/3 * 1/2 = 5/6$$

This might seem like a long path to a result that took us fewer lines earlier, but now we have the additional estimates of probabilities for each the coins.

Further coin flip results would continue to alter these!

## 'Reliable' test for rare disease - a famously counterintuitive example

A fairly reliable diagnostic test T exists for a rare disease D. The result of the test is either positive ($T_+$) or negative ($T_-$)

|Conditional Events | Probability |
| --------- | ----------- |
| $ P(T_+|D)$ | .99 |
| $ P(T_+|\neg D)$ | .05 |
| $P(D)$ | .005 |

So for someone who tests positive, what is their probability of having the disease ($ P(D | T_+) $)?

First, give a quick rough answer!  
In particular, are they more likely to have the disease or not?

## ## 'Reliable' test for rare disease: Rough answer $P(D|T_+) \approx 1/11$
There are two ways to test positive: $P(T_+D)$ and $P(T_+ \neg D)$.  

The rare events gating these are respectively $P(D) = .005$ and $ P(T_+ \neg D) = .05$

Because $D$ and $\neg D$ partition the space, Bayes theorem says:

$$P(D|T_+) = \frac{P(T_+|D)P(D)}{(T_+|D)P(D) + (T_+| \neg D)P( \neg D)} \approx \frac{.005}{.005 + .05} = \frac{1}{11}$$

From this, we obtain a quick estimate by ignoring terms close to 1.

If the test were less reliable ($P(T_+|D) << 1$), we would need that in an estimate as well.

## 'Reliable' test for rare disease
This probability update of D goes in the right direction, from $0.005$ to $.091$.

Even so, it may seem surprisingly slow to update as we'd wish!

## Bayesian Updating: Accumulation of evidence

In today's pairs sprint, you'll implement a discrete approximation to the following:

![Bayesian updating](images/bayesianUpdate.png)

## Bayesian Updating: Accumulation of evidence (II)

Observe how situations impossible from the data are updated to 0 (e.g. p=1 when any tails have been seen.)

After many updates, the posterior distribution starts resembling a normal distribution calculated via MOM or MLE.

## The fierce ideological war between Bayesians and 'Frequentists': XKCD#1132
    
![xkcd Bayes vc Freq](images/xkcd1132.png)    

## The fierce ideological war between Bayesians and 'Frequentists': XKCD#1132
    
![xkcd Bayes vc Freq](images/xkcd1132b.png)    

#  Summary


##  Summary

**Q**:  When do you use factorial vs. permutation vs. combination?

**Q**:  What is independence?

**Q**:  What is conditional probability? How do I use Bayes's rule?

**Q**:  What are the PDF and CDF?

**Q**:  What are moments and which should you use to characterize a distribution?  How do you calculate them?

**Q**:  What is a quantile?

**Q**:  What are some common distributions?  What type of processes do they model?