# Power / Bayes
### Jack Bennetto
#### May 4, 2017

In [None]:
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt
%matplotlib inline

## Objectives

 * Define power, significance, standard deviation, effect size and sample size, and explain their relationship
 * Compute the sample size needed for an experiment
 * Describe the difference between Frequentist and Bayesian statistics
 * Use Bayes' theorem to calculate posterior probabilities 

## Agenda

In the morning we'll talk about power calculations when doing Frequentist A/B testing. In the afternoon we'll move on to Bayesian statistics.

### Morning: Power Calculation

* Review of A/B testing
* Types of Errors and Power
* Calculating sample size

### Afternoon: Bayesian Statistics

* Compare Frequentist and Bayesian approaches
* Review Bayes' theorem
* Calculate posterior probabilities

## Morning Lecture: Power Calculations

### A/B Testing

Yesterday we talked about Frequentist A/B testing. In this, you set up some test to measure some number of values and decide how suprised you need to be to reject the null hypothesis. This significance level, called $\alpha$, is often something like 0.05, i.e., if there's less than a 5% chance your results (or something even more extreme) could have happened by chance given the null hypothesis, then you assume that hypothesis is wrong.


Of course, you might be wrong and reject the null hypothesis even though it's correct just because of the sample of data you took. That is called a **false positive** or **type I error**. It's positive because you have a positive result (rejecting null) and it's false because it's wrong. I rememeber that's a type I errors because false positive is much more common than false negative in casual conversation.

The lower you set $\alpha$, the less likely you'll get a type-I error.

But there's a tradeoff. The opposite situation is a **false negative** or **type II error**, when the null hypothesis is wrong but you fail to reject it. The probability of a type-II error (assuming the null hypothesis is wrong) is $\beta$.

The lower you set $\alpha$ the higher $\beta$ will be, and vise versa.

**Power**, defined as $1 - \beta$, is the probability that you're correctly able the reject the null hypothesis assuming that it is in fact false.


|               | Reject $H_0$            | Fail to reject $H_0$
|---------------|-------------------------|---------------
|**$H_0$ false**| Correct ($1-\beta$)     | Type II error ($\beta$)
|**$H_0$ true** | Type I error ($\alpha$) | Correct ($1-\alpha$)

#### An aside on terminology

Generally confusion matrices are discussed in the context of predictive statistics, but it's the same concept so I'm including this here to put the various terms in context.

![confusion matrix](Confusion Matrix.png)

## Plotting the power

In [None]:
def plot_power(n, sigma, effect_size, critical_value):
    standard_error = sigma / n**0.5

    fig, ax = plt.subplots(figsize=(10,6))
    x = np.linspace(-3, 8, 200)
    xpos = x[x >= critical_value]
    xneg = x[x <= critical_value]

    h0 = scs.norm(0, standard_error)
    ha = scs.norm(effect_size, standard_error)

    ax.plot(x, h0.pdf(x), color='red', label='$H_0$')
    ax.plot(x, ha.pdf(x), color='blue', label='$H_A$')
    ax.fill_between(xpos, 0, h0.pdf(xpos), color='red', alpha=0.2, label="$\\alpha$")
    ax.fill_between(xneg, 0, ha.pdf(xneg), color='blue', alpha=0.2, label="$\\beta$")
    ax.fill_between(xpos, 0, ha.pdf(xpos), color='white', hatch='////', alpha=0.2, label="Power")
    ax.axvline(critical_value, color='black')
    ax.set_xlabel("sample mean")
    ax.set_ylabel("pdf")
    ax.set_ylim(ymin=0.0)
    ax.legend()

    print "standard deviation = {0:7.4f}".format(sigma)
    print "sample size (n)    = {0:7d}".format(n)
    print "   standard error  = {0:7.4f}".format(standard_error)
    print "alpha              = {0:7.4f}".format(1-h0.cdf(critical_value))
    print "effect size        = {0:7.4f}".format(effect_size)
    print "power              = {0:7.4f}".format(1-ha.cdf(critical_value))
    print "   beta            = {0:7.4f}".format(ha.cdf(critical_value))

In [None]:
plot_power(n=100, sigma=10.0, effect_size=2., critical_value=1.5)

The **effect size** is the amount of difference we hope to detect with our study. If we will only recommend a drug if it will lower blood pressure by at least 10 mm Hg, that's the effect size. In the picture above and calculations below we take the most conservative approach, that that actually is the effect size.

We usually don't know the **standard deviation** $\sigma$ beforehand, and we may need to do a pilot study to estimate this.
˜
We've already talked about $\alpha$ and power (or $1-\beta$). When we create a study, we generally decide what values we expect for these. It's common to choose 0.05 for $\alpha$ and 0.80 for power.

The remaining factor is **sample size** n, the number of data in our sample. In general we are trying to calculate this from the other factors.

### Calculating sample size

If the samples are large enough the sample means falls in a normal distribution by the **central limit theorem**. The standard deviation of this distribution is called the **standard error** and is given by

$$SE = \frac{\sigma}{\sqrt{n}}$$

where $\sigma$ is the standard deviation of the original distribution and $n$ is the sample size.

For the minimum value of $n$, the distance from the $\mu_0$ to the critical value is $SE \cdot Z_\alpha$, and the distance of the critical value to $\mu_A$ is $SE \cdot Z_\beta$. So

$$\begin{align}
\mu_A - \mu_0 & = SE \cdot (Z_\beta + Z_\alpha) \\
              & = \frac{\sigma}{\sqrt{n}} (Z_\beta + Z_\alpha)
\end{align}
$$

so

$$ n \ge \left( \frac{\sigma ( Z_\beta + Z_\alpha )}{\mu_A - \mu_0} \right)^2 $$

In [None]:
def calculate_sample_size(sigma, effect_size, alpha=0.05, power=0.80):
    beta = 1 - power
    return ((sigma*(scs.norm(0,1).ppf(beta) + scs.norm(0,1).ppf(alpha)))/effect_size)**2

### Relationship of factors

As a summary, here are how the factors relate. This should be read as, for example, "Holding everything else constant, if **Effect size** is **Larger** then **Power** is **Larger**.

| Factor             | Direction |
|-------------------|------|----
| Power             | Larger | Smaller
| Effect size       | Larger | Smaller
| Sample size       | Larger | Smaller
| Standard deviation| Smaller| Larger
| Significance level| Smaller| Larger

# Bayesian Statistics
## Afternoon lecture

So far we've discussed Frequentist statistics. Let's review this a bit.

What is likelihood?

What is a confidence interval?

## Frequentist vs Bayesians

In general, Frequentist and Bayesians look at statistical problems in fundamentally ways that are almost inverse of each other.

The Frequentist says "There is one true hypothesis, though we don't know what it is. The observation (or data or sample) is one of many that could have been generated."

The Bayesian says "There is no single true hypothesis, but any number of possible hypotheses, each with a probability. The only thing we can know for certain is the observation."

For a Frequentist, it's absurd to talk about the probability of some hypothesis being true; either it is true or it isn't.

### The cookie problem

Last week we talked about drawing a vanilla cookie from one of two bowls, and used that to calculate probability that we'd picked the first bowl. Imagine that we'd continued the experiment, drawing additional cookies (perhaps with replacement) to get a better and better idea of which bowl we took.

This is Bayesian statistics.

### Bayes' theorem.

Suppose we're considering some hypothesis $H$ and we've collected some data $\mathbf{X}$.
$$ P(H|\mathbf{X}) = \frac{P(\mathbf{X}|H) P(H)}{P(\mathbf{X})} $$

Each term has a name.

* $P(H)$ is the *prior probability*
* $P(\mathbf{X}|H)$ is the *likelihood*.
* $P(\mathbf{X})$ is the *normalizing constant*.
* $P(H|\mathbf{X})$ is the *posterior probability*.


If there are a bunch of hypotheses $H_1, H_2, ... H_n$, we could write this as

$$\begin{align}
P(H_i|\mathbf{X}) & = \frac{P(\mathbf{X}|H_i) P(H_i)}{P(\mathbf{X})}\\
         & = \frac{P(\mathbf{X}|H_i) P(H_i)}{\sum_{j=0}^{n} P(\mathbf{X}|H_j) P(H_j)}
\end{align}
$$

Here we see the normalizing constant is the likelihood times the prior summed over all possible hypothesis. In other words, it's the constant (independent of hypothesis) needed to be multiplied by all the numerators so that they all add up to one.



### Baysian statistics to find a mean
Let's assume you have a bunch of points drawn from a normal distribution. To make things easy, let's say you happen to know that the standard deviation is 3, and the mean $\mu \in \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$.

Humans are pretty bad at choosing random numbers, so someone will need to run this.

In [None]:
mu = scs.randint(0,10).rvs()
print mu

Now you need to choose a number from the distribution.

In [None]:
sd = 3
scs.norm(mu, sd).rvs()

In [None]:
datum = ???enter number here???
likelihood = []
for i in range(0,10):
    likelihood.append(scs.norm(i, sd).pdf(datum))
    print("The likelihood of N({0}, {1}) generating {2:5.2f} is {3:6.4f}"
           .format(i, sd, datum, likelihood[i]))

In [None]:
fig, ax = plt.subplots()
ax.bar(range(10), likelihood)
ax.set_xlabel('hypothesized mean')
ax.set_ylabel('likelihood')
plt.show()

Question: Do these add to one?

Which of these hypotheses has the maximum likelihood of producing the data?

If we were a Frequentist, we'd go with that, and then we'd construct a confidence interval that giving a range that (had we sampled from the data many times) has a certain probability (maybe 95%) of including the actual value.

But today we're all going to be Bayesians, which means we're going to assign probabilities

The tough part of being a Bayesian is we need to start out with a prior probabilities. For this, we'll assume that all the probabilities are equal. You choose them that way, that works out, but if I'd just asked you to choose a number that would have been weird.

In [None]:
probs = np.ones(10)/10

Now we need to multiple each of these by the likelihood...

In [None]:
for i in range(10):
    probs[i] *= scs.norm(i, sd).pdf(datum)

...and then divide normalize them by dividing them each by the sum:

In [None]:
probs /= probs.sum()

...and print out the **probabilities**.

In [None]:
for i in range(0,10):
    print("The probability of N({0}, {1}) being correct is {2:6.4f}"
           .format(i, sd, probs[i]))

fig, ax = plt.subplots()
ax.bar(range(10), probs)
ax.set_xlabel('hypothesized mean')
ax.set_ylabel('posterior probability')
plt.show()

That was great, but maybe we should get some more data. Generate another number!

In [None]:
scs.norm(mu, sd).rvs()

In [None]:
datum = ???enter number here???

for i in range(10):
    probs[i] *= scs.norm(i, sd).pdf(datum)
probs /= probs.sum()

for i in range(0,10):
    print("The probability of N({0}, {1}) being correct is {3:6.4f}"
           .format(i, sd, datum, probs[i]))

fig, ax = plt.subplots()
ax.bar(range(10), probs)
ax.set_xlabel('hypothesized mean')
ax.set_ylabel('posterior probability')
plt.show()

## Bayesian Priors

The largest disagreement with Bayesian statistics is the nessecity of the prior. In the example about, and in the cookie problem, the prior had actual meaning and the math is beyond dispute. If that's the case (or at least somewhat the case), we have an **informed prior**.

In the real world that's not always true and we have to choose an **uninformed prior**. If so we usually say something like "every possibility is equally likely" and leave it at that.

But if you have enough data the choice of prior doesn't matter all that much.