In [29]:
import numpy as np

# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Stats Review & Intro to SciPy
Week 2 | Lesson 4.2




### LEARNING OBJECTIVES
*After this lesson, you will be able to:*

- Explain t-testing and demonstrate it with scipy
- Contrast t-testing with simulation solutions
- Explain Type I and Type II errors

### "Bayesian" vs. "Frequentist"?

We will talk more about Bayesian stats later in the course; for now we are focused on Frequentist statistics.

We assume that the data that we have observed is a sample from an underlying population, and that population has fixed (but unknown to us) parameters.


### Setting the Scene

We want to make a statement about ALL data points (the population) based on a SAMPLE of data points, and describe our UNCERTAINTY about that statement.

To do this, we:
- make an assumption about the population (i.e., come up with a hypothesis)
- we calculate the probability of seeing the data we've seen if this assumption were true
- based on this probability, we are able to say something about whether we think our assumption / hypothesis is reasonable

#### Estimating the mean height of female professional athletes from a Frequentist approach...

As a Frequentist I believe:

- The mean height of the population of female professional athletes is an unknown but fixed, "true" value.
- I have 100 data points.  My 100 data points are a random sample. That is to say, I have collected at random 100 heights from the population pool.
- This random sampling procedure is considered infinitely repeatable. My inferences about height are based on the idea that this sample is just one of an infinite number of hypothetical population samples. 
- Each of the possible samples has a sample mean.  They may differ between samples, but the true value of height is fixed across all hypothetical samples.
- There is a distribution of possible sample means given the true fixed value.

**FREQUENTISTS** ask:
$$P(\text{data}\;|\;\text{true mean})$$

So, for example, what is the probability of our data given a true and fixed population mean?

In other words, if I assume that the true population of female professional athletes  has a mean value of 5'11,  what is the probability of observing a mean height of, say, 5'9 in my sample?

Based on this probability, what can I say about my sample / population?

### Ok, more hypothesis testing!

Let's say we are testing a new drug.

- We randomly select 50 people to be in the placebo control condition and 50 people to recieve the treatment.
- Our sample is selected from the broader, unknown population pool.
- In a parallel world we could have ended up with any random sample from the population pool of 100 people.



The null hypothesis is, in this example, that there is no difference between placebo and treatment.

*H0: The measured difference is equal to zero.*

The alternative hypothesis is the other possible outcome of the experiment: the difference between the placebo and the treatment is real.

*H1: The measured difference is not zero.*


Say in our experiment we follow-up with the experimental and control groups:
- 5 out of 50 patients in the control group indicate that their symptoms are better
- 7 out of 50 patients in the experimental group indicate that their symptoms are better

So does the drug work?  Is this enough to make a recommendation?

### Enter the p-value!

The p-value is the probability that, GIVEN THE NULL HYPOTHESIS IS TRUE, we would have sampled the current set of data.

We saw 10% improvement rate in placebo group, and 14% improvement rate in drug group.

So, for this example, the *p-value* it is the probability that even if there is in fact no effect (i.e. the  means of the two underlying populations are the same), that we might see a difference of 4% or MORE simply owing to the random sampling (i.e. *just by chance!*)

### What do we do with the p-value?

Let's say the p-value here is 10%.  This means that if we were to repeat this experiment, 1 in 10 times we would see a difference of at least 4% just by chance.  

Does that mean this finding is significant?  What can we say about our null hypothesis?

Typically we pick a threshold (often 5%), below which we say "if our null hypothesis were true, the probability of our seeing the data that we saw is so unlikely, that we no longer believe our null to be true."

But here, the p-value is above 5%, so we fail to reject our null.

So we have no reason to think that the measured difference is not actually zero, or, in other words,  that the drug has an effect.

### Sidenote:


Strictly speaking, we only ever reject the null, or fail to reject the null.

Rejecting the null is not the same as accepting the alternative!

### Great!  But how do I calculate this p-value?

We can do this parametrically or computationally.

Assuming we take the parametric approach, we need a few building blocks...


## Law of large numbers (Probability Theory)

 
If you perform the same experiment a large number of times, the average of the results should approach the expected value.

So if I repeatedly take samples from a population, and calculate the sample mean each time, the average of these sample means should equal the population mean.


This means that for sufficiently large N, $\bar{Y}\ -\ \mu$ is ~ 0.

## Central limit theorem

- If you sample from an underlying population with mean $\mu$ and variance $\sigma^2$, then when *n* is large, sample mean $\hat{\mu}$ is approximately normally distributed with mean $\mu$ and variance $\frac{\sigma^2}{n}$.
- If the sample sizes are large enough, this is true regardless of the underlying distribution!
- So what? 
Well, this allows us to assume that some random variables are normally distributed, and to make inferences about the likelihood of observations drawn from that distribution. For instance, it implies:

$$\frac{\hat{\mu}\ -\ \mu}{\sigma/\sqrt{n}}\ \sim \ N(0,1)$$

## T-tests revisited

Does $\frac{\hat{\mu}\ -\ \mu}{\sigma/\sqrt{n}}$ look familiar? If we use the sample standard deviation in the denominator, that's the t-statistic!

$$\frac{\hat{\mu}\ -\ \mu}{s/\sqrt{n}}$$

And if n is large, then the value of the t-statistic is approximately normally distributed. (If n is small and the sample observations are normally distributed, then it has a t-distribution.)

$\frac{s}{\sqrt{n}}$ is also called the standard error, $s.e.(\hat{\mu})$

## P-values and hypothesis testing, revisited

So what if the t statistic is normally distributed?

Well, then we can calculate the probability of having observed that t-statistic.  

Or, in terms of hypothesis testing for the difference of means (drug treatment), we can calculate the probability of having observed a difference as large (or larger) than we did, if there is in fact no difference (i.e. if our null hypothesis is true)



### How do we calculate it?


![](./norm_dist_probs.jpg)

### Hypothesis testing with t-tests, an example

Let's say 1165 bootcamp applicants take a GA admissions test in 2017, with an average score of 60.86 and a standard deviation of 8.02. The expected score for all bootcamp applicants is 59. Do GA applicants have the same expected score as the underlying population?

The sample standard deviation *s* is 8.02.

Standard error of the estimate is then $se(\hat{\mu}) = \frac{s}{\sqrt{n}} = \frac{8.02}{\sqrt{1165}} = 0.235$

> What are our null and alternative hypotheses?

$$H_0: \mu = 59$$
$$H_a: \mu \neq 59$$



Under $H_0,\ t = \frac{\hat{\mu} - 59}{se(\hat{\mu})}$ is approximately normally distributed with N(0,1). If t falls far on the tail, the p-value is low and we'll reject $H_0$.

Calculate the t-statistic and [look up its p-value](https://graphpad.com/quickcalcs/PValue1.cfm]).

(for degrees of freedom, use n-1, where n is the size of the sample)


$$t = \frac{60.86 - 59}{0.2350} = 7.915$$

pvalue <0.001

## Again, but now with scipy ('skippy'?)

In [8]:
from scipy import stats
import numpy as np
np.random.seed(7654567)  # fix seed to get the same result
rvs = stats.norm.rvs(loc=60.86, scale=8.02, size=(1165))

# Note that the mean and std of our generated data aren't precisely the 
#same.
print np.std(rvs), np.mean(rvs)


8.05037070851 61.1582143415


## Which scipy function to use?  

A few common t-tests include:
- One-sample t-test. Used to determine whether a hypothesized population
    mean differs significantly from an observed sample mean.
- Two-sample t-test. Used to determine whether the difference between samples means differs significantly from the              hypothesized difference between population means.
- Paired t-test. Used to test the significance of the difference
    between paired means.

Scipy has methods for all of these, and more. Which one do we want?


## Also, tails?  

<br>

#### One tailed test
I care about direction (e.g. the mean for these students is *higher* than the population mean)
<br><br>
#### Two tailed test
I don't care about direction (e.g. the mean for these students is *different* to the population mean)

In [6]:
# Test if mean of random sample is equal to true mean

print stats.ttest_1samp(rvs,59.0)


Ttest_1sampResult(statistic=9.1465051826423593, pvalue=2.5558469858139369e-19)


Do we reject the hypothesis?  

Yes!

So based on the data we collected, we do not believe that the mean for the students who took the test in March is the same as the mean for students generally.

>Check:
Did scipy do a one tailed or two tailed ttest?

If you want to to convert a p-value from a two-tailed test to the p-value for the equivalent one-tailed test, we can simply divide by 2.

<a name="ind-practice"></a>
## Independent Practice: classic t-tests (30 minutes)

In pairs or trios, look at the SAT test data from Project 1. (We'll assume it's a sample of results, rather than the population results.) Together, form null and alternative hypotheses about some of the scores. 
(E.g., H0: the mean difference between states' verbal and math scores is 0)

Choose a significance level and conduct an appropriate t-test.  Repeat for a few different hypotheses.  Be prepared to describe your findings with another group!  Think about how you might describe these findings to someone who hasn't studied statistics.

- [t-tests](http://iaingallagher.tumblr.com/post/50980987285/t-tests-in-python)
- [t distribution](http://stattrek.com/probability-distributions/t-distribution.aspx)

## Significance levels, Type I and Type II errors

Type I errors occur when the researcher rejects a null hypothesis when it is actually true. The probability of committing a Type I error is called the significance level, often denoted $\alpha$.

$$\alpha\ =\ P(Reject\ H_0\ |\ H_0\ is\ true) = P(Type\ I\ error)$$

So, if our threshold level / significance level were 5%, if we repeated the experiment 20 times, just by chance, we might reject the null hypothesis once, even if it is in fact true.

Also... see [p-hacking](https://projects.fivethirtyeight.com/p-hacking/)

A Type II error occurs when the researcher wrongly accepts a null hypothesis that is false.  The probability of committing a Type II error is often denoted by $\beta$.






$$\beta\ =\ P(Not rejecting \ H_0\ |\ H_a\ is\ true) = P(Type\ II\ error)$$

<a name="t-testing"></a>
## Demo/Guided Practice: computational approaches (10 minutes)

Now that computational power is cheap and available (and you know Python!), we have an alternative way of approaching these questions: we can iteratively calculate the probability of observing some result.  We can simulate a repeated experiment!

For example:

```Python
# Simulating a binomial variable (e.g. seeing heads in 20 out of 30 coin flips )
m = 0
for i in range(10000):
    trials = np.random.randint(2, size = 30)
    if (trials.sum() >= 20):
        m += 1
p = m / 10000.0
p
```

> Check: what is this doing?

In [33]:
m = 0
for i in range(10000):
    trials = np.random.randint(2, size = 30)
    if (trials.sum() >= 20):
        m += 1
p = m / 10000.0
p

0.0492

This was an example if **simulating** your experiment -- you can do this if you have an a priori model of what happens.

If you don't have an a priori model, another option is **shuffling** results:

(Example from: http://cs.nyu.edu/shasha/papers/StatisticsIsEasyExcerpt.html)


Placebo: 54 51 58 44 55 52 42 47 58 46

Drug: 54 73 53 70 73 68 52 65 65


"As you can see, the drug seems more effective on the average (the average measured improvement is 63.7 for the drug and 50.7 for the placebo). But is this difference in the average real? Formula-based statistics would use a t-test which entails certain assumptions about normality and variance, but we are going to look just at the samples themselves and shuffle the labels."


What this means can be illustrated as follows. We put all the people in a table having two columns value and label (P for placebo and D for drug).

| value | label |
|:-:|---|
|54	| P |
|51	| P |
|58	| P |
|44	| P |
|55	| P |
|52	| P |
|42	| P |
|47	| P |
|58	| P |
|46	| P |
|54	| D |
|73	| D |
|53	| D |
|70	| D |
|73	| D |
|68	| D |
|52	| D |
|65	| D |
|65	| D |



Shuffling the labels means that we will take the Ps and Ds and randomly distribute them among the patients. 

This might give:

| value | label |
|:-:|---|
|54	| P 
|51	| P
|58	| D
|44	| P
|55	| P
|52	| D
|42	| D
|47	| D
|58	| D
|46	| D
|54	| P
|73	| P
|53	| P
|70	| D
|73	| P
|68	| P
|52	| D
|65	| P
|65	| D



We can then look at the difference in the average P value vs. the average D value here. We get an average of 59.0 for P and 54.4 for D. We repeat this shuffle-then-measure procedure 10,000 times and ask what fraction of time we get a difference between drug and placebo greater than or equal to the measured difference of 63.7 - 50.7 = 13. The answer in this case is under 0.001.""


<a name="ind-practice"></a>
## Independent Practice: finding probabilities computationally (45 minutes)

In pairs or trios, design and code a computational way of finding the probability of rolling a 6 at least one-third of the time on a fair die.

Got it? Now try finding the probability of seeing a difference of at least 0.686 between the mean verbal and math scores in your SAT dataset, assuming there is, in truth, no difference between these means.  Try shuffling!

<a name="conclusion"></a>
## Conclusion (5 mins)

- We make trade-offs between risking Type I and Type II errors
- There are varieties of t-tests, and scipy methods for conducting them
- Simulations / computation strategies are an alternative to parametric statistical inference