<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Motivation-for-Hypothesis-Testing" data-toc-modified-id="Motivation-for-Hypothesis-Testing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Motivation for Hypothesis Testing</a></span></li><li><span><a href="#Experiment-Design" data-toc-modified-id="Experiment-Design-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Experiment Design</a></span><ul class="toc-item"><li><span><a href="#The-Scientific-Method" data-toc-modified-id="The-Scientific-Method-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>The Scientific Method</a></span></li><li><span><a href="#Making-a-Good-Experiment" data-toc-modified-id="Making-a-Good-Experiment-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Making a Good Experiment</a></span><ul class="toc-item"><li><span><a href="#Control-Groups" data-toc-modified-id="Control-Groups-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Control Groups</a></span></li><li><span><a href="#Random-Trials" data-toc-modified-id="Random-Trials-3.2.2"><span class="toc-item-num">3.2.2&nbsp;&nbsp;</span>Random Trials</a></span></li><li><span><a href="#Sample-Size" data-toc-modified-id="Sample-Size-3.2.3"><span class="toc-item-num">3.2.3&nbsp;&nbsp;</span>Sample Size</a></span></li><li><span><a href="#Reproducible" data-toc-modified-id="Reproducible-3.2.4"><span class="toc-item-num">3.2.4&nbsp;&nbsp;</span>Reproducible</a></span></li></ul></li><li><span><a href="#Scenarios" data-toc-modified-id="Scenarios-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Scenarios</a></span></li><li><span><a href="#High-Level-Hypothesis-Testing" data-toc-modified-id="High-Level-Hypothesis-Testing-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>High-Level Hypothesis Testing</a></span></li></ul></li><li><span><a href="#Parts-of-a-Hypothesis-Test" data-toc-modified-id="Parts-of-a-Hypothesis-Test-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Parts of a Hypothesis Test</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Define-Null-and-Alternative-Hypotheses" data-toc-modified-id="Define-Null-and-Alternative-Hypotheses-4.0.1"><span class="toc-item-num">4.0.1&nbsp;&nbsp;</span>Define Null and Alternative Hypotheses</a></span><ul class="toc-item"><li><span><a href="#The-Null-Hypothesis" data-toc-modified-id="The-Null-Hypothesis-4.0.1.1"><span class="toc-item-num">4.0.1.1&nbsp;&nbsp;</span>The Null Hypothesis</a></span></li></ul></li></ul></li><li><span><a href="#$p$-Values" data-toc-modified-id="$p$-Values-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>$p$-Values</a></span></li><li><span><a href="#$\alpha$" data-toc-modified-id="$\alpha$-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>$\alpha$</a></span><ul class="toc-item"><li><span><a href="#A-Caution" data-toc-modified-id="A-Caution-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>A Caution</a></span></li></ul></li></ul></li><li><span><a href="#Steps-of-a-Hypothesis-Test" data-toc-modified-id="Steps-of-a-Hypothesis-Test-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Steps of a Hypothesis Test</a></span><ul class="toc-item"><li><span><a href="#Let's-write-the-appropriate-hypotheses" data-toc-modified-id="Let's-write-the-appropriate-hypotheses-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Let's write the appropriate hypotheses</a></span></li></ul></li><li><span><a href="#Performing-a-$z$-test" data-toc-modified-id="Performing-a-$z$-test-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Performing a $z$-test</a></span><ul class="toc-item"><li><span><a href="#$z$-Tests" data-toc-modified-id="$z$-Tests-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>$z$-Tests</a></span><ul class="toc-item"><li><span><a href="#Variable-review:" data-toc-modified-id="Variable-review:-6.1.1"><span class="toc-item-num">6.1.1&nbsp;&nbsp;</span>Variable review:</a></span></li><li><span><a href="#Example" data-toc-modified-id="Example-6.1.2"><span class="toc-item-num">6.1.2&nbsp;&nbsp;</span>Example</a></span></li></ul></li></ul></li><li><span><a href="#Summary" data-toc-modified-id="Summary-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

In [None]:
from scipy import stats
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Objectives

- Describe the basic framework and vocabulary for hypothesis testing
- Define Null and Alternative Hypotheses
- Define p-value, $\alpha$
- Perform z-tests

# Motivation for Hypothesis Testing

When we want to be able to determine something about the world, we typically turn to science. And science is really built upon the idea testing ideas through **experiments**. If we have an idea, but our experiment(s) shows that it's unlikely to be true then we learned something about our world!

<img src="https://upload.wikimedia.org/wikipedia/commons/8/89/Beaker_-_The_Noun_Project.svg" width=50%/>

Data _science_ can use this same process and it can be formalized through a statistical process called a **hypothesis test**. But before we can talk about performing these tests, we need to talk about how we design our experiments.

# Experiment Design

Experiments are how we get the data we need to determine if our observations are worthwhile! But if you have a poorly designed experiment, you can't trust the observations/data to say anything useful.

> **NOTE**
>
> We typically use the term "experiment" when doing a hypothesis test. This can be a little confusing when the data has been collected _before_ any other step. This is fine but we should consider if this experiment follows the general criteria of a "good" design.

## The Scientific Method

You should consider if the question you're looking to answer can be investigated with the **scientific method**. If it is, we can feel better that we're asking a _scientific question_ (compared to a question that is _unverifiable_).

There is no completely agreed upon "scientific method" but the following should help us know if we're on the right track:

- Question/Observation
- Background Knowledge
- Hypothesis
- Experiment
- Analysis
- Conclusions

## Making a Good Experiment

We strive to make the best we can which is a perfectly designed experiment that would test every possible answer to a question. Of course this is unrealistic, though we strive towards the best experiment we can to answer questions.

Below are a few items to consider for a good experiment. An experiment doesn't have to fulfill everything to still be useful, though the more items off the checkoff list the more certain we'll feel about our results. 

### Control Groups

> Your experiment should consider other factors that could affect the outcome and try to account for (or _control_) those factors

### Random Trials

> By having random trials/samples, you're less likely to have bias in your observations/data

### Sample Size

> A large enough sample size that we can reasonably extrapolate to the population of interest

### Reproducible

> Being able to reproduce the experiment means we can test again and ensure are results are valid.

## Scenarios

- Chemistry - do inputs from two different barley fields produce different
yields?
- Astrophysics - do star systems with near-orbiting gas giants have hotter
stars?
- Medicine - BMI vs. Hypertension, etc.
- Business - which ad is more effective given engagement?

![img1](./img/img1.png)

![img2](./img/img2.png)

## High-Level Hypothesis Testing

1. Start with a Scientific Question (yes/no)
2. Take the skeptical stance (null hypothesis) 
3. State the complement (alternative)
4. Set a threshold for errors
5. Create a model (test statistic) of the situation *assuming the null hypothesis is true*
6. Decide whether or not to reject the null hypothesis

**Intuition** 

Suppose you have a large dataset for a population. The data is normally distributed with mean 0 and standard deviation 1.

Along comes a new sample with a sample mean of 2.9.

> The idea behind hypothesis testing is a desire to quantify our belief as to whether our sample of observations came from the same population as the original dataset. 

According to the empirical (68–95–99.7) rule for normal distributions there is only roughly a 0.003 chance that the sample came from the same population, because it is roughly 3 standard deviations above the mean. 

<img src="img/normal_sd_new.png" width="500">
 
To formalize this intuition, we define a threshold value for deciding whether we believe that the sample is from the same underlying population or not. This threshold is $\alpha$, the **significance threshold**.  

This serves as the foundation for hypothesis testing where we will reject or fail to reject the null hypothesis.

# Parts of a Hypothesis Test 

(alpha, p-value, null-hypothesis, etc.)

### Define Null and Alternative Hypotheses

#### The Null Hypothesis

![gmonk](https://vignette.wikia.nocookie.net/villains/images/2/2f/Ogmork.jpg/revision/latest?cb=20120217040244)
> There is NOTHING, **no** difference.

If we're testing the function of a new drug, then the null hypothesis will say that the drug has _no effect_ on patients, or anyway no effect relative to relief of the malady the drug was designed to combat. 

If we're testing whether Peeps cause dementia, then the null hypothesis will say that there is _no correlation_ between Peeps consumption and rate of dementia development.

The **alternative hypothesis** says the opposite of the null hypothesis.

## $p$-Values

The basic idea of a $p$-value is to quantify the probability that the results seen are in fact the result of mere random chance. This is connected with the null hypothesis: If the null hypothesis is true and there is no significant correlation between the population variables X and Y, then of course any correlation between X and Y observed in our sample would have to be the result of mere random chance.

**How Unlikely Is Too Unlikely?**

## $\alpha$

Suppose we calculate a $p$-value for some statistic we've measured (more on this below!) and we get a $p$-value of 20%. This would mean that there is a 20% chance that the results we observed were the result of mere random chance. Probably this is high enough that we ought _not_ to reject the null hypothesis that our variables are uncorrelated.

In practice, a $p$-value _threshold_ ($\Large \alpha$) of 5% is very often the default value for these tests of statistical significance. Thus, if it is calculated that the chance that the results we observed were actually the result of randomness is less than 1 in 20, then we would _reject_ the null hypothesis and _accept_ the alternative hypothesis.

If $p \lt \alpha$, we reject the null hypothesis.:

If $p \geq \alpha$, we fail to reject the null hypothesis.

> **We never _accept_ the null hypothesis, because future experiments may yield significant results.**

* We do not throw out "failed" experiments! 
* We say "this methodology, with this data, does not produce significant results" 
    * Maybe we need more data!
    

### A Caution

The choice of $\alpha = 0.05$ is arbitrary and has survived as a pseudo-standard largely because of traditions in teaching.

The [American Statistical Association](https://www.amstat.org) has [recently been questioning this standard](https://www.tandfonline.com/toc/utas20/73/sup1?nav=tocList&) and in fact there are movements to reject hypothesis testing in a more wholesale way.

The chief thing to keep in mind is that binary test results are often misleading. And as for an appropriate $p$-level: This really depends on the case. In some scenarios, false positives are more costly than in others. We must also determine our $\alpha$ level *before* we conduct our tests. Otherwise, we will be accused of $p$-hacking.

# Steps of a Hypothesis Test

Steps in doing one-sample z-test:

1. State alternative hypothesis (and null)
  * example: sample mean is greater than population mean (mu)
2. Specify significane level (alpha)
  * alpha is the probability of rejecting null even though its true (!)
3. Calculate test statistic (z-statistic)
  * z-stat = (x_bar - mu) / (sigma /√n)  --> more data mean more likely
4. Calculate p-value (from z-table)
  * p = 1 - CDF(normal distribution given z-stat)
  * Probability we'd find this value given null is true
  * `1 - scipy.stats.norm.cdf(z_score)`
5. Interpret p-value
  * p out of $\alpha$ (out of confidence interval)
  


## Let's write the appropriate hypotheses

1. A drug manufacturer claims that a drug increases memory. It designs an experiment where both control and experimental groups are shown a series of images, and records the number of correct recollections until an error is made for each group. 

Answer:

Null: People who took the drug don't have more correct recollections than people who didn't take the drug.

Alternative: People who took the drug do have more correct recollections than people who didn't take the drug.

2. An online toystore claims that putting a 5 minute timer on the checkout page of its website decreases conversion rate. It sets up two versions of its site, one with a timer and one with no timer. 

Answer:

3. The Kansas City public school system wants to test whether the scores of students who take standardized tests under the supervision of teachers differ from the scores of students who take them in rooms with school administrators.

Answer:

4. A pest control company believes that the length of cockroach legs in colonies which have persisted after two or more insecticide treatements are longer than those in which have not been treated with insecticide.

Answer:

5. A healthcare company believes patients between the ages of 18 and 25 participate in annual checkups less than all other age groups.

Answer:

# Performing a $z$-test

## $z$-Tests 

A $z$-test is used when you know the population mean and standard deviation.

Our test statistic is the $z$-stat.

For a single point in relation to a distribution of points:

$z = \dfrac{{x} - \mu}{\sigma}$



<br>Our $z$-score tells us how many standard deviations away from the mean our point is.
<br>We assume that the sample population is normally destributed, and we are familiar with the empirical rule: <br>66:95:99.7

![](img/Empirical_Rule.png)


Because of this, we can say, with a $z$-score of approximately 2, our data point is 2 standard deviations from the mean, and therefore has a probability of appearing of 1-.95, or .05. 

Recall the following example: Assume the mean height for women is normally distributed with a mean of 65 inches and a standard deviation of 4 inches. What is the $z$-score of a woman who is 75 inches tall?

In [None]:
z_score = (75 - 65)/4
print(z_score)

When we are working with a sampling distribution, the z score is equal to <br><br>  $\Large z = \dfrac{{\bar{x}} - \mu_{0}}{\dfrac{\sigma}{\sqrt{n}}}$

### Variable review: 

$\bar{x}$ equals the sample mean.
<br>$\mu_{0}$ is the mean associated with the null hypothesis.
<br>$\sigma$ is the population standard deviation
<br>$\sqrt{n}$ is the sample size, which reflects that we are dealing with a sample of the population, not the entire population.

The denominator $\frac{\sigma}{\sqrt{n}}$, is the standard error

Standard error is the standard deviation of the sampling mean. We will go into that further below.

Once we have a z-stat, we can use a [z-table](http://www.z-table.com/) to find the associated p-value.

In [None]:
sample_female_heights = [68, 65, 69, 70, 70, 
                         61, 59, 65, 64, 66,
                         72, 71, 68, 66, 64,
                         65, 65, 70, 71, 63, 
                         72, 66, 65, 65, 72]

x_bar = np.mean(sample_female_heights)
mu = 65
n = len(sample_female_heights)
std = 4

z = (x_bar - mu)/(4/np.sqrt(n))
z

In [None]:
# we can use stats to calculate the percentile
print(stats.norm.cdf(z))

# We can also use the survival function to calculate the probability
print(stats.norm.sf(z))

### Example

Let's work with the normal distribution, since it's so useful. Suppose we are told that African elephants have weights distributed normally around a mean of 9000 lbs., with a standard deviation of 900 lbs. Pachyderm Adventures has recently measured the weights of 40 African elephants in Gabon and has calculated their average weight at 8637 lbs. They claim that these statistics on the Gabonese elephants are significant. Let's find out!

What is our null hypothesis?

What is our alternative hypothesis?

What is our alpha?

Remember we gave the formula for standard error before as $\frac{\sigma}{\sqrt{n}}$
<br> Let's calculate that with our elephant numbers.

In [None]:
se = 900 / np.sqrt(40)
se

Now let's calculate the z-score analytically.
Remember the formula for z-score:
$z = \dfrac{{\bar{x}} - \mu_{0}}{\dfrac{\sigma}{\sqrt{n}}}$

In [None]:
x_bar = 8637
mu = 9000
se = 142.3

z = (x_bar - mu) / se
z

In [None]:
# Now we get our p-value from the test statistic:
stats.norm.cdf(z)

# Summary 

Key Takeaways:

* A statistical hypothesis test is a method for testing a hypothesis about a parameter in a population using data measured in a sample. 
* Hypothesis tests consist of a null hypothesis and an alternative hypothesis.
* We test a hypothesis by determining the chance of obtaining a sample statistic if the null hypothesis regarding the population parameter is true. 
