# Hypothesis Testing: z-Tests

![the lord is testing me gif from giphy, originally from community](https://media.giphy.com/media/nxeAo2Q8qNdrG/giphy.gif)

## Objectives

- Describe the basic framework and vocabulary for hypothesis testing
- Define Null and Alternative Hypotheses
- Define p-value, $\alpha$
- Perform one-sample z-tests

## Intuition

Suppose we gather a sample of data. We want to know if the sample is a good representation of some estimated population. How can we make an appropriate guess about how *representative* the sample is of that population? Or, how can we know if we have evidence that this sample *does not* come from the same distribution as our estimated population?

Once we have a hypothesis and test it with some kind of experiment), we can calculate whether it's likely our data shows some significant finding, or whether it's more likely that we got something that seemed exciting just because of random chance.

## Steps of a Hypothesis Test

1. State the alternate and null hypotheses
2. Determine the what type of test to run (we'll focus on $z$-tests in this lecture)
3. Specify significance level ($\alpha$)
4. Calculate test statistic (z-statistic) - aka run the test!
5. Translate, by either:
    - Translating significance level $\alpha$ into a significance threshold
    - Translating test statistic into a p-value
6. Interpret (reject or fail to reject the null hypothesis) 

Let's go through the steps of the hypothesis test one-by-one:

## STEP 1: State the Null and Alternative Hypotheses

A pretty painted picture of science posits that a scientist formulates a hypothesis that explains or generalizes from some set of observations, and then conducts some experiment, which will either confirm or refute that hypothesis.

A nice simplification, but an oversimplification. Often the confirmation of some testing/experiment/ **alternative hypothesis** is a _relative_ affair, where it is measured against some **null hypothesis**.

> The **null hypothesis** $H_0$ is what we would expect if there is no difference from our sample to our comparison group (in other words, the status quo, what we'd expect given what we already know about the subject)
> 
> The **alternative hypothesis** $H_a$  says the sample is _different_ from the comparison group. It is essentially the opposite of the null hypothesis (there is an _effect_ and we found something _significant_).

If an alternative hypothesis states that there is some significant relationship between two variables, then the null hypothesis simply states that there is no such relationship.

If we're testing the function of a new drug, then the null hypothesis will say that the drug has _no effect_ on patients, or anyway no effect relative to relief of the malady the drug was designed to combat. If we're testing whether Peeps cause dementia, then the null hypothesis will say that there is _no correlation_ between Peeps consumption and rate of dementia development.

It's important to clearly state both the **null hypothesis** $H_0$ and **alternative hypothesis** $H_a$ (or $H_1$) so we can be clear in what we can learn from our hypothesis test.

### Right-Tail, Left-Tail or Two-Tailed Tests

The direction you explore impacts how you write your hypotheses!

<img src="images/comparison-table_statsatoz.png" width=500>

[[Image Source]](https://www.statisticsfromatoz.com/blog/statistics-tip-in-a-1-tailed-test-the-alternative-hypothesis-points-in-the-direction-of-the-tail)

While we're not going to worry so much about this right now (though we will discuss this again later in this notebook!), we often write and define our **alternative** hypotheses with this kind of language:

- **Two-tail test:** **$H_a$** says that our sample shows a **difference** or that something is **different than** ($\neq$) we'd expect with the null (no clear direction, just different, so we test on both sides!)
- **Right-tail test:** **$H_a$** says that our sample shows an **increase** or that something is **greater than** ($>$) we'd expect with the null
- **Left-tail test:** **$H_a$** says that our sample shows a **decrease** or that something is **less than** ($<$) we'd expect with the null


### 🧠 Knowledge Check

1) A drug manufacturer **claims that a drug increases memory.** It designs an experiment where both control and experimental groups are shown a series of images, and records the number of correct recollections until an error is made for each group. 

**What are the null and alternate hypotheses?**

- 


2) An online toystore claims that **putting a 5 minute timer on the checkout page of its website decreases conversion rate.**

**What are the null and alternate hypotheses?**

- 


3) The Kansas City public school system wants to **test whether the scores of students who take standardized tests under the supervision of teachers differ from the scores of students who take them in rooms with school administrators.**

**What are the null and alternate hypotheses?**

- 


4) A pest control company **believes that the length of cockroach legs in colonies which have persisted after two or more insecticide treatements are longer than those in colonies which have not been treated with insecticide.**

**What are the null and alternate hypotheses?**

- 


5) A healthcare company **believes patients between the ages of 18 and 25 participate in annual checkups less than all other age groups.**

**What are the null and alternate hypotheses?**

- 


## STEP 2: Determine What Type of Test to Run

Over the next few days we'll learn about quite a few hypothesis tests! The type of test is determined by some underlying known parameters and what we're trying to test.

In this lecture, we're specifically running **one-sample z-tests**.

**Z-tests** are run when we:
- are dealing with **normally distributed data** (on the z-distribution)
- have a significantly **large sample size** (typically at least 30, some say at least 50)
- know the **population mean $\mu$ and standard deviation $\sigma$**

This hypothesis test tries to answer the question: **how likely are we to observe a z-statistic as extreme as our sample's, given the null hypothesis that the sample and the population have the same mean?**

## STEP 3: Specify a Signifance Level ($\text{alpha}: \alpha$)

Now that we have our hypotheses defined and we know what test we're running, we have to determine when we say an observation is **statistically significant**. Basically, how "weird" do things have to be until we reject $H_0$.

_We choose_ a threshold called the **significance level** $\alpha$. The smaller the value, the more "weirdness" we're willing to accept before reject the null hypothesis. The significance level is the threshold at which you're okay with rejecting the null hypothesis. It is the probability of rejecting the null hypothesis when it is true.

The most commonly used significance level is $\alpha = 0.05$. 

> When you set $\alpha = 0.05$, you're saying: "I'm okay with rejecting the null hypothesis if there is less than a 5% chance that the results I am seeing are actually due to randomness".

If the probability of observing what we found in our sample is smaller than $\alpha$, then we will reject the null hypothesis.

## STEP 4: Calculate a Test Statistic - by Running a Hypothesis Test!

With the setup from the prior steps, we can now look at our sample data. We'll want to find a **test statistic** that can be compared to the distribution we'd expect based on the null hypothesis (usually something like the normal distribution).

> "The test statistic takes your data from an experiment or survey and compares your results to the results you would expect from the null hypothesis."
> 
> -- [Statistic How-To](https://www.statisticshowto.com/test-statistic/)

Today we will focus on performing a **$z$-test** which is a hypothesis test that uses the normal curve. So we will find basically the $z$-score of our sample's mean - also known as our **$z$-statistic** in the context of hypothesis testing.

We already introduced the concept and some of the requirements of a z-test, but let's state them more formally:

For large enough sample sizes (at least $n$ =30), with known population standard deviation, the test statistic of the sample mean $\bar x$ is given by the z-statistic,

$$Z = \frac{(\bar{x} - \mu)}{\sigma/\sqrt{n}}$$

Where $\bar{x}$ is the sample mean, $\mu$ is the population mean, $\sigma$ is the known population standard deviation, and $n$ is the number of samples (or, you can say that the denominator, $\sigma/\sqrt{n}$, is the _standard error of the mean_).

> Remember that our $\mu$ comes from the null hypothesis; we expect our sample to have about the same mean as the population if the null hypothesis is true.

NOTE: If you think this is basically the formula for a z-score: _you're correct!_

But note that, when we're comparing a population mean to a sampling distribution, we calculate our denominator as the _standard error of the mean_, taking into account the size of our sample - this helps us put the z-score in the correct units to figure out how statistically unlikely our sample is. With this formula, the resulting z-statistic (or z-score) tells us how many standard deviations above or below the population mean our sample distribution is.

## STEP 5: Translate

Once we've calculated a test statistic, we need a way to compare it to our significance level $\alpha$.

In other words, we need to translate!

We can do this one of two ways:

1. Translate the test statistic into a **p-value** you can compare to the significance level ($\alpha$).

2. Translate the significance level ($\alpha$) into a **significance threshold** (or critical value) in the same units as a test statistic.


Today we're going to focus on the first option: translating our test statistic (z-statistic) into a **p-value**

The basic idea of a p-value is to quantify the probability that the results seen are in fact the result of mere random chance.

This is connected with the null hypothesis: If the null hypothesis is true and there is _**no** significant_ correlation between the population and our sample, then the result we see would have to be the result of mere random chance.

The p-value is the probability of observing a test statistic at least as extreme as ours by random chance, assuming that the null hypothesis is true. This tells us how _likely or unlikely_ our sample measurement is.

### Bring back the tails!

That's right - we the way we designed our test has an impact here! 

<img src="images/comparison-table_statsatoz.png" width=500>


**LEFT SIDE**

If we're running a **left** tailed test, the calculation to translate our test statistic into a p-value is easy, and something we've already seen:

```
p_value = stats.norm.cdf(z_statistic)
```

This calculates the likelihood we'd see a test statistic _as small or smaller_ than the one we calculated from our test.

**RIGHT SIDE**

If instead we're running a **right** tailed test, we need to **invert** the cdf to check the likelihood on the other side:

```
p_value = 1 - stats.norm.cdf(z_statistic)
```

This is called the survival function, and we can also calculate it like:

```
p_value = stats.norm.sf(z_statistic)
```

This calculates the likelihood we'd see a test statistic _as large or larger_ than the one we calculated from our test.

**BOTH SIDES**

If we're running a **two** tailed test, it gets more complicated. Because we are checking both sides, we need to divide that 5% chance of being wrong (our significance threshold) to accommodate the possibility of being wrong on either side - so we instead compare our calculated p-value to $\alpha / 2$!

We can accommodate this directly in our calculations when we derive our p-value using this code, and multiplying our calculation by two rather than dividing our $\alpha$:

```
p_value = stats.norm.sf(np.abs(z_statistic)) * 2
```

(Note: don't believe me? hopefully you believe the `statsmodels` library: I grabbed the above code directly from their [source code](https://www.statsmodels.org/stable/_modules/statsmodels/stats/weightstats.html#_zstat_generic)!)

## STEP 6: Interpret

Suppose we calculate a p-value for some test statistic we've measured and we get a p-value of 20%. This would mean that there is a 20% chance that the results we observed were the result of mere random chance - these results are not surprising based on what we already knew, and what we stated would be the case in the null hypothesis. Probably this is high enough that we should **fail to reject the null hypothesis**.

On the other hand, if we calculated a p-value of .000001% for our measured test statistic, then it's pretty unlikely that we would've seen that statistic if the null hypothesis was true - we can then **reject the null hypothesis** in favor of the alternative hypothesis.

In short:

If $p < \alpha$, we can reject the null hypothesis.

If $p \geq \alpha$, we fail to reject the null hypothesis.

Some notes on how we discuss our interpretations and findings!

**We never accept the alternative hypothesis, we only *reject* or *fail to reject* the null hypothesis in favor of the alternative.**

Also!

**We never _accept_ the null hypothesis, because future experiments may yield significant results.**

We do not throw out "failed" experiments! Instead we say "this methodology, with this data, does not produce significant results" 

> This only tells us if there is a statistically significant difference not to what _degree_
> ![](https://imgs.xkcd.com/comics/p_values.png)
> ☝️ _Be careful how you interpret your p-value_

### Summary

One-sample z-test steps:

1. State alternative hypothesis (and null)
    - example: sample mean is greater than population mean (mu)
    
    
2. Decide your test 
    - right now, we only know about one-sample z-tests!
    
    
3. Specify significance level ($\alpha$)
    - alpha is the probability of rejecting null even though its true (!)
    
    
4. Calculate test statistic (z-statistic)
    - $z = \frac{\bar{x}-\mu}{\sigma/\sqrt n}$
    
    
5. Calculate p-value
    - Probability we'd find this value given null is true
        - Right: p = 1 - CDF(z-stat)
        - Left: p = CDF(z-stat)
        - Two-Tailed: p = CDF(|z-stat|)\*2
        
        
6. Interpret p-value against $\alpha$


## YOUR TURN

Suppose we are told that the population of African elephants have shoulder heights distributed normally around a mean of 260 cm, with a standard deviation of 45 cm. 

Pachyderm Adventures has recently measured the shoulder height (among other things) of 217 adult African elephants in Gabon and shared their dataset with us - they think their Gabonese elephants are significantly smaller than other African elephants! Let's see what the evidence says:

**What is our alternative hypothesis?**

**What is our null hypothesis?**

**What is our significance level, alpha?**

Now let's read in the dataset given to us by Pachyderm Adventures!

([actual data source](https://vincentarelbundock.github.io/Rdatasets/doc/Stat2Data/ElephantsMF.html) - was adjusted to only include adult elephants over the age of 4)

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('data/adult_elephants.csv')

df['Height'].mean()

We were given our population parameters, $\mu$ and $\sigma$

In [None]:
mu = 260
sigma = 30

**Now let's calculate the z-statistic**

Remember the formula to calculate a z-statistic:

$$Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}}$$

In [None]:
# First calculate the denominator, standard error


In [None]:
# Now for the full picture!


In [None]:
# Now we get our p-value from the test statistic:


In [None]:
# I'll note that there are functions to calculate test statistics for you
# HOWEVER note that this doesn't let you input the population stdev
# If you're given the pop stdev, use the formula above, NOT something like this!
from statsmodels.stats.weightstats import ztest 
# Takes in the data, our population mean, and what kind of test
ztest(df['Height'], value=mu, alternative = 'smaller')

# Our test stat and the resulting p-value are different

#### Interpret!

So what? What is your result?

- 


### The Rough-and-Tumble Recap to Statistical Hypothesis Testing:

- Start with a Scientific Question (yes/no)
- Take the skeptical stance (null hypothesis)
- State the complement (alternative hypothesis)
- Decide how surprised you would need to be in order to change your mind (alpha)
- Create a model of the situation **assuming the null hypothesis is true!**

### Sidebar: What P-Values Are, and What They Aren't

There's a trend in stats right now of criticizing P-values, so you may see some criticism of using P-values to conduct tests. Yudi Pawitan, who works in Medical Epidemiology and Biostatistics at the Karolinska Institutet in Stockholm, Sweden, went on the podcast Data Skeptic to discuss his paper: _Defending the P-value_.

If you want to learn more about the controversy, and what P-values are and what they aren't, I recommend you give the episode a listen:

https://podcasts.apple.com/us/podcast/defending-the-p-value/id890348705?i=1000494460371

The point: scientists often don't do enough work thinking through what p-value _threshold_ they should use, which can lead to problems. Often the standard is 5% (.05) - but while that works fine for some areas of research, that might be too low or too high for others. 

P-values more than anything are way of balancing between false positives and false negatives, which we'll discuss more later. But, when deciding your threshold, you should think through the cost of your false positive versus the cost of your false negative, rather than using some arbitrary standard.


To further make the point, check out this cautionary study, the weight-loss chocolate study conducted by John Bohannon: https://www.scribd.com/doc/266969860/Chocolate-causes-weight-loss

> Article on explaining the whole ordeal https://io9.gizmodo.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800

Related: ["P-hacking"](https://scienceinthenewsroom.org/resources/statistical-p-hacking-explained/)

Explained in comic form:

![p-hacking comic by xkcd](https://imgs.xkcd.com/comics/significant.png)

## Level Up: Experiment Design

When we want to be able to determine something about the world, we typically turn to science. And science is really built upon the idea testing ideas through **experiments**. If we have an idea, but our experiments show that it's unlikely to be true (or likely to be true!), then we learned something about our world!

Experiments are how we get the data we need to determine if our observations are worthwhile! But if you have a poorly designed experiment, you can't trust the observations/data to say anything useful.

> **NOTE**
>
> We typically use the term "experiment" when doing a hypothesis test. This can be a little confusing when the data has been collected _before_ any other step. This is fine but we should consider if this experiment follows the general criteria of a "good" design.

### Making a Good Experiment

We strive to make the best we can which is a perfectly designed experiment that would test every possible answer to a question. Of course this is unrealistic, though we strive towards the best experiment we can to answer questions.

Below are a few items to consider for a good experiment. An experiment doesn't have to fulfill everything to still be useful, though the more items off the checkoff list the more certain we'll feel about our results. 

### Control Groups

> Your experiment should consider other factors that could affect the outcome and try to account for (or *control*) those factors

### Random Trials

> By having random trials/samples, you're less likely to have bias in your observations/data

### Sample Size

> A large enough sample size that we can reasonably extrapolate to the population of interest

### Reproducible

> Being able to reproduce the experiment means we can test again and ensure are results are valid.