# Statistical Hypothesis Testing

### Learning Objectives:
- [Introduction: Statistical Significance & Probability](#Introduction:-Statistical-Significance-&-Probability)
- [Normal Distributions, Standardization & Z-tests](#Normal-Distributions,-Standardization-&-Z-tests)
- [T-tests](#T\-tests)
- [ANOVA Testing](#Anova-Testing)
- [Chi-squared Testing](#Chi\-squared-Testing)

# Introduction: Statistical Significance & Probability

Having covered the motivation behind statistical hypothesis testing, we will now look at the mechanisms behind this framework so that you can understand how to be able to carry out your own hypothesis tests! We have seen that the aim of hypothesis testing is to help different scientists to reach the same conclusion with the same data. But how to we bring together the ideas of _objectiveness_ and _randomness_?

This is where the probability and probability distributions we have encountered come in handy. Since _probability_ is a measure of how likely something is to occur, and _probability distributions_ tell us the probability/probability density of every possible outcome, we can use probability distributions to tell us the __probability of something being true given our null hypothesis__. This means that if the probability of some __test statistic__, which is based on our data given the null hypothesis is true is small _enough,_ we can reject the null hypothesis and accept the alternative 
hypothesis. This probability is known as a __p-value__, and the _threshold_ of probability for rejecting the null hypothesis is known as the __significance level__ ($\mathbf{\alpha}$), which is determined depending on how certain we need to be. If the p-value we compute from our data is lesser than our chosen significance level, our results are __statistically significant__ and we can reject the null hypothesis.

Therefore, we can formalize the steps of hypothesis testing:
1. Formulate a hypothesis 
2. Find the appropriate statistical test
3. Choose a significance level
4. Collect data and compute test statistic
5. Determine the p-value (probability)
6. Reject/accept null hypothesis
7. Make a decision

We will understand these steps in the context of four widely used statistical tests.

# Normal Distributions, Standardization & Z-tests

As we have seen before, the normal distribution is a bell-shaped curve centered around the mean, whose width is determined by its standard deviation. It is a useful distribution, as we can approximate many random variables to have a normal distribution, and the _central limit theorem_ we previously encountered also allows us to make useful approximations. We will now explore the normal distribution in relation to hypothesis testing.

Let us start by considering the information that a normal distribution conveys. We will use the example of the height distribution in an entire year for three different schools, with population mean and standard deviations given by the table below. 

| School(#) | Mean | Standard deviation |
|--------|------|--------------------|
| 1      | 170  | 5                  |
| 2      | 175  | 10                 |
| 3      | 180  | 15                 |


Below we plot the __probability density functions(pdfs)__ for each of the three schools.

In [7]:
import numpy as np
import plotly.graph_objects as go

def norm_plot(mean, std, x_vals):
    exponent = -0.5*(((x-mean)/std)**2)
    f = np.exp(exponent)/(std*np.sqrt(2*np.pi))
    return f

x = np.linspace(130, 210, 1000)
f1 = norm_plot(mean=170, std=5, x_vals=x)
f2 = norm_plot(mean=175, std=10, x_vals=x)
f3 = norm_plot(mean=180, std=15, x_vals=x)

fig = go.Figure(data=go.Scatter(x=x, y=f1, name='#1'))
fig.add_trace(go.Scatter(x=x, y=f2, name='#2'))
fig.add_trace(go.Scatter(x=x, y=f3, name='#3'))

fig.update_layout(title='Normal Distribution',
                   xaxis_title='x',
                   yaxis_title='p(x)')
fig.show()

From the plot, it is clear that the standard deviation determines the _spread_ of the probability distribution and the mean determines the _centre_ location of the distribution. This makes sense, as the mean being a measure of central tendency, is the most likely value to occur, and the further away from it, the less likely values are to occur. The same logic can be applied to the standard deviation: if you, on average, are further away from the mean (hence a larger standard deviation), there is a larger probability of observing values further from the mean.

When it comes to hypothesis testing, we often use the normal distribution to give us measures of probability. However, we tend to simplify the normal distribution to only have the necessary components for testing. So, what parts of the normal distribution actually contribute to the probability we compute? The first thing we notice is that the location of the mean is irrelevant. __We only care about the displacement away from the mean__. Secondly, we see from the above plots that the standard deviation affects the probability of a given displacement from the mean, meaning that __the probability depends on the displacement of away from the mean relative to the standard deviation.__

This gives us motivation for an incredibly powerful tool, known as __standardization__. The goal of standardization is to compare any normally distributed data to the __standard normal distribution__, which is a normal distribution with a population mean $\mu = 0$ and a population standard deviation $\sigma = 1$. This is useful since we don't need the explicit mean and standard deviation, but rather _how many standard deviations away from the mean a value is_ to calculate its probability.

To standardize a dataset, we carry out the following two steps:
- Subtract the mean from all points in our dataset to ensure it is centered at 0
- Divide the result by the standard deviation of the dataset, scaling the results so that the standard deviation is 1

For each data value we standardize, we denote it as shown below:

$$z = \frac{x-\mu}{\sigma}$$

Where $x$ is our data point, referred to as a __raw score__, and $z$ is known as the __z-score__ or __standard score__. By definition, it is a measure of how many standard deviations a data point is from the mean.

Why do we bother using the standard normal distribution? Couldn't we get our probabilities from the _raw_ distributions? In practice, you _could_ just use the original normal distribution, even in a statistical test. But by using z-scores, we can use _one_ hollistic distribution to describe the probabilities of _any_ normally distributed datasets. Another reason is that since the to get a probability from a pdf, we have to find the area under its curve within a given range, we would to carry out complex operations for each distribution. Instead, statisticians have created a __z-table__, which tells us the _cumulative probability_ of any given z-score. We can use this to compute probabilities for any given normal distribution! While this is not _as_ relevant as when we couldn't use special calculators and computers to give us these probabilities, it still greatly simplifies the process.

# T-tests
https://blog.minitab.com/blog/michelle-paret/guinness-t-tests-and-proving-a-pint-really-does-taste-better-in-ireland

# ANOVA Testing

# Chi-squared Testing

In [11]:
1.645-1.067

0.5780000000000001