# Hypothesis testing

Hypothesis testing belongs in a part of statistics known as *inference*. Inference is the part of science in which we try to *infer* the parameters of a model. Yet, there are some characters to introduce in this story.

## Models, parameters, and estimates

The first, and main one, is the *model*. A *model* is the set of equations and mathematical rules that govern the phenomenon we are modelling. For example, if we are flipping a finite number of coins and we want to know the number of heads we get, then our model could be ``the number of heads follows a Binomial distribution''.

A model has parameters. A Binomial has $n$ and $p$, a Poisson has $\lambda$, a Normal has $\mu$ and $\sigma$ and so on. We are going to refer to the set of parameters using the Greek letter $\theta$, which solves the problem of having to write ``model parameters'' several times throughout this text.

The problem here is that, even if we figure out which model we are dealing with, we seldom have access to its true parameters $\theta$. Instead, we use sample statistics to *estimate* parameters. We differentiate estimated parameters from the ``true'' ones using a hat, that is, we have estimated parameters $\hat{\theta}$. For example, we can use a sample mean as an estimate for the model mean, or: $\hat{\mu} = \bar{x}$.

That being put, we proceed to discuss:

## What is hypothesis testing, anyway?

Let's take for example a measurement of the weights of lettuces. According to a quick Google search, lettuces weight around 300g, but they could easily go around 250g or 350g. Because I don't know much about lettuces, I will just assume that regular lettuce weights follow a Normal distribution with $\mu=300$ and $\sigma=50$.

Now let's suppose I decided to grow some lettuces with a non-standard technique. This is a different population from the standard one because I purposefully intervented onto it. Now we have two groups: the *control* group, which used the standard technique, and the *experimental* group, which used the non-standard technique.

We usually know things about the control group because they are well-established, but we don't know much about the experimental group because its conditions are a novelty. Hence, we will need to estimate parameters for the experimental group. How do we do this?

## Data analysis

As a non-agricultural person, I have no idea how to actually grow lettuces other than planting seeds, watering them, and hoping they will grow. There is obviously more to that, but in this text we are interested in the data that comes out of this experiment. We should weight some lettuces of the experimental group, which will get us the following *measurements*:

* A sample mean $\bar{x} = \hat{\mu_e}$
* A sample standard deviation $s = \hat{\sigma_e}$
* A sample size N

We don't have direct access to these other parameters, but we suppose they exist:

* $\mu_e$, the model mean for the experimental group. We only have access to $\bar{x}$. 
* $\sigma_e$, the model standard deviation for the experiment. We only have access to $s$.

Now, we are starting to get somewhere!

## The Null Hypothesis

We have some sample measurements in the experimental group. It's likely that these sample statistics are a bit different from the model statistics from the control group. However, because the experimental group has samples, they could hypothetically have been generated under the model parameters of the control group.

That would actually be a problem, because it would mean that the experimental group and the control group follow the same statistics, and hence are not discernible. In other words, it would mean that our novel non-standard lettuce growing technique is not different from the standard one. We are especially interested in findind differences between the population means $\mu$ and $\mu_e$, because they represent the expected values for lettuce weights.

So, let's suppose $\mu=\mu_e$. In this case, $\bar{x}$ is a bit of an extreme observation, in the sense that it is a bit far from the model mean $\mu$. This leads to our *null* hypothesis:

$$
H_0: \mu = \mu_e, 
$$

that is, we will begin with the hypothesis the both model means are equal.

Assuming the null hypothesis as true, we will calculate the probability of observing a sample at least as extreme as $\bar{x}$. This is a somewhat tricky part because it requires understanding how to calculate this probability. 

If we use the Central Limit Theorem and assume Normal distributions all around, what we would get is that, assuming the Null Hypothesis is true, the sample mean in the experimental group follows a distribution:

$$
\bar{X} \sim N(\mu, \sigma^2/N)
$$

so, for a sample of size $N=10$ lettuces, we would have:

In [11]:
import scipy.stats as st
import numpy as np
mu = 300
sigma = 50
N = 10
xbar = 350
sigma_xbar = sigma / np.sqrt(N)
p = 1-st.norm.cdf(xbar, loc=mu, scale=sigma_xbar)
print(p)

0.0007827011290012509


This means that, if the Null Hypothesis ($H_0$) is true, and if our modelling afterwards makes sense, we have a probability $P(\bar{X} \geq \bar{x} | \mu_e = \mu) = 7 \times 10^{-4}$. This is called a *p-value*.

One common practice to deal with this question is to pre-define a level $\alpha$ which is called *significance level*. If the p-value is below $\alpha$, then we call our effect *statistically significant* and we *reject* the Null Hypothesis. But, wait, if we *reject* the Null Hypothesis, what remains?

## The Alternative Hypothesis

When we calculated our p-value, we used the line:

`p = 1-st.norm.cdf(xbar, loc=mu, scale=sigma_xbar)`,

which calculates $P(\bar{X} \geq \bar{x} | \mu_e=\mu)$. This is because our alternative hypothesis is that $\mu_e > \mu$. This alternative hypothesis only makes sense if we *know* that $\mu_e$ cannot be lower than $\mu$ - and this entirely depends on how confident we are in our novelty lettuce growing technique.

Another possible alternative hypothesis would be $\mu_e \neq \mu$. In this case, we would have to consider extreme observations in both sides of the Normal curve, thus our p-value would be twice the one we had previously calculated. These different approaches are often called *one-sided* and *two-sided* tests.

In general, we would write the alternative hypothesis as one of the following:

$$
H_1: \mu < \mu_e \text{ (for a one-sided test)}\\
H_1: \mu > \mu_e \text{ (for a one-sided test)}\\
H_1: \mu \neq \mu_e \text{ (for a two-sided test)}\\
$$

## What happens if we reject the Null?

When we decide to reject the Null, there is a chance we are wrong. In this case, we had a *false positive*, also known as a *Type-1 Error*. The probability that we made a type-1 error is the *p-value* itself.

In other words: if we commit and decide to reject $H_0$ in favor of $H_1$, we are incorporating this possibility of error in our next inferences knowing that we could be wrong with probability $p$.

## Is a low p-value all you need?

We could want lower p-values. However, remember that p-values only refer to their corresponding null hypothesis. In our case, the p-values refer to the hypothesis that $\mu=\mu_e$.

Hence, rejecting the null hypothesis means that our observations indicate that $\mu$ is not *strictly* equal to $\mu_e$. It says absolutely nothing about how different they are. In fact, we would need an effect size analysis for this, which is a whole different problem.

An interesting experiment is the following. If you go back to the code above and insanely increase the sample size, you will observe that $p$ quickly drops. This is one of the techniques for the so-called *p-value hacking*.

Although low p-values can make your boss, your funding agency, and ultimately yourself very happy, there is more to that. Effects can be statistically significant, but their practical significance should be evaluated as well. For example, we could well observe a very low p-value with a mean difference of 1g between lettuce crops, which represents less than half of a typical lettuce leaf... perhaps it is not even worth trying it in practice!

## In conclusion

Hypothesis testing is a very important, yet very misinterpreted part of science. This text was a short review on that, and hopefully a useful one. The next step from here is to build a repertoire of tests that are useful in your field - perhaps start with the [t-test](pvalues.html)?



