#  Hypothesis Testing with Firearm Licensees

Warning: this notebook will be a referesher on hypothesis testing and the t-test for those that have already seen these concepts before, but the ideas may be difficult to parse right away if this is your first time grappling with these concepts!

## Hypothesis testing

A *hypothesis* is, as per the scientific method, a statement of something that we believe to be true (or maybe false). When Gallileo posed the 16th century Europeans that the solar system revolves around the sun, not the Earth, he had a theory (heliocentrism) built up on a sequence of hypotheses. You probably did something similar many times over in your fifth-grade science class or high school "laboratory".

In statistics, hypotheses are used to determine the level to which a statistical result is reliable. The fundamental problem is that when we try to measure something about an underlying distribution using a bunch of datapoints, we are almost certainly wrong. However, some answers are more wrong than others, and the less wrong the answer, the more useful the result. Hence what we are interested in is statistical significance: the chance that something that we have measured is wrong by more than some predetermined amount.

Suppose we measure the mean of a distribution using a bunch of data points we have. We instigate a contrarian hypothesis: that this distribution does not have such-and-such mean. This is known as the *null hypothesis*, or $H_0$. The *alternative hypothesis*, $H_a$, is that this distribution does have such-and-such mean. Then we select a *significance level*: how much of a risk we are willing to take on that we are wrong. Suppose we pick $\alpha = 0.05$ (there is a 5% chance we are wrong). We apply some kind of statistical assay (a *test statistic*) to the data, and get a *p-value* for our mean estimate. The significance level tells us how unlikely (on a $[0, 1]$ probabilistic scale) our result is to occur.

Suppose we got a significance level of 0.01. Since $0.01 < 0.05$, the significance level is smaller than the threshold that we chose beforehand, so we reject the null hypothesis and accept $H_a$: this result is so rare that we're confident enough that the mean we chose is wrong.

This whole procedure is known as a "hypothesis test".

## Test statistics and the t-test

Hypothesis test problems are addressed by devising an appropriate [test statistic](https://en.wikipedia.org/wiki/Test_statistic) for the distribution in question: the number that we can actually use to estimate how likely we are to be wrong. The one that's used most commonly (and taught in intro to stats classes in college) is the t-test.

The t-test has the following formula. Let $\hat{\beta}$ be an estimator on some population parameter $\beta$. Let $\beta_0$ be the $\beta$ that we hypothesize in our null hypothesis $H_0$. Let $\hat{\sigma}$ be an estimate of the standard error of $\hat{\beta}$ (with the formula $\hat{\sigma}(\hat{\beta}) = \sqrt{\frac{\sigma^2}{n}}$). Then $t$ is defined as:

$$t = \frac{\hat{\beta} - \beta_0}{\hat{\sigma}(\hat{\beta})}$$

Given that we fulfill certain conditions, $t$ is a valid test statistic for $\hat{\beta}$, and hence the statistical significance of our null hypothesis $H_0$.

This test is appropriate for estimates on independent identically distributed variables, whenever (approximately) $n > 30$.

The variables need to be "independent" because one variable cannot influence any of the other variables in the distribution. An example of the opposite occuring&mdash;a dependent variable&mdash;would be a time-series of stock prices. If the stock price for MSFT is 100\$ today, it is unlikely that it wil be 200\$ tomorrow, and if it's 200\$ tomorrow, it's unlikely to be 50\$ the day after that. An independent variable is one that doesn't do this. For example, the number you draw from the bag in a game of lotto.

The variables need to be "identically distributed" because the test requires that the distribution the numbers are being drawn from doesn't "drift" over time.

Variables which fulfill both of these conditions are known as i.i.d variables. Most estimators are i.i.d by definition.

## Central Limit Theorem

The $n > 30$ condition is interesting. This condition stems from what is known as the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem): given enough draws, the [normalized](https://www.kaggle.com/residentmario/nyc-buildings-part-2-feature-scales-grid-search/) mean of a sequence of I.I.D. variables will converge almost assuredly to the normal distribution $N(\mu=0, \sigma^2=1)$. This convergence is so fast that in practice, it's accurate after just 30 or so draws.

## More

The theory behind why the $t$-test works is very complex (especially the Central Limit Theorem), and something that I will explore in future notebooks. The way that this material is taught usually enschews or hand-waves past *why* it works, leaving it as what it is.

## Hypothesis testing firearm licensees

In the rest of this notebook I'll apply the t-test to the Firearm Licensees dataset.

In [None]:
import pandas as pd
import numpy as np
licensees = pd.read_csv("../input/federal-firearm-licensees.csv", index_col=0)[1:]
licensees.head(3)

Suppose we're interested in measuring the mean number of gun sale licensees by county. If we look at this data, here's what we find:

In [None]:
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
licensees['Premise Zip Code'].value_counts().plot.hist(bins=50)

(note that since I'm excluding counties which do not have *any* firearm stores, this is actually a flawed metric, but it's good enough for the purposes of demonstration!)

<!--
This chart makes it pretty obvious that the number of gun licensees per country variables wildly between different counties. That is to say, by looking at this histogram, we can conclude that the number of gun shops deviates geographically.

But suppose that we want to be more authoritative about this observation. We could pose this as a hypothesis. Our null hypothesis will be that "Gun shops in the United States are equivalently distributed", while our alternative hypothesis will be that "Gun shops in the United States are not equivalently distributed".
-->

We'll take the following mean:

In [None]:
licensees['Premise Zip Code'].value_counts().mean()

Here are our hypotheses:

$$H_0: \bar{n} = 2.75$$
$$H_a: \bar{n} \neq 2.75$$

Let's set our p-level to 0.05. That is, let's say that we're willing to accept a 5% risk that when we reject the null hypothesis we are wrong.

Now we will implement our t-test. Here's a hand implementation first, to see what it looks like. Note that to get the p-value we'll just throw the number at the `scipy` normal distribution built-in because the normal distribution is non-trivial to simulate computationally.

In [None]:
X = licensees['Premise Zip Code'].value_counts()

In [None]:
import numpy as np
import scipy.stats as stats

def t_value(X, h_0):
    se = np.sqrt(np.var(X) / len(X))
    return (np.mean(X) - h_0) / se

def p_value(t):
    # Two-sided p-value, so we multiply by 2.
    return stats.norm.sf(abs(t))*2

t = t_value(X, 2.75)
p = p_value(t)

In [None]:
t, p

The $t$ score tells us that our result is 0.36 standard deviations away from the average mean estimator result we can expect. 0.36 standard deviations is not a lot at all though! Our $p$ value tells us that almost 72% of possible mean estimate values are further away from our expectation than the value that we got.

In other words, a mean value of 2.75 is in the 72nd percentile of closeness.

Since $0.72 > 0.05$, we fail to reject the null hypothesis $H_0$. We conclude that there is strong evidence that $\bar{n} = 2.75$&mdash;that is, that the mean number of gun shops per US Zip Code is almost 3!

For reference, here is the usual way of performing this test using `scipy`:

In [None]:
import scipy.stats as stats

stats.ttest_1samp(a=X, popmean=2.75)

## Conclusion

Hypothesis testing is used extensively in the literature because it is a relatively simple and powerful tool for *making decisions*. Hypothesis testing allows us to state what level of confidence we want to have in some observation about our data, then, in testing that observation, determine whether or not we are satisfied that it is correct.

Another way of making this decision is to make it into a chart. For example, we could have randomly recomputed the mean of an increasing number of values in the dataset, and used that to determine how confident we are in our result. So for example, we'll take the mean of a single sample from the dataset; then the mean of two samples from the dataset; then three, and so on. Here's how that would look:

In [None]:
r = (licensees['Premise Zip Code']
         .value_counts()
         .sample(len(licensees['Premise Zip Code'].unique()) - 1))
pd.Series(r.cumsum() / np.array(range(1, len(r) + 1))).reset_index(drop=True).plot.line(
    figsize=(12, 4), linewidth=1
)

As you can see, the mean of our values stabilizes on the "true" value over time. No matter the amount of variance at the beginning of the distribution, by the end we have a very good idea that the real result is approximately 2.75.

Hypothesis testing is merely way of quantifying **how sure we are about this approximation**. It's important to do because a computer can't "look" at a graph; we mortals can, be we still need it too sometimes, because the graph is oftentimes ambiguous, and sometimes we need to make a lot of decisions potentially very quickly without necessarily looking at all of the graphs.

In a future notebook we'll look at the intimately related concept of confidence intervals. After that, we'll look at an alternative way of generating this same information that's oftentimes more flexible: bootstrapping. Strap in!