## Pooling
We have data on kidney cancer rates in each county of the US. This map shoes counties with the the highest kidney cancer rates in blue and the lowest kidney cancer rates in red:

![](images/kidneycancerus.png)

#### At each county
How should we estimate the rate of kidney cancer in each county? Well, the MLE esimtate for a rate is number of cases divided by number of trials, so let's go with that.

Unfortunately, though, all those red counties have literally zero cases of kidney cancer! We'd estimate that there's no chance of anyone there having cancer.

![](images/kidneycanceruspop.png)

#### Nationally
The other option is to dump all the observations into a pile and estimate the rate that way. That is, get the total number of cancer cases and the total population and divide them. 

In effect, this weighs the dots above with high popoulations more heavily than dots with low populations because the high-population dots are based on a larger sample and thus less variable.

But now all we can say is that the national cancer rate is (making this number up) 5 out of 100,000. We don't get estimated cancer rates in particular counties, other than "probably close to the national average". And when we see counties with 200,000 people maintaining rates of 10 out of 100,000 we have to doubt that the cancer rate there really is the same as the national average (especially if the county is famous for its beta particle radiation)



## There has to be a better way
Estimating in each county is clearly wrong: counties don't actually have 0% chance of kidney cancer. Pooling the data up to the national level is clearly wrong: some counties are maintaining rates well above the average across large populations.

Instead of not pooling at all (county level) and complete pooling (national level), we'd like to do **partial pooling**. This feels like a regularization problem: one model overfits the county data, and the other model is extremely biased and inflexible: one rate in all counties. We'd like to in some way say that high-cancer counties probbaly do have high cancer rates but our estimate should pulled towards the national average, especially if the county is small/based on a small sample.

## Heirarchical Models and Partial Pooling
Hierarchical models assume that the (unobserved) true cancer rate $\mu_i$ in each county is drawn from some overall distribution $p(\mu)$. In the no-pooling, county-level case each county's true cancer rate is completely separate from all other county rates [there's a $p_i(\mu)$ for each county and no interplay]. In the full-pooling, national analysis each county's true cancer rate is assumed to be equal: $p(\mu)$ always spits back a fixed value. By specifying a $p(\mu)$ that's less degenerate than those above we get a goldilocks medium where the data can inform the shape of $p(\mu)$ and the likely values of the $\mu_i$ each county got.

To write out an example:

$$
\alpha \sim Poisson(5)\\
\beta \sim  Poisson(5)\\
\mu_i \sim Beta(\alpha, \beta)\\
y_i \sim Binomial(n_i,mu_i)\\
$$

$\alpha$ and $\beta$ are called hyper-parameters, becuase they specify the distribution of one of a parameter (in this case the $\mu_i$).

It is important that $\alpha$ and $\beta$ be random and not fixed. If they were fixed the distribution of $\mu_i$ would be static. It would either be able to produce $\mu_i$ that fit the data well, or it would not, but it couldn't change based on the data. We could learn the likely $\mu_i$ in each county, and these values would be regularized by whatever $p(\mu)$ we specified, but we wouldn't be partial pooling. Each $mu_i$ would be independent.

It's not until $\alpha$ and $\beta$ are stochastic and learnt from data that we can edit the distribution $p(\mu)$ and share information across counties.

## When to Pool
If we believed that each county's cancer rate is unique and unaffected by other counties, we'd be correct to analyze each county individually.

If we believed that each county had the same cancer rate, we'd get a much better estimate by pooling all the data together and applying that one estimate in each county.

As a rule, we want to pool as much as possible since it means we estimate fewer parameters with more data, and only resist pooling when there is evidence that pooling would lump together samples that are truly different in the things being estimated. That said, with enough data to overcome the extra parameters a partial pooling / hierarchical model can figure out that the rates are the same everywhere and build a posterior for $\mu$ that is very narrow and near the rate we'd find by total pooling.

As a final example, if we believed that all counties within a state had cancer rates that were prettymuch the same but different states have different rates, we'd want to pool the counties into an overall number of cancer cases and population for that state, but keep each state's data separate.