## The Set-up 
You took a sample of 200 people and found thier average height. But you know that anything calculated from a dataset comes with error bars: what other values might you have seen if you ran the process again. What's the +/- on your calculated value? What is the _sampling distribution_ of the mean height (that is, what distribution of heights might we see if we re-collected the data and re-did our analysis over and over)?

Being clever, you figure 200 is approximately infinity and decide that by the LLN and CLT your result is basically a draw from a normal distribution centered at the true average and with SD $\frac{\sigma}{\sqrt{200}}$, where $\sigma$ is the SD of the heights in the US population, which maybe isn't too different from the SD of heights in your sample?

Good for you.

You took a sample of 200 people and found their median height. Um... er... yeah. LLN and CLT don't say anything about sample medians being like true medians. But You still want to know about the sampling distribution. You still want to know what other values you might have calculated if you'd drawn a different dataset.

## Bootstrapping
Well, heck, what we'd REALLY like to do is re-collect our dataset and repeat the analysis (find the median height) again and again until we feel like we know what range of medians is possible. But NSF never, ever funds "I'm going to collect the same data hundreds of times".


## Parametric Bootstrap
We do the next best thing. Suppose we've fitted a _generative_ model to our data. Because it's a generative model we can shake up some dice and use the model to make more data. [Quick refresher: regression models are non-generative, gaussian discriminative analysis is genrative]

So that's exactly what we'll do. We'll MAKE [simulate] a dataset via the fitted model and calculate the median, and make a dataset and calculate the median again and again until we have a sense of what medians are possible.

If our model assumptions are right and the parameters properly tuned, this process is perfect. Getting new dat
a from out in the world would be exactly like getting data from the model- the whole point of the model was to accurately generate the data. In practice our model won't be perfect, but hopefully it's not too badly wrong and there aren't glaring differences between the data it generates and real collections.

This is the **parametric bootstrap**. The process is illustrated in the diagram below, taken from Shalizi:

![](images/parabootstrap.png)

And heck, maybe the median height is even one of the parameters plugged in to the model. We can still get a sense of the distribution of median heights we might see if we re-collected and re-analyzed the data. (If we're willing to believe that the model using that median height is correclty tuned).


There are 3 sources of error with respect to the sampling distribution that come from the bootstrap:

- Simulation error: We only make M different new datasets. To get a sense of the sampling distribution we need a lot of calculated medians and thus a lot of simulated datasets. Simulation error can be made arbitrarially small  by making M big.
- Specification error: the model assumptions may just be wrong. There's no fix here but to fix the assumptions or hope the error doesn't matter.
- Statistical error: We estimated a model parameter to be .33. In the real data generating process that knob is set to .37. Going ahead with .33 is inaccurate. Often though, the distribution of an estimator from the samples around the truth is more invariant, so subtraction is a good choice in reducing the sampling error

### Non-parametric bootstrap
What if we don't like all the assumptions about our model being correct?

Well, we can go assumption-free. What do we know about the data generating process? Well, we know that it generated our data. The best (i.e. maximum-likelihood) assumption-free way of building a new data point is to randomly choose one of the observed data points. We know the process can generate those points. To get a second point, we again choose one of the set of observed points.

Thus we just build our fake dataset by repeatedly sampling _with replacement_ from our existing dataset.

This is the **non parametric bootstrap**. We want to sample with replacement, for if we do so, more typical values will be represented more often in the multiple datasets we create.

Here we are using the **empirical distribution**, since it comes without any model preconceptions. This process may be illustrated so:

![](images/nonparabootstrap.png)

Building a "as-if-we-collected-new-data" dataset by just sampling with replacement from the existing data sounds insane. Like a man lifting himself into the sky by pulling on his own show laces (which is literally how the technique got its name). But it works: the best assumption-free model we have of the data generating process is just drawing at random from the data it generated.

Of course, the data we happen to have can be too small to be representative or may be biased. But with a dataset large and representative enough to capture the intricasies of the true distribution, it should be clear why sampling from the dataset is as good as collecting fresh data.

Of course, we'd like to know what happens if the we have a dataset that isn't all the way representative. And it turns out that resampling can still give us a fairly accurate picture of the sampling distribution.