## Importance Sampling
Rejection sampling gave us a way of drawing samples from a target distribution $f(x)$ by sampling from a different, easier proposal distribution $g(x)$ and applying a correction. Importance sampling likewise replaces sampling from $f(x)$ with sampling from $g(x)$, but 1) is more straightforward and 2) returns weighted samples instead of unweighted samples.

#### Weighted Samples
Weighted samples [or weighted datasets] are a generalization of standard, unweighted samples. Basically, each line of data has a weight attached to it, specifying how important or common that line is. In an unweighted sample common values are repeated on multiple lines; in a weighted sample common values are given higher weights.

[This doesn't mean that each value appears only once in a weighted sample; a value can appear multiple times, but with the weight split among its appearances].

Weighted samples are more general, but less common, and sometimes harder to work with. Default statistical machinery assumes an unweighted sample, but versions for weighted data can almost always be found.

Weights can be normalized to sum to 1 or not. It doesn't really matter.

#### The Idea
We're going to more or less ignore the target distribution $f(x)$ and sample from $g(x)$ instead. Obviously, we have too many points from where $g(x)$ exceeds $f(x)$ and too few where $g(x)$ is below $f(x)$. We're going to give points that are too common a low weight (below 1.0) and points that are too rare a high weight (above 1.0).

It turns out the right weighting is just $weight(x) = f(x)/g(x)$. If we post-hoc normalize the weights to sum to 1 the probability of seeing value x come up is

$$P(x) = \frac{weight(x)}{\Sigma{weight(x)}}g(x) = \frac{1}{\Sigma{weight(x)}} \frac{g(x)}{f(x)}g(x) = f(x)$$

#### The catch
That's really all there is to it: by accepting a weighted dataset instead of an unweighted one, we're able to sample from whatever distribution we want and, via weighting, transform that sample into a sample from any other distribution. Of course, we don't end up with "one sample to rule them all". There are better and worse choices of $g(x)$ for specific $f(x)$ and even for specific things you want to do with the sample.

For instance, if $g(x)$ is zero somewhere $f(x)$ exists, we'll never, ever see that $x$ value come up [and we'd try to attach infinite weight to it if it did]. So in that case sampling from $g(x)$ will never adequately approximate sampling from $f(x)$.

Additionally $g(x)$ might be absurdly tiny somewhere $f(x)$ is large. With a finite sample size we may never see that value of x and the absurdly large weight we put on it. Or we may be unlucky and see too many of that x and still give them huge weight. [Remember weights are based only on the ratio of g(x) to f(x), which only matches how often values show up in each sample as the samples become infinitely large]. With this pathology, any finite sample from g(x) may be a poor approximation to f(x) becuase of the luck of the draw about whether, and which, rare points showed up.

Note that we can't get out of the game: becuase both $f(x)$ and $g(x)$ have the same mass and integrate to 1, there are some places where $g(x)$ is below $f(x)$ (unless they're exactly equal everywhere).

#### Summary
The more similar $g(x)$ is to $f(x)$, the more uniform the weights are, and the less luck-prone our sampling will be. But even if we choose a poor alternative distribution $g(x)$, weighting $x$ values drawn from $g$ by $f(x)/g(x)$ will, in the limit, give us an accurate sample from $f$.