
**Credit**: Images lovingly stolen from http://freakonometrics.hypotheses.org/9593. Huge thanks to them for making the exact right set of graphs!
$\let\v\mathbf$

**Warning**: This approach to GLMs describes link functions as taking in the values of $\v{w}\cdot\v{x_i}$ and outputting y-value where the mean of the distribution/the regression line falls. Most sources on GLMs define link functions as the inverse of this: taking in the mean of the distribution and outputting the weigthed sum of the predictors, which is just, like, "Wait, what? Why who wold ever use that?".

## Generalized Linear Models
GLMs allow us to write the model we want to write. E.G. Y is predicted by an exponential curve and follows a binomial distribution at each X value, or Y is predicted by a polynomial and follows a Poisson distribution at each X value.

## Extending OLS Again
In the last notebook, we embelished the OLS put-a-line-through-points problem to specify that the observed data's deviation from the fitted line follows a normal distribution (with fixed variance) with the distribution's mean placed right on the line.

1) In this notebook we'll make the obvious extension of replacing the normal distribution with any probability distribution you'd care to name. 

2) In addition, we will allow for link functions, which transform from a straight line to an exponential or log or logistic curve without messing up the error distribution in the process.

Together, these two changes extend linear regression to generalied linear models.

### Changing the Error Distribution
To see how powerful playing with the error distribution is, consider a Bernoulli 'error' distribution. No matter what the inputs are, a bernoulli only ever returns values of 0 and 1 (though the percentage of each does change as the mean of the distribution increases). Using a Bernoulli as our error distribution would let us faithfully model Y data that is only ever 0 or 1.

More simply, consider a lognormal error distribution. Lognormals are skewed and have long right-hand tails, allowing us to write a model where data far above the mean for any given X value are common, but data far below the mean are quite rare. As an example, this could be a reasonable model for lotto winnings: most outcomes fall near the cost of the ticket or slightly below, but large winnings are definitely possible.

#### Changing the Error Distribution in Math
For the model with normally distributed errors, we had
$$\cal{L} = p(\v{y} | \v{x}, \v{w}, \sigma) = \prod_i p(\v{y}_i | \v{x}_i, \v{w}, \sigma)=  \prod_i \frac{1}{\sigma\sqrt{2\pi}} e^{-(y_i - \v{w}\cdot\v{x_i})^2 / 2\sigma^2}$$
If we switch to a different distribution, e.g. Bernoulli we simply yank out the Normal distribution and [naively] write in a Bernoulli parameterized to have mean 
$\v{w}\cdot\v{x_i}$, since that will put the mean directly on the regression line. Fitting via MLE/likelihood loss minimization then proceeds as normal.

$$\cal{L} = p(\v{y} | \v{x}, \v{w}, \sigma) = \prod_i (p)^{y_i}(1-p)^{(1-y_i)} =\prod_i (\v{w}\cdot\v{x_i})^{y_i}(1-\v{w}\cdot\v{x_i})^{(1-y_i)}$$

Here we got lucky and the mean of the Bernoulli distribution was just $p$, so it was obvious how to set the parameters to put the distribution's mean on the regression line: replace $p$ by $\v{w}\cdot\v{x_i}$. The math to find a parameterization with a well-placed mean in the general case can be more hairy, but we've got computer programs that handle it.

But wait! $\v{w}\cdot\v{x_i}$ might come out to be more than 1 or less than 0, and those are illegal values for $p$. We can either live with that and ignore those regions, or we could transform that term so that it stays between 0 and 1. A scaled arctangent or a logistic function will do that job, so the final model could be:

$$\cal{L} = \prod_i 
\left( \frac{e^{\v{w}\cdot\v{x_i}}}{1+e^{\v{w}\cdot\v{x_i}}} \right)^{y_i}\left(1-\frac{e^{\v{w}\cdot\v{x_i}}}{1+e^{\v{w}\cdot\v{x_i}}}\right)^{(1-y_i)}$$

[Also, to avoid writing out case statements we used a form of the Bernoulli PDF that only gives correct values if $y_i$ is exlusively 0 or 1. Be careful if you copy/paste this.]

**Overall**: instead of a normal distribution in the likelihood, choose whichever distribution you want, set up so that the mean is some appropriate function of the linear prediction.

#### Changing the Error Distribution in Pictures
A standard linear regression with normal errors looks like this:
![A bad line](images/glm_line_normal.png)
The yellow area is a 95% prediction interval, ignoring any variability from sampling. We can see that thre are normal distributions with thier means right on the regression line.

If we swap to a poisson model of the errors, we get poisson distributions with thier means above the regression line.
![A bad line](images/glm_line_poisson.png)
In most distributions, the distribution's varaince is tied to its mean (unlike the normal, where mean and variance can be set totaly separately), so we see that the 95% prediction gets wider as the line (and thus the mean of the distribution) increases.

The mean of a poisson distribution is supposed to always be positive, and we see the model begining to break down as we head towards the left. As in the Benroulli example above, we might want to apply a transformation so that we're always plugging in valid parameters to the posisson distribution, no matter what the value of $\v{w}\cdot\v{x_i}$ is.

### Link Functions
In the above discussion we often wanted to apply a tranform to $\v{w}\cdot\v{x_i}$ so that the output would always be an acceptible value for the parameters of the distribution we're interested in. These transformations are called "link functions".

Before we say more about link functions interacting with distributions, let's just keep the normal distribution and apply a link function:

#### Link Functions In math
We take the normal likelihood and edit it to use a tranformation of $\v{w}\cdot\v{x_i}$ as the mean instead of $\v{w}\cdot\v{x_i}$ itself. Here we apply the logit transform.

$$\cal{L} = p(\v{y} | \v{x}, \v{w}, \sigma) = \prod_i \frac{1}{\sigma\sqrt{2\pi}} e^{-(y_i - logit(\v{w}\cdot\v{x_i}))^2 / 2\sigma^2}$$

The effect here is that the mean of the normal distribution will now always be between 0 and 1, no matter what the values of the Xs are. Pictorially, we've switched from a straight line as X increases with the normal distributions centered on the line to a logistic curve as X increases, with the normal distributions centered on the curve.

Such a transformation might be important if we had a theoretical reason to belive that the mean Y value should never go above zero/below one, even if individual observations do. For instance if we are measuring the amount of charge on a capacitor with some degree of gaussian measurement error: the true amount of charge is always between 0 and 1 (if normalized) but we might measure a value that's above the maximum becauase our equipment is noisy.

Note that the link function above is different than applying a transform directly to the Y variable. Applying a transform directly to Y also transforms the error distribution. One is usually better off using GLMs to specify the curve and the error distribution separately.



#### Link Functions In Pictures
Remember, this is the usual OLS picture:
![A bad line](images/glm_line_normal.png)

And this is the picture when we've transformed so that the distribution means (and the overal line of fit) is exponential (i.e. the $\mu$ in the normal distribution is no longer $\v{w}\cdot\v{x_i}$ but is $e^{\v{w}\cdot\v{x_i}}$ instead)
![A bad line](images/glm_curve_normal.png)
We still have the nice homoskedastic normal distribution, but now the centers track an exponential curve instead of a straight line

### All Together
The picture below shows a GLM using both a log link function and a poisson error function. These choices play well together by ensuring the poisson always has valid input parameters. But as we've seen, this is not the only way to have poisson errors.
![A bad line](images/glm_curve_poisson.png)
Kind of pretty, isn't it?