# Maximum Likelihood




+ How do we get the loss function that we use for logistic regression? 
+ We rely on a statistical argument called maximum likelihood (ML). 
+ Sadly, ML is used to be represent maximum likelihood and machine learning, both important topics in data science. 


## Example, coin flipping 

+ Consider a coin flips, each with their own probability of a head

$$
p_i = P(Y_i = 1 | x_i) ~~~ 1 - p_i = P(Y_i = 0 | x_i)
$$

+ For example $Y_i$ could be the event that person $i$ has hypertension and $x_i$ their smoking consumption in pack years. 
+ We'd like to estimate the probability that someone has hypertension given their pack years.
+ We could write this more compactly as:

$$
P(Y_i = j | x_i) = p_i ^ j (1 - p_i)^{1-j} ~~~ j \in \{0, 1\}.
$$



## A coin flip

+ Consider a dataset, $Y_1, \ldots, Y_n$ and $x_1, \ldots, x_n$. 
+ Every $y_1, \ldots, y_n$  is either 0 or 1. 
+ Under our model, the probability of one observed data point is

$$
P(Y_i = y_i | x_i) = p_i ^ {y_i} (1 - p_i)^{1-y_i}
$$


## Multiple coin flips

+ What about all of the data jointly? 
+ Assume the coin flips are independent, then the probabilities multiply. 
+ The **joint** probability of our data in this case is 

$$
P(Y_1 = y_1, \ldots, Y_n = y_n ~|~ x_1, \ldots, x_n)
= \prod_{i=1}^n p_i ^ {y_i} (1 - p_i)^{1-y_i}
$$

## Tying the probaiblities together

+ This model doesn't say much, there's nothing to tie these probabilities together. 
+ In our example, all we could do is estimate the probability of hypertension for a bunch of people with exactly the same pack years. + It seems logical that groups with nearly the same pack years should have similar probabilities, or even better that they vary smoothly with pack years. 
+ Our logistic regression model does this.

$$
\mathrm{logit}(p_i) = \beta_0 + \beta_1 x_i
$$

+ Now we have a model that relates the probabilities to the $x_i$ in a smooth way. 

## PUtting it together

$$
P(Y_1 = y_1, \ldots, Y_n = y_n ~|~ x_1, \ldots, x_n)
= \prod_{i=1}^n p_i ^ {y_i} (1 - p_i)^{1-y_i}
= \prod_{i=1}^n \left(\frac{e^{\beta_0 + \beta_1 x_i}}{1 + e^{\beta_0 + \beta_1 x_i}}\right)^{y_i}
\left(\frac{1}{1 + e^{\beta_0 + \beta_1 x_i}}\right)^{1-y_i}
$$

## Simplyfing

$$
\exp\left\{\beta_0 \sum_{i=1}^n y_i + \beta_1 \sum_{i=1}^n y_i x_i\right\}\times \prod_{i=1}^n \left(\frac{1}{1 + e^{\beta_0 + \beta_1 x_i}}\right)
$$


## Simplyfing

$$
\exp\left\{\beta_0 \sum_{i=1}^n y_i + \beta_1 \sum_{i=1}^n y_i x_i\right\}\times \prod_{i=1}^n \left(\frac{1}{1 + e^{\beta_0 + \beta_1 x_i}}\right)
$$

+ Notice, interestingly, this only depends on $n$, $\sum_{i=1}^n y_i$ and $\sum_{i=1}^n y_i x_i$. 
+ These are called the **sufficient statistics**, since we don't actually need to know the individual data points, just these quantities. (Effectively

## Maximum likelihood

+ The principle of ML. 

*Pick the values of $\beta_0$ and $\beta_1$ that make the data that we actually observed most probable.* 

+ When you take the joint probability and plug in the actual Ys and Xs that we observed and view it as a function of $\beta_0$ and $\beta_1$, it's called a **likelihood**.
+ So a likelihood is the joint probability with the observed data plugged in and maximum likelihood finds the values of the parameters that makes the data that we observed most likely.

## Our example

+ Generally, since sums are more convenient than products, we take the natural logarithm.

$$
\beta_0 \sum_{i=1}^n y_i + \beta_1 \sum_{i=1}^n y_i x_i - \sum_{i=1}^n \log\left(1 + e^{\beta_0 + \beta_1 x_i}\right)
$$

+ This is the function that sklearn maximizes over $\beta_1$ and $\beta_0$ to obtain the estimates

## Second example, linear regression

+ linear regression can also be cast as a likelihood problem. + Consider an instance where we assume that the $Y_i$ are Gaussian with a mean equal to $\beta_0 + \beta_1 x_i$ and variance $\sigma^2$. 
+ The probability that $Y_i$ lies betweens the points $A$ and $B$ is governed by the equation

$$
P(Y_i \in [A, B) ~|~ x_i) = \int_A^B \exp\left\{ -(y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\} dy_i
$$

## Density functions

+ Letting $A=-\infty$ and taking the derivative with respect to $B$, we obtain the density function, sort of the probability on an infintessimally small interval:

$$
\exp\left\{ -(y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\}
$$

## Joint likelihood

+ Uses the density evaluated at the observed data, the joint likelihood assuming independence is:

$$
\prod_{i=1}^n \exp\left\{ -(y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\}
= \exp\left\{ -\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 \right\}
$$

Since it's more convenient to deal with logs we get that the joint log likelihood is

$$
- \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2 / 2\sigma^2 
$$

## Loss function

+ Since minimizing the negative is the same as maximizing this, and the constants of proportionality are irrelevant for maximizing for $\beta_1$ and $\beta_0$, we get that maximum likelihood for these parameters minimizes

$$
\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2
$$

+ which is the same thing we minimized to obtain our least squares regression estimates.