# Normality

The normality assumption in linear regression refers to the assumption that the target variable is normally distributed for the model of form:

$$ y = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n $$

As seen in [section 3](3_Linear_regression_MLE.ipynb), this when passed through maximum likelihood estimation (MLE) for $\hat\beta$, we get sum of squared residuals as the cost function which needs to be minimized:

$$ \mathcal{S}(\beta) = \frac{1}{n}\sum_{i=1}^n(y-\hat{y})^2 = \text{MSE} $$

But what-if our target variable is not normally distributed? Benefit is that we can **change the assumption on target variable distribution** and continue with the MLE to get a estimate for $\beta$.

## Breaking the assumption

Lets say, that the data we have in that the target variable instead of being a continuous variable is discrete variable which takes only two values either 0 or 1 (i.e $y \in [0,1]$). This is very common in classification problems where the target variable a class label to which that data point belongs. In this case, the target variable will have binomial distribution:

$$ y \sim Bernoulli(p) $$

which means:

\begin{align*}
    y =
    \left\{
    \begin{array}{ll}
      1,&  p \\
      0,&  1-p
    \end{array}
    \right.
\end{align*}

Let us create a linear model for this estimating the probability of target variable taking a value of 1. This can we defined as

$$
p = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n
$$

But there is an issue, LHS is a probability which has a range [0,1] whereas RHS has range [$-\infty, +\infty$]. This is a miss match.

What-if instead of modeling probability we model odds ratio i.e. $\frac{p}{1-p}$. This is now unbound on the upper end but still has a lower bound of zero. This ratio has a value of 1 in the middle, indicating a probability of .5 for both occurrence and non occurrence. A small range of odds, from 0 to 1, have a higher probability of failure than for success. Then there is an infinite range of odds, from 1 to infinity, which shows higher probability of success than of failure.

Due to the unbalanced ranges and to centralize the odds ratio around 0, we need to take a logarithmic transformation of the odds ratio. This helps the ranges of the odds ratio to become symmetric around 0, i.e., go from -infinity to +infinity.

So, now we have a model:

$$
\log\Big(\frac{p}{1-p}\Big) = \beta_0 + \beta_1 x_1 + \dots + \beta_n x_n = x\beta
$$

From this we can find a model for probability of y=1

\begin{align*}
\log\Big(\frac{p}{1-p}\Big) &= x\beta & \\
\frac{p}{1-p} &= e^{x\beta} & \text{taking exponential on both side} \\
p &= (1-p)e^{x\beta} & \\
p &= e^{x\beta} - pe^{x\beta} & \\
p + pe^{x\beta} &= e^{x\beta} & \\
p(1 + e^{x\beta}) &= e^{x\beta} & \\
p &= \frac{e^{x\beta}}{(1 + e^{x\beta})} & \\
p &= \frac{1}{(\frac{1}{e^{x\beta}} + \frac{e^{x\beta}}{e^{x\beta}})} & \\
p &= \frac{1}{(\frac{1}{e^{x\beta}} + 1)} & \\
p &= \frac{1}{(e^{-x\beta} + 1)} & \\
p &= \frac{1}{(1 + e^{-x\beta})} & \\
\end{align*}

That is we get probability of y=1:
$$
P(y=1) = \frac{1}{(1 + e^{-x\beta})}
$$