# Regression
Regression is the statistical analysis of relationships between a dependent (or "outcome") variable and one or more independent (or "predictor") variables. 

## Linear regression
Linear regression is suitable if we suspect that a dependent variable can be expressed as a weighted sum of dependent variables as shown below.

$$
Y = \alpha 1 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_n X_n + E
$$

where, E is the random variable representing the error. Note that it is called "linear regression" because $Y$ is linear in $\alpha, \beta_i$, not because it is linear in $X_i$. So $ Y = \alpha + \beta_1 X_1 + \beta_2 X^2_1 + E $ is also considered a linear relationship. This is because one can think of a new random variable $ Z = X^2_1 $ and we will get a equation $ Y = \alpha + \beta_1 X_1 + \beta_2 Z + E $. So linear regression covers representations of $Y$ as not only a line/plane/hyperplane, but also polynomial on the independent variables. 

Typically \textbf{least-squares} method is used to solve for the coefficients, $\alpha, \beta_i$. Note that $1$ is used next to alpha to indicate that alpha is scaling a single-value random  variable. This may be absurd, but it is useful think about when we compute the least-squares solution.

The first step in using least-squares method is to define the inner product (we will use induced norm). Inner product should be defined in a way where reducing $ \langle E,E \rangle $ makes sense: Probabilistic version of square error, i.e., $ \langle E,E \rangle = E[E,E] $ is a good choice, (i.e., $ \langle X,Y \rangle = E[XY] $). We will later see alternatives to this inner product.

The simplest linear equation for $Y$ would be $Y = \alpha + E$. It is silly because it is not dependent on any random variable $X_i$. But it is useful for understand more complex relationships. If we were to use the least-squares method to solve for $\alpha$, we would get

$$
\begin{align}
\langle 1,1 \rangle \alpha &= \langle Y,1 \rangle \\
\alpha &= E[Y]
\end{align}
$$

In other words, if one wants to get the smallest error prediction of $Y$ without considering its relationship with any underlying independent variable, then the best bet is to simply use $E[Y]$. For instance, if someone asks us what is the literacy rate of Madurai district, if we don't have a detailed relationship between literacy rate and independect factors such as average household income, number of schools etc. related to Madurai, the best bet is to just use the average literacy rate of Tamil Nadu. Surely Madurai would have a higher literacy rate because TN literacy rate as Madurai district is the cultural capital of TN. But using average TN literacy rate is a good idea, considering that we would use the same for answering literacy rates of all districts. That is, using the average literacy rate of TN would reduce the total (square) error in our predictions for all districts.

### Simple linear regression
Going one step beyond $ Y = \alpha + E $, we add just one dependent variable and try to model $Y$ as $ Y = \alpha + \beta X + E $. In a way, this is like asking ourselves how much better can we explain $Y$ - or reduce the error in $Y$ beyond just using $Y = E[Y] + E$. $ Y = \alpha + \beta X + E $ can be thought of as a line with $\alpha$ as the y-intercept and $\beta$ as the slope of the line. Using least-squares we have

$$
\begin{bmatrix} 
\langle 1,1 \rangle & \langle X,1 \rangle \\ 
\langle 1,X \rangle & \langle X,X \rangle 
\end{bmatrix} 
\begin{bmatrix}
\alpha \\
\beta
\end{bmatrix}
=
\begin{bmatrix}
\langle Y,1 \rangle \\
\langle Y,X \rangle 
\end{bmatrix}
$$

When we solve for $\alpha$ and $\beta$, we get, 

$$
\alpha = \dfrac{\mu_Y E[X^2] - \mu_X E[XY]}{E[X^2] - \mu^2_X} \\
\beta = \dfrac{E[XY]-\mu_X\mu_Y}{E[X^2] - \mu^2_X}
$$

After a bit of massaging, we get better expressions for $\alpha$ and $\beta$

$$
\beta = \dfrac{\sigma^2_{XY}}{\sigma^2_X} = \rho_{XY} \dfrac{\sigma_Y}{\sigma_X} \\
\alpha = \mu_Y - \rho_{XY} \dfrac{\sigma_Y}{\sigma_X} \mu_X
$$

where, $\sigma^2_{XY}$ is covariance of $X$ and $Y$. 

The resulting least-squares solution can be rewritten as:

$$
\begin{equation}
(Y - \mu_Y)\ =\ \rho_{XY} \dfrac{\sigma_Y}{\sigma_X} (X - \mu_X) + E
\label{eq:SLR_soln} \tag{1}
\end{equation}
$$

In fact, if, instead of finding the relationship between $Y$ and $X$, had we tried to find the (equivalent) relationship between the zero-mean variants, $Y' = Y-\mu_Y$ and $X' = X-\mu_X$, we would have solved for $Y' = \alpha' + \beta' X + E $ and we would have got $\alpha' = 0$ and $\beta = \rho_{XY} \dfrac{\sigma_Y}{\sigma_X}$, resulting in the exact same solution as above! In fact, one could further rearrage the equation and get 

$$
\begin{equation}
\left( \dfrac{Y-\mu_Y}{\sigma_Y} \right) = \rho_{XY} \left( \dfrac{X-\mu_X}{\sigma_X} \right) + E'
\label{eq:SLR_standardized} \tag{2}
\end{equation}
$$

where, $E' = E/\sigma_Y $ and $\rho_{XY}$ is the correlation coefficient:

$$
\rho_{XY} = \dfrac{Cov(X,Y)}{\sigma_X \sigma_Y} = \dfrac{\sigma^2_{XY}}{\sigma_X \sigma_Y}
$$

Note that, if $X$ and $Y$ were gaussian random variables, the above equation represents the relationship between the "standardized" gaussian version of $X$ and $Y$, where both the dependent variable and independent variable have zero mean and unit standard deviation. We will come back to this later. But let us first examine the characteristics of the error r.v. $E$ in equation $\eqref{eq:SLR_soln}$.

First, it is straight forward to show that, the error is zero mean:

$$
\begin{align}
\mu_E\ &=\ E[(Y-\mu_Y) - \rho_{XY} \dfrac{\sigma_Y}{\sigma_X} (X-\mu_X)] \\
      &=\ E[(Y-\mu_Y)] - \rho_{XY} \dfrac{\sigma_Y}{\sigma_X} E[(X-\mu_X)]
\end{align}
$$

The stadard deviation of the error can be derived as below:

$$
\begin{align}
\sigma^2_E\ &=\ E[E^2] - \mu^2_E \\
            &=\ E[E^2] - 0 \\
            &=\ E[((Y-\mu_Y) - \rho_{XY} \dfrac{\sigma_Y}{\sigma_X} (X-\mu_X) )^2] \\ 
            &=\ \sigma^2_Y + \rho^2_{XY} \dfrac{\sigma^2_Y}{\sigma^2_X} \sigma^2_X - 2 \rho_{XY} \dfrac{\sigma_Y}{\sigma_X} \sigma^2_{XY} \\
            \text{which reduces to}\\
\sigma^2_E\ &= \sigma^2_Y(1-\rho^2_{XY})
\end{align}
$$

$E$ is likely to be a Gaussian R.V. due to the implications of the central limit theorem. So what the above equation tells us is that, if, say, $\rho_{XY} = 0.7$, then the error $E$'s variance will be approximate half of that of the dependent variable $Y$. One could phrase this as "half of the variance of $Y$ is explained away by $X$". Also, combined with the fact that $\mu_E = 0$, we can now say things like, "We are 68\% confident that the prediction error is within $\pm 0.7\sigma_Y$"
           
