In previous section we derived cost function for linear regression using Maximum Likelihood Estimation (MLE) and in the process we made few assumptions. These assumptions are also the assumptions of Ordinary Least Squared (OLS) Linear Regression. Let us list them down once again here:

- **Linearity** : The relationship between target variable and dependent variable(s) is linear in parameters
- **Normality** : The target variable is normally distributed which extends from the assumption that error is white noise
- **Data is random sample from the population** : The data points are independent and identically distributed (IID)
- **Spherical errors** : The error is homoscedasticity and no serial correlation. This means that there error have same finite variance and are not correlation
- **No perfect multicollinearity** : There is no linear dependence in the independent variables. i.e the design matrix has full rank
- **Strict exogeneity** : There is no correlation between errors and independent variables or the expectation of errors conditioned on the design matrix is zero

# Linearity

The linearity in Linear Regression is being linear in parameters or coefficients.

What is this different types of linearity? Let's have a look at this. But before we dive into different linearities, we should first get familiar with some basic terminologies.

If we have an equation of type

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n $$

then $\beta_i$ (i.e. $\beta_0, \beta_1, \beta_2, \dots, \beta_n$) are the parameters/coefficients and $x_i$ (i.e. $x_1, x_2, \dots, x_n$) are variables

## Linear in parameters

A model is **linear in parameters** if the parameters (the coefficients we are estimating) appear in a linear way i.e. they are not multiplied or divided by each other, squared, or inside any non-linear function.

\begin{align*}
y &= \beta_0 + \beta_1 x_1 + \beta_2 x_2 \\
y &= \beta_0 + \beta_1 x_1^2 + \beta_2 x_2^3 \\
y &= \beta_0 + \beta_1 \log{x_1} + \beta_2 \exp{x_2}
\end{align*}

All the above models are linear in parameters even though variables have higher degree or have some non-linear functions.

Whereas, models like

\begin{align*}
y &= (\beta_0 + \beta_1 x_1)^2 \\
y &= \beta_0 + \exp{\beta_1 x_1}
\end{align*}

will be non-linear is parameters because these will have either higher degree or some non-linear function associated with the parameters

## Linear in variable

A model is **linear in variables** if the predictors (the x’s) all appear to the first power — no transformations like squares, logs, exponentials, etc.

So a model like $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$ will be linear in variable whereas, $y = \beta_0 + \beta_1 \log{x_1} + \beta_2 \exp{x_2}$ will be non-linear in varaiable.

## Relevance to OLS

In OLS linear regression the goal is to solve the model for parameters and variable contribute towards the multiple equations in the system of linear equations. Therefore, we can use matrix algebra to solve for $\hat\beta$.

So the OLS works as long as the model is **linear in parameters**, even if the variables are transformed (like squares, logs, etc.). Therefore, we can transform are variables in higher dimensions are still use linear regression.


$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1^2 + \beta_4 x_{1}x_{2} + \beta_5 x_2^2
$$

In linear regression, we minimize the sum of squared errors:

$$
\underset{\beta}{\text{min}} \quad \mathcal{S}(\beta) = \sum^n_{i=1}(y_i - \hat{y}_i)^2 = \sum^n_{i=1}(y_i - \text{X}_i\beta)^2
$$

When the model is linear in parameters, this is a quadratic function of $\beta$ because:


\begin{align*}
\mathcal{S}(\beta) &= \sum^n_{i=1}(y_i - \text{X}_i\beta)^2 \\
\mathcal{S}(\beta) &= (y - \text{X}\beta)^T(y - \text{X}\beta) \\
\mathcal{S}(\beta) &= y^Ty - 2y^T\text{X}\beta + \beta^T\text{X}^T\text{X}\beta
\end{align*}

- The first term $y^Ty$ is just constant
- The second term $2y^T\text{X}\beta$ is is linear in $\beta$.
- The third term $\beta^T\text{X}^T\text{X}\beta$ is quadratic in $\beta$ (because it’s $\beta$ transposed × matrix × $\beta$).

So $\mathcal{S}(\beta)$ is a quadratic function of $\beta$.

### Why this matters

The derivative $\frac{d\mathcal{S}}{d\beta}$ is linear in $\beta$ and setting the derivative to zero gives system of equation of form $\text{A}\beta = \text{b}$ which can be solved to get a **unique $\hat\beta$**.

If the model were non-linear in parameter:

$$
y = \beta_0 + e^{\beta_1 x}
$$

The sum of squared residuals would be

$$
\mathcal{S}(\beta) = (y_i - \beta_0 - e^{\beta_1 x})^2
$$

which is non-linear in $\beta$. So its derivative will also be non-linear and setting it to zero may not be solvable algebraically.

Even if we have a model which polynomial in parameter with degree higher than one, the derivative of the sum of squared residuals will atleast be quadratic. And setting it to zero may not result in a using solution for $\hat\beta$.

### Additional properties because of linearity in parameter

We found that since model is linear in parameter we are able to have a closed-form solution for $\beta$

$$ \hat\beta = (\text{X}^T\text{X})^{-1}\text{X}^Ty $$

For this we need to calculate $(\text{X}^T\text{X})^{-1}$. Now we can ask, what if our design matrix (X) is too big fit into memory or calculating $\text{X}^T\text{X}$ is computationally expensive. This would restrict us to use this closed-form solution and we would have to use some iterative method.

The function that we have to optimize for the $\beta$ is the sum of squared residuals.

$$ \mathcal{S}(\beta) = y^Ty - 2y^T\text{X}\beta + \beta^T\text{X}^T\text{X}\beta $$

But since the $\mathcal{S}(\beta)$ is quadratic in $\beta$ which also makes in convex, we can use any iterative method like Gradient descent, Netwon's method or Quasi-Newton method and it will be guaranteed to give us an optimal solution in fixed number of steps.