# Deriving the Normal Equation for Linear Regression

## Objective:
We aim to derive the **normal equation** for linear regression by minimizing the **mean squared error (MSE)**. The goal is to find the parameter vector $ \boldsymbol{\beta} $ that minimizes the error between predicted and actual values.

---

## Linear Regression Model in Matrix Form:
The hypothesis for linear regression is:
$$
\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}
$$
Where:
- $ \mathbf{y} $: $ n \times 1 $ vector of target values.
- $ \mathbf{X} $: $ n \times (p + 1) $ matrix of features (including a column of ones for the intercept).
- $ \boldsymbol{\beta} $: $ (p + 1) \times 1 $ vector of parameters (coefficients to be estimated).
- $ \boldsymbol{\epsilon} $: $ n \times 1 $ vector of residuals (errors).

---

## Mean Squared Error:
The mean squared error (MSE) is given by:
$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$
In matrix form:
$$
\text{MSE} = \frac{1}{n} \|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|^2
$$

The term $\|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|^2$ is the squared norm of the residual vector:
$$
\|\mathbf{y} - \mathbf{X} \boldsymbol{\beta}\|^2 = (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})
$$

Thus:
$$
\text{MSE} = \frac{1}{n} (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})
$$

---

## Objective Function:
To minimize MSE, we ignore the constant $ \frac{1}{n} $ (it does not affect optimization) and define the objective function:
$$
J(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X} \boldsymbol{\beta})
$$

---

## Expanding the Objective Function:
Expand $ J(\boldsymbol{\beta}) $:
$$
J(\boldsymbol{\beta}) = \mathbf{y}^\top \mathbf{y} - 2\mathbf{y}^\top \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta}
$$

Where:
- $ \mathbf{y}^\top \mathbf{y} $: A constant term (independent of $ \boldsymbol{\beta} $).
- $ -2\mathbf{y}^\top \mathbf{X} \boldsymbol{\beta} $: Linear term in $ \boldsymbol{\beta} $.
- $ \boldsymbol{\beta}^\top \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} $: Quadratic term in $ \boldsymbol{\beta} $.

---

## Minimizing the Objective Function:
To find the optimal $ \boldsymbol{\beta} $, take the derivative of $ J(\boldsymbol{\beta}) $ with respect to $ \boldsymbol{\beta} $ and set it to zero:
$$
\frac{\partial J(\boldsymbol{\beta})}{\partial \boldsymbol{\beta}} = -2 \mathbf{X}^\top \mathbf{y} + 2 \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta}
$$

Simplify:
$$
-2 \mathbf{X}^\top \mathbf{y} + 2 \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = 0
$$

Divide through by 2:
$$
\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^\top \mathbf{y}
$$

---

## The Normal Equation:
Solve for $ \boldsymbol{\beta} $:
$$
\boldsymbol{\beta} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
$$

---

## Key Points:
1. **Interpretation**:
   - $ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top $: Known as the **pseudoinverse** of $ \mathbf{X} $.
   - $ \boldsymbol{\beta} $: The vector of coefficients that minimizes the sum of squared errors.

2. **Assumptions**:
   - $ \mathbf{X}^\top \mathbf{X} $ must be invertible (no multicollinearity).
   - The solution assumes a linear relationship between $ \mathbf{X} $ and $ \mathbf{y} $.

3. **Application**:
   The normal equation provides a closed-form solution for linear regression, though iterative methods like gradient descent are often used for large datasets.

---
