# Linear regression from scratch

The **mean square error (MSE) loss** in linear regression can be interpreted as a specific implementation of **maximum likelihood estimation (MLE)** under the assumption that the error term $ \epsilon $ follows a **normal distribution**. 

#### 1. Linear Regression Model
The linear regression model is typically written as:

$$
Y = X\beta + \epsilon
$$

Where:
- $ Y $ is the dependent variable (observed outcomes),
- $ X $ is the matrix of independent variables (predictors),
- $ \beta $ is the vector of coefficients (parameters to be estimated),
- $ \epsilon $ is the error term (assumed to be independent and identically distributed).

#### 2. Assumption of Normality for $ \epsilon $
To connect MSE to MLE, assume that:
- The error term $ \epsilon \sim \mathcal{N}(0, \sigma^2) $ is normally distributed with mean 0 and variance $ \sigma^2 $.

This implies that $ Y | X \sim \mathcal{N}(X\beta, \sigma^2) $, meaning the conditional distribution of $ Y $ given $ X $ is normally distributed with mean $ X\beta $ and variance $ \sigma^2 $.

#### 3. Maximum Likelihood Estimation (MLE)
In MLE, we aim to find the parameter $ \beta $ that maximizes the likelihood of the observed data.

The likelihood function for the data $ \{Y_i, X_i\}_{i=1}^n $ is:

$$
L(\beta, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(Y_i - X_i\beta)^2}{2\sigma^2}\right)
$$

Taking the natural logarithm to simplify (log-likelihood):

$$
\ell(\beta, \sigma^2) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (Y_i - X_i\beta)^2
$$

#### 4. Maximizing the Log-Likelihood
To maximize the log-likelihood with respect to $ \beta $:
- The terms $ -\frac{n}{2} \log(2\pi) $ and $ -\frac{n}{2} \log(\sigma^2) $ are constants with respect to $ \beta $, so they can be ignored.
- The remaining term to maximize (or equivalently minimize its negative) is:

$$
\frac{1}{2\sigma^2} \sum_{i=1}^n (Y_i - X_i\beta)^2
$$

Minimizing this is equivalent to minimizing the sum of squared errors (SSE):

$$
\text{SSE} = \sum_{i=1}^n (Y_i - X_i\beta)^2
$$

Dividing by $ n $ gives the mean squared error (MSE):

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (Y_i - X_i\beta)^2
$$

Thus, minimizing the MSE corresponds to maximizing the likelihood under the assumption of normally distributed errors.