Assuming a standard linear regression model with independent normal errors, the ordinary least squares (OLS) estimator of $\beta$ is exactly the maximum likelihood estimator (MLE) because maximizing the Gaussian likelihood is equivalent to minimizing the sum of squared residuals.

## Model and likelihood

Consider the linear model

$$
Y = X\beta + \varepsilon,
$$

where $Y \in \mathbb{R}^n$, $X$ is an $n \times p$ matrix of predictors, $\beta \in \mathbb{R}^p$ is the parameter vector, and

$$
\varepsilon \sim N(0, \sigma^2 I_n),
$$

with $\sigma^2 > 0$ and $I_n$ the $n \times n$ identity matrix.

Conditional on $X$, the distribution of $Y$ is multivariate normal:

$$
Y \mid X \sim N(X\beta, \sigma^2 I_n).
$$

The joint density (likelihood) of $Y$ given $\beta, \sigma^2$ is

$$
L(\beta, \sigma^2 \mid y)
= \frac{1}{(2\pi\sigma^2)^{n/2}}
\exp\left(
-\frac{1}{2\sigma^2}(y - X\beta)^{\top}(y - X\beta)
\right).
$$

Taking logs gives the log-likelihood (up to constants not depending on $\beta$):

$$
\ell(\beta, \sigma^2 \mid y)
= -\frac{n}{2}\log(2\pi)
- \frac{n}{2}\log(\sigma^2)
- \frac{1}{2\sigma^2}(y - X\beta)^{\top}(y - X\beta).
$$

## Equivalence to least squares

For a fixed $\sigma^2$, maximizing $\ell(\beta, \sigma^2 \mid y)$ with respect to $\beta$ is the same as minimizing the quadratic term

$$
Q(\beta) = (y - X\beta)^{\top}(y - X\beta),
$$

because the other terms in $\ell$ do not depend on $\beta$ and $-\frac{1}{2\sigma^2}$ is a negative constant.

But minimizing $Q(\beta)$ is precisely the ordinary least squares criterion:

$$
\hat{\beta}_{\text{OLS}} 
= \arg\min_{\beta} (y - X\beta)^{\top}(y - X\beta).
$$

To find the minimizer, differentiate $Q(\beta)$ and set to zero:

$$
Q(\beta) = (y - X\beta)^{\top}(y - X\beta)
= y^{\top}y - 2\beta^{\top}X^{\top}y + \beta^{\top}X^{\top}X\beta,
$$

$$
\frac{\partial Q(\beta)}{\partial \beta}
= -2X^{\top}y + 2X^{\top}X\beta.
$$

Set the gradient to zero:

$$
-2X^{\top}y + 2X^{\top}X\hat{\beta} = 0
\quad \Longrightarrow \quad
X^{\top}X\hat{\beta} = X^{\top}y.
$$

If $X^{\top}X$ is invertible,

$$
\hat{\beta}_{\text{MLE}} 
= \hat{\beta}_{\text{OLS}}
= (X^{\top}X)^{-1}X^{\top}y.
$$

Thus, under the normal-error assumption, the least squares estimator is exactly the maximum likelihood estimator for $\beta$ in the linear regression model.