In [1]:
import numpy as np
import matplotlib.pyplot as plt

x = np.random.choice(np.arange(0, 10, 0.01), 100)
y = x + np.random.normal(0, 1, 100)
plt.scatter(x, y)
plt.xlabel("X", size=15)
plt.ylabel("Y", size=15)
plt.show()

<Figure size 640x480 with 1 Axes>

In [2]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>

Given above 2D data, let us try model the relationship $Y \sim X$. It can be observed that the relationship between X and Y is appears to linear but it is not perfect. This imperfection could possibly be result of some error or noise.
To begin modeling the relationship, let us first assume the **relationship is additive** and **the error is Gaussian with mean zero and some standard deviation (white noise)**. With this, we can write our linear model as

$$ y = \theta_1 x + \theta_0 + \epsilon $$
where,
$$ \epsilon \sim \mathcal{N}(0,\sigma^2) $$

Combining above equations we can write,

$$ y \sim \mathcal{N}(\theta_1 x + \theta_0, \sigma^2) $$

In this we get three parameters, $\theta_0$, $\theta_1$ and $\sigma^2$. Our main goal is to find the best parameters for these.

This can be done using maximum likelihood estimation (MLE). For this, we can write the conditional distribution of y given x in terms of this Gaussian.

$$ \mathcal{f}(y|x; \theta_0,\theta_1,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{-\frac{(y - (\theta_1 x + \theta_0))^2}{2\sigma^2}} $$

Lets say we have n oberved points in our data and if we assume that **each point is independent and identically distributed (iid)**, we can write the likelihood function with respect to all of our observed points as the product of each individual probability density.

$$ \mathcal{L}_x(\theta_0,\theta_1,\sigma^2) = \prod^n_{i=1} \frac{1}{\sqrt{2\pi\sigma_i^2}}\exp{-\frac{(y_i - (\theta_1 x_i + \theta_0))^2}{2\sigma_i^2}} $$

If we assume that **the standard deviation ($\sigma$) is constant for all data points (homoskedasticity)**, we can factor out the first part of this equation out of the product

\begin{align*}
    \mathcal{L}_x(\theta_0,\theta_1) & = \Bigg(\frac{1}{\sqrt{2\pi\sigma^2}}\Bigg)^n \prod^n_{i=1} \exp \big(-\frac{(y_i - (\theta_1 x_i + \theta_0))^2}{2\sigma^2} \big) \\
    & = \Bigg(\frac{1}{\sqrt{2\pi\sigma^2}}\Bigg)^n \exp \big( -\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - (\theta_1 x_i + \theta_0))^2 \big)
\end{align*}

Maximising likelihood function may be complex, therefore, we take a log of both side and convert it to log likelihood

\begin{align*}
    \mathcal{l}_x(\theta_0,\theta_1) & = \ln \Bigg(\frac{1}{\sqrt{2\pi\sigma^2}}\Bigg)^n + \ln \big( \exp \big( -\frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - (\theta_1 x_i + \theta_0))^2 \big) \big) \\
    & = n \ln \Bigg(\frac{1}{\sqrt{2\pi\sigma^2}}\Bigg) - \frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - (\theta_1 x_i + \theta_0))^2
\end{align*}

Since constants do not affect the optimization, problem can be re-written as

$$ \underset{\theta_0,\theta_1}{max} \quad - \frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - (\theta_1 x_i + \theta_0))^2 $$

or

\begin{align*}
\underset{\theta_0,\theta_1}{min} & \quad \frac{1}{2\sigma^2}\sum^n_{i=1}(y_i - (\theta_1 x_i + \theta_0))^2 \\
\underset{\theta_0,\theta_1}{min} & \quad \frac{n}{2\sigma^2}\frac{1}{n}\sum^n_{i=1}(y_i - (\theta_1 x_i + \theta_0))^2 \\
\underset{\theta_0,\theta_1}{min} & \quad \frac{1}{n}\sum^n_{i=1}(y_i - (\theta_1 x_i + \theta_0))^2 \\
\underset{\theta_0,\theta_1}{min} & \quad MSE(y, \hat{y})
\end{align*}

This can be extended to multi-variable data having m independent variables as

$$ \underset{\theta}{min} \quad \lVert y - \text{X} \theta \rVert^2 $$

where $\theta$ is the parameter vector of size (n x 1) and $\text{X}$ is the data matrix of size ((m+1) x n) with 1st column as 1.

Therefore, on optimizing this we get

$$ (\text{X}^T\text{X})\theta = \text{X}^Ty $$

So if (n x m)-matrix $\text{X}$ has rank of $\textit{m}$, which is possible only when there is **perfect multicollinearity**, then (m x m)-matrix $\text{X}^T\text{X}$ will also have rank of $\textit{m}$ (according to second last property [here](https://en.wikipedia.org/wiki/Rank_(linear_algebra)#Properties))

In this case, we get the unique solution

$$ \theta = (\text{X}^T\text{X})^{-1}\text{X}^Ty $$

In summary, we made following assumption:

* The relationship is additively linear
* The errors are white noise
* Each point is IID
* Homoskedasticity
* Multicollinearity