Simple linear regression in a nutshell is a very direct approach to predict response Y based on X. It assumes that there are linear relationship between X and Y. Mathematically, this is defined as

$$Y \approx b_0 + b_1X $$

Here, $b_0$ corresponds to intercept while $b_1$ corresponds to slope/gradient. Now, our goal is to estimate $b_0$ and $b_1$ so the plot looks 'nice'. 

<img src="./images/RSS.png" width=40%>

In mathematical notation, this is commonly referred as RSS. 

RSS is simply the sum of squared difference between target $y$ and estimated value $\hat{y}$
 
$$\sum(y-\hat{y})^2$$

Remember that $\hat{y}$ is equal to $b_0+b_1x$ so we can expand this further into
$$\sum(y-(b_0+b_1x))^2$$

Our goal is to find the 'correct' $b_0$ and $b_1$ so that RSS is minimized. Meanwhile, the plot of RSS vs $b_0$ and $b_1$ might look like this

<img src="./images/RSS_vs_param.png" width=40%>

The great thing about this RSS function is that it always have a global minimum. This is confirmed by calculus by setting the derivation of the RSS function to zero and directly find the $b_0$ and $b_1$. However, this value could correspond either to global maximum or minimum, we don't know for sure. After we have computed the second derivative we would have confirmed that the RSS function is a convex function, which means that it must have a unique, global minimum. However, RSS is a special case where the partial derivation can be set to zero so we could obtain the 'correct' parameter. More often than not, we could not do this. So take this closed form solution with a little bit grain of salt

Now, we would derive RSS with respect to (w.r.t.) $b_0$ and $b_1$ before we set them both to 0 and find the parameter $b_0$ and $b_1$

Let's now call RSS function as a cost function or a loss function. In a nutshell, its the function we strive to maximize or minimize. And in this specific case, we want to minimize the difference the RSS. In other words, what we're looking for is the $b_0$ and $b_1$ that would minimize the RSS. Intuitively, we would always find the minimum of cost function, but it turns out that in some cases, i.e logistic regression we would find the maximum value of the cost function. It's a little bit off topic, but a little bit of anticipation is always nice.


\begin{eqnarray}
 \frac{\partial C}{\partial b_0} &=& -2 \sum(y-b_0-b_1x)\nonumber\\
   &=& -2 \sum(y-(b_0+b_1x))\nonumber\\
\end{eqnarray}

\begin{eqnarray}
 \frac{\partial C}{\partial b_1} &=& -2 \sum(y-b_0-b_1x)(x)\nonumber\\
   &=& -2 \sum(y-(b_0+b_1x))(x)\nonumber\\
\end{eqnarray}



Set the first equation to zero

\begin{eqnarray}
 0 &=& -2 \sum(y-b_0-b_1x)\nonumber\\
   &=& \sum(y) - \sum(b_0) - \sum(b_1)x\nonumber\\
 b_0 &=& \frac{y}{n} - \frac {b_1 \sum(x)}{n}\nonumber\\
 &=& \overline{y} - b_1\overline{x}\nonumber\\
\end{eqnarray}

Set the second equation to zero

\begin{eqnarray}
 0 &=& -2 \sum y-b_0-b_1xx\nonumber\\
   &=& \sum yx - \sum b_0x - \sum b_1x^2\nonumber\\
 b_1\sum x^2 &=&  \sum yx - b_0\sum x\nonumber\\
\end{eqnarray}

But recall that $b_0 = \overline{y} - b_1\overline{x}$

\begin{eqnarray}
 b_1\sum(x^2) &=&  \sum(yx) - (\overline{y} - b_1\overline{x})\sum(x)\nonumber\\
   &=& \sum yx - (\frac {\sum y}{n} - b_1 \frac{\sum x}{n})\sum x \nonumber\\
   &=& \sum yx - \frac{1}{n} \sum y \sum x  + \frac{1}{n} b1 \sum x \sum x\nonumber\\
 b_1 (\sum x^2 - \frac {1}{n} \sum x \sum x)  &=& \sum yx - \frac{1}{n} \sum y \sum x\nonumber\\
 b_1 &=& \frac {\sum xy - \frac{1}{n} \sum x \sum y} {\sum x^2 - \frac {1}{n} \sum x \sum x}\nonumber\\
\end{eqnarray}

However, most of the times, we are not able to do this, since the cost function is much more complicated. That is why we would use Gradient Descent most of the time.