---
title: "correlation and linear regression"
execute:
  # echo: false
  freeze: auto  # re-render only when source changes
format:
  html:
    code-fold: true
    code-summary: "Show the code"
---

The correlation coefficient is closely related to linear regression. In simple linear regression, we model the relationship between a dependent variable $Y$ and an independent variable $X$ as:

$$
Y = \beta_0 + \beta_1 X + \epsilon
$$
where $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\epsilon$ is the error term.

## prelude: finding the intercept and slope

Let's derive the formulas for the intercept and slope of the regression line. We want to minimize the sum of squared residuals $L$:

$$
L = \sum_{i=1}^n (y_i - \hat{y}_i)^2,
\tag{1}
$$

where 

$$
\hat{y}_i = \beta_0 + \beta_1 x_i.
\tag{2}
$$

To find the optimal values of $\beta_0$ and $\beta_1$, we take the partial derivatives of $L$ with respect to $\beta_0$ and $\beta_1$, set them to zero, and solve the resulting equations.

\begin{align*}
\frac{\partial L}{\partial \beta_0} &= 0 \tag{3a}\\
\frac{\partial L}{\partial \beta_1} &= 0 \tag{3b}
\end{align*}

### intercept

Let's start with Eq. (3a):

\begin{align*}
\frac{\partial L}{\partial \beta_0} &= \frac{\partial}{\partial \beta_0} (y_i - \hat{y}_i)^2 \tag{4a} \\
&= \frac{\partial}{\partial \beta_0} (y_i - \beta_0 - \beta_1 x_i)^2 \tag{4b} \\
&= -2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0 \tag{4c}
\end{align*}

Eliminating the constant factor $-2$ and expanding the summation, we get:

$$
\sum_{i=1}^n y_i - n \beta_0 - \beta_1 \sum_{i=1}^n x_i = 0
\tag{5}
$$

We now divide by $n$ and rearrange to isolate $\beta_0$:

<div class="alert alert-primary">
$$
\beta_0 = \bar{y} - \beta_1 \bar{x}
\tag{6}
$$
</div>

Note: we can rewrite equation (4c) as

$$
\sum_{i=1}^n (y_i - \hat{y}_i) = \sum_{i=1}^n \text{residuals} = 0,
$$

which is a nice thing to know.

### slope

Now let's move on to Eq. (3b):

\begin{align*}
\frac{\partial L}{\partial \beta_1} &= \frac{\partial}{\partial \beta_1} (y_i - \hat{y}_i)^2 \tag{7a} \\
&= \frac{\partial}{\partial \beta_1} (y_i - \beta_0 - \beta_1 x_i)^2 \tag{7b} \\
&= -2 \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) x_i = 0 \tag{7c} \\
&= -2 \sum_{i=1}^n (y_i - \bar{y} + \beta_1 \bar{x} - \beta_1 x_i) x_i = 0 \tag{7d}
\end{align*}

Eliminating the constant factor $2$ and expanding the summation, we get:

$$
-\sum_{i=1}^n x_i y_i + \sum_{i=1}^n x_i \bar{y} - \beta_1 \sum_{i=1}^n x_i \bar{x} + \beta_1 \sum_{i=1}^n x_i^2 = 0
\tag{8}
$$

Let's group the terms involving $\beta_1$ on one side and the rest on the other side:

$$
\beta_1 \left( \sum_{i=1}^n x_i^2 - \sum_{i=1}^n x_i \bar{x} \right) = \sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \bar{y}
\tag{9}
$$

Isolating $\beta_1$, we have:

$$
\beta_1 = \frac{\sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \bar{y}}{\sum_{i=1}^n x_i^2 - \sum_{i=1}^n x_i \bar{x}} = \frac{\text{numerator}}{\text{denominator}}
\tag{10}
$$

It's easier to interpret the numerator and denominator separately. To each we will add and subtract a term that will allow us to express them in simpler forms.

Numerator:

$$
\text{numerator} = \sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \bar{y} + \sum_{i=1}^n \bar{x} y_i - \sum_{i=1}^n \bar{x} y_i
\tag{11}
$$

We express the third term thus:

$$
\text{third term} = \sum_{i=1}^n \bar{x} y_i = \bar{x} \sum_{i=1}^n y_i = n \bar{x} \bar{y} = \sum_{i=1}^n \bar{x} \bar{y}
\tag{12}
$$

The numerator now becomes:

\begin{align*}
\text{numerator} &= \sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \bar{y} + \sum_{i=1}^n \bar{x} \bar{y} - \sum_{i=1}^n \bar{x} y_i \tag{13a} \\
&= \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) \tag{13b} \\
\end{align*}

Now the denominator:

$$
\text{denominator} = \sum_{i=1}^n x_i^2 - \sum_{i=1}^n x_i \bar{x} + \sum_{i=1}^n x_i\bar{x} - \sum_{i=1}^n x_i\bar{x}
\tag{14}
$$

We group the second and fourth terms, and express the third term thus:

$$
\text{third term} = \sum_{i=1}^n x_i \bar{x} = \bar{x} \sum_{i=1}^n x_i = n \bar{x}^2 = \sum_{i=1}^n \bar{x}^2
\tag{15}
$$

The denominator now becomes:

\begin{align*}
\text{denominator} &= \sum_{i=1}^n x_i^2 - 2 \sum_{i=1}^n x_i \bar{x} + \sum_{i=1}^n \bar{x}^2 \tag{16a} \\
&= \sum_{i=1}^n (x_i - \bar{x})^2 \tag{16b}
\end{align*}


Putting it all together, we have:

<div class="alert alert-primary">
$$
\beta_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}
\tag{17}
$$
</div>

## slope and correlation

Let's divide both the numerator and denominator of Eq. (17) by $n-1$:

$$
\beta_1 = \frac{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}
\tag{18}
$$

The numerator is the sample covariance $Cov(X, Y)$, and the denominator is the sample variance $Var(X)$:

$$
\beta_1 = \frac{Cov(X, Y)}{Var(X)}
\tag{19}
$$

Now, we can express the covariance in terms of the correlation coefficient $\rho_{X,Y}$ and the standard deviations $\sigma_X$ and $\sigma_Y$:

$$
Cov(X, Y) = \rho_{X,Y} \sigma_X \sigma_Y
\tag{20}
$$
Substituting Eq. (20) into Eq. (19), we get:

$$
\beta_1 = \frac{\rho_{X,Y} \sigma_X \sigma_Y}{\sigma_X^2}
\tag{21}
$$

And finally, we have:

<div class="alert alert-primary">
$$
\beta_1 = \rho_{X,Y} \frac{\sigma_Y}{\sigma_X}
\tag{22}
$$
</div>


This shows that the slope of the regression line is directly proportional to the correlation coefficient. A higher absolute value of the correlation coefficient indicates a steeper slope, while a lower absolute value indicates a flatter slope.