<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202-SU24/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Class 06</h1>

<h2>Regression</h2>

Inspired by [SticiGui Chapter 9](https://www.stat.berkeley.edu/~stark/SticiGui/Text/regression.htm)

<h3>Learning objectives:</h3>

1. We’d like a model that predicts $Y$ using $X$
2. If the scatterplot looks like a football, then a line running through the averages is a good model. But the issue is how to choose the slope of the line
3. The SD Line — with slope $SD(Y)/SD(X)$ — is a decent choice
4. But it does a poor job predicting $Y$ within distant slices of $X$
5. Instead, $\beta = r \times SD(Y)/SD(X) = Cov(X,Y)/Var(X)$ is better, a rescaling with $r$

To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Best practices include running the cells in order.

In [None]:
install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

Here again are 1,078 observations of "fathers" and "sons" from a well-known training dataset based on the historical work of [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson). Please see the Class 03 notebook for more details.

Here is a direct link to the Google Sheets file loaded in the cell below: [Pearson height data.sheets](https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link)

In [None]:
sheet_url = "https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link"

pheights <- read_sheet(sheet_url,
                       range = "B13:D1091")

Calling `head()` provides a useful quick look at the top of the dataset. Calling `dim()` helps us make sure we have the right dataset loaded up in the correct way.

In [None]:
head(pheights)
dim(pheights)

Let's create a scatterplot. Here's a simple way to do it:

In [None]:
plot(pheights$father, pheights$son,
     main = "Pearson height dataset n = 1,078",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")

Calculating the Pearson correlation cleanly seems to require passing a few options. `method = "pearson"` appears to be redundant here, but I'll include it anyway. __R__ and other statistical programs tend to get finicky about missing observations, and `use = "complete.obs"` seems to help.

In [None]:
r = cor(pheights$father, pheights$son,
        method = "pearson",
        use = "complete.obs"
       )
r

Let's also calculate the standard deviations of $Y$ and $X$. First, let's calculate the sample size $n$, with `nrow()`, and then we'll use `sd()` and apply the sample size correction:

In [None]:
n = nrow(pheights)
n

sdY = sd(pheights$son) * sqrt( (n-1)/n )
sdY

sdX = sd(pheights$father) * sqrt( (n-1)/n )
sdX

The slope of the $SD$ line is just the ratio of $SD(Y)$ to $SD(X)$:

In [None]:
sdlslope = sdY/sdX
sdlslope

Does the $SD$ line predict $Y$ well or poorly? A visualization will take a little bit of fussing. The $SD$ line passes through the point of averages. Here are the averages:

In [None]:
meanY = mean(pheights$son)
meanY

meanX = mean(pheights$father)
meanX

And now here's a trick to find the intercept term in the $SD$ line. We know it runs through $\bar{X},\bar{Y}$ and we know its slope, $b = SD(Y)/SD(X)$. Then:

$$
\bar{Y} = a + b \ \bar{X}
$$
$$
a = \bar{Y} - b \ \bar{X}
$$

In [None]:
sdlint = meanY - sdlslope * meanX
sdlint

Now we can predict values of the $SD$ line for any $X$. Below, I choose $60$ and $75$.

In [None]:
plot(pheights$father, pheights$son,
     main = "Pearson height dataset n = 1,078",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")
lines(c(60, 75), 
      c(sdlint + sdlslope*60, sdlint + sdlslope*75),
      col = "red",
      lwd = 2
     )

Visually speaking, the $SD$ line looks like it's doing a very nice job summarizing the football cloud, and it does. But within slices of $X$, especially at the extremes, the $SD$ line does a poor job predicting $Y$. At far right, the $SD$ line is much too high; at far left, it is much too low.

---

Instead, consider this adjustment to the slope of the $SD$ line, $b$:

$$
\beta = r \times b = r \times \frac{SD(Y)}{SD(X)} 
$$

In [None]:
betacoef = r * sdlslope
betacoef

This $\beta$ is the least squares slope coefficient, and it is also equal to the ratio of the covariance of $X$ and $Y$ to the variance of $X$:

$$
\beta = \frac{Cov(X,Y)}{Var(X)}
$$

In [None]:
betacoef = cov(pheights$father,pheights$son)/var(pheights$father)
betacoef

As before, we can find the intercept $\alpha$ using our knowledge of the slope and the point of averages:

$$
\bar{Y} = \alpha + \beta \ \bar{X}
$$
$$
\alpha = \bar{Y} - \beta \ \bar{X}
$$

In [None]:
alphacoef = meanY - betacoef * meanX
alphacoef

Finally, here is the scatterplot again, now with the $SD$ line superimposed in red and the linear regression line superimposed in blue.

In [None]:
plot(pheights$father, pheights$son,
     main = "Pearson height dataset n = 1,078",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")
lines(c(60, 75), 
      c(sdlint + sdlslope*60, sdlint + sdlslope*75),
      col = "red",
      lwd = 2
     )
lines(c(60, 75), 
      c(alphacoef + betacoef*60, alphacoef + betacoef*75),
      col = "blue",
      lwd = 2
     )

Hold on to your hats. It turns out that we can also use `lm()` to estimate the blue line using <b>ordinary least squares</b>, which we will return to later in COMPSS 202.

The syntax of `lm()` is as follows, where the funny part with the tilde (~) is the estimation equation, with a tilde instead of an equals sign and no coefficients formally listed:

In [None]:
reg1 <- lm(son ~ father,
          data = pheights)
summary(reg1)

In the output here, the `Estimate` for `(Intercept)` is the constant term, $\alpha$, and the `Estimate` for `father` is $\beta$, the ordinary least squares regression coefficient.

Later in the course we will discuss what the `Std. Error` (standard error) and other columns mean. For now: the similarity between "standard deviation" and "standard error" is no accident.

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>