<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Class 07</h1>

<h2>Regression Diagnostics</h2>

Inspired by [SticiGui Chapter 10](https://www.stat.berkeley.edu/~stark/SticiGui/Text/regressionDiagnostics.htm)

<h3>Learning objectives:</h3>

1. The regression line $Y = \alpha + \beta \ X + \epsilon$ has a slope $\beta$, a constant term or “intercept” $\alpha$, and an error $\epsilon$
2. Estimates of these are sometimes written $\hat{\beta}$ and $\hat{\alpha}$
3. Once we know $\hat{\beta} = r \times SD(Y)/SD(X)$, we can find $\hat{\alpha}$
4. With$\hat{\beta}$ and $\hat{\alpha}$, we can predict $\hat{Y}$ for any $X$
5. But $\hat{Y}$ will differ from $Y$ by the error, $\epsilon$, and $\epsilon$ can be big
6. If our model is specified well, $\epsilon$ will have mean zero and will jump around randomly. If it doesn’t, then we’re in trouble

To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Best practices include running the cells in order.

In [None]:
install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

<h2>1. Review of earlier results with the Pearson heights data</h2>

Here again are 1,078 observations of "fathers" and "sons" from a well-known training dataset based on the historical work of [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson). Please see the Class 03 notebook for more details.

Here is a direct link to the Google Sheets file loaded in the cell below: [Pearson height data.sheets](https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link)

In [None]:
sheet_url = "https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link"

pheights <- read_sheet(sheet_url,
                       range = "B13:D1091")

Calling `head()` provides a useful quick look at the top of the dataset. Calling `dim()` helps us make sure we have the right dataset loaded up in the correct way.

In [None]:
head(pheights)
dim(pheights)

Calculating the Pearson correlation cleanly seems to require passing a few options. `method = "pearson"` appears to be redundant here, but I'll include it anyway. __R__ and other statistical programs tend to get finicky about missing observations, and `use = "complete.obs"` seems to help.

In [None]:
r = cor(pheights$father, pheights$son,
        method = "pearson",
        use = "complete.obs"
       )
r

Let's also calculate the standard deviations of $Y$ and $X$. First, let's calculate the sample size $n$, with `nrow()`, and then we'll use `sd()` and apply the sample size correction:

In [None]:
n = nrow(pheights)
n

sdY = sd(pheights$son) * sqrt( (n-1)/n )
sdY

sdX = sd(pheights$father) * sqrt( (n-1)/n )
sdX

The slope of the $SD$ line is just the ratio of $SD(Y)$ to $SD(X)$:

In [None]:
sdlslope = sdY/sdX
sdlslope

The $SD$ line passes through the point of averages. Here are the averages:

In [None]:
meanY = mean(pheights$son)
meanY

meanX = mean(pheights$father)
meanX

And now here's a trick to find the intercept term in the $SD$ line. We know it runs through $\bar{X},\bar{Y}$ and we know its slope, $b = SD(Y)/SD(X)$. Then:

$$
\bar{Y} = a + b \ \bar{X}
$$
$$
a = \bar{Y} - b \ \bar{X}
$$

In [None]:
sdlint = meanY - sdlslope * meanX
sdlint

Consider this adjustment to the slope of the $SD$ line, $b$:

$$
\beta = r \times b = r \times \frac{SD(Y)}{SD(X)} 
$$

In [None]:
betacoef = r * sdlslope
betacoef

This $\beta$ is the least squares slope coefficient, and it is also equal to the ratio of the covariance of $X$ and $Y$ to the variance of $X$:

$$
\beta = \frac{Cov(X,Y)}{Var(X)}
$$

In [None]:
betacoef = cov(pheights$father,pheights$son)/var(pheights$father)
betacoef

As before, we can find the intercept $\alpha$ using our knowledge of the slope and the point of averages:

$$
\bar{Y} = \alpha + \beta \ \bar{X}
$$
$$
\alpha = \bar{Y} - \beta \ \bar{X}
$$

In [None]:
alphacoef = meanY - betacoef * meanX
alphacoef

Finally, here is a scatterplot, now with the $SD$ line superimposed in red and the linear regression line superimposed in blue.

In [None]:
plot(pheights$father, pheights$son,
     main = "Pearson height dataset n = 1,078",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")
lines(c(60, 75), 
      c(sdlint + sdlslope*60, sdlint + sdlslope*75),
      col = "red",
      lwd = 2
     )
lines(c(60, 75), 
      c(alphacoef + betacoef*60, alphacoef + betacoef*75),
      col = "blue",
      lwd = 2
     )

As we discussed in the notebook for Class 06, it turns out that we can also use `lm()` to estimate the blue line using <b>ordinary least squares</b>, which we will return to later in COMPSS 202.

The syntax of `lm()` is as follows, where the funny part with the tilde (~) is the estimation equation, with a tilde instead of an equals sign and no coefficients formally listed:

In [None]:
reg1 <- lm(son ~ father,
          data = pheights)
summary(reg1)

In the output here, the `Estimate` for `(Intercept)` is the constant term, $\alpha$, and the `Estimate` for `father` is $\beta$, the ordinary least squares regression coefficient.

Later in the course we will discuss what the `Std. Error` (standard error) and other columns mean. For now: the similarity between "standard deviation" and "standard error" is no accident.

<h2>2. Predictions and errors</h2>

When our ordinary least squares model is simple, with one $Y$ and only one $X$, our estimates $\hat{\beta}$ and $\hat{\alpha}$ based on the Pearson correlation or the covariance are also simple. We have:

$$
Y = \alpha + \beta \ X + \epsilon
$$
and our estimates equation is:
$$
\hat{Y} = \hat{\alpha} + \hat{\beta} \ X
$$
because the average of the $\epsilon$'s equals zero. With these specific data, we have:
$$
\hat{Y} = 33.9 + 0.514 \ X
$$


Suppose we look at the prediction of son's height $\hat{Y}$ when father's height $X = 70$ inches.

In [None]:
son_of_70 = alphacoef + betacoef * 70
son_of_70

Visually, this occurs where a vertical line at $X = 70$ intersects the blue regression line:

In [None]:
plot(pheights$father, pheights$son,
     main = "Pearson height dataset n = 1,078",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")
lines(c(60, 75), 
      c(alphacoef + betacoef*60, alphacoef + betacoef*75),
      col = "blue",
      lwd = 2
     )
abline(v = 70, 
       col = "red",
       lwd = 2
       )

Does a prediction of son's height equal to $69.9$ given a father's height of $70$ seem small? Or about right?

Some of the deep insights that emerge are that linear regression
* Is a very good model for predicting $Y$ given an $X$ (and maybe a $Z$ and more, stay tuned)
* Imposes regression to the mean; an $X$ with a big deviation from the mean is likely to give us a $Y$ with a much smaller deviation from its mean

Error terms can be calculated by hand, like this:

In [None]:
pheights$errors = pheights$son - (alphacoef + betacoef * pheights$father)
hist(pheights$errors)

Or we can recover them from the `lm()` object that we created earlier;

In [None]:
hist(reg1$residuals)

A very good thing to do is examine how the residuals behave across $X$. For that, a scatterplot is most useful:

In [None]:
plot(pheights$father, reg1$residuals,
     main = "Residuals from the regression",
     xlab = "Height of father, X",
     ylab = "Residual")

These are good residuals. They bounce randomly all over the place. They do not follow any obvious pattern across $X$. Good stuff. Bad residuals are anything other than the white noise kind of thing shown here. If the residuals are predictable, that's bad.

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>