<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202-SU24/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Class 05</h1>

<h2>Correlation</h2>

Inspired by [SticiGui Chapter 7](https://www.stat.berkeley.edu/~stark/SticiGui/Text/correlation.htm) and by [SticiGui Chapter 8](https://www.stat.berkeley.edu/~stark/SticiGui/Text/computeR.htm)

<h3>Learning objectives:</h3>

1. The correlation coefficient $r$ or $R$, also called the Pearson correlation, measures whether and how tightly $Y$ and $X$ move together
2. The correlation has a sign and is always bounded: $-1 \leq r \leq 1$
3. The familiar metric for regression fit $R^2$ is indeed just the square of this $r$, if there is only one $X$. The correlation is also related to the slope of the regression line
4. To calculate $r$, just use a statistical program. But to build intuition, we will calculate normalized deviations from the mean

To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Best practices include running the cells in order.

In [None]:
install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

Here again are 1,078 observations of "fathers" and "sons" from a well-known training dataset based on the historical work of [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson). Please see the Class 03 notebook for more details.

Here is a direct link to the Google Sheets file loaded in the cell below: [Pearson height data.sheets](https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link)

In [None]:
sheet_url = "https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link"

pheights <- read_sheet(sheet_url,
                       range = "B13:D1091")

Calling `head()` provides a useful quick look at the top of the dataset.

In [None]:
head(pheights)

Let's create a scatterplot. Here's a simple way to do it:

In [None]:
plot(pheights$father, pheights$son,
     main = "Pearson height dataset n = 1,078",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")

Calculating the Pearson correlation cleanly seems to require passing a few options. `method = "pearson"` appears to be redundant here, but I'll include it anyway. __R__ and other statistical programs tend to get finicky about missing observations, and `use = "complete.obs"` seems to help.

In [None]:
r = cor(pheights$father, pheights$son,
        method = "pearson",
        use = "complete.obs"
       )
r

---

To build intuition, let's now examine just the first 100 observations in the dataset. I chose this subsample because it exhibits different dynamics than the full sample of 1,078 observations. As we will see, larger samples typically push sample means and other things toward population means, and smaller samples can contain a lot of noise and other things.  

First step: the visualization:

In [None]:
pheights100 = subset(pheights, num <= 100)
plot(pheights100$father, pheights100$son,
     main = "Pearson height dataset n = 100",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")

It definitely doesn't look the same, does it? Describe what you see.

Now let's calculate the correlation coefficient:

In [None]:
r100 = cor(pheights100$father, pheights100$son,
           method = "pearson",
           use = "complete.obs"
           )
r100

What we see in the plot, namely less of a clear positive relationship, is confirmed by the lower $r$. 

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>