<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202-SU24/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Class 03</h1>

<h2>An $X$ and a $Y$ in a Scatterplot</h2>

Inspired by [SticiGui Chapter 5](https://www.stat.berkeley.edu/~stark/SticiGui/Text/scatterplots.htm)

<h3>Learning objectives:</h3>

<ol style="margin-top: 0; margin-bottom: 0;">
  <li>Scatterplots show a vertical $Y$ variable plotted vs. an $X$ variable and are are a very common visualization
  </li>
  <li>When someone plots $Y$ against $X$, their implicit hypothesis is that $X$ might be causing $Y = f(X)$, where $f(\cdot)$ is some function
      </li>
      <li>When $Y$ varies differently across slices defined by $X$, that is called heteroscedasticity, and it complicates inference
   <ul style="margin-top: 0; margin-bottom: 0;">
      <li>Variances in prices, incomes, and wealth (anything in dollars) usually increase with most $X$’s
      </li>
      <li>Variances in the <b>natural logarithm</b> of these usually do not
      </li>
    </ul>
   </li>
    <li>
        Outliers are visually weird and numerically extreme points. They could be measurement error or real and may deserve thought 
    </li>
</ol>

To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Best practices include running the cells in order.

In [None]:
install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

Here are 1,078 observations of "fathers" and "sons" from a well-known training dataset based on the historical work of [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson), who adapted work by his mentor [Francis Galton](https://en.wikipedia.org/wiki/Francis_Galton). I've placed "fathers" and "sons" in quotation marks, because I believe the original samples included mothers and daughters as well, and some of these observations are actually mathematical translations of true observed relationships between mothers and daughters, and possibly between mothers and sons, and fathers and daughters.

Who knows what they might have done with twins or triplets! Taken an integral or something?

The other thing that bears mentioning here is that the historical of statistical thinking has many direct and indirect links to <b>eugenics</b>, the shameful, racist ideology and pseudo-science that grew in prominence in the post-Darwinian period of scientific thinking. This is one of the more direct links.

---

<h2>Eugenics is a sad chapter in human thinking</h2>

There is much more to say about this, and the MaCSS curriculum in <i>Ethics, Societal Conflicts, and Data</i> will probably elaborate. In COMPSS 202, I think it's important to:
* Recognize the connections between Galton, Pearson, and others like [Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher) and the eugenics movement
* See the difference between scientific beliefs in heredity and genes, which inform modern medicine and other things, versus repressive and unjust political thinking about "desirable" people, as opposed to traits
* Celebrate the inherent value in all people and work to safeguard their rights

Connections to eugenics are closer than we imagine they are. In the year 2024, we owe it to ourselves and future generations to understand these connections and to do better. For further reading, the [Wikipedia page on eugenics](https://en.wikipedia.org/wiki/Eugenics) offers a decent overview.

---

Despite ugly truths about where these measures and this thinking ultimately led, measurements within human families of characteristics like height, which we know is influenced by genes, or nature as well as nurture, are useful and visually powerful tools for training our statistical intuition and knowledge. 

These measures are also relatively rare. Some of our modern datasets include some measurements of parents or of children, but these also are often self-reported measures, usually from the perspective of the respondent.

Here is a direct link to the Google Sheets file loaded in the cell below: [Pearson height data.sheets](https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link)

In [None]:
sheet_url = "https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link"

pheights <- read_sheet(sheet_url,
                       range = "B13:D1091")

Calling `head()` provides a useful quick look at the top of the dataset. These data are in tenths of inches, a nod to how weird the dataset is. Who would ever measure tenths of inches?! What exactly is a deci-inch?

In order to enhance the visuals in a scatter plot, the data have been perturbed with small additions or subtractions. They may originally have been listed in whole inches or down to quarter inches.

In [None]:
head(pheights)

Let's create a scatterplot. 

Does it make more more sense for the son's height to be the $X$ or horizontol variable, or the $Y$ or vertical variable? Which variable do we think causes the other one?

A shrewd observer might say, "Neither variable causes the other; genes cause them both." That is true, and in social science, we often set aside such details when discussing how an $X$ causes a $Y$. If $X$ were a binary measure of winning a lottery, the causality running from $X$ to an outcome variable seems a little clearer. In this case, what we mean by a person's height causing another height is that one person's height is a <b>proxy</b> for their genetic contribution.

Because human reproduction requires an egg as well as sperm, there is clearly an <b>omitted</b> third variable we'd like to consider if we could: mother's height.

But in addition to that, we would also like to measure <b>nutrition</b> and <b>sicknesses</b> during the child's life. We would probably like to know <b>birth order</b> and the gender of any siblings, because second-born sons tend to be shorter than their older brothers.

These points can all be true, and they probably do not invalidate our looking at how son's height $Y$ varies with father's height $X$. (The only way in which they would is if some of these omitted variables were also systematically correlated with father's height, which would create what's called <i>omitted variable bias</i>. This is a topic for a more advanced class.)

Let's create a scatterplot. Here's a simple way to do it:

In [None]:
plot(pheights$father, pheights$son,
     main = "Pearson height dataset n = 1,078",
     xlab = "Height of the father in inches",
     ylab = "Height of the son in inches")

It's true that there are some outliers here scattered around the periphery, but nothing that implies measurement problems. What else do you see? Does it look like the variation in $Y$ is roughly the same within vertical slices of the plot?

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>