<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202-SU24/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Guided Project 1</h1>

<h2>"He's So Tall and Educated as Hell"</h2>

Inspired by [this person you've probably never heard of](https://en.wikipedia.org/wiki/Taylor_Swift) and her lyrics in [this song](https://en.wikipedia.org/wiki/Wildest_Dreams_(Taylor_Swift_song))

Feel free to work together in teams of 2-4 students. Please answer each question below in your own original and complete sentences. Questions appear in <b><font color="red">bold red</font></b>. Be parsimonious: brief while also hitting the main points. Students earn scores on this project for complete answers, which can be either subjectively correct or incorrect.

<b>This exercise is intended to blend familiarity with confusion.</b> Some questions have clear answers. Others have many answers, some of which are probably better answers than others, but all might still be correct in one way or another.


To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Best practices include running the cells in order.

In [None]:
install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

<h3>The data</h3>

Consider the following dataset from the 8th wave in 2006 of the [U.S. Health and Retirement Study (HRS)](https://hrs.isr.umich.edu/about). It includes observations on objective height and weight and years of education for 2,063 male-female coupled dyads. Males are coded as “respondents,” whose variables start with the prefix "r," and females are coded as "spouses," with an "s." Wave-specific variables are indexed by the next character in the variable name, like `r8heightbio` for objective height measured in wave 8 and `ragender` for the respondent's gender. Height and weight were measured by an interviewer during an enhanced face-to-face interview.

Because males and females tend to have different heights, I dropped 24 same-sex dyads from the dataset to simplify the analysis. Same-sex couples are interesting and equally deserving of research focus, but their numbers are too small within the original HRS sample for us to examine them.

I'll refer to the males as "husbands" and the females as "wives."

<b>To view the Sheets file, click here:</b> [HRS wave 8 hcouples.sheets](https://docs.google.com/spreadsheets/d/1FRsXnmTQWjZF5FdM2sPdA5Z91TkmXzvP0r8tA6bDTAM/edit?usp=sharing)

In [None]:
sheet_url = "https://docs.google.com/spreadsheets/d/1FRsXnmTQWjZF5FdM2sPdA5Z91TkmXzvP0r8tA6bDTAM/edit?usp=sharing"

h8hcpl <- read_sheet(sheet_url,
                         range = "A1:P2064")

First let's quickly examine the top of the dataset, examine its dimensions, and assign $n$ to be the number of rows, which is the sample size.

In [None]:
head(h8hcpl)
dimensions = dim(h8hcpl)
n = dimensions[1]
n

---

Run the code below to generate a histogram of the heights in inches of wives in the data.

In [None]:
hist(h8hcpl$s8heightbio_in,
    main = "Histogram of wives' heights in wave 8 of HRS",
    xlab = "Objective height in inches"
    )

<b><font color = "red">1(a). Describe the distribution of wives’ heights. Discuss its shape, referencing the right and left tail when appropriate. Take a guess at the mean.</font></b>

<b><font color = "red">1(b). Calculate the mean of wife's height by completing the code below.</font></b>

In [None]:
meanheight_wives = mean(h8hcpl$s8heightbio_in)
meanheight_wives

---

Run the code below to generate a histogram of the heights in inches of husbands in the data.

In [None]:
hist(h8hcpl$r8heightbio_in,
    main = "Histogram of husbands' heights in wave 8 of HRS",
    xlab = "Objective height in inches"
    )

<b><font color = "red">2(a). Describe the distribution of husbands’ heights. Discuss its shape, referencing the right and left tail when appropriate. Take a guess at the mean.</font></b>

<b><font color = "red">2(b). Calculate the mean of husband's height by completing the code below.</font></b>

In [None]:
meanheight_husbands = mean(h8hcpl$r8heightbio_in)
meanheight_husbands

---

Run the code below to generate a scatterplot of husbands' heights ($Y$) as function of wives' heights ($X$), with a horizontal line and a vertical line running through the point of averages.

In [None]:
plot(h8hcpl$s8heightbio_in, h8hcpl$r8heightbio_in, 
     main = "Scatterplot of heights in HRS couples",
     xlab = "Objective height of wife in inches",
     ylab = "Objective height of husband in inches"
     )
abline(v = meanheight_wives, 
       col = "red", 
       lwd = 2
      )
abline(h = meanheight_husbands,
       col = "blue", 
       lwd = 2
      ) 

<b><font color = "red">3. Discuss what you see in the scatterplot. Do you see a positive association between $Y$ and $X$? Or a negative association? Or no association? Take a stand and briefly describe.</font></b>

Complete the code below and run it in order to find the Pearson correlation coefficient $r$; the standard deviation of $Y$, $SD(Y)$; the standard deviation of $X$, $SD(X)$; and the regression coefficient $\beta = r \times SD(Y)/SD(X)$.

In [None]:
r = cor(..., ...)
r

sdy = sd(...) * sqrt((n-1)/n)
sdy

sdx = sd(...) * sqrt((n-1)/n)
sdx

betacoef = r * sdy / sdx
betacoef

---

<b><font color = "red">4. Think about what $Y$ and $X$ actually are. Does $X$ cause $Y$? Or does $Y$ cause $X$? Discuss, both in a literal sense and also figuratively (i.e., abstractly, about what Y and X represent rather than just their literal meaning).</font></b>

---

Run the code below to generate a histogram of the years of education among wives in the data.

In [None]:
hist(h8hcpl$s8edyrs,
    main = "Histogram of wives' years of education in wave 8 of HRS",
    xlab = "Years of education"
    )

<b><font color = "red">5(a). Describe the distribution of wives’ schooling. Discuss its shape, referencing the right and left tail when appropriate. Take a guess at the mean.</font></b>

<b><font color = "red">5(b). Calculate the mean of wife's height by completing the code below. Note that there are missing values coded as NA in the education data, so you must deal with them inside the call to `mean()`.</font></b>

In [None]:
meanedyrs_wives = mean(h8hcpl$s8edyrs, na.rm = TRUE)
meanedyrs_wives

---

Run the code below to generate a histogram of the years of education among husbands in the data.

In [None]:
hist(h8hcpl$raedyrs,
    main = "Histogram of husbands' years of education in wave 8 of HRS",
    xlab = "Years of education"
    )

<b><font color = "red">6(a). Describe the distribution of husbands’ schooling. Discuss its shape, referencing the right and left tail when appropriate. Take a guess at the mean.</font></b>

<b><font color = "red">6(b). Calculate the mean of husband's height by completing the code below. Note that there are missing values coded as NA in the education data, so you must deal with them inside the call to `mean()`.</font></b>

In [None]:
meanedyrs_husbands = mean(h8hcpl$raedyrs, na.rm = TRUE)
meanedyrs_husbands

---

Run the code below to generate a scatterplot of husbands' height ($Y$) as function of wives' education ($X$), with a horizontal line and a vertical line running through the point of averages.

In [None]:
plot(h8hcpl$s8edyrs, h8hcpl$r8heightbio_in, 
     main = "Scatterplot of education in HRS couples",
     xlab = "Wife's years of education",
     ylab = "Objective height of husband in inches"
     )
abline(v = meanedyrs_wives, 
       col = "red", 
       lwd = 2
      )
abline(h = meanheight_husbands,
       col = "blue", 
       lwd = 2
      ) 

<b><font color = "red">7. Discuss what you see in the scatterplot. Do you see a positive association between $Y$ and $X$? Or a negative association? Or no association? Take a stand and briefly describe.</font></b>

---

Complete the code below and run it in order to find the Pearson correlation coefficient $r$; the standard deviation of $Y$, $SD(Y)$; the standard deviation of $X$, $SD(X)$; and the regression coefficient $\beta = r \times SD(Y)/SD(X)$ for this $X$ and this $Y$.

Note the funny change in syntax for `cor()` when there are missing values. :/

In [None]:
r_eh = cor(..., ..., use = "complete.obs")
r_eh

sdy_h = sd(..., na.rm = TRUE) * sqrt((n-1)/n)
sdy_h

sdx_e = sd(..., na.rm = TRUE) * sqrt((n-1)/n)
sdx_e

betacoef_eh = r_eh * sdy_h / sdx_e
betacoef_eh

---

<b><font color = "red">8. Think about what $Y$ and $X$ actually are. Does $X$ cause $Y$? Or does $Y$ cause $X$? Discuss, both in a literal sense and also figuratively (i.e., abstractly, about what Y and X represent rather than just their literal meaning).</font></b>

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>