<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202-SU24/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Class 01</h1>

<h2>Statistics and Histograms</h2>

Inspired by [SticiGui Chapter 3](https://www.stat.berkeley.edu/~stark/SticiGui/Text/histograms.htm)

Histograms are superb tools for visualizing a distribution of some kind of measure. In this notebook, we will use the simple built-in routine `hist()` in __R__ to plot and examine histograms based on data shown in Chapter 3 of SticiGui and in wave 1 of the [National Longitudinal Study of Adolescent to Adult Health (Add Health)](https://addhealth.cpc.unc.edu/).

<h3>Learning objectives:</h3>

1. Sorted tables, like the board in ["Chutes and Ladders"](https://shop.hasbro.com/en-us/product/chutes-and-ladders-game/1095F835-5056-9047-F548-2F4D0AEF4ACC), quickly reveal percentiles
2. Histograms show the spread in measurements

To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Make sure to run the cells in order.

In [None]:
# These are standard calls to load in packages
install.packages("googlesheets4")
library(googlesheets4)
# This call allows the notebook to skip Google authorization, to access publicly viewable files 
gs4_deauth()

In [None]:
# This is the URL for the public sheet
sheet_url = "https://docs.google.com/spreadsheets/d/1EU25WfiXQrQcLyz2xjqgT8XvI8_IMEG9plPCouy_14g/edit?usp=sharing"

# This call creates a data frame called "gmeasures" containing data from the range shown
gmeasures <- read_sheet(sheet_url,
                        range = "A1:J11")

<h3>1. Looking at a table</h3>

Did Prof. Philip B. Stark also spend many hours staring at a <i>Chutes and Ladders</i> game perhaps? If you have, you know that a shortcut for adults when playing the game is to add each spin to your space number; and because the board is a 10x10 grid, you can either quickly jump to your next space, or use it to double-check your math when counting off steps one-by-one like how kids typically play.

Here I've set up the data in a 10x10 matrix already, having copied it from SticiGui. Typically, data will sit in columns of tables instead, as we will see shortly. Let's look at the table.

In [None]:
# Show the dataframe
gmeasures

Like most languages, __R__ is picky about object types. In order to sort the table elements from smallest to largest, we first need to create a matrix containing the elements of the data frame, then we vectorize it, the we sort, and then we populate a new matrix with the sorted elements. Whew.

In [None]:
gmeasures_mat = as.matrix(gmeasures)
gmeasures_vec = as.vector(gmeasures_mat)
sorted_gmeasures_vec <- sort(unlist(gmeasures_mat))
sorted_gmeasures_mat <- matrix(sorted_gmeasures_vec, nrow = 10, ncol = 10, byrow = TRUE) 
sorted_gmeasures_mat

As shown in SticiGui and in the accompanying slides, this 10x10 format is handy for quickly finding key percentiles. Square 10 on the <i>Chutes and Ladders</i> board is the upper right corner here, and it shows the 10th percentile.

Can you find the 90th percentile?

What about the 25th and 75th percentiles? With those in hand, can you calculate the <b>interquartile range (IQR)</b>? The IQR is the difference between the two middle quartiles, a.k.a. the 75th quartile minus the 25th quartile.

<h3>2. Histograms: The Visualization</h3>

Here is a simple call to the built-in graphics in __R__, where we are using the vector of 100 observations rather than the 10x10 data frame or the matrix.

In [None]:
hist(gmeasures_vec,
     main = "Figure 3-2: Histogram of deviations of g", 
     xlab = "Value", 
     ylab = "Frequency", 
     col = "gray", 
     border = "black")

Feel free to play around with the settings. (I often like to leave simple things simple when I can!)

<h3>Add Health data on heights</h3>

The [National Longitudinal Study of Adolescent to Adult Health (Add Health)](https://addhealth.cpc.unc.edu/) is a panel survey of 6,500 Americans who were enrolled in grades 7-12 in 1994-95, and who were mostly born between 1977 and 1982. In a panel survey, participants are reinterviewed periodically. This constrasts with most large-scale government surveys that tend to interview more people only once. The Add Health cohort has been reinterviewed about every 5 years, and extensive questions in the survey produce rich data on the developing lives of participants.

Let's examine height in inches, which was self-reported in wave 1. Height and other physical measures could also be measured objectively, by an interviewer with a measuring tape. This also happened in wave 1, and we will examine those data later. 

Here is a vector containing a random selection of 100 observations of self-reported heights among people identified as female. 
Based on the [Data Collection Instrument and User Guide](https://adatawinter.site.wesleyan.edu/files/2017/08/AddHealth-Wave_1_Questionnaire-and-Codebook.pdf), I think Add Health probably began with "biological sex" as recorded for the person by the school, which likely drew from a parental report, and then asked interviewers to verify the measure by asking the respondent as subjectively needed.

In [None]:
sheet_url_ah1 = "https://docs.google.com/spreadsheets/d/1EVrb8li-wZ6UhsItF5jAHq_EKwDUKAjknurpERYpE7Y/edit?usp=sharing"

ah_height_f100 <- read_sheet(sheet_url_ah1,
                             range = "A1:C101")

The __R__ routine `head()` is useful for showing the top part of a data frame. Here, we see that height in inches is in the column `height`.

In [None]:
head(ah_height_f100)

Let's now do the same thing as above to sort the data, taking care to reference `ah_height_f100$height`, which is the `height` column in the data frame.

In [None]:
ah_height_f100_mat = as.matrix(ah_height_f100$height)
ah_height_f100_vec = as.vector(ah_height_f100_mat)
sorted_ah_height_f100_vec <- sort(unlist(ah_height_f100_mat))
sorted_ah_height_f100_mat <- matrix(sorted_ah_height_f100_vec, nrow = 10, ncol = 10, byrow = TRUE) 
sorted_ah_height_f100_mat

Can you identify useful quantiles like the 10th, 90th, 50th, 25th, and 75th?

Here is a histogram. Because these are 100 observations, the histogram is not a smooth bell curve.

In [None]:
hist(ah_height_f100$height,
     main = "Heights of 100 female Add Health respondents in wave 1", 
     xlab = "Inches", 
     ylab = "Frequency", 
     col = "gray", 
     border = "black")

And now here's a look at the full dataset, which includes self-reported height observed for 3,309 females.

In [None]:
sheet_url_ah2 = "https://docs.google.com/spreadsheets/d/1ppDSHu7bfJRlXZFg1pQnl8aWJ7oKxp3BjRUMSOWKY1w/edit?usp=sharing"

ah_height_f <- read_sheet(sheet_url_ah2,
                          range = "A1:C3310")

In [None]:
hist(ah_height_f$height,
     main = "Heights of 3,309 female Add Health respondents in wave 1", 
     xlab = "Inches", 
     ylab = "Frequency", 
     col = "gray", 
     border = "black")

This is much more pleasing to the eye. It's like a cross section of heights within ages 13-18, which we might see in a [CDC growth chart](https://www.cdc.gov/growthcharts/data/set2clinical/cj41c072.pdf) as "stature for age" for example.

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>