<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Class 02</h1>

<h2>Location and Spread</h2>

Inspired by [SticiGui Chapter 4](https://www.stat.berkeley.edu/~stark/SticiGui/Text/location.htm)


<h3>Learning objectives:</h3>

<ol style="margin-top: 0; margin-bottom: 0;">
  <li>The mean is what most of us know as the average
  </li>
  <li>The median is the middle measure, a.k.a. the 50th percentile
    <ul style="margin-top: 0; margin-bottom: 0;">
      <li>It can be the same as the mean
      </li>
      <li>Or if the distibution has a long right or left tail, it will be different
      </li>
    </ul>
  </li>
  <li>The mode is the highest peak of the distribution
   <ul style="margin-top: 0; margin-bottom: 0;">
      <li>Like the median, it can be above, the same as, or below the mean
      </li>
      <li>A distribution can have more than one peak, or flat without any peaks
      </li>
    </ul>
   </li>
    <li>
        <a href=#section2>The standard deviation is a common, useful measure of variation around the mean</a>
    </li>
</ol>

To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Best practices include running the cells in order.

In [None]:
install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

<h2>1. Looking at distributions</h2>

A good place to start is by using the best statistical tool we have: your eyes.

Here are data from 4,963 observations of people measured in wave 4 of the [National Longitudinal Study of Adolescent to Adult Health (Add Health)](https://addhealth.cpc.unc.edu/), which is also viewable in Google Sheets in [addhealth_height_weight_bmi.sheets](https://docs.google.com/spreadsheets/d/1j7vrQ-wxB8HJFFWurUt9Ywfgbh_JX1S2IK4wgXFkHCQ/edit?usp=sharing). This extract includes measures of "sex" (which probably is gender identity mashed into a binary metric), and self-reports and objective measures of height, weight, and body mass index (BMI), which is kilograms of weight divided by the square of meters of height.

These measures are collected in wave 4 of the Add Health panel study. Wave 4 was mostly completed in 2008, which is when respondents, born between 1977 and 1982, were between 26 and 31 years old. Add Health includes both men and women, and later waves and modules also explored more modern measures of gender identity.

Objective height was originally measured in tenths of centimeters, which I converted to inches with a scaling factor of $0.393701$. Self-reported or subjective height `r4height` was recorded in feet and inches, which I converted to inches.

In [None]:
sheet_url0 = "https://docs.google.com/spreadsheets/d/1j7vrQ-wxB8HJFFWurUt9Ywfgbh_JX1S2IK4wgXFkHCQ/edit?usp=sharing"

ah_bmi <- read_sheet(sheet_url0,
                     range = "A1:I4964")

In [None]:
head(ah_bmi)

Let's plot a histogram of objective height, which is the `r4heightbio` column inside the data frame `ah_bmi`. We reference a column with `$`. The built-in `hist()` function works perfectly well and is parsimonious.

(The cool kids use `ggplot2`, I believe. But I find the syntax confusing and dense, not really worth the beauty of the picture in exchange for more opaque pedagogy!)

In [None]:
hist(ah_bmi$r4heightbio,
    main = "Histogram of Add Health w4 heights in inches",
    xlab = "Objective height in inches"
    )

As you can see, the distribution of height in the Add Health is not a perfect normal curve, probably because here we have mixed everyone together. Visually, the histogram looks like it has a longer right tail, and this might be because males are in the right tail. With the code below, we can look just at females, coded with `r4sex == 2` in the data:

In [None]:
hist(subset(ah_bmi,r4sex==2)$r4heightbio,
    main = "Histogram of Add Health w4 heights in inches, females",
    xlab = "Objective height in inches"
    )

The mode of this distribution is fairly easy to see: it is the highest peak, near 65 inches. The mean and median might be pretty similar to that, but we can also see some deviations from symmetry across the mode, evidence of a slightly longer right tail.

It can be helpful to overlay what's called a [kernel density estimate](https://en.wikipedia.org/wiki/Kernel_density_estimation), a smoothed line that shows an estimate of the probability density function likely to be responsible for the draws. Here is some code that does that:

In [None]:
hist(subset(ah_bmi,r4sex==2)$r4heightbio,
    main = "Histogram of Add Health w4 heights in inches, females",
    xlab = "Objective height in inches",
    probability = TRUE,
    ylim = c(0, 0.15)
    )
lines(density(subset(ah_bmi,r4sex==2)$r4heightbio), 
     lwd = 2)

There is still something funny near the mode, but the distribution looks fairly normal-ish. Let's now also look at the mean and median of this distribution:

In [None]:
mean(subset(ah_bmi,r4sex==2)$r4heightbio)
median(subset(ah_bmi,r4sex==2)$r4heightbio)

These are almost identical.

---

Now let's examine objective body mass index, or BMI. 

In [None]:
hist(subset(ah_bmi,r4sex==2)$r4bmibio,
    main = "Histogram of Add Health w4 BMI, females",
    xlab = "Objective BMI in kg/m2",
    probability = TRUE
    )
lines(density(subset(ah_bmi,r4sex==2)$r4bmibio), 
     lwd = 2)

Now there's a lognormal distribution, with a very clearly longer right tail.

Lognormal distributions are common in social science. A good rule of thumb is that most things denominated in currency (like U.S. dollars) will be lognormally distributed. Here, BMI is also lognormal. Taking the natural log of the variable and then plotting the distribution is likely to produce a curve that looks more symmetric:

In [None]:
ah_bmi$r4logbmibio = log(ah_bmi$r4bmibio)

hist(subset(ah_bmi,r4sex==2)$r4logbmibio,
    main = "Histogram of Add Health w4 log BMI, females",
    xlab = "Log objective BMI",
    probability = TRUE
    )
lines(density(subset(ah_bmi,r4sex==2)$r4logbmibio), 
     lwd = 2)

An improvement, but still a longer right tail.

---

<h2>2. A close look at the standard deviation</h2><a name='section2'/>

The standard deviation is a very useful measure in mathematical or formal statistics. It's also a useful empirical measure, but its key characteristic is probably how analytically useful it is. It will show up again and again during our COMPSS 202 journey. 

Let's look at a very small, simple dataset. If you see an error message below, you could try running the code at the top of this notebook again.

In [None]:
sheet_url = "https://docs.google.com/spreadsheets/d/14Fms0CPsWj2eedSucH_WmUeOua3qkCOA8pMYfelAYWI/edit?usp=sharing"

# Here are fictional income data for members of the 2003 teen show "The O.C." 
oc_income <- read_sheet(sheet_url,
                        range = "A1:F4")

# A transposed matrix is here
oc_income_t <- read_sheet(sheet_url,
                          range = "A7:D12")

Because these matrices are small, we can just call them to see their contents. The first matrix has people (characters) along the columns and years along the rows:

In [None]:
oc_income

And the transposed matrix has years along the columns and people along the rows. Like any program worth its salt, __R__ can transpose things on its own. But I found it expeditious just to pull in the transposed data rather than fuss with changing labels of columns.

In [None]:
oc_income_t

The sample size $n$ here is the number of people, or the number of rows here, equal to 5.

In [None]:
n = nrow(oc_income_t)
n

The average income in a year across people is the sum divided by $n$:

$$
\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i
$$

Here is a long way to write it out by hand in __R__:

In [None]:
mean_2016 = (1/n) * (100 + 200 + 300 + 400 + 500)
mean_2016

Here is another way, using the matrix elements. Notice the data type of the answer is data frame. 

In [None]:
mean_2017 = (oc_income_t[1,3] + 
             oc_income_t[2,3] +
             oc_income_t[3,3] +
             oc_income_t[4,3] +
             oc_income_t[5,3]
            )/n
mean_2017

And another way for the 2017 data is by calling `sum()` on the column, which returns a numeric:

In [None]:
mean_2017 = (sum(oc_income_t$y2017)
            )/n
mean_2017
class(mean_2017)

We can also use the __R__ function `mean()` called on the column:

In [None]:
mean_2018 = mean(oc_income_t$y2018)
mean_2018

---

The standard deviation is visually very similar to the mean, except that it is also the <i>square root</i> of the variance of the sample, which is the average sum of squared deviations from the mean:

$$
SD(x) = \sqrt{
\frac{1}{n} \sum_{i=1}^n 
\left(
x_i - \bar{x}
\right)^2
}
$$

As before, there are multiple roads to travel. For me, an easy batch of code uses __R__'s ability to subtract a numeric from a column element-by-element into a new (temporary) column:

In [None]:
sd_2016 = ( (1/n) * (sum( (oc_income_t$y2016 - mean_2016)^2 ) ) )^0.5
sd_2016

---

<h2><div style="text-align: left"> <span style="font-family:Papyrus; " >And then there came GREAT CONFUSION across the land...</span></div></h2>

Unfortunately, while this is indeed the standard deviation of the sample, pretty much all statistical programs will produce something a little different, called the <b>sample standard deviation</b>, which is more handy for calculating confidence intervals and other things. 

Observe what happens when we call `sd()` in __R__:

In [None]:
sd(oc_income_t$y2016)

This is definitely different than what we found. Fortunately there is a quick fix, rescaling by the square root of the ratio of $n-1$ to $n$. This is because `sd()` gives us the sample standard deviation, which is the following measure:

$$
s(x) = \sqrt{
\frac{1}{n-1} \sum_{i=1}^n 
\left(
x_i - \bar{x}
\right)^2
}
$$

Comparing these two reveals that

$$
SD(x) = \sqrt{ \frac{n-1}{n} }\times s(x)
$$

In [None]:
sd(oc_income_t$y2016) * sqrt( (n-1)/n )

sd_2016

This fix reproduces the standard deviation as we are using it.

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>