<img src="https://raw.githubusercontent.com/ryanedw/COMPSS-202-SU24/main/Images/UCB-macss.jpg" width="120" align="right"/>
<h1>COMPSS 202 Class 13</h1>

<h2>Estimating Parameters</h2>

Inspired by [SticiGui Chapter 25](https://www.stat.berkeley.edu/~stark/SticiGui/Text/estimation.htm) and by [SticiGui Chapter 26](https://www.stat.berkeley.edu/~stark/SticiGui/Text/confidenceIntervals.htm)

<h3>Learning objectives:</h3>

<ol style="margin-top: 0; margin-bottom: 0;">
  <li>Suppose you’ve estimated a sample mean or proportion. The problem is that bias and sampling error could push the estimate higher or lower than the true value
  </li>
  <li>The standard error of an estimator is the standard deviation of that estimator, and when $x$ is continuous, $SE(\bar{x}) = \frac{s}{\sqrt{n}}=
\frac{1}{\sqrt{n}} 
\sqrt{
\frac{1}{n-1}
\sum_{i=1}^n
\left(x_i-\bar{x}
\right)^2
}$
  </li>
  <li>A confidence interval of ±2 standard errors on either side of the mean captures about 95% of the possible true mean values. Statistical programs usually use 1.96, a critical value in the t distribution
      </li>
    </ul>
   </li>
</ol>





To begin, please run the cells below to load up the libraries necessary to access data in Google Sheets. Best practices include running the cells in order.

In [None]:
install.packages("googlesheets4")
library(googlesheets4)
gs4_deauth()

<h2>1. Loading in the Pearson heights data</h2>

Here again are 1,078 observations of "fathers" and "sons" from a well-known training dataset based on the historical work of [Karl Pearson](https://en.wikipedia.org/wiki/Karl_Pearson). Please see the Class 03 notebook for more details.

Here is a direct link to the Google Sheets file loaded in the cell below: [Pearson height data.sheets](https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link)

In [None]:
sheet_url = "https://docs.google.com/spreadsheets/d/1TZhFGjT-uXd9ScucSYkT0MNARNDMCRCbAQgx4jac-X8/edit?usp=drive_link"

pheights <- read_sheet(sheet_url,
                       range = "B13:D1091")

Calling `head()` provides a useful quick look at the top of the dataset. Calling `dim()` helps us make sure we have the right dataset loaded up in the correct way.

In [None]:
head(pheights)
dim(pheights)
n = nrow(pheights)
n

<h2>2. Means and their standard errors</h2>

The sample mean is an unbiased estimator of the population mean. In the Pearson heights data, the obvious metrics to examine are the sample averages of father's height ($\bar{x}$) and son's height ($\bar{y}$). Here they are:

In [None]:
meanX = mean(pheights$father)
meanY = mean(pheights$son)

meanX
meanY

The standard errors of these means tell us how confident we are about the true population means involved. Let $\mu_x$ represent the average height of all British fathers around 1900, and let $\mu_y$ be the average height of all British sons around 1900. 

In other words, we see sample means of $\bar{x} = 67.7$ inches for fathers and $\bar{y} = 68.7$ inches for sons, in a sample of $n = 1,078$ observations. What is the precision of these estimates? What can we say about the likely values of the true population average heights of fathers and sons? 

The standard error of the sample mean equals the sample standard deviation divided by the square root of the sample size:

$$
SE(\bar{x}) = \frac{s}{\sqrt{n}}=
\frac{1}{\sqrt{n}} 
\sqrt{
\frac{1}{n-1}
\sum_{i=1}^n
\left(x_i-\bar{x}
\right)^2
}
$$

In [None]:
# short form
semx = sd(pheights$father) / sqrt(n)
semx

# long form
semx = ( sum( (pheights$father - meanX)^2 ) / (n-1) )^0.5 / sqrt(n)
semx

In words, the standard error of $\bar{x} = 67.7$ is $0.0836$. 

In [None]:
# short form
semy = sd(pheights$son) / sqrt(n)
semy

# long form
semy = ( sum( (pheights$son - meanY)^2 ) / (n-1) )^0.5 / sqrt(n)
semy

And here, the standard error of $\bar{y} = 68.7$ is $0.0858$.

By the Central Limit Theorem, these averages are approximately distributed normal with standard deviations equal to their standard errors. Therefore we predict that 95% of the sample means in any sample drawn from this population will fall within about ± 2 $SE$'s on either side. In other words, a 95% confidence interval runs from $\bar{x} - 2 \ SE(\bar{x})$ to  $\bar{x} + 2 \ SE(\bar{x})$.

For father's height, that 95% confidence interval is:

In [None]:
meanX - 2*semx
meanX + 2*semx

For son's height, the 95% confidence interval is:

In [None]:
meanY - 2*semy
meanY + 2*semy

<h2>3. Hypothesis testing</h2>

In words, here is what these results mean:

Because the 95% confidence interval around father's average height runs from $67.5$ to $67.9$, we can reject any null hypothesis that father's true average height is shorter than $67.5$ or taller than $67.9$.

Because the 95% confidence interval around son's average height runs from $68.5$ to $68.9$, we can reject any null hypothesis that father's true average height is shorter than $68.5$ or taller than $68.9$.

Precisely why these are "null hypotheses" probably seems a little nebulous. A proper formulation of a null hypothesis under these conditions would be something like this:
* $H_0$: The population average father's height minus 67 is zero, $\mu_x - 67.0 = 0$
* $H'_0$: The population average son's height minus 69 is zero, $\mu_y - 69.0 = 0$

For both of these, the proposed value ($67.0$ or $69.0$) lies outside the relevant 95% confidence interval, so in each case we reject the null hypothesis that the difference is zero, or equivalently that the true means equal those proposed values. 

More succinctly, we reject the hypotheses that $\mu_x = 67.0$ or that $\mu_y = 69.0$, because those values lie outside the 95% confidence intervals.

---

<h2>4. Beyond the Pale</h2>

As we will see in Class 14, ordinary least squares via `lm()` can also give us simple results like this. In this case, if we are seeking the sample average and its standard error for subgroups that are measured in different columns of the data frame, we can look at just those columns. 

In [None]:
reg_f <- lm(father ~ 1,
            data = pheights
            )
summary(reg_f)

In [None]:
reg_s <- lm(son ~ 1,
            data = pheights
            )
summary(reg_s)

---

In the regression results above, `Estimate` shows the average, and `Std. Error` shows its standard error.

<div style="text-align: right"> <span style="font-family:Papyrus; ">And they lived happily ever after. The End.</span></div>