**SM339 &#x25aa; Applied Statistics &#x25aa; Spring 2024 &#x25aa; Uhan**

# Lesson 3. Confidence Intervals &mdash; Part 2

## Computing confidence intervals in R

__Example__.
Suppose we randomly select 8 midshipmen and record how many children are in their families.
These data values are in a CSV file `data/children.csv`, in the same folder as this notebook.

* First, let's read the CSV file into a data frame called `Children`

In [1]:
Children <- read.table('data/children.csv', header=TRUE, sep=',')

* Next, let's take a look at the first few rows of `Children`

    * It's always good practice to do this, just to make sure everything looks OK

In [2]:
# Solution
head(Children)

Unnamed: 0_level_0,Number
Unnamed: 0_level_1,<int>
1,1
2,4
3,2
4,5
5,6
6,3


* The values in the `Number` column of this data frame are going to be our sample values $x_1, x_2, \dots, x_n$

* For our convenience, let's create a variable `x` to hold this column

In [3]:
# Solution
x <- Children$Number

* Let's calculate a 95% confidence interval for the average number of children in all midshipmen families

* Let's start by computing the sample mean (estimate) $\bar{x} = \displaystyle \frac{1}{n} \sum_{i=1}^{n} x_i$:

In [4]:
# Solution
xbar <-  mean(x)

* Next, let's compute the sample standard deviation (estimate) $\displaystyle s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2}$: 

In [5]:
# Solution
s <- sd(x)

* We can get the number of samples $n$ by taking the length of `x` (which is the same as the column `Children$Number`):

In [6]:
# Solution
n <- length(x)

* We want a $95\%$ confidence interval

* Recall that the significance level $\alpha = 1 - $ confidence level

* So, $\alpha = 1 - 0.95$:

In [7]:
# Solution
alpha <- 1 - 0.95

* Based on `alpha`, we can then compute the critical value $t_{\alpha/2, n - 1}$ from the $t$-distribution with $n - 1$ degrees of freedom

* Recall that $t_{\alpha/2, n - 1}$ is the $(1 - \alpha/2)$-quantile of the $t$-distribution with $n - 1$ degrees of freedom:

In [8]:
# Solution
t <- qt(1 - alpha/2, n - 1)

* Finally, we can compute the lower and upper endpoints of the CI, $\bar{x} \pm t_{\alpha/2, n - 1} \frac{s}{\sqrt{n}}$:

In [9]:
# Solution
xbar - t * s / sqrt(n)
xbar + t * s / sqrt(n)

* Now that we've computed a 95% confidence interval, let's intepret it:

*Write your answer here. Double-click to edit.*

*Solution.* We are 95% confident that the average number of children in all midshipmen families is between 2.16 and 4.84.

* Which of these correctly explains what it means to be 95% confident?

    1. The probability that this interval contains the population average is 0.95.
    2. We have constructed this interval with a process that captures the population average 95% of the time.
    3. 95% of all midshipmen families have between 2.16 and 4.84 children.

*Write your answer here. Double-click to edit.*

*Solution.* Option 2.

<hr style="border-top: 2px solid gray; margin-top: 1px; margin-bottom: 1px"></hr>

## Exercises

### Problem 1

Suppose the heights of midshipmen are normally distributed.
Suppose we randomly select 16 midshipmen and record their heights in inches.
These values are in a CSV file `data/heights.csv`, in the same folder as this notebook.

(a) Compute the observed sample mean and sample standard deviation.

In [10]:
# Solution 
Heights <- read.table('data/heights.csv', header=TRUE, sep=',')
head(Heights)

x <- Heights$height
mean(x)
sd(x)

Unnamed: 0_level_0,height
Unnamed: 0_level_1,<dbl>
1,72.2
2,67.4
3,74.3
4,72.6
5,70.8
6,76.9


(b) Find 90%, 95%, and 99% confidence intervals for the population (all midshipmen) mean height.

In [11]:
# Solution
# 90% CI
xbar <- mean(x)
s <- sd(x)
alpha <- 1 - 0.9
n <- length(x)
t <- qt(1 - alpha / 2, n - 1)

xbar - t * s / sqrt(n)
xbar + t * s / sqrt(n)

In [12]:
# Solution
# 95% CI
xbar <- mean(x)
s <- sd(x)
alpha <- 1 - 0.95
n <- length(x)
t <- qt(1 - alpha / 2, n - 1)

xbar - t * s / sqrt(n)
xbar + t * s / sqrt(n)

In [13]:
# Solution
# 99% CI
xbar <- mean(x)
s <- sd(x)
alpha <- 1 - 0.99
n <- length(x)
t <- qt(1 - alpha / 2, n - 1)

xbar - t * s / sqrt(n)
xbar + t * s / sqrt(n)

### Problem 2

Suppose the math SAT scores of midshipmen are approximately normally distributed.
Suppose we randomly select 25 midshipmen and record their math SAT scores.
These values are in a CSV file `data/scores.csv`, in the same folder as this notebook.

(a) Compute the observed sample mean and sample standard deviation.

In [14]:
# Solution 
Scores <- read.table('data/scores.csv', header=TRUE, sep=',')
head(Scores)

x <- Scores$score
mean(x)
sd(x)

Unnamed: 0_level_0,score
Unnamed: 0_level_1,<int>
1,490
2,489
3,590
4,517
5,533
6,500


(b) Find 90%, 95%, and 99% confidence intervals for the population (all midshipmen) mean math SAT score.

In [15]:
# Solution
# 90% CI
xbar <- mean(x)
s <- sd(x)
alpha <- 1 - 0.9
n <- length(x)
t <- qt(1 - alpha / 2, n - 1)

xbar - t * s / sqrt(n)
xbar + t * s / sqrt(n)

In [16]:
# Solution
# 95% CI
xbar <- mean(x)
s <- sd(x)
alpha <- 1 - 0.95
n <- length(x)
t <- qt(1 - alpha / 2, n - 1)

xbar - t * s / sqrt(n)
xbar + t * s / sqrt(n)

In [17]:
# Solution
# 99% CI
xbar <- mean(x)
s <- sd(x)
alpha <- 1 - 0.99
n <- length(x)
t <- qt(1 - alpha / 2, n - 1)

xbar - t * s / sqrt(n)
xbar + t * s / sqrt(n)

### Problem 3

A study of 20 midshipmen reports that a 90\% confidence interval for the average IQ of all midshipmen is 118.2 to 121.8. The Superintendent wants to know the corresponding 99\% confidence interval. Find it for him, assuming that midshipmen IQs are Normally distributed.

*Solution.*  In this setting, $n = 20$. 

The 90\% confidence interval is $\Big( \bar{x} - t_{0.05, 19} \frac{s}{\sqrt{n}}, \bar{x} + t_{0.05, 19} \frac{s}{\sqrt{n}} \Big) = (118.2, 121.8)$. 
Note that for the 90\% confidence interval, $\alpha = 1 - 0.9 = 0.1$.

Therefore, $\bar{x}$ is the center of this interval, which we can find as follows:

In [18]:
# Find xbar by finding the center of the interval
xbar <- (118.2 + 121.8) / 2

Similarly, the margin of error $t_{0.05, 19} \frac{s}{\sqrt{n}}$ is half the width of the interval, which we can find like this:

In [19]:
# Find the margin of error (half the interval width)
margin_of_error <- (121.8 - 118.2) / 2

We can also find $t_{0.05, 19}$:

In [20]:
t90 <- qt(1 - 0.05, 19)

Therefore, we can solve for the standard error $\displaystyle \frac{s}{\sqrt{n}} = \frac{\text{margin of error}}{t_{0.05, 19}}$:

In [21]:
standard_error <- margin_of_error / t90

Now we can compute the 99\% confidence interval $\Big( \bar{x} - t_{0.005, 19} \frac{s}{\sqrt{n}}, \bar{x} + t_{0.005, 19} \frac{s}{\sqrt{n}} \Big)$:

In [22]:
t99 <- qt(1 - 0.005, 19)
lower <- xbar - t99 * standard_error
upper <- xbar + t99 * standard_error

lower
upper