# 1.3.2 - Beginner - Confidence Intervals

COMET Team <br> *Anneke Dresselhuis, Colby Chambers, Jonathan Graves*  
2023-01-12

## Outline

### Prerequisites

-   Introduction to Jupyter <br>
-   Introduction to R <br>
-   Introduction to Visualization <br>
-   Central Tendency <br>
-   Distribution <br>
-   Dispersion and Dependence <br>

### Outcomes

After completing this notebook, you will be able to: \* Interpret and
report confidence intervals \* Calculate confidence intervals under a
variety of conditions \* Understand how the scope of sampling impacts
confidence intervals

### References

-   [Simulating the Construction of Confidence Intervals for Sample
    Means](https://rpubs.com/pgrosse/545955)

In [None]:
source("beginner_confidence_intervals_tests.r")

# importing typical packages
library(tidyverse)
library(haven)
library(ggplot2)

# loading the dataset
census_data <- read_dta("../datasets_beginner/01_census2016.dta")

# cleaning the dataset
census_data <- filter(census_data, !is.na(census_data$wages))
census_data <- filter(census_data, !is.na(census_data$mrkinc))
census_data <- filter(census_data, census_data$pkids != 9)

# Introduction

So far, we have developed a strong grasp of core concepts in statistics.
We’ve learned about measures of central tendency and variation, as well
as how these measures relate to distributions. We have also learned
about random sampling and how sampling distributions can shed light on
the parameters of a population distribution.

So, how can we apply this knowledge to real empirical work? In this
notebook, we will learn about a key concept which relates to how we
report our results empirically when sampling from a population. This is
the idea of a **confidence interval**.

# Confidence Intervals and Point Estimates

A **confidence interval** is an estimate that gives us a range of values
within which we expect a population parameter to fall. Put another way,
it provides a range within which we can have a certain degree of
*confidence* that a desired parameter, such as a population mean, lies.

> This is in contrast to a **point estimate**, which is a specific
> estimated value of another object, like a population parameter.
>
> ie: The point estimate of the population mean is the sample mean and
> the point estimate of the population standard deviation is the sample
> standard deviation.

Let’s make this concrete with an example.

## Example

**Aim:** Find the mean GPA of undergraduate students at universities
across Canada

**Method:** Instead of collecting the GPA of every single undergraduate
student in the country without error, we can collect a sample of
students and find the mean of their GPAs (the sample mean).

**Evaluation:** Make inferences about the desired, yet unobtainable,
population mean using the sample mean (point estimate).

But how can we report an estimate for the population mean GPA if we draw
a different mean GPA for every possible sample? This is where
**confidence intervals** become useful. They allow us to combine
information about central tendency and dispersion into a single object.

# Confidence Levels

The confidence interval describes the precision of point estimate from
the sample.

To calculate this confidence interval, we must choose a **confidence
level**. The confidence level indicates the probability with which the
estimation of a statistical parameter (ie: the mean) in a sample survey
is also true for the population.

Higher confidence level means greater certainty that our confidence
interval serves as good estimate for the population parameter of
interest.

~The most commonly chosen confidence level is 95%, but other percentages
(90%, 99%) are also used sometimes.~

If the confidence level is established at 95% for our sample scenario,
this would mean that if we drew random samples of undergraduate students
1000 different times and got 1000 sample mean point estimates and
corresponding confidence intervals, we would expect 950 of these
confidence intervals to contain the actual average GPA of all Canadian
undergraduates.

We say that we are **95% confident** that the true mean GPA of all
Canadian undergraduates lies in this range.

# Calculating Confidence Intervals

The official representation of a confidence interval is the following:

$$
(\text{point estimate} - \text{margin of error}, \text{point estimate} + \text{margin of error})
$$

or

$$ 
\text{point estimate} \pm \text{margin of error}
$$

Point estimate:

-   The sample statistic we find from our random sample.

Margin of error:

-   This is subtracted and added from our point estimate to find the
    **lower bound** and **upper bound** of our confidence interval
    estimate. Calculating the margin of error varies depending on what
    sample statistic we are looking at and what we know about our
    population. Let’s look at few important special cases.

# Confidence Intervals for the Sample Mean

To construct a confidence interval for a sample mean we’ve found (e.g.
the mean GPA of a sample of undergraduates), we must meet the following
three conditions

1\. Sample must be **obtained randomly** (typically found through simple
random sampling)

2\. The sampling distribution of the sample means is **approximately
normal**, either because

-   a\) The original population is normally distributed

-   b\) The Sample size is \> 120 (invokes the Central Limit Theorem)

3\. Our sample observations must be **independent** either because

-   a\) we sample with replacement (when we record an observation, we
    put it back in the population with the possibility of drawing it
    again)
-   b\) our sample size is \< 10% of the population size

If each of conditions 1-3 are met, we are able to construct a valid
confidence interval around our sample mean point estimate. There are two
different cases for this construction.

## Case 1: We Know the Population Standard Deviation

In rare instances when we may know the variance (and thus standard
deviation) of our original population of interest, we use the following
formula to calculate the confidence interval:

$$
\bar x \pm z_{\alpha / 2} \cdot \frac{\sigma}{\sqrt n}
$$

where $\bar x$ is the sample mean, $z$ is the critical value (from the
standard normal distribution) for a chosen confidence level $1-\alpha$,
$\sigma$ is the population standard deviation, and $n$ is the sample
size.

Note, this case is extremely rare as it requires us to know the standard
deviation but not the mean of a population! Typically we either know
both the mean and standard deviation of the population or we know
neither.

## Case 2: We Don’t Know the Population Standard Deviation

In this case we invoke the $t$-distribution when calculating the margin
of error for our confidence intervals. When we don’t know population
standard deviation, we will use the sample standard deviation instead.
The calculation procedure otherwise follows exactly as before in Case 1.

$$
\bar x \pm t_{\alpha / 2} \cdot \frac{s}{\sqrt n}
$$ <br>

Where $\bar x$ is the sample mean, $t$ is the critical value (from the
$t$-distribution) for a chosen confidence level $1-\alpha$, $s$ is the
sample standard deviation, and $n$ is the sample size.

For example, let’s construct a 95% confidence interval for the sample
mean of the variable `wages`. We can immediately calculate its mean,
which serves as our sample mean point estimate.

In [None]:
# calculating the sample mean of wages
x <- mean(census_data$wages)

Now that we have this point estimate, we can calculate our margin of
error around it. To do so, we must first find

1.  The $t$ value corresponding to a 95% confidence level
2.  The standard deviation of `wages`
3.  The sample size (the number of observations recorded for `wages`.

In [None]:
# finding the sample size and associated degrees of freedom
n <- nrow(census_data)
df <- n - 1

# finding the t value for a confidence level of 95% (noticing this value converges on the z value as so we could have used this too)
t <- qt(p = 0.05, df = df)

# finding the sample standard deviation of wages
s <- sd(census_data$wages)

# calculating the lower and upper bounds of the desired confidence interval

lower_bound <- x - (t*s/sqrt(n))
upper_bound <- x + (t*s/sqrt(n))

lower_bound
upper_bound

We are 95% confident that the mean wage of all Canadians ranges between
$54274$ and $54690$. We also know this is a valid confidence interval
estimate because our `wages` variable and the procedure for sampling
meets all of the three criteria outlined:

-   1.  Random sampling: Statistics Canada (the source for this data)
        utilizes random sampling
    2.  Our sample size is $n > 30$ and thus we don’t even need to check
        the distribution of `wages`.
    3.  Our observations are independent because our sample size $n$ is
        \< 10% of the total population (since the total population of
        Canada is about 38 million).

    This is a very small confidence interval given our large sample
    size, $n$. This means our confidence interval estimate is very
    precise as indicated by the narrowness of the interval we found
    above.

## Exercise

Matilda takes a random sample of 10 books from a library in order to
estimate the average number of pages among all books in the library.
Let’s assume the library is very large and the library does not keep
record of the specifics of its overall population of books in terms of
their pages.

Does it make more sense for Matilda to use a standard z distribution or
student’s t distribution when calculating the margin of error for her
confidence interval?

In [None]:
answer_1 <- "X" # your answer for "z" or "t" in place of "X"

test_1()

From her sample, Matilda finds a sample mean of 280 and sample variance
of 400. She wants to construct a 90% confidence interval to estimate the
population mean number of pages. What will be the upper and lower bounds
of this interval (assuming its a valid confidence interval)?

In [None]:
# your code here

answer_2 <- # your answer for the lower bound here, rounded to 2 decimal places
answer_3 <- # your answer for the upper bound here, rounded to 2 decimal places

test_2()
test_3()

# Confidence Intervals for the Sample Proportion

While we’ve looked at the example of mean GPA throughout this notebook,
we can also calculate confidence intervals for sample proportions as
well. Let’s try another example:

Condition: A population must vote for either of the two political
parties (A or B)

Aim: Find out the proportion of the population that voted for party A

Method:

1.  Collect a sample and corresponding sample proportion
2.  Using the point estimate, we establish a confidence interval
3.  Estimate the porpotion who voted for party A within a certain degree
    of confidence

Before we establish the confidence interval, we must make sure that our
sampling process again satisfies three conditions; this time, however,
the second of these three conditions will be different:

1.  **Random sample**

-   Typically found through simple random sampling

1.  **The sampling distribution of the sample proportions is normally
    distributed**

-   We must have at least 10 “successes” and 10 “failures” in our
    sample, this means at least 10 people in our sample voted for party
    A and at least 10 people voted for party B.

-   Therefore, very small sample sizes (i.e. $n = 5$, $n = 10$, etc.
    will fail this condition)

1.  **Sample observations must be independent either because**

-   A\) We sample with replacement (when we record an observation, we
    put it back in the population with the possibility of drawing it
    again)

-   B\) Our sample size is \< 10% of the population size

If conditions 1-3 are all met, we are able to construct a valid
confidence interval around our sample proportion point estimate. We now
turn to the one case we must consider when calculating the margin of
error and confidence interval for sample proportions.

## The Only Case: We Don’t Know the Population Standard Deviation

When we don’t know the standard deviation for the population, we use the
following formula to constrcut the the confidence interval of a sample
proportion:

$$
\hat P \pm z_{\alpha / 2} \cdot \sqrt \frac {\hat P \cdot(1 - \hat P)}{n}
$$

where $\hat P$ is the sample proportion, $z$ is the critical value (from
the standard normal table) for a chosen confidence level $1-\alpha$, and
$n$ is the sample size.

<sub>Note; if we knew the population standard deviation, we would also know the population proportion and there would be no point in sampling and constructing confidence intervals to estimate it!</sub>

For example:

Let’s calculate a 95% confidence interval for the sample proportion of
the census dataset who has one or more kids in their household
(`pkids == 1`). We can immediately calculate our sample proportion,
which serves us our point estimate.

In [None]:
# calculating our sample proportion of observations with pkids == 1
p <- sum(census_data$pkids == 1) / n
p

Now that we have our sample proportion, we can find our $z$ critical
value for a 95% confidence level, as well as use our sample proportion
$\hat{p}$ and sample size $n$, to calculate our confidence interval.

In [None]:
# finding the z value for a confidence level of 95%
z <- qnorm(p = 0.05, lower.tail=FALSE)

# calculating the lower and upper bounds of the desired confidence interval
lower_bound <- p - z*sqrt(p*(1-p)/n)
upper_bound <- p + z*sqrt(p*(1-p)/n)

lower_bound
upper_bound

From our above calculations, we can say that we are 95% confident that
the true proportion of Canadians with a child in their household ranges
between 0.7075% - 0.7104%.

-   <sub>Note: In rare cases when our sample proportion point estimate is either very high or low and our sample size is small, we may find that the the upper or lower bound of the confidence interval for a sample proportion is outside of the accepted domain of \[0, 1\]. We may choose to either report the true interval or cap or interval at 0 or 1, while noting that this does not reflect the full confidence interval found.</sub>

## Exercise

Matilda now wants to know the proportion of students in her school who
are left-handed. Let’s assume her sampling procedure meets all of the
criteria for constructing a valid confidence interval. She takes a
sample of 200 students and finds that 22 of them are left-handed. What
is the upper and lower bound of a 98% confidence interval for the
proportion of the school’s overall student body that are left-handed?

In [None]:
# your code here

answer_4 <- # your answer for the lower bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)
answer_5 <- # your answer for the upper bound here, rounded to 3 decimal places (in proportion form, i.e. 10% = 0.1)

test_4()
test_5()

Let’s imagine that our sample size and confidence level are fixed and
cannot be changed. What sample proportion of students who are
left-handed would result in the smallest confidence interval possible?

In [None]:
answer_6 <- # your answer for the sample proportion here (i.e. 10% = 0.1)

test_6()

# Confidence Intervals for the Sample Variance

Finally, we may want to construct confidence intervals for a sample
variation itself in order to estimate the population standard deviation
that we do not know. The following conditions must be met for this
confidence interval to be valid:

> 1.  Sample collected randomly
> 2.  Original population is normally distributed or at least
>     symmetrically distributed without many outliers.
>     -   If this does not hold, our sample size must be \> 120 (invokes
>         the Central Limit Theorem)
> 3.  Our sample observations must be independent either because
>     -   A\) We sample with replacement (when we record an observation,
>         we put it back in the population with the possibility of
>         drawing it again) <br>
>
>     -   B\) Our sample size is \< 10% of the population size

If conditions 1-3 are all met, we are able to construct a valid
confidence interval for our sample variance point estimate.

## The Only Case: We Don’t Know the Population Standard Deviation

We only need worry about this case when calculating confidence intervals
for the sample variance since if we knew the population standard
deviation, we would also know the population variance and therefore not
need to construct a confidence interval to estimate this number.

Instead, we assume we have only a sample variance to rely on. The
formula works a bit differently in this case: instead of adding and
subtracting a margin of error to our point estimate, we will use our
point estimate to calculate the lower and upper bounds of our confidence
interval directly.

$$
(\frac{(n - 1) \cdot s^2}{\chi^2_{\alpha/{2}}}, \frac{(n - 1) \cdot s^2}{\chi^2_{1 - \alpha/{2}}})
$$

where $n$ is the sample size, $s^2$ is the sample variance, and $\chi^2$
is the chi-squared value for a chosen confidence level $1 - \alpha$ and
degrees of freedom $n - 1$.

> Note: Constructing this type of confidence interval is different than
> previous instances with the sample mean and sample proportion. This is
> because this sample variance follows a **non-normal distribution**:
> the $\chi^2$ distribution instead of a normal distribution like that
> of the sample mean and sample proportion.

Let’s do one final example to reinforce the calculation of confidence
intervals for this type of sample statistic. We will construct a 95%
confidence interval for the sample mean of `mrkinc`. Our procedure will
follow exactly the steps above, although this time we need to **use the
chi-squared distribution in place of the t or z distributions**. We can
calculate our sample variance first.

In [None]:
# calculating the variance of mrkinc
var <- var(census_data$mrkinc)
var

Now that we have our sample variance (which is quite large), we can find
the other statistics necessary to calculate our confidence interval
estimate.

In [None]:
# finding the chi-squared values for a 95% confidence level and n - 1 degrees of freedom
upper_chi <- qchisq(p = 0.05, df =df, lower.tail = TRUE)
lower_chi <- qchisq(p = 0.05, df = df, lower.tail = FALSE)

# calculating the upper and lower bounds of the desired confidence interval
lower_bound <- (df*var)/lower_chi
upper_bound <- (df*var)/upper_chi

lower_bound
upper_bound

Therefore, we are 95% confident that the variance of market income among
all Canadians is within (767761585, 7745209769). This is quite a large
interval, but given the size of the variance for this variable, this is
reasonable.

## Exercise

Finally, Matilda wants to know the variance of weights of all cars ever
sold at her father’s car dealership.

-   Since she can’t find the variance of the thousands of cars sold, she
    takes a random sample of 40 cars and records their weights.

-   She finds that they have a sample mean weight of 5,000 pounds and a
    sample variance of 250,000.

-   Matilda wants to construct a 95% confidence interval estimate for
    the population variance.

Given the information above, what are the upper and lower bounds of this
confidence interval?

In [None]:
# your code here

answer_7 <- # your answer for the lower bound here, rounded to the nearest whole number
answer_7 <- # your answer for the upper bound here, rounded to the nearest whole number

test_7()
test_8()

Let’s now say that Matilda draws a new random sample of 40 cars and
reports 95% confidence that the population variance of car weights falls
within the confidence interval (490000, 640000). Under this sampling
procedure, what is the 95% confidence interval estimate for the standard
deviation of weights of all cars ever sold at the dealership?

In [None]:
answer_10 <- # your answer for the lower bound here
answer_11 <- # your answer for the upper bound here

test_10()
test_11()

# Factors Which Impact the Width of Confidence Intervals

We can see that no matter the parameter we are estimating, we always
need to establish

-   The confidence level

-   The sample size.

Because these numbers are chosen early on during the sampling procedure,
they can easily be changed. Let’s explore what happens to our confidence
intervals when we change each of these numbers.

# Changing the Sample Size

Let’s say we want to change our sample size $n$.

-   **If we <u>increase</u> our sample size** $n$,

<!-- -->

    -   Both our margin of error and confidence interval will
        [decrease]{.underline} since our sample is [larger]{.underline}
        and our estimates are therefore [more precise.]{.underline}

-   **If we <u>decrease</u> our sample size** $n$**.**

    -   Both our margin of error and confidence interval will
        <u>increase</u> since our sample is <u>smaller</u> and our
        estimates are therefore <u>less precise</u>.

To see this point interactively, modify the code below by changing the
input for $n$ (currently set at 30). We can see that the size of the
confidence intervals increases or decreases depending on whether we
decrease or increase the simulated sample size.

In [None]:
population <- rnorm(10000, 0, 1)
set.seed(2)

# defining a function which outputs a confidence interval for a given sample size
create_confidence_intervals <- function(n) {
    x = mean(sample(population, n))
    z = qnorm(p = 0.05, lower.tail=FALSE)
    lower = x - (z*1/sqrt(n))
    upper = x + (z*1/sqrt(n))
    df = data.frame(lower, upper)
    return(c(lower, upper))
    }

# calling the function, tweak default sample size 30 here!
create_confidence_intervals(30)

# Changing the Confidence Level

-   If we increase the confidence level to a higher percentage, then the
    new confidence interval will be wider.

-   If we decrease the confidence level to a lower percentage, then the
    new confidence interval will be more narrow.

The logic is simple: to be more confident that our confidence interval
actually does contain the true value of the population parameter, means
our confidence interval must be wider and likewise for decreased
confidence level.

<sub>Increased confidence level → Higher error bound → Wider confidence interval</sub>

<sub>Decreased confidence level→ Lower error bound→ Narrower confidence interval</sub>

This all occurs mathematically through an increase or decrease in our
margin of error (or bounds) respectively due to the increase or decrease
in our $z$ or $t$ critical value.

To see this point interactively, modify the code below by changing the
input for $\alpha$ (currently set at 0.05, indicating a 95% confidence
level). We can see that the vertical length (width) of the confidence
intervals increases or decreases depending on whether we increase or
decrease the simulated confidence level.

In [None]:
population <- rnorm(10000, 0, 1)
set.seed(2)

# defining a function which outputs a confidence interval for a given confidence level
create_confidence_intervals <- function(alpha) {
    x = mean(sample(population, 100))
    z = qnorm(p = alpha, lower.tail=FALSE)
    lower = x - (z*1/sqrt(100))
    upper = x + (z*1/sqrt(100))
    df = data.frame(lower, upper)
    return(c(lower, upper))
    }

# calling the function, tweak default 0.05 alpha (95% confidence level) here!
create_confidence_intervals(0.05)

## Exercise

Matilda thinks that one of her confidence intervals above is too wide
and wishes to narrow it. What could she do in order to achieve this
goal?

-   A. increase the sample size and higher the confidence level
-   B. decrease the sample size and lower the confidence level
-   C. increase the sample size and lower the confidence level
-   D. decrease the sample size and higher the confidence level

In [None]:
answer_1 <- "..." # enter your choice here

test_11()

# Common Misconceptions

Up to this point, we’ve covered what confidence intervals are, how we
calculate them, and how they’re sensitive to two key parameters. Let’s
lastly clarify a couple of misconceptions about the interpretation of
confidence intervals.

## Misconception 1:

*If we have a 95% confidence interval, this is a **concrete range**
under which our estimated population parameter **must** fall*.

-   If we repeated our sampling procedure many times and constructed a
    confidence interval each time, we would expect about 95% of these
    confidence intervals to contain our true parameter.

-   **However, this is not 100%.** since about 5% of our confidence
    intervals will not contain the true parameter. There is no stopping
    the actual confidence interval we calculate from being one of those
    5%.

-   Therefore, cannot say with absolute certainty that our true
    parameter lies within the interval that we calculate.

The confidence interval is an *estimator* and not an official range of
possible values for the population parameter.

## Misconception 2:

*If we have a confidence level of 95%, 95% of our population data must
lie within the calculated confidence interval*.

-   This is not true since our confidence level indicates the long run
    percentage of constructed confidence intervals which contain our
    true parameter but says nothing about the spread of our actual data.

-   To find the range within which 95% of our data lie, we must consult
    a histogram for the population.

-   For instance if our data is quite bimodaly distributed (around half
    of our data is clustered far to the left of our mean, and the other
    half is clustered far to the right of our mean), our calculated 95%
    confidence interval will likely contain very little (much less than
    95%) of the data.

The confidence level does **not** determine the spread of the actual
data

# Misconception 3:

*If we have a confidence level of 95%, a confidence interval calculated
from a sample of 500 observations will more likely contain the true
parameter than a confidence interval calculated from a sample of 100
observations.*

-   We know from the previous section that a confidence interval
    generated from the sample $n = 500$ will be smaller than one
    generated from $n = 100$

-   However a confidence level by definition is the percentage of
    calculated intervals we expect to contain the true parameter of
    interest if we calculated these intervals over and over.

-   This means any one interval from a sample of $n = 100$ has a 95% of
    containing the true parameter, just as any one interval from a
    sample of $n = 500$ has a 95% of containing the true paramater. Each
    interval (the wider one from $n = 100$ and narrower one from
    $n= 500$) has a chance of containing the true parameter in relation
    to all other calculated intervals for that same sample size.

-   Hence, whether we have an interval from a sample of n=100 or n=500,
    we are still 95% confident in both cases that the true parameter
    lies within that interval.

The probability of a given interval containing the true parameter is not
affected by the sample size. This probability only changes when we
change our confidence level.

> **🔎 **Let’s think critically****
>
> > 🟠 Every research context will drastically shape how confidence
> > intervals are approached. As we have seen, the volume and quality of
> > data affect how accurate data analyses can be, and many rules of
> > thumb in data science are simply that - rules of thumb, as opposed
> > to hard facts about how to report statistics.  
> > 🟠 What are some situations where you want to know that something is
> > true with nearly 100% confidence?  
> > 🟠 What are some situations where the uncertainty of statistic is
> > maybe not so bad?  
> > 🟠 What does it *really* mean to have something within or outside of
> > a confidence interval?