# 3 Sampling Distributions

Learn how to quantify the accuracy of sample statistics using relative errors, and measure variation in your estimates by generating sampling distributions.

# Calculating relative errors

The size of the sample you take affects how accurately the point estimates reflect the corresponding population parameter. For example, when you calculate a sample mean, you want it to be close to the population mean. However, if your sample is too small, this might not be the case.

The most common metric for assessing accuracy is relative error. This is the absolute difference between the population parameter and the point estimate, all divided by the population parameter. It is sometimes expressed as a percentage.

attrition_pop and mean_attrition_pop are available; dplyr is loaded.

# Instructions:

- Generate a simple random sample from attrition_pop of ten rows.
- Summarize to calculate the mean proportion of employee attrition (Attrition equals "Yes").
- Calculate the relative error between mean_attrition_srs10 and mean_attrition_pop as a percentage.

In [None]:
# Calculate the relative error percentage again with a sample of 100 rows
attrition_srs100 <- attrition_pop %>% 
  slice_sample(n = 100)

mean_attrition_srs100 <- attrition_srs100 %>% 
  summarize(mean_attrition = mean(Attrition == "Yes")) %>% 
  pull(mean_attrition)

rel_error_pct100 <- 100 * abs(mean_attrition_pop - mean_attrition_srs100) / mean_attrition_pop

# See the result
rel_error_pct100

- Calculate the relative error percentage again. This time, use a simple random sample of one hundred rows of attrition_pop.

In [None]:
# Calculate the relative error percentage again with a sample of 100 rows
attrition_srs100 <- attrition_pop %>% 
  slice_sample(n = 100)

mean_attrition_srs100 <- attrition_srs100 %>% 
  summarize(mean_attrition = mean(Attrition == "Yes")) %>% 
  pull(mean_attrition)

rel_error_pct100 <- 100 * abs(mean_attrition_pop - mean_attrition_srs100) / mean_attrition_pop

# See the result
rel_error_pct100

# Relative error vs. sample size

The plot shows the relative error in the proportion of employee attritions, using simple random sampling, for sample sizes from 2 to 1470 (the size of the population).

Clicking "Regenerate plot" will create new samples for each sample size, and calculate the relative errors again.

Which statement about relative errors and sample sizes is true?

( ) For any given sample size, the relative error between the sample mean and the population mean is fixed at a specific value.

( ) When the sample is as large as the whole population, the relative error is small, but never zero.

( ) If the sample mean is greater than the population mean, the relative error can be less than zero.

( ) The relative error can never be greater than 100%.

(x) For small sample sizes, each additional entry in a sample can result in substantial decreases to the relative error.

# Replicating samples

When you calculate a point estimate such as a sample mean, the value you calculate depends on the rows that were included in the sample. That means that there is some randomness in the answer. In order to quantify the variation caused by this randomness, you can create many samples and calculate the sample mean (or other statistic) for each sample.

attrition_pop is available; dplyr and ggplot2 are loaded.

# Instructions:

- Replicate the provided code so that it runs 500 times. Assign the resulting vector of sample means to mean_attritions.

In [None]:
# Replicate this code 500 times

mean_attritions <- replicate(
  n = 500,
  attrition_pop %>% 
    slice_sample(n = 20) %>% 
    summarize(mean_attrition = mean(Attrition == "Yes")) %>% 
    pull(mean_attrition))
# See the result
head(mean_attritions)

- Create a tibble with a column named sample_mean to store mean_attritions.
- Using sample_means, draw a histogram of the sample_mean column with a binwidth of 0.05.

In [None]:
# From previous step
mean_attritions <- replicate(
  n = 500,
  attrition_pop %>% 
    slice_sample(n = 20) %>% 
    summarize(mean_attrition = mean(Attrition == "Yes")) %>% 
    pull(mean_attrition)
)

# Store mean_attritions in a tibble in a column named sample_mean
sample_means <- tibble(sample_mean = mean_attritions)
# Plot a histogram of the `sample_mean` column, binwidth 0.05
ggplot(sample_means, aes(sample_mean)) +
  geom_histogram(binwidth = 0.05)

# Replication parameters

The dashboard shows a histogram of sample mean proportions of employee attrition. There are two parameters: the size of each simple random sample, and the number of replicates. It's important to understand how each of these parameters affects the result. Use the parameter sliders to explore different values and note their effect on the histogram.

Which statement about the effect of each parameter on the distribution of sample means is true?

( ) As the sample size increases, the range of calculated sample means tends to increase.

( ) As the number of replicates increase, the range of calculated sample means tends to increase.

(x) As the sample size increases, the range of calculated sample means tends to decrease.

( ) As the number of replicates increase, the range of calculated sample means tends to decrease.

# Exact sampling distribution

To quantify how the point estimate (sample statistic) you are interested in varies, you need to know all the possible values it can take, and how often. That is, you need to know its distribution.

The distribution of a sample statistic is called the sampling distribution. When we can calculate this exactly, rather than using an approximation, it is known as the exact sampling distribution.

Let's take another look at the sampling distribution of dice rolls. This time, we'll look at five eight-sided dice. (These have the numbers one to eight.)



# Instructions:

- Expand a grid representing 5 8-sided dice. That is, create a tibble with five columns, named die1 to die5. The rows should contain all possibilities for throwing five dice, each numbered 1 to 8.

In [None]:
# Expand a grid representing 5 8-sided dice
dice <- expand_grid(
  die1 = 1:8,
  die2 = 1:8,
  die3 = 1:8,
  die4 = 1:8,
  die5 = 1:8
)

# See the result
dice

- Add a column, mean_roll, to dice, that contains the mean of the five rolls.

In [None]:
dice <- expand_grid(
  die1 = 1:8,
  die2 = 1:8,
  die3 = 1:8,
  die4 = 1:8,
  die5 = 1:8
) %>% 
  # Add a column of mean rolls
  mutate(mean_roll = (die1 + die2 +die3 + die4 +die5)/5)

- Using the dice dataset, plot mean_roll, converted to a factor, as a bar plot.

In [None]:
# From previous step
dice <- expand_grid(
  die1 = 1:8,
  die2 = 1:8,
  die3 = 1:8,
  die4 = 1:8,
  die5 = 1:8
) %>% 
  mutate(mean_roll = (die1 + die2 + die3 + die4 + die5) / 5)

# Using dice, draw a bar plot of mean_roll as a factor
ggplot(dice, aes(factor(mean_roll))) + geom_bar()

# Approximate sampling distribution

Calculating the exact sampling distribution is only possible in very simple situations. With just five eight-sided dice, the number of possible rolls is 8 ^ 5, which is over thirty thousand. When the dataset is more complicated, for example where a variable has hundreds or thousands or categories, the number of possible outcomes becomes too difficult to compute exactly.

In this situation, you can calculate an approximate sampling distribution by simulating the exact sampling distribution. That is, you can repeat a procedure over and over again to simulate both the sampling process and the sample statistic calculation process.

tibble and ggplot2 are loaded.

# Instructions:

- Sample one to eight, five times, with replacement. Assign to five_rolls.
- Calculate the mean of five_rolls.

In [None]:
# Sample one to eight, five times, with replacement
five_rolls <- sample(1:8, size = 5, replace = TRUE)
# Calculate the mean of five_rolls
mean(five_rolls)

- Replicate the sampling code 1000 times, assigning to sample_means_1000.

In [None]:
# Replicate the sampling code 1000 times
sample_means_1000 <- replicate(
  n = 1000,
  expr = {
    five_rolls <- sample(1:8, size = 5, replace = TRUE)
    mean(five_rolls)})
# Wrap sample_means_1000 in the sample_mean column of a tibble
sample_means <- tibble(
  sample_mean = sample_means_1000
  )
# See the result
sample_means

- Create a tibble, and store sample_means_1000 in the a column named sample_mean.

In [None]:
# From previous step
sample_means_1000 <- replicate(
  n = 1000,
  expr = {
    five_rolls <- sample(1:8, size = 5, replace = TRUE)
    mean(five_rolls)})
# Wrap sample_means_1000 in the sample_mean column of a tibble
sample_means <- tibble(
  sample_mean = sample_means_1000
  )
# See the result
sample_means

- Using the sample_means dataset, plot sample_mean, converted to a factor, as a bar plot.

In [None]:
# From previous steps
sample_means_1000 <- replicate(
  n = 1000,
  expr = {
    five_rolls <- sample(1:8, size = 5, replace = TRUE)
    mean(five_rolls)
  }
)
sample_means <- tibble(
  sample_mean = sample_means_1000
)

# Using sample_means, draw a bar plot of sample_mean as a factor
ggplot(sample_means, aes(factor(sample_mean))) + geom_bar()

# Exact vs. approximate

You've seen two types of sampling distribution now (exact and approximate). You need to be clear about when each should be computed.

Should we always be able to compute the exact sampling distribution directly?

# Answer the question:

(x) No, the computational time and resources needed to look at the population of values could be too much for our problem.

( ) No, the exact sampling distribution is always unknown even for calculating the sample mean of a small number of die tosses like 2 or 3.

( ) Yes, the population will always be known ahead of time so one extra calculation is no problem.

( ) Yes, the exact sampling distribution can be generated using the replicate() function so it should be used in all circumstances.

# Population & sampling distribution means

One of the useful features of sampling distributions is that you can quantify them. In particular, you can calculate summary statistics on them. Here, we'll look at the relationship between the mean of the sampling distribution and the population parameter that the sampling is supposed to estimate.

Three sampling distributions are provided. In each case, the employee attrition dataset was sampled using simple random sampling, then the mean attrition was calculated. This was done 1000 times to get a sampling distribution of mean attritions. One sampling distribution used a sample size of 5 for each replicate, one used 50, and one used 500.

attrition_pop, sampling_distribution_5, sampling_distribution_50, and sampling_distribution_500 are available; dplyr is loaded.

# Instructions:

- Using sampling_distribution_5, calculate the mean across all the replicates of the mean_attritions (a mean of sample means). Store this in a column called mean_mean_attrition.
- Do the same calculation using sampling_distribution_50 and sampling_distribution_500.

In [None]:
# Calculate the mean across replicates of the mean attritions in sampling_distribution_5
mean_of_means_5 <- sampling_distribution_5 %>%
  summarize(mean_mean_attrition = mean(mean_attrition))

# Do the same for sampling_distribution_50
mean_of_means_50 <- sampling_distribution_50 %>%
  summarize(mean_mean_attrition = mean(mean_attrition))

# ... and for sampling_distribution_500
mean_of_means_500 <- sampling_distribution_500 %>%
  summarize(mean_mean_attrition = mean(mean_attrition))

# See the results
mean_of_means_5
mean_of_means_50
mean_of_means_500

# Question
How does sample size affect the mean of the sample means?

( ) As the sample size increases, the mean of the sampling distribution decreases until it reaches the population mean.

( ) As the sample size increases, the mean of the sampling distribution increases until it reaches the population mean.

(x) Regardless of sample size, the mean of the sampling distribution is a close approximation to the population mean.

( ) Regardless of sample size, the mean of the sampling distribution is biased and cannot approximate the population mean.

In [None]:
DM.result = 3

# Population and sampling distribution variation

You just calculated the mean of the sampling distribution and saw how it is an estimate of the corresponding population parameter. Similarly, as a result of the central limit theorem, the standard deviation of the sampling distribution has an interesting relationship with the population parameter's standard deviation and the sample size.

attrition_pop, sampling_distribution_5, sampling_distribution_50, and sampling_distribution_500 are available; dplyr is loaded.

# Instructions:

- Using sampling_distribution_5, calculate the standard deviation across all the replicates of the mean_attritions (a standard deviation of sample means). Store this in a column called sd_mean_attrition.
- Do the same calculation using sampling_distribution_50 and sampling_distribution_500.

In [None]:
# Calculate the standard deviation across replicates of the mean attritions in sampling_distribution_5
sd_of_means_5 <- sampling_distribution_5 %>%
  summarize(sd_mean_attrition = sd(mean_attrition))

# Do the same for sampling_distribution_50
sd_of_means_50 <- sampling_distribution_50 %>%
  summarize(sd_mean_attrition = sd(mean_attrition))

# ... and for sampling_distribution_500
sd_of_means_500 <- sampling_distribution_500 %>%
  summarize(sd_mean_attrition = sd(mean_attrition))

# See the results
sd_of_means_5
sd_of_means_50
sd_of_means_500

# Question

How are the standard deviations of the sampling distributions related to the population standard deviation and the sample size?

Possible answers

( ) The standard deviation of the sampling distribution is approximately equal to the population standard deviation, regardless of sample size.

( ) The standard deviation of the sampling distribution is approximately equal to the population standard deviation multiplied by the sample size.

( ) The standard deviation of the sampling distribution is approximately equal to the population standard deviation multiplied by the square root of the sample size.

( ) The standard deviation of the sampling distribution is approximately equal to the population standard deviation divided by the sample size.

( ) The standard deviation of the sampling distribution is approximately equal to the population standard deviation divided by the square root of the sample size.

In [None]:
DM.result = 5