# 1 Introduction to Sampling

Learn what sampling is and why it is useful, understand the problems caused by convenience sampling, and learn about the differences between true randomness and pseudo-randomness.


# Reasons for sampling

Sampling is an important technique in your statistical arsenal. It isn't always appropriate though—you need to know when to use it and when to work with the whole dataset.

Which of the following is not a good scenario to use sampling?

# Select one answer

(x) You've been handed one terabyte of data about error logs for your company's IoT device.

( ) You wish to learn information about the travel habits of all Pakistani adult citizens.

( ) You've finished collecting data on a small study on the size of the wings of 10 butterflies.

( ) You are working to predict customer turnover on a big data project for your marketing firm.

# Simple sampling with dplyr

Throughout this chapter you'll be exploring song data from Spotify. Each row of the dataset represents a song, and there are 41656 rows. Columns include the name of the song, the artists who performed it, the release year, and attributes of the song like its duration, tempo, and danceability. We'll start by looking at the durations.

Your first task is to sample the song dataset and compare a calculation on the whole population and on a sample.

spotify_population is available and dplyr is loaded.

# Instructions:

( ) Use View() to view the spotify_population dataset. Explore it in the viewer until you are clear on what it contains.
( ) Use dplyr to sample 1000 rows from spotify_population, assigning to spotify_sample.

In [None]:
# View the whole population dataset
head(spotify_population)

# Sample 1000 rows from spotify_population
spotify_sample <- spotify_population %>% slice_sample(n = 1000)

# See the result
spotify_sample

- Using the spotify_population dataset, calculate the mean duration in minutes. Call the calculated column mean_dur.
- Using the spotify_sample dataset, perform the same calculation in another column called mean_dur.
- Look at the two values. How different are they?

In [None]:
# From previous step
spotify_sample <- spotify_population %>% 
  slice_sample(n = 1000)

# Calculate the mean duration in mins from spotify_population
mean_dur_pop <- spotify_population %>% 
  summarize(mean_dur = mean(duration_minutes))

# Calculate the mean duration in mins from spotify_sample
mean_dur_samp <- spotify_sample %>% 
  summarize(mean_dur = mean(duration_minutes))

# See the results
mean_dur_pop

# Simple sampling with base-R

While dplyr provides great tools for sampling data frames, if you want to work with vectors you can use base-R.

Let's turn it up to eleven and look at the loudness property of each song.

spotify_population is available.

# Instructions:

- Get the loudness column of spotify_population, assigning to loudness_pop.
- Using base-R, sample loudness_pop to get 100 random values, assigning to loudness_samp.

In [None]:
# Get the loudness column of spotify_population
loudness_pop <- spotify_population$loudness

# Sample 100 values of loudness_pop
loudness_samp <- loudness_pop %>% sample(100)

# See the results
loudness_samp

- Calculate the standard deviation of loudness_pop.
- Calculate the standard deviation of loudness_samp.
- Look at the two values. How different are they?

In [None]:
# From previous step
loudness_pop <- spotify_population$loudness
loudness_samp <- sample(loudness_pop, size = 100)

# Calculate the standard deviation of loudness_pop
sd_loudness_pop <- sd(loudness_pop)

# Calculate the standard deviation of loudness_samp
sd_loudness_samp <- sd(loudness_samp)

# See the results
sd_loudness_pop

# Are findings from the sample generalizable?

You just saw how convenience sampling—collecting data via the easiest method can result in samples that aren't representative of the whole population. Equivalently, this means findings from the sample are not generalizable to the whole population. Visualizing the distributions of the population and the sample can help determine whether or not the sample is representative of the population.

The Spotify dataset contains a column named acousticness, which is a confidence measure from zero to one of whether the track is acoustic, that is, it was made with instruments that aren't plugged in. Here, you'll look at acousticness in the total population of songs, and in a sample of those songs.

spotify_population and spotify_mysterious_sample are available; dplyr and ggplot2 are loaded.

# Instructions:

- Using spotify_population, draw a histogram of acousticness with binwidth of 0.01.

In [None]:
# Visualize the distribution of acousticness as a histogram with a binwidth of 0.01
ggplot(spotify_population, aes(acousticness)) +geom_histogram(binwidth = 0.01)


- Update the histogram code to use the spotify_mysterious_sample dataset.
- Set the x-axis limits from zero to one (for easier comparison with the previous plot).

In [None]:
# Update the histogram to use spotify_mysterious_sample with x-axis limits from 0 to 1
ggplot(spotify_mysterious_sample, aes(acousticness)) +
  geom_histogram(binwidth = 0.01) +
  xlim(0, 1)

# Question

Compare the two histograms you drew. Are the acousticness values in the sample generalizable to the general population?

# Possible answers

( ) Yes. Any sample should lead to a generalizable result about the population.

( ) Yes. The sample selected is likely a random sample of all songs in our population.

( ) No. Samples can never lead to generalizable results about the population.

(x) No. The acousticness samples are consistently higher than those in the general population.

( ) No. The acousticness samples are consistently lower than those in the general population.

In [None]:
DM.result = 4

# Are these findings generalizable?

Let's look at another sample to see if it is representative of the population. This time, you'll look at the duration_minutes column of the Spotify dataset, which contains the length of the song in minutes.

spotify_population and spotify_mysterious_sample2 are available; dplyr and ggplot2 are loaded.

# Instructions:

- Using spotify_population, draw a histogram of duration_minutes with binwidth of 0.5.



In [None]:
# Visualize the distribution of duration_minutes as a histogram with a binwidth of 0.5
ggplot(spotify_population, aes(duration_minutes)) +
  geom_histogram(binwidth = 0.5)

- Update the histogram code to use the spotify_mysterious_sample2 dataset.
- Set the x-axis limits from zero to fifteen (for easier comparison with the previous plot).

In [None]:
# Update the histogram to use spotify_mysterious_sample2 with x-axis limits from 0 to 15
ggplot(spotify_mysterious_sample2, aes(duration_minutes)) +
  geom_histogram(binwidth = 0.01) +
  xlim(0, 15)

# Question

Compare the two histograms you drew. Are the duration values in the sample generalizable to the general population?

# Possible answers

( ) Yes. Any sample should lead to a generalizable result about the population.

(x) Yes. The sample selected is likely a random sample of all songs in our population.

( ) No. Samples can never lead to generalizable results about the population.

( ) No. The duration samples are consistently higher than those in the general population.

( )No. The duration samples are consistently lower than those in the general population.

In [None]:
DM.result = 2

# Generating random numbers

You've seen sample() and it's dplyr cousin, slice_sample() for generating pseudo-random numbers from a set of values. A related task is to generate random numbers that follow a statistical distribution, like the uniform distribution or the normal distribution.

Each random number generation function has a name beginning with "r". It's first argument is the number of numbers to generate, but other arguments are distribution-specific. Free hint: Try args(runif) and args(rnorm) to see what arguments you need to pass to those functions.

n_numbers is available and set to 5000; ggplot2 is loaded.

# Instructions:

- Complete the data frame of random numbers.
- Generate n_numbers from a uniform distribution ranging from -3 to 3.
- Generate n_numbers from a normal distribution with mean 5 and standard deviation 2.

In [None]:
# Generate random numbers from ...
randoms <- data.frame(
  # a uniform distribution from -3 to 3
  uniform = runif(n_numbers, min = -3, max = 3),
  # a normal distribution with mean 5 and sd 2
  normal = rnorm(n_numbers, mean = 5, sd = 2))

- Using randoms, plot a histogram of the uniform column, using binwidth 0.25.

In [None]:
# From previous step
randoms <- data.frame(
  uniform = runif(n_numbers, min = -3, max = 3),
  normal = rnorm(n_numbers, mean = 5, sd = 2)
)

# Plot a histogram of uniform values, binwidth 0.25
ggplot(randoms, aes(uniform)) + geom_histogram(binwidth = 0.25)

- Using randoms, plot a histogram of the normal column, using binwidth 0.5.

In [None]:
# From previous step
randoms <- data.frame(
  uniform = runif(n_numbers, min = -3, max = 3),
  normal = rnorm(n_numbers, mean = 5, sd = 2)
)

# Plot a histogram of normal values, binwidth 0.5
ggplot(randoms, aes(normal)) + geom_histogram(binwidth = 0.5)


# Understanding random seeds

While random numbers are important for many analyses, they create a problem: the results you get can vary slightly. This can cause awkward conversations with your boss when your script for calculating the sales forecast gives different answers each time.

Setting the seed to R's random number generator helps avoid such problems by making the random number generation reproducible.

# Instructions:

Question
Which statement about x and y is true?

set.seed(123)
x <- rnorm(5)
y <- rnorm(5)
Possible answers

( ) x and y have identical values.

( ) The first value of x is identical to the first value of y but other values are different.

(x) The values of x are different to those of y.

In [None]:
DM.result = 3

# Which statement about x and y is true?

set.seed(123)
x <- rnorm(5)
set.seed(123)
y <- rnorm(5)

Possible answers:

(x) x and y have identical values.

( ) The first value of x is identical to the first value of y but other values are different.

( ) The values of x are different to those of y.

In [None]:
DM.result = 1

# Which statement about x and y is true?

set.seed(123)
x <- rnorm(5)
set.seed(456)
y <- rnorm(5)

Possible answers: 

( ) x and y have identical values.

( ) The first value of x is identical to the first value of y but other values are different.

(x) The values of x are different to those of y.


In [None]:
DM.result = 3

# Which statement about x and y is true?

set.seed(123)
x <- c(rnorm(5), rnorm(5))
set.seed(123)
y <- rnorm(10)

Possible answers:

(x) x and y have identical values.

( ) The first value of x is identical to the first value of y but other values are different.

( ) The values of x are different to those of y.

In [None]:
DM.result = 1