# 2 Sampling Methods

Learn how to and when to perform the four methods of random sampling: simple, systematic, stratified, and cluster.

# Simple random sampling

The simplest method of sampling a population is the one you've seen already. It is known as simple random sampling (sometimes abbreviated to "SRS"), and involves picking rows at random, one at a time, where each row has the same chance of being picked as any other.

To make it easier to see which rows end up in the sample, it's helpful to include a row ID column in the dataset before you take the sample.

In this chapter, we'll look at sampling methods using a synthetic (fictional) employee attrition dataset from IBM, where "attrition" means leaving the company.

attrition_pop is available; dplyr is loaded.

# Instructions:

- View the attrition_pop dataset. Explore it in the viewer until you are clear on what it contains.
- Set the random seed to a value of your choosing.
- Add a row ID column to the dataset, then use simple random sampling to get 200 rows.
- View the sample dataset, attrition_samp. What do you notice about the row IDs?

In [None]:
# View the attrition_pop dataset
attrition_pop

# Set the seed
set.seed(123)

attrition_samp <- attrition_pop %>% 
  # Add a row ID column
  rowid_to_column() %>% 
  # Get 200 rows using simple random sampling
  slice_sample(n = 200)

# View the attrition_samp dataset
head(attrition_samp)

# Systematic sampling

One sampling method that avoids randomness is called systematic sampling. Here, you pick rows from the population at regular intervals.

For example, if the population dataset had one thousand rows and you wanted a sample size of five, you'd pick rows 200, 400, 600, 800, and 1000.

attrition_pop is available; dplyr and tibble are loaded.

# Instructions:

- Set the sample size to 200.
- Get the population size from attrition_pop.
- Calculate the interval between rows to be sampled.

In [None]:
# Set the sample size to 200
sample_size <- 200

# Get the population size from attrition_pop
pop_size <- nrow(attrition_pop)
# Calculate the interval
interval <- pop_size %/% sample_size

- Get the row indexes for the sample as a numeric sequence of interval, 2 * interval, up to sample_size * interval.
- Systematically sample attrition_pop, assigning to attrition_sys_samp.
- Add a row ID column to attrition_pop.
- Get the rows of the population corresponding to row_indexes.

In [None]:
# From previous step
sample_size <- 200
pop_size <- nrow(attrition_pop)
interval <- pop_size %/% sample_size

# Get row indexes for the sample
row_indexes <- seq_len(sample_size) * interval
attrition_sys_samp <- attrition_pop %>% 
  # Add a row ID column
  rowid_to_column() %>% 
  # Get 200 rows using systematic sampling
  slice(row_indexes)
# See the result
head(attrition_sys_samp)

# Is systematic sampling OK?

Systematic sampling has a problem: if the data has been sorted, or there is some sort of pattern or meaning behind the row order, then the resulting sample may not be representative of the whole population. The problem can be solved by shuffling the rows, but then systematic sampling is equivalent to simple random sampling.

Here you'll look at how to determine whether or not there is a problem.

attrition_sys_samp is available and has been given a row ID column; dplyr and ggplot2 are loaded.

# Instructions:

- Add a row ID column to attrition_pop.
- Using the attrition_pop_id dataset, plot YearsAtCompany versus rowid as a scatter plot, with a smooth trend line.

In [None]:
# Add a row ID column to attrition_pop
attrition_pop_id <- attrition_pop %>% 
  rowid_to_column()

# Using attrition_pop_id, plot YearsAtCompany vs. rowid
ggplot(attrition_pop_id, aes(rowid, YearsAtCompany)) +
  # Make it a scatter plot
  geom_point() +
  # Add a smooth trend line
  geom_smooth(method = "lm", se = FALSE)

- Shuffle the rows of attrition_pop.
- Add a row ID column to attrition_pop.
- Repeat the plot of YearsAtCompany versus rowid with points and a smooth trend line, this time using attrition_shuffled.

In [None]:
# Shuffle the rows of attrition_pop then add row IDs
attrition_shuffled <- attrition_pop %>% slice_sample(prop = 1) %>% rowid_to_column()
# Using attrition_shuffled, plot YearsAtCompany vs. rowid
# Add points and a smooth trend line
ggplot(attrition_shuffled, aes(rowid, YearsAtCompany)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

# Does a systematic sample always produce a sample similar to an simple random sample?

Possible answers

( ) Yes. All sampling (random or non-random) schemes will lead us to similar results.

( ) Yes. We should always expect a representative sample for both systematic and simple random sampling.

(x) No. This only holds if a seed has been set for both processes.

( ) No. This is not true if the data is sorted in some way.

In [None]:
DM.result = 3

# Proportional stratified sampling

If you are interested in subgroups within the population, then you may need to carefully control the counts of each subgroup within the population. Proportional stratified sampling results in subgroup sizes within the sample that are representative of the subgroup sizes within the population. It is equivalent to performing a simple random sample on each subgroup.

attrition_pop is available; dplyr is loaded.

# Instructions:

- Get the counts of employees by Education level from attrition_pop, sorted by descending count
- Add a percent column of percentages (100 times the count divided by the total count).