# Module 8: Random Sampling

In this module, you will learn how to randomly generate numbers in R. This is useful for selecting a simple random sample in observational studies and for assigning individuals to treatments in experiments.

## Generating Random Numbers

R has many ways of generating random numbers. In this module, we focus on the "sample()" function. This function draws some number of values from a list. To be more specific, let’s say you want to draw a sample of size n from a population of size N. You’ll first want to assign a unique ID number between 1 and N to each member of the population. 

You can create a list of numbers between 1 and N in R using the ":" function. To do this, write 1, followed by ":", followed by the number you want your list to end at. Let's make a list of the numbers between 1 and 10.

In [1]:
1:10

In order to sample from a population, we first sample from the numbers between 1 and 10, then sample the individuals in your population with the corresponding labels. We therefore need to be able to draw a sample of size n from the numbers between 1 and N. We do this in R using the "sample()" function.

The "sample()" function has two required inputs: "x" and "size". The "x" input is the list of values that we want to draw a sample from (i.e., the population). The "size" input is the number of values we want to sample (i.e., the sample size). 

If we want to sample values between 1 and N, then we must set "x" equal to a list containing all the numbers between 1 and N. We can generate this list in R using ":". We then set "size" equal to n, the size of our sample.

Let's make a list of all the values from 1 to 100, then randomly sample 20 of them.

In [2]:
our.list = 1:100
sample(x=our.list, size=20)

Try re-running the last cell a few times. As you might expect, you will get a different list of numbers each time. 

Sometimes we may want to be able to generate the same sample again. We are able to do this because the procedure that R uses to generate random numbers is actually not random at all, just really complicated. This is called pseudo-random number generation. R uses a 'random seed' to determine where to start this complicated procedure for generating pseudo-random numbers. If we choose a fixed random seed, then we will get the same 'random' numbers every time we re-run the code. We can fix the random seed in R using the "set.seed()" function. 

The "set.seed()" function has only one input: "seed". The "seed" input tells R what random seed. The "seed" input can be any number, although typically people don't use more than eight digits. 

Let's again generate 20 numbers between 1 and 100, but this time we will set the random seed to 12345678. Remember that if a function has only one input we do not need to include the name of that input.

In [3]:
set.seed(12345678)
sample(x=our.list, size=20)

You can run this cell as many times as you want; you will get the same answer every time. The "set.seed()" function allows us to make sure our work is reproducible, by making sure that we get the can get the same sample every time.

## Review Topic: Sampling From a Population in R

Suppose that we have a list of all the members of a population in R, and we need to select which ones to sample. For example, we might have a list of people in a class, and we want to ask some of them what their major is. In order to do this, we will need to index the list of names. We discuss how to index a list in Module 6. We now expand upon these ideas a bit.

Suppose that we have a list of names called "names", and we want to get the first and third elements of that list. Remember that we can get the first element using "names[1]" and the third element using "names[3]". To get both of these elements, we index our list with another list.



In [4]:
#Import our data and obtain the corresponding list
data = read.csv("Names.csv")
names = data$Name

#Index our list with another list
names[c(1,3)]

Remember that the "c()" function produces a list containing the elements inside the parentheses.

The file "Names.csv" contains a list of 100 people. Let's choose 20 of them for our sample. Remember that the "sample()" function produces a list of numbers that can be used for indexing. Try writing some code to do this yourself before you look at the solution.

In [5]:
#Import the dataset
data = read.csv("Names.csv")
head(data)

#Generate a sample of numbers
sample.numbers = sample(x=1:100, size=20)

#Choose the corresponding people
sample.names = data$Name[sample.numbers]
print(sample.names)

Name
Sophia
Emma
Olivia
Isabella
Ava
Lily


 [1] Grace     Riley     Elizabeth Evelyn    Isabelle  Wyatt     Kylie    
 [8] Dylan     Hunter    Benjamin  Henry     Lily      Aaliyah   Ethan    
[15] Oliver    Hannah    Cameron   Landon    Jacob     Kaitlyn  
100 Levels: Aaliyah Abigail Addison Aiden Alexander Amelia Andrew ... Zoe


Our final answer is a list of 20 names. We can ignore the part that says "100 Levels:...".