# Sampling Strategies and Experimental Design

## Sampling strategies, determine which

A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.

What sampling strategy has this company used?

## Answer: Stratified Sample

## Sampling strategies, choose worst

A school district has requested a survey be conducted on the socioeconomic status of their students. Their budget only allows them to conduct the survey in some of the schools, hence they need to first sample a few schools. 

Students living in this district generally attend a school in their neighborhood. The district is broken into many distinct and unique neighborhoods, some including large single-family homes and others with only low-income housing. 

Which approach would likely be the least effective for selecting the schools where the survey will be conducted?

### Answer: Cluster sampling, where each cluster is a neighborhood.

This sampling strategy would be a bad idea because each neighborhood has a unique socioeconomic status. A good study would collect information about every neighborhood.



# Sampling in R

In [1]:
library(openintro)

Please visit openintro.org for free statistics materials

Attaching package: 'openintro'

The following object is masked from 'package:datasets':

    cars



In [2]:
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [3]:
# Load county data
data(county)

In [4]:
glimpse(county)

Observations: 3,143
Variables: 10
$ name          <fctr> Autauga County, Baldwin County, Barbour County, Bibb...
$ state         <fctr> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama...
$ pop2000       <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 112...
$ pop2010       <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 118...
$ fed_spend     <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910, 9.9...
$ poverty       <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, 20.3,...
$ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, 71.4,...
$ multiunit     <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7, 4.3,...
$ income        <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916, 2057...
$ med_income    <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659, 3840...


In [5]:
# Remove DC, because it is not a state
county_noDc <- county %>% filter(state !=  "District of Columbia") %>% droplevels()

In [6]:
glimpse(county_noDc)

Observations: 3,142
Variables: 10
$ name          <fctr> Autauga County, Baldwin County, Barbour County, Bibb...
$ state         <fctr> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama...
$ pop2000       <dbl> 43671, 140415, 29038, 20826, 51024, 11714, 21399, 112...
$ pop2010       <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 118...
$ fed_spend     <dbl> 6.068095, 6.139862, 8.752158, 7.122016, 5.130910, 9.9...
$ poverty       <dbl> 10.6, 12.2, 25.0, 12.6, 13.4, 25.3, 25.0, 19.5, 20.3,...
$ homeownership <dbl> 77.5, 76.7, 68.0, 82.9, 82.0, 76.9, 69.0, 70.7, 71.4,...
$ multiunit     <dbl> 7.2, 22.6, 11.1, 6.6, 3.7, 9.9, 13.7, 14.3, 8.7, 4.3,...
$ income        <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916, 2057...
$ med_income    <dbl> 53255, 50147, 33219, 41770, 45549, 31602, 30659, 3840...


Scenario 1: limited resources only allow us to collect information on 150 out of over 3000 counties.

options:
 - 1. Take a simple random sample:
  - in dplyr use the simple_n() function

In [7]:
# Simple random sample of 150 counties
# _srs = simple random sample
county_srs<-county_noDc %>% sample_n(size=150)

In [8]:
# Glimpse county_rsr
glimpse(county_srs)

Observations: 150
Variables: 10
$ name          <fctr> Stanly County, Lehigh County, Alameda County, Hot Sp...
$ state         <fctr> North Carolina, Pennsylvania, California, Wyoming, M...
$ pop2000       <dbl> 58100, 312090, 1443741, 4882, 15069, 17943, 6556, 226...
$ pop2010       <dbl> 60585, 349497, 1510271, 4812, 14962, 18550, 6306, 216...
$ fed_spend     <dbl> 6.495981, 6.253095, 10.057915, 9.829593, 9.175511, 8....
$ poverty       <dbl> 12.7, 11.9, 11.4, 9.1, 23.1, 15.6, 20.4, 17.5, 17.1, ...
$ homeownership <dbl> 76.9, 68.8, 55.1, 64.1, 68.8, 69.1, 82.9, 79.0, 72.5,...
$ multiunit     <dbl> 7.3, 23.5, 38.0, 12.8, 4.7, 10.6, 3.6, 5.4, 7.3, 13.2...
$ income        <dbl> 21139, 27301, 33961, 25269, 15183, 24656, 15418, 2007...
$ med_income    <dbl> 44802, 53541, 69384, 42469, 28484, 36900, 29740, 3815...


Scenario 2: We want to obtain equal numbers from each state that is 3 counties per state a simple random sample won't insure that. We can confirm this we can count the number of counties per state using group_by and count


In [9]:
# State distribution of SRS counties
county_srs %>% group_by(state) %>% count()

state,n
Alabama,2
Arkansas,3
California,2
Colorado,3
Florida,7
Georgia,8
Idaho,3
Illinois,6
Indiana,3
Iowa,3


If our goal is to sample 3 counties per state we should use stratified sampling.

To do this in R we write a similar query to the srs except that we use a group by before we request the sample.

In [10]:
# Stratified sample of 150 counties, each state is a stratum
county_str <- county_noDc %>% group_by(state) %>% sample_n(size = 3)

In [11]:
glimpse(county_str)

Observations: 150
Variables: 10
$ name          <fctr> Choctaw County, Henry County, Clarke County, Yukon-K...
$ state         <fctr> Alabama, Alabama, Alabama, Alaska, Alaska, Alaska, A...
$ pop2000       <dbl> 15922, 16310, 27867, 6551, 2697, 59322, 8547, 160026,...
$ pop2010       <dbl> 13859, 17302, 25833, 5588, 3141, 88995, 8437, 195751,...
$ fed_spend     <dbl> 10.640378, 8.788464, 9.781442, 28.110773, 4.450493, 4...
$ poverty       <dbl> 18.7, 15.1, 29.2, 23.6, 10.4, 9.9, 13.5, 20.9, 13.9, ...
$ homeownership <dbl> 85.6, 81.9, 80.0, 69.1, 59.2, 79.2, 46.9, 69.6, 66.3,...
$ multiunit     <dbl> 3.9, 3.2, 6.3, 2.9, 11.8, 10.1, 6.1, 12.5, 25.1, 5.9,...
$ income        <dbl> 17214, 19716, 17372, 18614, 22279, 27910, 21281, 1841...
$ med_income    <dbl> 31076, 38379, 27439, 33712, 54375, 67703, 48696, 4034...


In [12]:
county_str %>% group_by(state) %>% count()

state,n
Alabama,3
Alaska,3
Arizona,3
Arkansas,3
California,3
Colorado,3
Connecticut,3
Delaware,3
Florida,3
Georgia,3


## Stratified sample in R

In the previous exercise, we took a simple random sample of eight states. However, we did not have any control over how many states from each region got sampled. The goal of stratified sampling in this context is to have control over the number of states sampled from each region. Our goal for this exercise is to sample an equal number of states from each region.

The dplyr package has been loaded and us_regions is still available in your workspace.

- Use stratified sampling to select a total of eight states, where each stratum is a region. Save this sample in a data frame called states_str.

- Count the number of states from each region in your sample to confirm that each region is represented equally in your sample.

In [32]:
us_regions<- read.csv('f:/datasets/us_regions_2.csv', sep=",", header=FALSE, quote = "\'")

In [33]:
head(us_regions)

V1,V2,V3
1,Connecticut,Northeast
2,Maine,Northeast
3,Massachusetts,Northeast
4,New Hampshire,Northeast
5,Rhode Island,Northeast
6,Vermont,Northeast


In [34]:
colnames(us_regions)<-c('counter', 'state', 'region')

In [35]:
head(us_regions)

counter,state,region
1,Connecticut,Northeast
2,Maine,Northeast
3,Massachusetts,Northeast
4,New Hampshire,Northeast
5,Rhode Island,Northeast
6,Vermont,Northeast


In [36]:
# Stratified Sample
states_str <- us_regions %>% group_by(region) %>% sample_n(size = 2)

In [37]:
states_str

counter,state,region
13,Ohio,Midwest
20,North Dakota,Midwest
2,Maine,Northeast
5,Rhode Island,Northeast
23,Florida,South
33,Mississippi,South
45,Utah,West
50,Oregon,West


In [38]:
# Count States by region
states_str %>% group_by(region) %>% count()

region,n
Midwest,2
Northeast,2
South,2
West,2


n this stratified sample, each stratum (i.e. Region) is represented equally. 

### Compare SRS vs. stratified sample

Which method you implemented, simple random sampling or stratified sampling, ensured an equal number of states from each region?

answer: Stratified sampling

Simple random sampling would result in diferent amounts of data being sampled from each state.

# Identifying components of a study

A researcher designs a study to test the effect of light and noise levels on exam performance of students. The researcher also believes that light and noise levels might have different effects on males and females, so she wants to make sure both genders are represented equally under different conditions.

Which of the below is correct?

answer: There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 response variable (exam performance).

## Experimental design terminology

___ variables are conditions you can impose on the experimental units, while ___ variables are characteristics that the experimental units come with that you would like to control for.

answer: Explanatory, blocking

## Connect blocking and stratifying

In random sampling, we use ___ to control for a variable. In random assignment, we use ___ to achieve the same goal.

answer: stratifying, blocking