# POLSCI 3

## Week 1, Notebook Lecture 2: Subsetting Data

In this notebook, we will cover subsetting data with the `subset()` function and then taking the mean of variables within those subsets.

## Subsetting in R with `subset()`

Sometimes when we work with large datasets, we want to take a look at a specific *subset* of that data.

For our example today, we'll use a pretty cool dataset. This dataset was gathered by <a href="https://onlinelibrary.wiley.com/doi/10.1111/ajps.12618?af=R" target="_blank">Shoub et al. (2021)</a>.

Let's start by reading in a dataset, as we learned in Notebook Lecture 1 in class.

In [1]:
officerdata <- read.csv('ps3_fl_officers_large.csv')
head(officerdata) # head() just shows us the first six rows of a dataset. This dataset is too long to print!

Unnamed: 0_level_0,search_occur,contra,driver_age,driver_race,officer_female,officer_id
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<int>,<chr>
1,0,0,above 60,,0,4a6ccf9467
2,0,0,under 35,,0,f1bc8a7a74
3,0,0,under 35,POC,0,07e6718e7d
4,0,0,between 35 and 60,White,0,000f298db7
5,0,0,between 35 and 60,White,0,4f549e4570
6,0,0,under 35,,0,99d2f19d4f


This dataset contains real data on officers and drivers from nearly 860,000 police traffic stops. Each row represents one time that an officer stopped someone. Here is more information about the variables:

- <code>search_occur</code>: Whether or not a search was conducted by the officer at that stop (0 = no search, 1 =  search)
- <code>contra</code>: Whether or not a contraband (illegal items such as illegal drugs or guns) is found by the officer at that stop (0 = no contraband, 1 = contraband). *Officers only can find contraband if they conduct a search.*
- <code>driver_age</code>: Age of driver (years)
- <code>driver_race</code>: Race of driver (White = white; POC = non-white)
- <code>officer_female</code>: Officer gender (0 = male, 1 = female)

### How `subset()` works

Here's how `subset()` works:

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == accepted.value)`

This line takes `original.dataset`, subsets it to rows (observations) when `variable.in.dataset` equals `accepted.value`, and saves that subset in `name.of.new.subset.dataset`.

If the variable is a **string** (letters and words) variable, you need to wrap it in quotations, like this (single quotes `'` and double quotes `"` both work):

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == 'accepted.value')`
`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == "accepted.value")`

### Hypothetical use of `subset()`

Suppose we have a `oskiStore` dataset:

<table>
<thead>
  <tr>
    <th>Month</th>
    <th>Sweaters</th>
    <th>Hoodies</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Jan</td>
    <td>220</td>
    <td>75</td>
  </tr>
  <tr>
    <td>Feb</td>
    <td>195</td>
    <td>90</td>
  </tr>
  <tr>
    <td>March</td>
    <td>175</td>
    <td>80</td>
  </tr>
  <tr>
    <td>April</td>
    <td>220</td>
    <td>60</td>
  </tr>
</tbody>
</table>

If we run `oski.many.sweaters <- subset(oskiStore, Sweaters == 220)`, then `oski.many.sweaters` will look like this:

<table>
<thead>
  <tr>
    <th>Month</th>
    <th>Sweaters</th>
    <th>Hoodies</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>Jan</td>
    <td>220</td>
    <td>75</td>
  </tr>
  <tr>
    <td>April</td>
    <td>220</td>
    <td>60</td>
  </tr>
</tbody>
</table>

In this example, `oski.many.sweaters` is a entirely new dataset, and we can do all the same things with it that we could do to `oskiStore`.

Likewise, if we run `oskiApril <- subset(oskiStore, Month == 'April')`, then `oskiApril` will look like this:

<table>
<thead>
  <tr>
    <th>Month</th>
    <th>Sweaters</th>
    <th>Hoodies</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td>April</td>
    <td>220</td>
    <td>60</td>
  </tr>
</tbody>
</table>



### Real example use of `subset()`

Let's say we want to subset the `officerdata` in order to look at the data specifically from traffic stops by female officers.

*Hint: You'll need to do exactly this in Activity Notebook 1 in class!*

In [2]:
female.officer.stops <- subset(officerdata, officer_female == 1)
head(female.officer.stops)

Unnamed: 0_level_0,search_occur,contra,driver_age,driver_race,officer_female,officer_id
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<int>,<chr>
9,0,0,under 35,POC,1,5e29ba9f07
33,0,0,under 35,White,1,b93689b570
55,0,0,between 35 and 60,White,1,540f9fc4cc
65,0,0,under 35,White,1,b8210a822f
71,0,0,under 35,POC,1,14b640d409
78,0,0,between 35 and 60,,1,dc025d8eb8


Side note: instead of using the `subset` command, we could alternately filter values by using `[]` to indicate the rows to be included

In [3]:
female.officer.stops.alt <- officerdata[officerdata$officer_female == 1,] 
# Using the comma after officer_femail ==1 tells R to filter based on the rows not the columns

# You can check that these two datasets have the same number of rows (observations) using the nrow command
nrow(female.officer.stops)
nrow(female.officer.stops.alt)

We will usually use the `subset` command for now, however.  

We can also filter on other variables, for example, including only cases where searches occured:

In [4]:
search.occurred <- subset(officerdata, search_occur == 1)
head(search.occurred)

Unnamed: 0_level_0,search_occur,contra,driver_age,driver_race,officer_female,officer_id
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<int>,<chr>
54,1,0,under 35,POC,0,315004d21f
186,1,0,between 35 and 60,White,0,c07c091355
384,1,0,between 35 and 60,POC,0,da29b43b89
435,1,1,under 35,White,0,2423ae5aec
1340,1,1,between 35 and 60,White,0,780732a96b
1789,1,0,under 35,White,0,954c054210


## Gathering Statistics: Using `mean()` within a subset

To gather more statistics, we can run `mean()` on just this subset of data. Running `mean()` on subsets of the data will give us the means (of whatever variable we take the mean of) within those subsets.

#### Mean 

Let's compare the means of contrabands found among stops between male and female officers. 

In [5]:
female.officer.stops <- subset(officerdata, officer_female == 1)
mean(female.officer.stops$contra)

In [6]:
male.officer.stops <- subset(officerdata, officer_female == 0)
mean(male.officer.stops$contra)

Now let's compare the means of this rate between drivers under 35 and between 35 and 60.

To do this, we need to subset on a **string** variable. We can see above that the `driver_age` variable is a string because it has letters and words.

In [7]:
drivers.under35 <- subset(officerdata, driver_age == 'under 35')
mean(drivers.under35$contra)

In [8]:
drivers.35to60 <- subset(officerdata, driver_age == 'between 35 and 60')
mean(drivers.35to60$contra)

We can also turn these into percentages when printing them by just multiplying them by 100:

In [9]:
prop.drivers.35to60.withcontra <- mean(drivers.35to60$contra)
prop.drivers.35to60.withcontra * 100

## More about Thursday's class

This was *Week 1, Notebook Lecture 2*. In class, you'll work on *Week 1, Activity Notebook 1*. In that notebook, you'll use what you'll learned in this notebook to answer very similar problems. Because it's the first week, the notebook will not count towards your grade.

- You'll first have 30 minutes to answer the questions individually. The notebooks will be available at 8:10 AM (students with DSP accommodations can start early; we will contact you individually about this) and be due at 8:40 AM.
- You'll then have 20 minutes to answer the *same* questions as a group from 8:45 - 9:05 AM. (For some of those questions, you'll be able to see whether you're getting the answer right or wrong.)
- Finally, we'll regroup and go over the right answers as a class.
 
In the group notebooks, some of the questions will tell you whether you're getting the right answer or not as you go. Some of the notebooks will also have open ended questions. These we will manually grade later and can't be automatically graded, of course.


If you need help during class:

- Check out the R Cheat Sheet in bCourses.
- Find a GSI and ask for help.

Final reminder: you do **NOT** need to turn in this lecture notebook. You only need to turn in in-class activities and problem sets.