# Analyzing racial disparities in SF traffic stops

First, let's load the necessary libraries and data that will allow us to begin our investigation!

In [0]:
# Some initial setup
options(digits = 3)
library(tidyverse)
library(lubridate)
theme_set(theme_bw())

# Read the data
stops <- read_rds("../data/san_francisco_stop_data.rds")
pop_2015 <- read_rds("../data/sf_pop_2015.rds")

## Covering the basics

The core of `R` is the dataframe. We've given you one to start with, in the form
of `stops`. Think of dataframes like a spreadsheet: they have rows and columns.
Usually, rows are a "datapoint": in `stops`, each row corresponds to a single
stop from San Francisco. The columns are the "variables": again, in `stops`,
these are the things we know about the stop, like where the stop happened, the
age of the driver, whether an arrest was made, and so on.

We can take a peak simply by typing `stops` into an R chunk:

In [0]:
stops

## Functions

In math, functions are a way to "do something" to an input. So `f(a)=b` takes a number `a`, and applies `f()`, and gets the output `b`. In programming, we also have functions! Most of the functions we'll use allow us to manipulate our dataframe as the input. 

So if we want to find the number of rows in our dataframe, we'd use the function `nrow()`, which takes a dataframe (like `stops`) as an input, and then outputs an integer (the number of rows in `stops`).

In [0]:
nrow(stops)

### Your turn

To find the number of columns, we (unsurprisingly) use `ncol`. Try it!

In [0]:
# Find the number of columns in `stops`


To figure out what the names of our columns are, we can use `colnames()`.

In [0]:
# Find the column names in `stops`


Now, if we want to take a peak at our dataframe without printing the whole 900,000 x 13 table, we can use the functions `head()` or `tail()` to see the first few or last few rows.

In [0]:
head(stops)

**Pro-tip:** If you're every confused about a function and what to know more about it, what it does, how to use it, etc., every function has "documentation" to help! To know more about the `head()` function, simply run a code chunk with `?head`. It provides way more information than you might want or need -- but if you scroll down to the "Examples" section, those usually help!

## Exercise 1: Stop dates

For this first exercise, let's get a better sense of what time range our stops data covers. To do this, we'll be dealing with the `date` column in our dataframe. 

1. What happens when you run `stops$date`? How about `pull(stops, date)`? What do `$` and `pull()` do?

2. What date range does our dataset cover? (Hint: Try exploring the `min()` and `max()` functions, or the `range()` function!)

In [0]:
## YOUR CODE HERE

**tidyverse tip**: The "pipe" operator, `%>%`, helps to keep our code clean. It
just places the previous item into the first argument of the function. So,
`x %>% f()` is simply `f(x)`. While in a one-function call the pipe might feel
silly and unnecessary, it's going to become _really_ helpful once we start
wanting to do multiple transformations to our data. For instance, that means
that instead of typing `range(pull(stops, date))` we can write
`stops %>% pull(date) %>% range()`, which is easier to understand.

In [0]:
# Let's confirm that these give the same answer:

range(pull(stops, date))

stops %>% 
    pull(date) %>% 
    range()


## Preparing our data

For some of our analysis, we'll want to focus on the most recent full year: 2015.

To do this we'll want to use the _year_ of each stop, but _year_ isn't currently a column in our dataset. Let's add it!

**tidyverse function: `mutate()`**

We can use the `mutate()` function to fix add a `yr` column to `stops`.
The `mutate()` function adds new columns to a dataframe based on old columns.
The basic setup is `mutate(DATA, NEW_COL = FUN(OLD_COL))` where `DATA` is our
dataframe, `NEW_COL` is the name of the new column we want, and `FUN` is the
function we apply to the old column, `OLD_COL`, to get it.

Below we:
1. use the `year()` and `mutate()` functions to add a new column called `yr` to our `stops` dataframe, and
2. use the assignment operator `<-` (it's like = in `R`) to overwrite `stops`.

In [0]:
stops <- stops %>% mutate(yr = year(date))

**Note:** When we write code chunks and _don't_ save our result using `<-`, that result does not overwrite or in any way change the data. To change the data, we need to use the process above (`stops <- stops %>% ...`)

Now, we can investigate this new `yr` column in a few ways. 
1. We can check it's acutally there by looking at `colnames(stops)`.
2. We can compute the range of years using `range(stops$yr)`.
3. We can count the number of stops per year: `stops %>% count(yr)`. 

Let's try the last one:

In [0]:
stops %>%
    count(yr)

What do you notice about stop counts over the years?

Now let's get back to prepping our data. To get to our desired date range of the most recent full year (2015), we will 
1. Use the `filter()` function to specify the years we want, and 
2. Again use the assignment operator `<-` (it's like = in `R`) to overwrite `stops`.

**tidyverse function: `filter()`**

The `filter()` function is used to separate rows from the dataframe that
interest us from rows that do not. In particular, `filter(DATA, CONDITION)`
returns `DATA` with all of the rows that don't satisfy `CONDITION` removed. For
instance, we might want to only look at stops from 2015. To do this, we would type `stops %>% filter(yr == 2015)`, since we only want
rows from `stops` where the `yr` column is (i.e., `==`) `2015`. We can also filter on multiple conditions, just separating each condition with a comma. So, for example, if we wanted all stops between 2011 and 2015, we would write `stops %>% filter(yr >= 2011, yr <=2015)`.

In [0]:
# Use the filter function to get just stops from 2015
stops_2015 <- 
    stops %>%
    filter(yr == 2015)

Just to be extra sure, let's check our date range in this new dataframe, `stops_2015`!

In [0]:
stops_2015$date %>% range()

## Exercise 2: Stops by race

For this second exercise, let's compute the racial breakdown of traffic stops. To do this, we'll need two functions that we've already seen: `count()` and `mutate()`.

1. Count how many stops per race our dataset has, saving your result to a new dataframe: `stops_by_race`. 

2. Describe in words what we'd need to do to find the proportion of stops that were of white drivers.

3. To do the above computation for each race, we can add additional column to `stops_by_race` using the `mutate()` function. Overwrite `stops_by_race`, adding a new column `p` with the proportion of stops that were made of drivers of each race.

4. Discuss: What do these proportions mean? Are drivers of certain races being stopped more than others? What might we be missing to really start interpreting these values?

In [0]:
# YOUR CODE HERE

We see white drivers make up about one-third of stops, and drivers of each other race represent 14-18%. The by-race
stop counts are only meaningful, though, when compared to some baseline. If
the San Francisco population was about one-third white, one-third percent of stops
being of white drivers wouldn't be at all surprising. But if 75% of the SF population is white, then our findings might be more suspicious!

## Stop rates

In order to do this baseline comparison, we need to understand the racial
demographics in our SF population data. (Note: This is why we wanted just one full year: comparing the number of stops in a year to the population from that year.) The data as we've given it to you
has raw population numbers from 2015. To make it useful, we'll need to compute the
_proportion_ of SF residents in each demographic group. As before, we do this using the `mutate()` function.

In [0]:
pop_2015 %>%
    mutate(p = n_people / (sum(n_people)))

An eyeball comparison leads us to see that black drivers, Hispanic drivers, and drivers of other races (not black, white, Hispanic, or Asian) are being stopped
disproportionately, relative to the city's population. But let's be a bit more
rigorous about this. If we join the two tables together, we can compute stop 
rates by race (i.e., number of stops per person). 

**R function: `left_join()`**

The way we put tables together is with the `left_join()` function. We need to
input three things into this function: the two tables we'd like to combine,
along with instructions on how to join them. In this case, the two tables we
want to join are the table of stops counted according to `race` and
`pop`. The instruction for combining the tables is to join rows that
contain the same race, so we have to also give `left_join()` the argument `by =
"race"`. This means that `left_join()` will look at the first table---
i.e., the table stops counted by race---and then go to the `race` column
in each row. Then, `left_join()` will take what it finds there---in this case,
`"asian/pacific islander"`, `"black"`, `"hispanic"`, `"other/unknown"`, and
`"white"`---and look in the second table, i.e., `pop`, for all the
rows that contain the same information in `pop`'s race column. Then,
it will add the second row at the end of the first to create a new row with
information from both. What we end up with is a dataframe with all of the
columns from _both_ tables.

The process is a little complicated, and we won't use it too much, so don't
worry if the abstract description doesn't make sense. To get a better
understanding of what's going on, try joining the two tables described above,
being sure to include the `by = "race"` argument.

In [0]:
stops_by_race %>%
    left_join(pop, by = "race")

In [0]:
stops_by_race %>%
    left_join(pop, by = "race") %>%
    mutate(stop_rate = n / n_people)

Good! Now we can divide the black (or Asian, or Hispanic, or "other") stop rate by the white stop rate to be able to make 
a quantitative statement about how much more often black drivers are stops compared to white drivers, relative to their share of the city's population.

In [0]:
# black vs white
0.3340/0.0861
# hispanic vs white
0.0929/0.0861
# asian vs white
0.0544/0.0861
# other vs white
0.3876/0.0861

Black drivers are stopped at a rate 3.9 times higher than white drivers, Hispanic drivers are stopped at a rate 1.1 times higher than white drivers, Asian drivers are stopped less often than white drivers, and drivers of other races are stopped 4.5 times as often as white drivers!

## Caveats about stop rates

While these baseline stats give us a sense that there are racial disparities in
policing practices in SF, they are not strong evidence of discrimination. The
argument against using stop rates (often called "benchmarking" or the "benchmark test") is that we haven't identified the correct
baseline to compare to. 

For the stop rate denominator, what we really want to know is what the true 
distribution is for individuals breaking traffic laws or exhibiting other 
criminal behavior in their vehicles. But using SF residential population doesn't account for commuting populations, or possible race-specific differences in driving behavior,
including amount of time spent on the road (and adherence to traffic laws, as
mentioned above).  If black drivers, hypothetically, spend more time on the
road than white drivers, that in and of itself could explain the higher stop
rates we see for black drivers, even in the absence of discrimination.

## Searches

Let's next consider how often drivers of different races were searched. Computing search rates is actually easier than stop rates because we don't need an external population benchmark.
We can use the stopped population as our baseline, defining search rate to be proportion of stopped people who were subsequently searched. 

**tidyverse functions: `group_by()` and `summarize()`**

One thing that we often want to do with data is disaggregate it. That is, we
might want to take the data and break it down into smaller subpopulations. Then,
when we ask questions, we can ask about each piece---for instance, each
demographic group, or each police district---instead of asking about the populationas a whole.

The way to do this in `R` is with `group_by()` and `summarize()`. The standard way
to use `group_by()` is to call `group_by(DATA, COL_NAME)`, where `DATA` is our
dataframe and `COL_NAME` is the name of a column. What `group_by()` then does is
take all the rows in the dataframe `DATA` and put them into different groups,
one for each different value in the column `COL_NAME`. So, for instance, if we
called `group_by(stops, district)`, `R` would hand back to us the `stops`
dataframe with all of its columns broken into different groups, one for each
police district.

The second step is to do something with those groups. That's what `summarize()`
does. The way `summarize()` works is to take a dataframe broken into groups by
`group_by()` and calculate a statistic for each group . The basic syntax is
`summarize(DATA, STAT = FUN(COL_NAME))`, where `DATA` is some dataframe broken
up by `group_by()`, `STAT` is some statistic we want to calculate, `FUN` is the
function that calculates that statistic, and `COL_NAME` is the name of the
column (or columns) used to calculate the statistic.

Let's put it all together with an example. A straightforward one is: `stops %>%
group_by(race) %>% summarize(arrest_rate = mean(arrested))`. Typing
this into the `R` console will calculates arrest rates disaggregated by race.

Using `group_by()` and `summarize()`, let's calculate search rates by race.

## Exercise 3: Search rates

1. Below are two different ways to compute search rates. Discuss both ways with your group and make sure both methods of computation make sense. The second one is kind of tricky! (And for the first one, you might want to investigate what the function `n()` does -- try `?n`.)
2. Discuss the search rate findings. Are some race groups searched more often than other race groups, relative to their share of stopped drivers?

NOTE: Since we're not comparing to population numbers, we can return to using our full dataset, with all years!

In [0]:
# Search rates, METHOD 1
stops %>%
    group_by(race) %>%
    summarize(
        n_searched = sum(searched),
        n_stopped = n(),
        search_rate = n_searched / n_stopped
    )

In [0]:
# Search rates, METHOD 2
stops %>%
    group_by(race) %>%
    summarize(
        search_rate = mean(searched)
    )

Search rates are slightly less suspect than stop rates, since among the stopped 
population, it's more reasonable to believe that people of different races
offend at equal rates. In the context of searches, this means assuming that
all races exhibit probable cause of possessing contraband at equal rates. One could claim that the stopped
population isn't a good measure of the true racial distribution of probable
cause. This is all to say that while benchmark stats (stop rates, search rates) are a good place to
start, more investigation is required before we can draw any conclusions.

## Outcome test

To circumvent the benchmarking problem, it's common to turn to the search 
decision, rather than the stop decision. This is because we have a notion of
what a "successful" search is. The legal justification for performing a search
is probable cause that the driver possesses contraband. So a successful search
is one which uncovers contraband.

We thus turn to rates of successful searches. That is, what proportion of
searches, by race, were successful? This proportion is known as the contraband
recovery rate, or the "hit rate." If racial groups have different hit rates, it
can imply that racial groups are being subjected to different standards.

As a caricatured example, suppose among white drivers who were searched, 
officers found contraband 99% of the time, while among black drivers who were
searched, officers found contraband only 1% of the time. This would lead us to
believe that officers made sure they were _certain_ white individuals had
contraband before deciding to search, but that they were searching black 
individuals on a whiff of evidence.

Let's investigate hit rates by race in SF in 2015.

## Exercise 4: Hit rates

1. Filter to drivers who were searched, and then compute the hit rate (rate of contraband recovery) by race. Remember your `group_by()` and `summarize()` functions!

2. Discuss your findings. 

In [0]:
## YOUR CODE HERE

However, what if hit rates vary by police district? If the bar for stopping
people, irrespective of race, is lower in certain police districts, and black
individuals are more likely to live in neighborhoods in those districts, then
the observed disparities may not reflect bias.

Let's compute hit rates by race _and_ district. We can do this simply by adding multiple arguments to the `group_by()` function.

In [0]:
hit_rates <- 
  stops %>% 
  filter(searched) %>% 
  group_by(race, district) %>% 
  summarize(hit_rate = mean(contraband_found))

hit_rates %>% nrow()

This is too many hit rates to compare in one table!

## Exercise 5: Visualization brainstorm

Sketch out using pen and paper how you might try to use visualizations to help us synthesize the 50 hit rates above. Start with the question we're trying to answer (Are hit rates for minority drivers lower than hit rates for white drivers?) -- and then think about what type of plot might best help you answer that question. See if you can come up with at least 3 different sketches!

## One way to visualize: scatterplots

In [0]:
# Reshape table to show hit rates of minorities vs white drivers
reshaped_hit_rates <-
  hit_rates %>% 
  spread(race, hit_rate, fill = 0) %>% 
  rename(white_hit_rate = white) %>% 
  gather(
      minority_race, minority_hit_rate, 
      c(black, hispanic, `asian/pacific islander`, other)
  ) %>%
  arrange(district)

# We'll use this just to make our axes' limits nice and even
max_hit_rate <-
  reshaped_hit_rates %>% 
  select(ends_with("hit_rate")) %>% 
  max()

# Get corresponding number of searches (to size points).
# Again, for each district we want to know the number of white+black searches
# and white+Hispanic searches. This requires the same spreading and gathering
# as our previous data-munging.
search_counts <-
  stops %>% 
  filter(searched) %>%  
  count(district, race) %>% 
  spread(race, n, fill = 0) %>% 
  rename(num_white_searches = white) %>% 
  gather(
      minority_race, num_minority_searches, 
      c(black, hispanic, `asian/pacific islander`, other)
  ) %>% 
  mutate(num_searches = num_minority_searches + num_white_searches) %>% 
  select(district, minority_race, num_searches)

# Now let's plot!
reshaped_hit_rates %>% 
  left_join(
    search_counts, 
    by = c("district", "minority_race")
  ) %>% 
  ggplot(aes(
    x = white_hit_rate,
    y = minority_hit_rate
  )) +
  geom_point(aes(size = num_searches), pch = 21) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  scale_x_continuous("White hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent_format(accuracy = 1)
  ) +
  scale_y_continuous("Minority hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent_format(accuracy = 1)
  ) +
  coord_fixed() +
  facet_wrap(vars(minority_race))

## Exercise 6: Plot interpretation

Explain what you see above. What does each point represent? What does the dotted line represent? What do these plots tell us about discrimination in search practices?