tutorial.Rmd

---
title: "Open Policing Project Tutorial"
output:
  html_document:
    code_folding: hide
---


## Setup

First, let's load the necessary libraries and data that will allow us to
begin our investigation!

```{r setup, message=FALSE,  warning=FALSE}
## Libraries to include
library(tidyverse)
library(lubridate)
# For Veil of Darkness
library(lutz)
library(suncalc)
library(splines)

## Load the data
# Replace the path below with the path to where your data lives
data_path <- "~/Desktop/pa_philadelphia_2019_02_25.rds"
stops <- read_rds(data_path)

# Additional data and fixed values we'll be using
population_2017 <- tibble(
  subject_race = c(
    "asian/pacific islander", "black", "hispanic", "other/unknown","white"
  ),
  num_people = c(110864, 648846, 221777, 39858, 548312)
) %>% 
  mutate(subject_race = as.factor(subject_race))

center_lat <- 39.9525839
center_lng <- -75.1652215
```

## Covering the basics

The core of `R` is the dataframe. We've given you one to start with, in the form
of `stops`. Think of dataframes like a spreadsheet: they have rows and columns.
Usually, rows are a "datapoint": in `stops`, each row corresponds to a single
stop from Philadelphia. The columns are the "variables": again, in `stops`,
these are the things we know about the stop, like where the stop happened, the
age of the stop subject, whether an arrest was made, and so on.

First, let's take a look at what columns exist in our `stops` dataframe. 

```{r}
colnames(stops)
```

How many stops do we have in our dataset? (**Hint**: Try the `nrow()` function.)

```{r}
nrow(stops)
```

What date range does our data cover? One way to do this would be to examine the
`min()` and `max()` dates in our dataframe. The problem is we can't just apply
`min()` to our dataframe: doing so produces an error.

The `min()` function wants its input to be just one column from the `stops`
dataframe. To access a single column from a dataframe, use `$`: what goes before
`$` is the dataframe, and what comes after is the name of the column we want.
(**Hint**: What happens if you type `stops$date` into the console?)

```{r}
min(stops$date)
max(stops$date)
```

Since we only have four and a half months of data for 2018, let's filter it out.
Note that there are ways to deal with partial years in analysis, but to make 
things easier for ourselves, let's focus on 2014-2017.

To do this we'll want to:

1. Use the `filter()` function to specify the years we want, and 
2. Use the assignment operator `<-` (it's like = in `R`) to overwrite `stops`.

#### Aside: `filter()`

The `filter()` function is used to separate rows from the dataframe that
interest us from rows that do not. In particular, `filter(DATA, CONDITION)`
returns `DATA` with all of the rows that don't satisfy `CONDITION` removed. For
instance, we might want to only look at stops where the subject was arrested. To
do this, we would type `filter(stops, arrest_made == TRUE)`, since we only want
rows from `stops` where the `arrest_made` column is (i.e., `==`) `TRUE`. (**Note**:
we could actually just type `filter(stops, arrest_made)`. Do you see why?)

You might have noticed in the colnames, that we have no `year` column (our data 
only has `date`). These dates are full dates that look like `YYYY-MM-DD`.
Luckily, with the `lubridate` package, getting the year from
the date is as easy as `year(date)`!

```{r}
stops <- filter(stops, year(date) < 2018)
```

How many stops do we have now?

```{r}
# NOTE that this is equivalent to `nrow(stops)`
stops %>% nrow()
```

**tidyverse tip**: The "pipe" operator, `%>%`, helps to keep our code clean. It
just places the previous item into the first argument of the function. So,
`x %>% f()` is simply `f(x)`. While in a one-function call the pipe might feel
silly and unnecessary, it's going to become _really_ helpful once we start
wanting to do multiple transformations to our data. For instance, that means
that instead of typing `nrows(filter(stops, year < 2018))` we can write
`stops %>% filter(year < 2018) %>% nrows()`, which is easier to understand.

Now, you may have noticed before that there was a column called `type`. Let's 
take a closer look to see what that means.

#### Aside: `select()`
A few minutes ago, we used `filter()` to whittle our dataframe down to just the
rows that interested us. What happens if we want to whittle down the columns
instead? The `select()` function is designed to solve exactly this problem. It
takes a dataframe, and gets rid of all of the columns you don't ask for. For
instance, `select(stops, lat, long)` produces a dataframe with exactly two
columns: `lat` and `long`. If we wanted to know the time as well, we could type
`select(stops, lat, long, time)`.

Try using `select()` to look at the `type` column.

```{r}
stops %>% 
  select(type) %>% 
  head(10) # this allows us to look at just the first 10 rows
```

Ah! Some of our stops are vehicular (i.e., traffic) stops, but some are 
pedestrian stops. Many of the larger cities in our database provided us with
both pedestrian and vehicular data. We've provided it all for you so that you
can dive in and uncover stories relating to pedestrian stops too! Many of the
analysis techniques we'll be covering today can be applied to the pedestrian
data as well. And we encourage you to spend some time on your own replicating the
relevant analyses on the pedestrian data. But for now, let's filter to just
vehicular stops, since one of our analyses is only relevant for traffic stops, 
and since this matches what the majority of the data in our database looks like.

```{r}
stops <- stops %>% filter(type == "vehicular")
```

How many stops do we have now?

```{r}
stops %>% nrow()
```

It'd be nice to see if the 1.1 million stops were evenly distributed across 
years, or if stop counts have changed. To find stop counts per year, we need to
define a notion of `year` (recall that our data only has `date`). 

The `count()` function gives us an easy way to tally up observations over any
number of groupings we desire. In this case, let's count over year.

#### Aside: `mutate()`

Notice, though, that we have the same problem as before: there's no `year`
column in the `stops` dataframe. We'll use the `mutate()` function to fix that.
The `mutate()` function adds new columns to a dataframe based on old columns.
The basic setup is `mutate(DATA, NEW_COL = FUN(OLD_COL))` where `DATA` is our
dataframe, `NEW_COL` is the name of the new column we want, and `FUN` is the
function we apply to the old column, `OLD_COL`, to get it.

Use the `year()` and `mutate()` functions to add a new column called `year`
to our `stops` dataframe and then use `count()` to determine the number of stops per year. 

```{r}
stops %>%
  mutate(year = year(date)) %>% 
  count(year)
```

How about counts over race?

```{r}
stops %>% 
  count(subject_race)
```

Let's make another table that gives us the proprtion of stops by race.
(There are a few equivalent ways to do this---choose the method that feels
most natural to you.)

```{r}
# This method builds off of using `count` as above
stops %>% 
  count(subject_race) %>% 
  mutate(prop = n / sum(n))

# This method uses the group_by/summarize paradigm
stops %>% 
  group_by(subject_race) %>% 
  summarize(
    n = n(),
    prop = n / nrow(.)
  )
```

At first glance, we see there are far more stops of black drivers than any other
race. About two-thirds of stops in our dataset were of black drivers! This 
stat on is own, though, doesn't actually say much. We'll return to this more
rigorously later on.

How about counting how many stops by year _and_ race?

```{r}
stops %>% 
  # notice that you can also mutate `date` *within* the count funciton
  count(year = year(date), subject_race)
```

Now we've gotten to the point where we have too many rows to interpret in a 
single table. A simple visualization could really help us out here.

```{r}
stops %>% 
  count(year = year(date), subject_race) %>% 
  ggplot(aes(x = year, y = n, color = subject_race)) +
  geom_point() +
  geom_line() 
```

We used `ggplot` to make this plot. This package provides a lot of options for
customization, and there are many tweaks we could make to this plot to make it
nicer. Its syntax is a little complicated, though, and in our data
exploration, it's enough just to get a coarse visual like this.

From this plot we see that, at least for black and white drivers, the annual
trends are very different by race. (It's hard to tell from this plot for
drivers of other races because the counts are comparatively so much smaller.) 
All races experienced a spike in enforcement in 2015, but thereafter, there
were fewer white drivers stopped (in 2016 and 2017), whereas there continued to
be an _increase_ in number of black drivers stopped over those two years. 

This is already a potential lead! We'd have to investigate further to see what 
the trend looks like when adjusting for population changes over time or other
factors we haven't considered in this example. But it's often simple
explorations like this that can uncover the path to stories.

**Fun fact**: This same trend does _not_ hold true for pedestrian stops. In fact,
if we hadn't filtered to only vehicular stops, we wouldn't have noticed this,
because this same plot over the combined pedestrian and vehicular data looks
unremarkable. The takeaway is that looking at trends by sub-categories can often
be very helpful. (E.g., in Nashville, looking at the different listed stop 
reasons uncovers the extent of the disparities.)

To give us time to dive into a variety of analysis methods, we'll let you 
investigate Philly's annual traffic stop patterns on your own time. But because
it seems like the disparities could be changing over time, we'll filter to
only 2017 and focus on that data for the rest of our analysis (and overwrite
`stops`).

```{r}
stops <- stops %>% filter(year(date) == 2017)
```

## Benchmark test

We saw before that over two-thirds of stops were of black drivers. The by-race
stop counts are only meaningful, though, when compared to some baseline. If
the Philadelphia population was about two-thirds black, then two-thirds of stops
being of black drivers wouldn't be at all surprising. 

### Stop rates

In order to do this baseline comparison, we need to understand the racial
demographics in our Philly population data. The data as we've given it to you
has raw population numbers. To make it useful, we'll need to compute the
_proportion_ of Philadelphia residents in each demographic group. (Hint: use the
`mutate()` function.)

```{r}
population_2017 %>% 
  mutate(prop = num_people / sum(num_people))
```

As an eyeball comparison leads us to see that black drivers are being stopped
disproportionately, relative to the city's population. But let's be a bit more
rigorous about this. If we join the two tables together, we can compute stop 
rates by race (i.e., number of stops per person). Remember to take into acount
how many years are in your stop data, in order to get a true value of stops per
capita; we're using only 2017 for stops and for population, so we're in good 
shape. 

#### Aside: `left_join()`

The way we put tables together is with the `left_join()` function. We need to
input three things into this function: the two tables we'd like to combine,
along with instructions on how to join them. In this case, the two tables we
want to join are the table of stops counted according to `subject race` and
`population_2017`. The instruction for combining the tables is to join rows that
contain the same race, so we have to also give `left_join()` the argument `by =
"subject_race"`. This means that `left_join()` will look at the first table---
i.e., the table stops counted by race---and then go to the `subject_race` column
in each row. Then, `left_join()` will take what it finds there---in this case,
`"asian/pacific islander"`, `"black"`, `"hispanic"`, `"other/unknown"`, and
`"white"`---and look in the second table, i.e., `population_2017`, for all the
rows that contain the same information in `population_2017`'s race column. Then,
it will add the second row at the end of the first to create a new row with
information from both. What we end up with is a dataframe with all of the
columns from _both_ tables.

The process is a little complicated, and we won't use it too much, so don't
worry if the abstract description doesn't make sense. To get a better
understanding of what's going on, try joining the two tables described above,
being sure to include the `by = "subject_race"` argument.

```{r}
stops %>% 
  count(subject_race) %>% 
  left_join(
    population_2017,
    by = "subject_race"
  ) %>% 
  mutate(stop_rate = n / num_people)
```

Good! Now we can divide the black stop rate by the white stop rate to be able to make 
a quantitative statement about how much more often black drivers are stops
compared to white drivers, relative to their share of the city's population.
Black drivers are stopped at a rate 3.4 times higher than white drivers, and
Hispanic drivers are stopped at a rate 1.5 times higher than white drivers.

### Search rates

Let's do the same sort of benchmark comparison for search and frisk rates. These
are easier than the last one since we don't need an external population benchmark.
We can use the stopped population as our baseline, defining search rate
to be proportion of stopped people who were subsequently searched, and frisk rate
as proportion of stopped people who were subsequently frisked. Let's get
these values by race.

#### Aside: `group_by()` and `summarize()`

One thing that we often want to do with data is disaggregate it. That is, we
might want to take the data and break it down into smaller subpopulations. Then,
when we ask questions, we can ask about each piece---for instance, each
demographic group, or each police district---instead of asking about the population
as a whole.

The way to do this in `R` is with `group_by()` and `summarize()`. The standard way
to use `group_by()` is to call `group_by(DATA, COL_NAME)`, where `DATA` is our
dataframe and `COL_NAME` is the name of a column. What `group_by()` then does is
take all the rows in the dataframe `DATA` and put them into different groups,
one for each different value in the column `COL_NAME`. So, for instance, if we
called `group_by(stops, district)`, `R` would hand back to us the `stops`
dataframe with all of its columns broken into different groups, one for each
police district.

The second step is to do something with those groups. That's what `summarize()`
does. The way `summarize()` works is to take a dataframe broken into groups by
`group_by()` and calculate a statistic for each group . The basic syntax is
`summarize(DATA, STAT = FUN(COL_NAME))`, where `DATA` is some dataframe broken
up by `group_by()`, `STAT` is some statistic we want to calculate, `FUN` is the
function that calculates that statistic, and `COL_NAME` is the name of the
column (or columns) used to calculate the statistic.

Let's put it all together with an example. A straightforward one is: `stops %>%
group_by(subject_race) %>% summarize(arrest_rate = mean(arrest_made))`. Typing
this into the `R` console will calculates arrest rates disaggregated by race.

Using `group_by()` and `summarize()`, let's calculate frisk rates by race.

```{r}
stops %>% 
  group_by(subject_race) %>% 
  summarize(
    search_rate = mean(search_conducted, na.rm = T),
    frisk_rate = mean(frisk_performed, na.rm = T)
  )
```

Note that with functions like `mean()`, if any of the values that are being
averaged are `NA`, the output value of the mean will simply be `NA`. We don't have
to worry about that with our Philly data, because it's super clean. But you may
encounter this problem in some of your own data. The easy workaround is to pass
the argument `na.rm = T` into `mean()`---as you see in the code above. This tells 
`R` to remove the `NA` values and compute the mean over all the non-NA values.

Now let's dive into these results! As with the stop rates, we can make a 
quantitative claim about disparities in search and frisk rates by dividing the
minority rate by the white rate. Here we see that among drivers who were
stopped, black drivers were searched at a rate 1.5 times higher than white
drivers, and Hispanic drivers were searched at a rate 1.2 times higher than
white drivers. Black drivers were frisked at a rate 2.1 times higher than
white drivers were, and Hispanic drivers were frisked at a rate 1.5 times
higher than white drivers were.  

### Caveats about the benchmark test

While these baseline stats give us a sense that there are racial disparities in
policing practices in Philadelphia, they are not evidence of discrimination. The
argument against the benchmark test is that we haven't identified the correct
baseline to compare to. 

For the stop rate benchmark, what we really want to know is what the true 
distribution is for individuals breaking traffic laws or exhibiting other 
criminal behavior in their vehicles. If black and Hispanic drivers are
disproportionately stopped relative to their rates of offending, that would be
stronger evidence. Some people then proposed to use benchmarks that approximate
those offending rates, like arrests, for example. However, we know arrests to
themselves be racially skewed (especially for low-level drug offenses, for
example), so it wouldn't give us the true offending population's racial
distribution. Furthermore, only `r stops %>% summarize(p = round(100 *
mean(arrest_made), 1)) %>% pull(p)`% of stops result in an arrest, so the
arrested population will naturally not match the stopped population. 

An even simpler critique of the population bechmark for stop rates is that it
doesn't account for possible race-specific differences in driving behavior,
including amount of time spent on the road (and adherence to traffic laws, as
mentioned above).  If black drivers, hypothetically, spend more time on the
road than white drivers, that in and of itself could explain the higher stop
rates we see for black drivers, even in the absence of discrimination.

Search and frisk rates are slightly less suspect, since among the stopped 
population, it's more reasonable to believe that people of different races
offend at equal rates. In the context of searches, this means assuming that
all races exhibit probable cause of possessing contraband at equal rates. And
in the case of frisks, this means assuming that all races exhibit reasonable
articulable suspicion of possessing a weapon at equal rates. Of course, one
could also argue against these assumptions. One could claim that the stopped
population isn't a good measure of the true racial distribution of probable
cause. This is all to say that while benchmark stats are a good place to
start, more investigation is required before we can draw any conclusions.

## Outcome test

To circumvent the benchmarking problem, it's common to turn to the search 
decision, rather than the stop decision. This is because we have a notion of
what a "successful" search is. The legal justification for performing a search
is probable cause that the driver possesses contraband. So a successful search
is one which uncovers contraband.

We thus turn to rates of successful searches. That is, what proportion of
searches, by race, were successful? This proportion is known as the contraband
recovery rate, or the "hit rate." If racial groups have different hit rates, it
can imply that racial groups are being subjected to different standards.

As a caricatured example, suppose among white drivers who were searched, 
officers found contraband 99% of the time, while among black drivers who were
searched, officers found contraband only 1% of the time. This would lead us to
believe that officers made sure they were _certain_ white individuals had
contraband before deciding to search, but that they were searching black 
individuals on a whiff of evidence.

Let's investigate hit rates by race in Philly in 2017.

```{r}
stops %>% 
  filter(search_conducted) %>% 
  group_by(subject_race) %>% 
  summarize(
    hit_rate = mean(contraband_found, na.rm = T)
  )
```

We see that hit rates are slightly lower for black and Hispanic drivers than
for white drivers.

However, what if hit rates vary by police district? If the bar for stopping
people, irrespective of race, is lower in certain police districts, and black
individuals are more likely to live in neighborhoods in those districts, then
the observed disparities may not reflect bias.  

### Adjusting for location

Now let's compute hit rates by race _and_ district. Name your result `hit_rates`.

```{r}
hit_rates <- 
  stops %>% 
  filter(search_conducted) %>% 
  group_by(subject_race, district) %>% 
  summarize(hit_rate = mean(contraband_found, na.rm = T))

hit_rates
```

Again, this is too many hit rates to compare in one table. To plot the hit rates
of black vs. white drivers and of Hispanic vs. white drivers, we need to
reshape our table to have each row containing a district, a minority race,
minority hit rate in that district, and white hit rate in that district. We'll
walk you through the code below that reshapes the data for us.

```{r}
# Reshape table to show hit rates of minorities vs white drivers
hit_rates <-
  hit_rates %>% 
  filter(subject_race %in% c("black", "white", "hispanic")) %>% 
  spread(subject_race, hit_rate, fill = 0) %>% 
  rename(white_hit_rate = white) %>% 
  gather(minority_race, minority_hit_rate, c(black, hispanic)) %>%
  arrange(district)

hit_rates
```

**Quick Tip**: You might have noticed a strange symbol in the code above:
`%in%`. This is really just shorthand for the sentence, "the `subject_race` is
`"black"` or it is `"white"` or it is `"hispanic"`." Without the `%in%` syntax, we
would have to write this in `R` as: `subject_race == "black" | subject_race ==
"white" | subject_race == "hispanic"`. The `%in%` syntax is a lot shorter and
clearer, as we think you'll agree.

Now let's plot it!

```{r}
# We'll use this just to make our axes' limits nice and even
max_hit_rate <-
  hit_rates %>% 
  select(ends_with("hit_rate")) %>% 
  max()

hit_rates %>% 
  ggplot(aes(
    x = white_hit_rate,
    y = minority_hit_rate
  )) +
  geom_point() +
  # This sets a diagonal reference line (line of equal hit rates)
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  # These next few lines just make the axes pretty and even
  scale_x_continuous("White hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  scale_y_continuous("Minority hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  # This makes sure that 1% on the x-axis is the same as 1% on the y-axis
  coord_fixed() +
  # This allows us to compare black v. white and Hispanic v. white side by
  # side, in panels
  facet_grid(. ~ minority_race)
  # Depending on your version of ggplot2, you may be able to use the syntax 
  # below (the newer ggplot2 syntax)---which is clearer, in my opinion.
  # But older versions of ggplot2 will only accept the above syntax
  # facet_grid(cols = vars(minority_race))
```

This is a good start. However, it's hard to tell where the mass is. That is, 
maybe the points above the dotted line (i.e., the points where minority hit
rates are _higher_ than white hit rates), maybe those districts have all of 
Philadelphia's population, and the districts below the line only represent 
a few people. While this is unlikely, it's still good to have a marker of how 
much weight or emphasis to give each of the points on our plot. 

Let's therefore size each of the points by number of searches.

```{r}
# Get corresponding number of searches (to size points).
# Again, for each district we want to know the number of white+black searches
# and white+Hispanic searches. This requires the same spreading and gathering
# as our previous data-munging.
search_counts <-
  stops %>% 
  filter(
    search_conducted, 
    subject_race %in% c("black", "white", "hispanic")
  ) %>%  
  count(district, subject_race) %>% 
  spread(subject_race, n, fill = 0) %>% 
  rename(num_white_searches = white) %>% 
  gather(minority_race, num_minority_searches, c(black, hispanic)) %>% 
  mutate(num_searches = num_minority_searches + num_white_searches) %>% 
  select(district, minority_race, num_searches)

hit_rates %>% 
  left_join(
    search_counts, 
    by = c("district", "minority_race")
  ) %>% 
  ggplot(aes(
    x = white_hit_rate,
    y = minority_hit_rate
  )) +
  geom_point(aes(size = num_searches), pch = 21) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  scale_x_continuous("White hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  scale_y_continuous("Minority hit rate", 
    limits = c(0, max_hit_rate + 0.01),
    labels = scales::percent
  ) +
  coord_fixed() +
  facet_grid(. ~ minority_race)
  # same facet_grid syntax issue here as before!
  # facet_grid(cols = vars(minority_race))
```

It seems like we can now say with confidence that even looking within location,
nearly every police district has lower hit rates for minorities than white
drivers.  

Note that we can also run a simple model (logistic regression) to quantify how 
much lower the odds are of recovering contraband from searching a black or 
Hispanic driver versus from searching a white driver, adjusting for location. 

### Investigating anomalies

One thing that we should always keep an eye out for, both as journalists and as
data scientists, are anomalies in the data that could be changing the meaning of
our results. 

In the plot above, you may have noticed a rather larger point (i.e., a district
with a lot of searches) that has a hit rate of essentially zero for both 
white and minority drivers. That's a red flag that something's up.

Let's figure out what district that is by filtering our `hit_rates` table 
to the smallest white hit rate.

```{r}
hit_rates %>% 
  filter(white_hit_rate == min(white_hit_rate))
```

Next, let's take a look at what addresses are usually written in for stops in
this district.

```{r}
# Hint: Note that districts in our dataset are encoded as characters, 
# so 77 is actually "77"

stops %>% 
  filter(district == "77") %>% 
  count(location, sort = T)
```

Depending on how you investigated the `location`, you might have found that the
majority of the stops happen at the airport, or have location listed as `NA` or
unavailable. Indeed, if we check the Philadelphia Police's website, we'll find
that there is no District 77, and that the airport does not fall within the 
jurisdiction of any particular police district. Between this information and the
fact that hit rates are essentially zero for this district, we can conclude that
the types of searches happening in this location are qualitatively different
from searches in our other locations.

Without knowing more about stops and searches in this district, it makes sense
to remove them from our hit rate analysis. Try computing citywide hit rates
again, this time filtering out district 77. 

```{r}
# Remember: Districts in our dataset are encoded as characters, 
# so 77 is actually "77"

stops %>% 
  filter(search_conducted, district != "77") %>% 
  group_by(subject_race) %>% 
  summarize(
    hit_rate = mean(contraband_found, na.rm = T)
  )
```

Wow! A sizeable share of searches of white individuals must have been taking
place at the airport with zero hit rates, deflating the white hit rate we
were calculating before. But this didn't influence hit rates of black or
Hispanic drivers as much. Our aggregate hit rate analysis now shows stronger
evidence of bias, similar to our by-location plot. (Note, thought, that
aggregate and by-location hit rates do not necessarily always tell the same
story---a phenomenon known as Simpson's Paradox---which is why it's important
to check both to get the full picture.) 


### Caveats about the outcome test

While the outcome test is a very simple and intuitively appealing test, it doesn't
allow us to observe the actual threshold for searching someone, it only allows
us to observe outcomes. There is a subtle statistical flaw with the outcome
test, known as _infra-marginality_, which in certain cases can lead to the outcome
test indicating discrimination when equal thresholds are actually being applied,
and in other cases can lead to the outcome test indicating equal hit rates when
different thresholds are actually being applied. We thus often use the threshold
test (a more complex test that uses both search and hit rates to infer thresholds),
in order to validate the outcome test.

(**Fun fact**: The threshold test confirms the results we see in Philadelphia.
Officers are indeed applying lower thresholds when deciding to search black and 
Hispanic drivers than when deciding to search white drivers.)

We don't have time to go into the threshold test now. The important takeaway, 
though, is that every statistical test has its flaws. So when reporting on
data, we have to make sure not overstate our findings.


## Veil of Darkness test

The outcome test is a good way to get around the benchmarking problem at the
search decision, but when it comes to assessing bias at the stop decision,
it's a little trickier. The outcome test usually doesn't work, since rarely
is there a notion of what a "successful" stop is. (There are, of course, some
exceptions to this, like when stop reason is listed as suspected Criminal 
Possession of a Weapon (CPW)---which is the case in NYC stop-question-frisk
data.)

An alternative approach to assessing bias in stop decisions was proposed
by Grogger and Ridgeway in 2006. This approach, known as
the Veil of Darkness test, relies on the hypothesis that officers who are
engaged in racial profiling are less likely to be able to identify a driver's
race after dark than during daylight. Under this hypothesis, if stops made
after dark had smaller proportion of black drivers stopped than stops made
during daylight, this would be evidence of the presence of racial profiling.  

The naive approach is just to compare whether daytime stops or nighttime stops 
have higher proportion of black drivers. This, however, is problematic. More black
drivers being stopped at night could be the case for a multitude of reasons: 
different deployment and enforcement patterns, different driving patterns,
etc. All of these things correlate with clock time and would be confounding our
results. So let's find a way to adjust for clock time.

To adjust for clock time, we want to compare times that were light at some
point during the year (i.e., occurred before dusk) but were dark at other points
during the year (i.e., occurred after dusk). This is called the "inter-twilight
period": the range from the earliest time dusk occurs in the year to the latest 
time dusk occurs in the year. 

Let's calculate the sunset and dusk times for Philly in 2017.

(Note that we distinguish here between _sunset_ and _dusk_. Sunset is the point 
in time where the sun dips below the horizon. Dusk, also known as the end of 
civil twilight, is the time when the sun is six degrees below the horizon, and 
it is widely considered to be dark. We'll explain why this is relevant in a
moment.)

```{r}
# Get timezone for Philly
tz <- lutz::tz_lookup_coords(center_lat, center_lng, warn = F)

# Helper function
time_to_minute <- function(time) {
  hour(hms(time)) * 60 + minute(hms(time))
}

# Compute sunset time for each date in our dataset
sunset_times <- 
  stops %>%
  mutate(
    lat = center_lat,
    lon = center_lng
  ) %>% 
  select(date, lat, lon) %>%
  distinct() %>%
  getSunlightTimes(
    data = ., 
    keep = c("sunset", "dusk"), 
    tz = tz
  ) %>% 
  mutate_at(vars("sunset", "dusk"), ~format(., "%H:%M:%S")) %>% 
  mutate(
    sunset_minute = time_to_minute(sunset),
    dusk_minute = time_to_minute(dusk),
    date = ymd(str_sub(date, 1, 10))
  ) %>% 
  select(date, sunset, dusk, ends_with("minute"))
```

Great. Now, using `sunset_times`, what is Philly's inter-twilight period?

```{r}
sunset_times %>% 
  filter(dusk == min(dusk) | dusk == max(dusk))
```

So the earliest sunset in 2017 was at around 4:30 p.m., in December (and it was
fully dark about 30 minutes later, at dusk, around 5:00 p.m.). The latest sunset time
in 2017 was at around 8:30 p.m., in late June (and it was fully dark about 30 
minutes later, at dusk, around 9:00 p.m.).

For a time in this range, like 6:30 p.m., part of the year it's still light, and part
of the year, it's still dark. So we can compare the proportion of black drivers 
stopped at 6:30 p.m. when it's light (usually in the summer) to the proportion of 
black drivers stopped at 6:30 p.m. when it's dark (usually in the winter). This way,
we're really seeing the effect of darkness, since driving and deployment 
patterns tend to change only with clock time. For example, the demographics
of commuters who are on the road at 6:30 p.m. likely stays pretty steady throughout
the year, compared to demographics of commuters at 5:00 p.m. versus at 9:00 p.m. 

One quick note: We want to make sure to filter out stops that occurred in the
approximately 30-minute period between sunset and dusk (end of civil twilight),
since it is ambiguous whether that period is light or dark, which would muddle
up our analysis.  

Let's join the sunset times to our stops data and:

1. Filter to the inter-twilight period;
2. Remove that ambigous pre-dusk period; and
3. For simplicity, compare only black and white drivers.

```{r}
vod_stops <- 
  stops %>% 
  left_join(
    sunset_times,
    by = "date"
  ) %>% 
  mutate(
    minute = time_to_minute(time),
    minutes_after_dark = minute - dusk_minute,
    is_dark = minute > dusk_minute,
    min_dusk_minute = min(dusk_minute),
    max_dusk_minute = max(dusk_minute),
    is_black = subject_race == "black"
  ) %>% 
  filter(
    # Filter to get only the intertwilight period
    minute >= min_dusk_minute,
    minute <= max_dusk_minute,
    # Remove ambigous period between sunset and dusk
    !(minute > sunset_minute & minute < dusk_minute),
    # Compare only white and black drivers
    subject_race %in% c("black", "white")
  )
```

How many stops are in our new, filtered data frame?

```{r}
vod_stops %>% nrow()
```

Now, to get some intuition, let's use our 6:30 p.m. example. If we filter the data
to stops that occurred in the narrow window from 6:30 p.m. to 6:45 p.m., we
can compute what proportion of stops were of black drivers for stops when it
was dark and compare that to the proportion for stops when it was not dark yet.

```{r}
vod_stops %>% 
  filter(time > hm("18:30"), time < hm("18:45")) %>% 
  group_by(is_dark) %>% 
  summarize(prop_black = mean(is_black))
```

Indeed, we see that the proportion is smaller when 6:30--6:45 p.m. is dark
than when it's light. According to the veil of darkness hypothesis, this is
because officers engaged in racial profiling can't determine drivers' races as
well at night.  We know that the proportion of black drivers stopped dropped by
about 5% when it was dark and race was harder to identify. It's hard to say what
this means in terms of the extent or pervasiveness, though. We *cannot* make
*any* claims about whether some small percent of the officers were engaged in 
stops that were 100% profiled on race, or whether all officers were engaged in
racial profiling in some small percentage of their stops, or anything of the
sort. 

Furthermore, while this stat gives us some intuition that racial profiling might 
be present in Philly, we've only looked at one time. In order to make sure the
result is robust, we want to check this for _every_ time in our inter-twilight
period. To do this, we can use _logistic regression_. The basic idea here is
that we're finding what our data implied about how darkness and clock time
influence the whether a stopped driver will be black. We then extract the 
coefficient of darkness. (You can think of this loosely as the measure of how
much darkness---as opposed to clock time---influences whether a stopped driver
will be black.)


```{r}
mod1 <- glm(
  is_black ~ is_dark + splines::ns(minute, df = 6),
  family = binomial,
  data = vod_stops
)

summary(mod1)$coefficients["is_darkTRUE", c("Estimate", "Std. Error")]
```

The fact that the coefficient is negative means that darkness _lessens_ the 
likelihood that a stopped driver will be black, after adjusting for clock 
time. This matches our intuitive result from 6:30 p.m.

The fact that the standard error is small means that this is a _statistically significant_ 
finding. Another important thing to do, besides checking 
statistical significance of your results, is to do a bunch of different 
robustness checks. The idea is that all models make some choices about how to 
turn the question at hand into math, and we want to vary those choices to make
sure that none of them change the model result---that the result holds regardless
of, for example, how you choose to control for time. In our model we used a
natural spline with six degrees of freedom, but we could have chosen to use
3 degrees of freedom, or to do something way simpler like bin time to one-minute 
or five-minute windows.

Another robustness check that often comes up with this type of modeling is
adding in a control for sub-geography. For the outcome test, we controlled
for police district under the proposition that maybe it was an intentional
policy decision for certain police districts to lower search thresholds 
than other police districts---and if those districts happened to be 
disproportionately black, that would "explain away" the bias we saw in the 
aggregate outcome test results. 

In the case of the Veil of Darkness test, one could make an analogous claim:
is the result that we're seeing just a product of certain districts being policed
more after dark and those districts have a lower proportion of black drivers 
than the districts being policed before dark? If so, then adjusting for location
in our model would lead to the darkness coefficient being zero. But if the
coefficient on darkness doesn't change much (i.e., stays negative in a 
statistically significant way), then that's another marker that our finding is
robust. 

```{r}
mod2 <- glm(
  is_black ~ is_dark + splines::ns(minute, df = 6) + as.factor(district),
  family = binomial,
  data = vod_stops
)

summary(mod2)$coefficients["is_darkTRUE", c("Estimate", "Std. Error")]
```

Looks like no matter how you slice it, the veil of darkness test shows racial
profiling of black drivers is present in Philly traffic stops.

### Caveats on the Veil of Darkness test

The veil-of-darkness test is a popular technique for assessing disparate
treatment, but, like all statistical methods, it comes with caveats.
(Hopefully that's one of the takeaways today---data and statistical analysis
can be very powerful, but it's good to know what the limitations are.) Perhaps
most importantly, darkness---after adjusting for time of day---is a function of
the date. As such, to the extent that driver behavior changes throughout the
year, and these changes are correlated with race, the test can suggest
discrimination where there is none. One way to account for seasonal effects
is to consider the brief period around daylight savings. Driving behavior
likely doesn't change much at, say, 6:00 p.m. the day before daylight savings to
the day after; but the sky is light on one day and dark on the next. Clever
checks like this can help circumvent some of the concerns about the veil of
darkness test. However, if driving behavior is more related to lighting than
time of day, the daylight savings test wouldn't help. 

Another thing to note is that artificial lighting (e.g., from street lamps)
can weaken the relationship between sunlight and visibility, and so the method
may underestimate the extent to which stops are predicated on perceived race. 

Other things, like vehicle make, year, and model often correlate with race and
are still visible at night, which could lead to the test under-estimating the
extent of racial profiling. Similarly, the test doesn't control for stop
reason, which is often correlated with both race and time of day. (Consider,
for example, pretextual broken tail light stops.)

Finally, the test only speaks to presence of racial profiling in the
intertwilight period -- doesn't say anything about other hours.  

Nevertheless, despite these shortcomings, the test provides a useful if
imperfect measure of bias in stop decisions.  

## Parting thoughts

Hopefully this gets you started on the basics of how to analyze data, and gives
you a little glimpse of some more complex modeling you can do with the data too.
While these analyses (from simple hit rates to modeling racial
profiling) can be really helpful in quantifying racial disparities, it's often
basic exploration of the data that yields the most interesting stories. So, the 
number one rule in working with data is: *be curious!*

Make sure to read the [full paper](https://5harad.com/papers/100M-stops.pdf) for a 
more in-depth analysis. And we look forward to reading about all of the analysis
you all do, and all the interesting stories you uncover!