# Law, Bias, and Algorithms
## Omitted/included variable bias and risk-adjusted regression

In [1]:
options(digits = 3)

library(tidyverse)

stop_df <- read_rds("../data/sqf_sample.rds")

theme_set(theme_bw())

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.0
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


The loaded data frame is a sample of stops in NYC, recorded on a 
[UF-250 form][uf250_link]

Below is a list of columns in the data, roughly corresponding to the [UF-250 form][uf250_link]:

* Base information regarding stop:
    * `id`, `year`, `date`, `time`, `precinct`, `location_housing`, 
      `suspected_crime`

* Circumstances which led to stop:
    * `stop_reason_object`, `stop_reason_desc`, `stop_reason_casing`,
      `stop_reason_lookout`, `stop_reason_clothing`, `stop_reason_drugs`,
      `stop_reason_furtive`, `stop_reason_violent`, `stop_reason_bulge`,
      `stop_reason_other` 
    
* Suspect demographics:
    * `suspect_dob`, `suspect_id_type`, `suspect_sex`, `suspect_race`,
      `suspect_hispanic`, `suspect_age`, `suspect_height`, `suspect_weight`,
      `suspect_hair`, `suspect_eye`, `suspect_build`, `reason_explained`,
      `others_stopped`

* Whether physical force was used:
    * `force_hands`, `force_wall`, `force_ground`, `force_drawn`,
      `force_pointed`, `force_baton`, `force_handcuffs`,
      `force_pepper`, `force_other`

* Was suspect arrested?: `arrested`

* Was summons issued?: `summons_issued`

* Officer in uniform?: `officer_uniform`, `officer_verbal`, `officer_shield`

* Was person frisked?: `frisked`
    * if yes: `frisk_reason_suspected_crime`, `frisk_reason_weapons`, 
      `frisk_reason_attire`, `frisk_reason_actual_crime`, 
      `frisk_reason_noncompliance`, `frisk_reason_threats`,
      `frisk_reason_prior`, `frisk_reason_furtive`, `frisk_reason_bulge`

* Was person searched?: `searched`,
    * if yes: `searched_hardobject`, `searched_outline`,
      `searched_admission`, `searched_other`

* Was weapon found?: `found_weapon`
    * if yes: `found_gun`, `found_pistol`, `found_rifle`, `found_assault`,
      `found_knife`, `found_machinegun`, `found_other`
      
* Was other contraband found?: `found_contraband`

* Additional circumstances/factors
    * `additional_report`, `additional_investigation`, `additional_proximity`, 
      `additional_evasive`, `additional_associating`, `additional_direction`, 
      `additional_highcrime`, `additional_time`, `additional_sights`, 
      `additional_other`

* Additional reports prepared: `extra_reports`

[uf250_link]: https://www.prisonlegalnews.org/media/publications/Blank%20UF-250%20Form%20-%20Stop%2C%20Question%20and%20Frisk%20Report%20Worksheet%2C%20NYPD%2C%202016.pdf

### Exercise 1: Initial exploration

* Compare columns of `stop_df` with the fields in the [UF-250 form][uf250_link].
* Explore basic statistics, e.g.,
    * What is the proportion of stops by `suspect_race`?
    * What are the five most common suspected crimes for a stop?
    * What proportion of stops result in retrieval of weapon or contraband?

[uf250_link]: https://www.prisonlegalnews.org/media/publications/Blank%20UF-250%20Form%20-%20Stop%2C%20Question%20and%20Frisk%20Report%20Worksheet%2C%20NYPD%2C%202016.pdf

In [2]:
# WRITE CODE HERE
# START solution
# Proportion of stops suspect race
stop_df %>%
    group_by(suspect_race) %>%
    summarize(prop = n()/nrow(.))

# Top five suspected crimes
stop_df %>%
    group_by(suspected_crime) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    top_n(5, count)

# Proportion of "successful" stop (weapon or contraband found)
stop_df %>% 
    mutate(success = found_weapon | found_contraband) %>%
    summarize(p_success = mean(success))
# END solution

suspect_race,prop
white,0.127
black,0.548
hispanic,0.325


suspected_crime,count
cpw,29087
robbery,21230
burglary,10424
grand larceny auto,10005
other,8192


p_success
0.0511


## Base rate disparities in the decision to frisk

Here, we will compute the disparities in police decision to frisk individuals of different race groups.

### Exercise 2: manual computation of odds and odds ratios

* **Step 1**: For each race group, compute the proportion that were frisked

In [3]:
# With the stop_df data, group by suspect_race and compute the proportion (mean) of frisked == 1
# WRITE CODE HERE
# START solution
p_frisked_df <- stop_df %>%
    group_by(suspect_race) %>%
    summarize(p_frisked = mean(frisked))
# END solution

* **Step 2**: Given probability $p$ of being frisked, the *odds* of being frisked is computed as $p / (1-p)$. 
Using the proportion frisked from step 1 as an estimate of the probability of being frisked, compute the *odds* of being frisked for each race.

In [4]:
# Compute the odds, p / (1-p), where p is the proportion from step 1
# WRITE CODE HERE
# START solution
odds_df <- p_frisked_df %>%
    mutate(odds = p_frisked / (1 - p_frisked))
# END solution

* **Step 3**: A common method of comparing odds between two groups is to compute the *odds ratio*. 
This is simply the ratio between two odds. For example, if the odds of being frisked is 0.8 for whites and 1.6 for blacks, the odds ratio of being frisked for blacks vs. whites would be $1.6 / 0.8 = 2$. In other words, we would say stopped blacks are twice as likely to be frisked, compared to stopped whites.
Using the odds computed in step 2, compute the odds ratio for minority groups (black / Hispanic) versus whites.

In [5]:
# Compute odds of frisk for minority race group / odds of frisk for whites
# WRITE CODE HERE
# START solution
# Purely-tidy solution
odds_df %>%
    select(suspect_race, odds) %>%
    spread(suspect_race, odds) %>%
    transmute(or_black = black / white, or_hispanic = hispanic / white)

# Alternative solution
odds_black <- odds_df$odds[odds_df$suspect_race == "black"]
odds_hispanic <- odds_df$odds[odds_df$suspect_race == "hispanic"]
odds_white <- odds_df$odds[odds_df$suspect_race == "white"]

cat("odds ratio for black:", odds_black / odds_white)
cat("\n")
cat("odds ratio for hispanic:", odds_hispanic / odds_white)
# END solution

or_black,or_hispanic
2.1,1.88


odds ratio for black: 2.1
odds ratio for hispanic: 1.88

Another method for comparing differences in treatment is to use regression. 
Specifically for binary treatment, e.g., where the decision is either "frisk" or "don't frisk", logistic regression is commonly used.

In `R` we use the `glm` function to fit *generalized* regressions (e.g., logistic regression, poisson regression). 
In its simplest form, the `glm` function is specified with a `formula`, the `data`, and a `family` which indicates what type of regression is used.
A `formula` in `R` is specified in the form: `Left-hand-side variable ~ Right-hand-side specifications`.
For example, to fit a logistic regression (which is of the `binomial` `family`) of `frisked` to the `suspect_race` variable, using the `stop_df` data, we can write:

In [6]:
base_model <- glm(frisked ~ suspect_race, data = stop_df, family = binomial)

where the first argument to `glm` is assumed to be the `formula`.

We can inspect the coefficients of the fitted model using the `coef()` function, i.e.,

In [7]:
print(coef(base_model))

         (Intercept)    suspect_raceblack suspect_racehispanic 
               0.304                0.740                0.633 


Note that the coefficients of a logistic regression represent the change in log-odds of treatment for a unit change in the variable, compared to the base case.

In this specific example, the base case is for `suspect_race = white`, and the `suspect_raceblack` coefficient represents the change in *log*-odds of being frisked for black individuals compared to the base case white individuals. And by exponentiating the coefficients, we effectively recover the odds-ratio of treatment for each race with respect to the base case whites.

In [8]:
# Exponentiating the coefficients recover odds ratio of treatment for each variable; 
# identical to what we find in exercise 2, 
# while the exponentiated intercept represents the odds of treatment for the base case (whites) 
print(exp(coef(base_model)))

         (Intercept)    suspect_raceblack suspect_racehispanic 
                1.36                 2.10                 1.88 


### Exercise 3: discussion of base rate disparities

Given the results so far, what can we say about disparate impact of frisk decisions on race?
What are some issues that need to be addressed?

## Omitted variable bias

One concern is that there might be a legitimate reason for officers to frisk stopped individuals more often, which happens to be highly correlated with race.

For example, one of the reasons for stopping an individual is if the officer suspects criminal posession of a weapon (encoded in the `suspected_crime` column as `cpw`).
Given that the primary justification of a frisk is concern for officer safety, it is entirely reasonable for an officer to 
frisk individuals whom they have stopped under suspicion of criminal posession of weapons.

### Exercise 4: with `stop_df`, create a new binary column named `is_cpw` that is `TRUE` if `suspected_crime` is `cpw`.

In [9]:
# WRITE CODE TO ADD is_cpw column HERE
# START solution
stop_df <- stop_df %>%
    mutate(is_cpw = suspected_crime == "cpw")
# END solution

However, we find that individuals who are suspected of `cpw` are *not* evenly distributed among race.

In [10]:
stop_df %>%
  group_by(suspect_race) %>%
  summarize(p_cpw = mean(is_cpw))

suspect_race,p_cpw
white,0.117
black,0.354
hispanic,0.252


Specifically, we find that a larger proportion of minorities are stopped for `cpw` than white,
and if we control for `is_cpw` in our analysis, 
we find that the disparities we measure decrease significantly.

In [11]:
glm(frisked ~ suspect_race + is_cpw, data = stop_df, family = binomial)


Call:  glm(formula = frisked ~ suspect_race + is_cpw, family = binomial, 
    data = stop_df)

Coefficients:
         (Intercept)     suspect_raceblack  suspect_racehispanic  
               0.122                 0.404                 0.462  
          is_cpwTRUE  
               2.278  

Degrees of Freedom: 99999 Total (i.e. Null);  99996 Residual
Null Deviance:	    120000 
Residual Deviance: 107000 	AIC: 107000

### Exercise 5: what variables should be included?

Explore `stop_df`, and discuss what variables (columns) should be accounted for when measuring disparate impact of frisk on race.
What variables should (or should *not*) be included?
How does including different variables in the regression affect the coefficient on race?

In [12]:
# WRITE CODE HERE
# START solution
# We are only interested in the race coefficients
race_coefficients <- c("suspect_raceblack", "suspect_racehispanic")

# Example model including multiple variables
print(coef(glm(frisked ~ suspect_race + suspected_crime + location_housing, 
               data = stop_df, family = binomial))[race_coefficients])

print(coef(glm(frisked ~ suspect_race + suspected_crime + location_housing + precinct, 
               data = stop_df, family = binomial))[race_coefficients])
# END solution

   suspect_raceblack suspect_racehispanic 
               0.240                0.324 
   suspect_raceblack suspect_racehispanic 
               0.249                0.199 


## Included variable bias

One common method for measuring disparities while addressing some of the omitted variable bias concerns is to include _all_ recorded data, that would have been available to the officer at the time of making the decision (to frisk an individual). This is also known as the "kitchen sink" approach.

### Exercise 6: The kitchen sink approach

For convenience, we have created a formula that includes all the variables that an officer would have had available.

* **Step 1**: Using the provided `kitchen_sink_formula`, apply the kitchen sink approach to measure the disparate impact of 
frisk on minority race groups.

In [13]:
feats <- c(
    "suspected_crime",
    "precinct",
    "location_housing",
    "suspect_sex",
    "suspect_age",
    "suspect_height",
    "suspect_weight",
    "suspect_hair",
    "suspect_eye",
    "suspect_build",
    "additional_report",
    "additional_investigation",
    "additional_proximity",
    "additional_evasive",
    "additional_associating",
    "additional_direction",
    "additional_highcrime",
    "additional_time",
    "additional_sights",
    "additional_other",
    "stop_reason_object",
    "stop_reason_desc",
    "stop_reason_casing",
    "stop_reason_lookout",
    "stop_reason_clothing",
    "stop_reason_drugs",
    "stop_reason_furtive",
    "stop_reason_violent",
    "stop_reason_bulge",
    "stop_reason_other",
    "suspect_race"
)

# This creates a formula with a specified left-hand side (frisked), and using 
# all the variables in feats on the right-hand side. 
# Constructing a formula in this way (instead of typing out all the variable names)
# is helpful for constructing multiple models that share a long list of variables in the right-hand side.
kitchen_sink_formula <- as.formula(paste("frisked ~", paste(feats, collapse = "+")))

# WRITE CODE HERE
# START solution
# We are only interested in the race coefficients
ks_model <- glm(kitchen_sink_formula, stop_df, family = binomial)
print(coef(ks_model)[race_coefficients])
# END solution

   suspect_raceblack suspect_racehispanic 
               0.191                0.173 


* **Step 2**: Note how the kitchen sink model reduces the coefficients on race---suggesting much less disparate impact than the base model.
Now carefully consider each variable that is included in `feats`. Are all of these variables justified? Which would you argue should or should _not_ be included? Why?

_Tip_: you can fit new models with different sets of features by commenting-out (adding a `# ` to the begining of) lines that define the `feats` vector and re-running the cell

## Risk-adjusted regression

In such contexts of measuring disparate impact, controlling for any variable (i.e., including it in the regression) is only justified if the variable is _predictive of the outcome we are ultimately interested in_ (in this case, recovering a weapon) and _happens to be correlated with race_. But as we see above, the extent to which each variable is justified is rarely clear.

One simple idea for addressing this concern of included variable bias is to control for a measure of **risk**, instead of controling for invididual variables.
Intuitively, we wish to know whether individuals who have _similar risk_ (of carrying a weapon) were treated (frisked) equally.

### Exercise 7: estimating risk

In order to adjust for risk, we must first estimate it. This is relatively straight forward in the context of frisk decisions in stop-and-frisk, 
because the goal of a frisk is clear---we wish to recover weapons (`found_weapon`). 
In other words, we want to predict whether a weapon would be found if an individual is frisked. 
One simple way to achieve this is to build a model with `found_weapon` on the left-hand side, restricting the data to the individuals who were frisked.

* **Step 1**: `filter` the `stop_df` data to those individuals who were frisked. We will call this new data frame `frisk_df`

In [14]:
# Subset the stop_df data to cases where the individual was frisked
# WRITE CODE HERE
# START solution
frisk_df <- stop_df %>%
    filter(frisked)
# END solution

* **Step 2**: Fit a logistic regression where the left-hand side is whether or not a weapon was found (`found_weapon`) and the right-hand side is all reasonable variables (as listed in `feats` above). Use this model to generate a column of model estimated risk (let's call it `risk`) on the original `stop_df` data. Note that we use logistic regression here for simplicity, but in reality, more complex methods would be employed, with additional measures to avoid overfitting.

_Tip_: Given a `glm` model named `risk_model`, a vector of predictions for `stop_df` can be created with the command `predict(risk_model, stop_df)`.

In [15]:
# Using the subset of data from Step 1, fit the logistic regression model: found_weapon ~ is_cpw 
# WRITE CODE HERE
# START solution
risk_formula <- as.formula(paste("found_weapon ~", paste(feats, collapse = "+")))
risk_model <- glm(risk_formula, data = frisk_df, family = binomial)

stop_df <- stop_df %>%
    mutate(risk = predict(risk_model, .))
# END solution

* **Step 3**: Fit a logistic regression model to measure disparate impact on race, but only controling for risk, i.e., with the formula
`frisked ~ suspect_race + risk`

In [19]:
glm(frisked ~ suspect_race + risk, stop_df, family = binomial)


Call:  glm(formula = frisked ~ suspect_race + risk, family = binomial, 
    data = stop_df)

Coefficients:
         (Intercept)     suspect_raceblack  suspect_racehispanic  
               1.263                 0.807                 0.666  
                risk  
               0.228  

Degrees of Freedom: 99999 Total (i.e. Null);  99996 Residual
Null Deviance:	    120000 
Residual Deviance: 117000 	AIC: 117000