# Law, Bias, and Algorithms
## Included variable bias (1/2)

In this exercise, we will investigate how to use a regression model to measure disparities across different groups, and discuss some of the problems that might arise in doing so.

In [1]:
# Some initial setup
options(digits = 3)
library(tidyverse)

theme_set(theme_bw())

# Read the data
stop_df <- read_rds("../data/sqf_sample.rds")

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.7
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


The loaded data frame is a sample of stops in NYC, recorded on a 
[UF-250 form][uf250_link]

Below is a list of columns in the data, roughly corresponding to the [UF-250 form][uf250_link]:

* Base information regarding stop:
    * `id`, `year`, `date`, `time`, `precinct`, `location_housing`, 
      `suspected_crime`

* Circumstances which led to stop:
    * `stop_reason_object`, `stop_reason_desc`, `stop_reason_casing`,
      `stop_reason_lookout`, `stop_reason_clothing`, `stop_reason_drugs`,
      `stop_reason_furtive`, `stop_reason_violent`, `stop_reason_bulge`,
      `stop_reason_other` 
    
* Suspect demographics:
    * `suspect_dob`, `suspect_id_type`, `suspect_sex`, `suspect_race`,
      `suspect_hispanic`, `suspect_age`, `suspect_height`, `suspect_weight`,
      `suspect_hair`, `suspect_eye`, `suspect_build`, `reason_explained`,
      `others_stopped`

* Whether physical force was used:
    * `force_hands`, `force_wall`, `force_ground`, `force_drawn`,
      `force_pointed`, `force_baton`, `force_handcuffs`,
      `force_pepper`, `force_other`

* Was suspect arrested?: `arrested`

* Was summons issued?: `summons_issued`

* Officer in uniform?: `officer_uniform`, `officer_verbal`, `officer_shield`

* Was person frisked?: `frisked`
    * if yes: `frisk_reason_suspected_crime`, `frisk_reason_weapons`, 
      `frisk_reason_attire`, `frisk_reason_actual_crime`, 
      `frisk_reason_noncompliance`, `frisk_reason_threats`,
      `frisk_reason_prior`, `frisk_reason_furtive`, `frisk_reason_bulge`

* Was person searched?: `searched`,
    * if yes: `searched_hardobject`, `searched_outline`,
      `searched_admission`, `searched_other`

* Was weapon found?: `found_weapon`
    * if yes: `found_gun`, `found_pistol`, `found_rifle`, `found_assault`,
      `found_knife`, `found_machinegun`, `found_other`
      
* Was other contraband found?: `found_contraband`

* Additional circumstances/factors
    * `additional_report`, `additional_investigation`, `additional_proximity`, 
      `additional_evasive`, `additional_associating`, `additional_direction`, 
      `additional_highcrime`, `additional_time`, `additional_sights`, 
      `additional_other`

* Additional reports prepared: `extra_reports`

[uf250_link]: https://www.prisonlegalnews.org/media/publications/Blank%20UF-250%20Form%20-%20Stop%2C%20Question%20and%20Frisk%20Report%20Worksheet%2C%20NYPD%2C%202016.pdf

## Base rate disparities in the decision to frisk

First, let's measure the disparities in police decisions to frisk individuals of different race groups.

### Exercise 1: manual computation of odds and odds ratios

* **Step 1**: For each race group, compute the proportion that were frisked

In [2]:
# With the stop_df data, group by suspect_race and compute the proportion (mean) of frisked == 1
# WRITE CODE HERE
# START solution
p_frisked_df <- stop_df %>%
    group_by(suspect_race) %>%
    summarize(p_frisked = mean(frisked))
# END solution

* **Step 2**: Given probability $p$ of being frisked, the *odds* of being frisked is computed as $p / (1-p)$. 

For example, if $p = \frac{1}{2}$, you're equally likely to be frisked or not (i.e., odds = 1); if $p = \frac{2}{3}$, you're twice as likely to be frisked than not (odds = 2).

Using the proportion frisked from Step 1 as an estimate of the probability of being frisked, compute the *odds* of being frisked for each race.

In [3]:
# Compute the odds, p / (1-p), where p is the proportion from step 1
# WRITE CODE HERE
# START solution
odds_df <- p_frisked_df %>%
    mutate(odds = p_frisked / (1 - p_frisked))
# END solution

* **Step 3**: A common method of comparing odds between two groups is to compute the *odds ratio*. 
This is simply the ratio between two odds.

For example, if the odds of being frisked is 0.8 for white pedestrians and 1.6 for black pedestrians, the odds ratio of being frisked for black vs. white pedestrians would be $1.6 / 0.8 = 2$. In other words, we would say stopped black pedestrians have twice the odds of being frisked, compared to stopped white pedestrians.

Using the odds computed in Step 2, compute the odds ratio for minority groups (black / Hispanic) versus whites.

In [5]:
# Compute odds of frisk for minority race group / odds of frisk for whites
# WRITE CODE HERE
# START solution
# Purely-tidy solution
odds_df %>%
    select(suspect_race, odds) %>%
    spread(suspect_race, odds) %>%
    transmute(black_white_odds = black / white, hispanic_white_odds = hispanic / white)

# Alternative solution
odds_black <- odds_df$odds[odds_df$suspect_race == "black"]
odds_hispanic <- odds_df$odds[odds_df$suspect_race == "hispanic"]
odds_white <- odds_df$odds[odds_df$suspect_race == "white"]

cat("Black-white odds ratio:", odds_black / odds_white, "\n")
cat("Hispanic-white odds ratio:", odds_hispanic / odds_white)
# END solution

black_white_odds,hispanic_white_odds
2.1,1.88


Black-white odds ratio: 2.1 
Hispanic-white odds ratio: 1.88

### Base rate disparities with (logistic) regression

Another method for comparing differences in frisk rates is to use regression. 
Specifically, logistic regression is commonly used for binary decisions (e.g., where the decision is either "frisk" or "don't frisk").

In `R` we use the `glm` function to fit *generalized* linear models (e.g., logistic regression, poisson regression). 
In its simplest form, the `glm` function is specified with a `formula`, the `data`, and a `family` which indicates what type of regression is used.
A `formula` in `R` is specified in the form: `Left-hand-side variable ~ Right-hand-side specifications`.
For example, to fit a logistic regression (which is of the `"binomial"` family) of `frisked` to the `suspect_race` variable, using the `stop_df` data, we can write:

In [5]:
base_model <- glm(frisked ~ suspect_race, data = stop_df, family = binomial)

where the first argument to `glm` is assumed to be the `formula`. 

Using mathematical notation, this corresponds to the model:
$$
\Pr(\text{frisked}) = \operatorname{logit}^{-1}(
    \beta_0 + \beta_{\text{black}}\mathbb{1}_{\text{black}} + 
    \beta_{\text{Hispanic}}\mathbb{1}_{\text{Hispanic}}
).
$$

Recall that $\operatorname{logit}(p)$ for some probability $p$ is defined as the _log_-odds of $p$:
$$
\operatorname{logit}(p) = \log\left(\frac{p}{1-p}\right).
$$
So, given
$$
p = \operatorname{logit}^{-1}(x),
$$
$x$ is the _log_-odds of $p$, and $\exp(x)$ corresponds to the odds of $p$.

From the above model, we can compute $\Pr(\text{frisked})$ as $\operatorname{logit}^{-1}(\beta_0)$ for white individuals and $\operatorname{logit}^{-1}(\beta_0 + \beta_{\text{black}})$ for black individuals.
Note that, then $\exp(\beta_0)$ corresponds to the odds of being frisked for white individuals who have been stopped, 
while $\exp(\beta_0 + \beta_{\text{black}})$ corresponds to the odds of being frisked for black individuals who have been stopped.
The odds _ratio_ of being frisked for black vs. white pedestrians would then be $\exp(\beta_{\text{black}})$---the exponentiated coefficient on
the variable indicating whether a pedestrian's race group is black or not.

We can inspect the coefficients of the fitted model using the `coef()` function.

In [6]:
print(coef(base_model))

         (Intercept)    suspect_raceblack suspect_racehispanic 
               0.304                0.740                0.633 


As we've seen above, the `(Intercept)` ($\beta_0$) term corresponds to the _log_-odds of being frisked for stopped white individuals, while the `suspect_raceblack` coefficient represents the change in *log*-odds (log of odds ratio) of being frisked for black individuals compared to the white individuals. By exponentiating the coefficients, we can recover the odds of being frisked for whites and odds-ratio of being frisked for each minority race group with respect to whites.

In [7]:
# Exponentiating the coefficients recover odds ratio of treatment for each variable; 
# identical to what we find in exercise 2, 
# while the exponentiated intercept represents the odds of treatment for the base case (whites) 
print(exp(coef(base_model)))

         (Intercept)    suspect_raceblack suspect_racehispanic 
                1.36                 2.10                 1.88 


### Exercise 2: discussion of base rate disparities

Given the results so far, what can we say about disparate impact of frisk decisions on groups defined by race?
What are some issues that need to be addressed?

## Adjusting for possible confounders

One concern is that officers might have a legitimate reason to frisk certain individuals more often; it might just be that the "legitimate reason" is also highly correlated with race.

For example, one of the reasons for stopping an individual is if the officer suspects criminal posession of a weapon (encoded in the `suspected_crime` column as `cpw`).
Given that the primary justification of a frisk is concern for officer safety, one could argue that it is reasonable for an officer to 
frisk individuals whom they have stopped under suspicion of criminal posession of weapons.

(Although, whether an officer's _suspicion_ itself is justified is a different question, which we will address later)

### Adjusting for `suspected_crime == "cpw"` 

* **Step 1**: With `stop_df`, we first create a new binary column named `is_cpw` that is `TRUE` if `suspected_crime` is `cpw`.

In [6]:
stop_df <- stop_df %>%
    mutate(is_cpw = suspected_crime == "cpw")

* **Step 2**: For each race group, we can compute the proportion of individuals who were stopped under suspicion of `cpw`

In [7]:
stop_df %>%
  group_by(suspect_race) %>%
  summarize(p_cpw = mean(is_cpw))

suspect_race,p_cpw
white,0.117
black,0.354
hispanic,0.252


Above, we find that individuals who are suspected of `cpw` are *not* evenly distributed accross race groups.

Specifically, we find that a larger proportion of minorities are stopped for `cpw` than white individuals,
and if we adjust for `is_cpw` in our analysis, 
we find that the disparities we measure decrease significantly.

In [10]:
glm(frisked ~ suspect_race + is_cpw, data = stop_df, family = binomial)


Call:  glm(formula = frisked ~ suspect_race + is_cpw, family = binomial, 
    data = stop_df)

Coefficients:
         (Intercept)     suspect_raceblack  suspect_racehispanic  
               0.122                 0.404                 0.462  
          is_cpwTRUE  
               2.278  

Degrees of Freedom: 99999 Total (i.e. Null);  99996 Residual
Null Deviance:	    120000 
Residual Deviance: 107000 	AIC: 107000

Note here that we "adjust for" a variable in our data by including it in the right-hand side of our regression formula.

### Exercise 3: Adjusting for confounding

Following the above logic, there could be multiple legitimate factors that account for the observed disparity of being frisked between different race groups. 
How does changing the model affect the coefficients on race? How do you interpret these results

In [10]:
# We are only interested in the race coefficients
race_coefficients <- c("suspect_raceblack", "suspect_racehispanic")

# For example, we could inspect the relevant coefficient of the example above with:
print(coef(glm(frisked ~ suspect_race + is_cpw, 
               data = stop_df, family = binomial))[race_coefficients])

# WRITE CODE HERE
# START solution
# Example models including multiple variables
print(coef(glm(frisked ~ suspect_race + suspected_crime + location_housing, 
               data = stop_df, family = binomial))[race_coefficients])

print(coef(glm(frisked ~ suspect_race + 
               suspected_crime + 
               location_housing + 
               precinct + 
               suspect_sex +
               suspect_age +
               stop_reason_object +
               stop_reason_furtive, 
               data = stop_df, family = binomial))[race_coefficients])
# END solution

   suspect_raceblack suspect_racehispanic 
               0.404                0.462 
   suspect_raceblack suspect_racehispanic 
               0.240                0.324 
   suspect_raceblack suspect_racehispanic 
               0.207                0.149 


## Included variable bias

One common method for measuring disparities while addressing some of the omitted variable bias concerns is to include _all_ recorded data, that would have been available to the officer at the time of making the decision (to frisk an individual). This is also known as the "kitchen sink" approach.

### Exercise 4: The kitchen sink approach

For convenience, we have created a formula that includes all the variables that an officer would have had available when making the frisk decision.

* **Step 1**: Using the provided `kitchen_sink_formula`, apply the kitchen sink approach to measure the disparate impact of 
frisk on minority race groups.

In [13]:
feats <- c(
    "suspected_crime",
    "precinct",
    "location_housing",
    "suspect_sex",
    "suspect_age",
    "suspect_height",
    "suspect_weight",
    "suspect_hair",
    "suspect_eye",
    "suspect_build",
    "additional_report",
    "additional_investigation",
    "additional_proximity",
    "additional_evasive",
    "additional_associating",
    "additional_direction",
    "additional_highcrime",
    "additional_time",
    "additional_sights",
    "additional_other",
    "stop_reason_object",
    "stop_reason_desc",
    "stop_reason_casing",
    "stop_reason_lookout",
    "stop_reason_clothing",
    "stop_reason_drugs",
    "stop_reason_furtive",
    "stop_reason_violent",
    "stop_reason_bulge",
    "stop_reason_other",
    "suspect_race"
)

# This creates a formula with a specified left-hand side (response = "frisked"),
# and using all the variables in feats on the right-hand side. 
# Constructing a formula in this way (instead of typing out all the variable names)
# is helpful for constructing multiple models that share a long list of variables in the right-hand side.
kitchen_sink_formula <- reformulate(feats, response = "frisked")

# WRITE CODE HERE
# START solution
# We are only interested in the race coefficients
ks_model <- glm(kitchen_sink_formula, stop_df, family = binomial)
print(coef(ks_model)[race_coefficients])
# END solution

   suspect_raceblack suspect_racehispanic 
               0.191                0.173 


* **Step 2**: Note how the kitchen sink model reduces the coefficients on race---suggesting much less disparate impact than the base model.
Now carefully consider each variable that is included in `feats`. Are all of these variables justified? Which would you argue should or should _not_ be included? Why?

_Tip_: you can fit new models with different sets of features by commenting-out (adding a `# ` to the begining of) lines that define the `feats` vector and re-running the cell

The problems with including variables in measuring disparate impact is that the correlation between a feature and race itself is not necessarily justified.
An obvious example would be something like "skin color", where including skin color in the regression will likely account for observed disparities in race,
but the correlation between skin color and race is unlikely to be justified!
On the other hand, a less obvious example would be an officers suspicion of `cpw`.
While it seems reasonable that an officer would frisk individuals suspected of posessing a weapon more frequently,
the suspicion itself would only be justified if, and to the degree that, it is predictive of achieving the goal of a frisk: recovering weapons.

Blindly including a variable in the regression for treatment fails to take into account this _degree_ of justification, 
often overcompensating for variables that are correlated with race.
This is the problem known as _included variable bias_. Next, we will learn one way of dealing with this included variable bias.