# Law, Bias, and Algorithms
## Included variable bias (2/2)

In [1]:
# Some initial setup
options(digits = 3)
library(tidyverse)

theme_set(theme_bw())

# Read the data
stop_df <- read_rds("../data/sqf_sample.rds")

── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.0
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()


## Included variable bias

One common method for measuring disparities while addressing some of the omitted variable bias concerns is to include _all_ recorded data, that would have been available to the officer at the time of making the decision (to frisk an individual). This is also known as the "kitchen sink" approach.

### Exercise 6: The kitchen sink approach

For convenience, we have created a formula that includes all the variables that an officer would have had available when making the frisk decision.

* **Step 1**: Using the provided `kitchen_sink_formula`, apply the kitchen sink approach to measure the disparate impact of 
frisk on minority race groups.

In [2]:
race_coefficients <- c("suspect_raceblack", "suspect_racehispanic")

feats <- c(
    "suspected_crime",
    "precinct",
    "location_housing",
    "suspect_sex",
    "suspect_age",
    "suspect_height",
    "suspect_weight",
    "suspect_hair",
    "suspect_eye",
    "suspect_build",
    "additional_report",
    "additional_investigation",
    "additional_proximity",
    "additional_evasive",
    "additional_associating",
    "additional_direction",
    "additional_highcrime",
    "additional_time",
    "additional_sights",
    "additional_other",
    "stop_reason_object",
    "stop_reason_desc",
    "stop_reason_casing",
    "stop_reason_lookout",
    "stop_reason_clothing",
    "stop_reason_drugs",
    "stop_reason_furtive",
    "stop_reason_violent",
    "stop_reason_bulge",
    "stop_reason_other",
    "suspect_race"
)

# This creates a formula with a specified left-hand side (frisked), and using 
# all the variables in feats on the right-hand side. 
# Constructing a formula in this way (instead of typing out all the variable names)
# is helpful for constructing multiple models that share a long list of variables in the right-hand side.
kitchen_sink_formula <- as.formula(paste("frisked ~", paste(feats, collapse = "+")))

# WRITE CODE HERE
# START solution
# We are only interested in the race coefficients
ks_model <- glm(kitchen_sink_formula, stop_df, family = binomial)
print(coef(ks_model)[race_coefficients])
# END solution

   suspect_raceblack suspect_racehispanic 
               0.191                0.173 


* **Step 2**: Note how the kitchen sink model reduces the coefficients on race---suggesting much less disparate impact than the base model.
Now carefully consider each variable that is included in `feats`. Are all of these variables justified? Which would you argue should or should _not_ be included? Why?

_Tip_: you can fit new models with different sets of features by commenting-out (adding a `# ` to the begining of) lines that define the `feats` vector and re-running the cell

The problems with including variables in measuring disparate impact is that the correlation between a feature and race itself is not necessarily justified.
An obvious example would be something like "skin color", where including skin color in the regression will likely account for observed disparities in race,
but the correlation between skin color and race is unlikely to be justified!
On the other hand, a less obvious example would be an officers suspicion of `cpw`.
While it seems reasonable that an officer would frisk individuals suspected of posessing a weapon more frequently,
the suspicion itself would only be justified if, and to the degree that, it is predictive of achieving the goal of a frisk: recovering weapons.

Blindly including a variable in the regression for treatment fails to take into account this _degree_ of justification, 
often overcompensating for variables that are correlated with race.
This is the problem known as _included variable bias_.

## Risk-adjusted regression

As we briefly discussed, controlling for any variable (i.e., including it in the regression) is only justified if, and to the degree that, the variable is _predictive of the outcome we are ultimately interested in_ (in this case, recovering a weapon). But the extent to which each variable is justified is rarely clear.

One simple idea for addressing this concern of included variable bias is to control for an explicit measure of **risk**, instead of controling for invididual variables.
Intuitively, we wish to know whether individuals who have _similar risk_ (of carrying a weapon) were treated (frisked) equally.

### Exercise 7: estimating risk

In order to adjust for risk, we must first estimate it. This is relatively straight forward in the context of frisk decisions in stop-and-frisk, 
because the goal of a frisk is clear---we wish to recover weapons (`found_weapon`). 
In other words, we want to predict whether a weapon would be found if an individual is frisked. 
One simple way to achieve this is to build a model with `found_weapon` on the left-hand side, restricting the data to the individuals who were frisked.

* **Step 1**: `filter` the `stop_df` data to those individuals who were frisked. We will call this new data frame `frisk_df`

In [14]:
# Subset the stop_df data to cases where the individual was frisked
# WRITE CODE HERE
# START solution
frisk_df <- stop_df %>%
    filter(frisked)
# END solution

* **Step 2**: Fit a logistic regression where the left-hand side is whether or not a weapon was found (`found_weapon`) and the right-hand side is all reasonable variables (as listed in `feats` above). Use this model to generate a column of model estimated risk (let's call it `risk`) on the original `stop_df` data. Note that we use logistic regression here for simplicity, but in reality, more complex methods for predictive modeling would be employed, with additional measures to avoid overfitting.

_Tip_: Given a `glm` model named `risk_model`, a vector of predictions for `stop_df` can be created with the command `predict(risk_model, stop_df)`.

In [15]:
# Using the subset of data from Step 1, fit the logistic regression model: found_weapon ~ is_cpw 
# WRITE CODE HERE
# START solution
risk_formula <- as.formula(paste("found_weapon ~", paste(feats, collapse = "+")))
risk_model <- glm(risk_formula, data = frisk_df, family = binomial)

stop_df <- stop_df %>%
    mutate(risk = predict(risk_model, .))
# END solution

In [26]:
# TODO(?): Maybe it's worth adding a short section on risk model checking? (e.g., calibration)
# TODO(?): We're being super hand-wavy here about scales (i.e., logit/probability). How much do we want/need to dive into the weeds?

Once we have a good measure of risk, we can easily compute the disparity on race, accounting for risk, by fitting a regression with the two variables: race and risk.

### Exercise 8: Risk-adjusted regression

Fit a logistic regression model to measure disparate impact on race, but only controling for risk, i.e., with the formula
`frisked ~ suspect_race + risk`

In [21]:
# WRITE CODE HERE
# START solution
glm(frisked ~ suspect_race + risk, stop_df, family = binomial)
# END solution


Call:  glm(formula = frisked ~ suspect_race + risk, family = binomial, 
    data = stop_df)

Coefficients:
         (Intercept)     suspect_raceblack  suspect_racehispanic  
               1.263                 0.807                 0.666  
                risk  
               0.228  

Degrees of Freedom: 99999 Total (i.e. Null);  99996 Residual
Null Deviance:	    120000 
Residual Deviance: 117000 	AIC: 117000