# Law, Bias, and Algorithms
## Prediction and Disparate Impact in Law School Admissions
This notebook will provide an example of how machine learning algorithms can be used to select individual cases out of a pool of candidates based on predicted outcomes. 

The notebook uses the example of law school admissions and considers a scenario in which law schools attempt to maximize the bar passage rates of the students they admit. After all, U.S News and World Report law school rankings considers bar passage rate and could plausibly increase the weight given to the factor. If they did so, law schools would have a strong incentive to maximize their bar passage rates in order to improve their law school rankings.

This notebook proceeds by constructing a gradient boosted decision tree model and then applying the model to simulate the admission of law school applicants. It then considers whether there are differences in outcomes between the actual admissions results and the bar-passage optimized admissions results and whether any differences can be deemed algorithmic bias. It also considers some alternative ways to structure an algorithmic model for college admissions and compares the results of those approaches. 

In [None]:
# Some initial setup
options(digits = 3)
library(tidyverse)
library(gbm)
theme_set(theme_bw())

# Read the data
data <- read_csv("./bar_passage_data.csv", col_types = cols())

Data entries that lacked information about the tier of law school attended or information about eventual bar result have been removed.

The dataset contains the following variables:

* An ID number:
    * `ID`
    
    
* Base demographic information about the applicant:
    * `GENDER`, `MALE`, `RACE`, `ASIAN`, `BLACK`, `OTHER` (other race, American Indian, or Alaskan Native), `HISP`
    * `RACE` is encoded as follows:
        * 1: American Indian or Alaskan Native
        * 2: Asian, Pacific Islander, Pacific American
        * 3: Black
        * 4: Mexican American
        * 5: Puerto Rican
        * 6: Other Hispanic
        * 7: White, non-hispanic
        * 8: Other


* Outcome of interest, Bar Passage:
    * `PASS_BAR`, `BAR`
    * `PASS_BAR` is an indicator variable and is encoded as 0 regardless of why the student did not pass the exam.  They may have dropped out of law school, never taken the bar, or failed the exam. `PASS_BAR` is encoded as 1 if the student eventually passes the bar. 
    * `BAR` provides more detail about bar results and test history
    
    
* Academic Indicators
    * `UGPA` (undergraduate GPA), `LSAT` (LSAT score, scaled to be between 10 and 50)
    
    
* Tier of Law School Attended, on a range from 2-6 with higher tiers being more selective
    * `TIER`
    * Note that in the original data tier 1 covers students attending historically black colleges and universities. All individuals who attend tier 1 schools were removed as those schools are outliers in law school admissions relative to the other tiers.


* Family Income Quintile:
    * `FAM_INC_1`, `FAM_INC_2`, `FAM_INC_3`, `FAM_INC_4`,` FAM_INC_5` are indicator variables
    * `FAM_INC` provides the quintile

### Exploratory Data Analysis

#### Exercise 1: Initial Data Exploration
Create a series of charts that illustrate the underlying date:
* a graph of law school tier by UGPA
* a graph of law school tier by LSAT
* a graph showing the distribution of LSAT and UGPA for all applicants
* a graph showing the distribution of LSAT and UGPA for minority applicants

You may also consider creating histograms of key variables of interest or graphically representing bar passage rates by law school tier, UGPA, or LSAT. 


In [None]:
#WRITE CODE HERE

#START Solution
#Law school tied by LSAT
TIER_LSAT <- ggplot(data=data, aes(x=TIER, y=LSAT)) + 
    geom_jitter(aes(color=factor(RACE)), alpha=.6, width=.3) +
    scale_color_discrete(name = "Race") +
    xlab("Law School Tier") +
    ylab("LSAT") +
    ggtitle("Law School Tier by LSAT")
TIER_LSAT

#Law school tier by UGPA
TIER_UGPA <- ggplot(data=data, aes(x=TIER, y=UGPA)) + 
    geom_jitter(aes(color=factor(RACE)), alpha=.6, width=.3) +
    scale_color_discrete(name = "Race") +
    xlab("Law School Tier") +
    ylab("Undergraduate GPA")+
    ggtitle("Law School Tier by Undergraduate GPA")
TIER_UGPA

#LSAT and UGPA distribution for all applicants
LSAT_UGPA <- ggplot(data=data, aes(x=LSAT, y=UGPA)) + 
    geom_jitter(aes(color=factor(RACE)), alpha=.3, width=.1) +
    scale_color_discrete(name = "Race") +
    xlab("LSAT") +
    ylab("Undergraduate GPA")+
    ggtitle("Distribution of LSAT and Undergraduate GPA of Law School Attendees")
LSAT_UGPA

#LSAT and UGPA distribution for minority applicants
LSAT_UGPA_minority <- ggplot(data=subset(data, RACE!=7), aes(x=LSAT, y=UGPA)) + 
    geom_jitter(aes(color=factor(RACE)), alpha=.3, width=.1) +
    scale_color_discrete(name = "Race") +
    xlab("LSAT") +
    ylab("Undergraduate GPA") +
    ggtitle("Distribution of LSAT and Undergraduate GPA of Minority Law School Attendees")
LSAT_UGPA_minority
#END Solution

#### Exercise 2: Understanding the Predictors of LSAT and UGPA 

It can  be helpful for understanding the data and potential risks of bias to look at the underlying correlations between important neutral variables (here LSAT and UGPA) and suspect class status (here race). Create two linear regression models that look at the correlation between (1) LSAT and (2) UGPA and the other variables in the dataset. 

In [None]:
# WRITE CODE HERE

# START solution
#Create linear regression model for UGPA
gpa_m <- lm(UGPA ~ MALE + ASIAN + BLACK + OTHER + HISP + 
                   FAM_INC_2 + FAM_INC_3 + FAM_INC_4 + FAM_INC_5, data=data)
#Create linear regression model for LSAT
lsat_m <- lm(LSAT ~ MALE + ASIAN + BLACK + OTHER + HISP + 
                   FAM_INC_2 + FAM_INC_3 + FAM_INC_4 + FAM_INC_5, data=data)        
summary(gpa_m)
summary(lsat_m)
#END solution

### Predicting Bar Passage

We would like to predict whether individual law students will pass the bar exam. We will then use this model to simulate an admissions cycle where the students predicted as being the most likely to pass the bar are admitted into the higherest tier law schools. We will create our predictive model using a gradient boosted decision tree. 

In [None]:
set.seed(12346)

'%!in%' <- function(x,y)!('%in%'(x,y))
count <- nrow(data)
train_index <- sample(1:count, count*.8, replace=FALSE)
train <- data[data$ID %in% train_index,]
test <- data[data$ID %!in% train_index,]

train.conditioned <- subset(train, TIER >= 5)

gbm_train <- gbm(PASS_BAR ~ LSAT + UGPA + MALE + ASIAN + BLACK + OTHER + HISP + 
             FAM_INC_2 + FAM_INC_3 + FAM_INC_4 + FAM_INC_5, data=train.conditioned,
             distribution = "gaussian", n.trees = 2000,
             shrinkage = 0.005, interaction.depth = 5, cv.folds=10)
print(gbm_train)

#Provide summary information about gradient boosted tree model
summary(gbm_train)
pretty.gbm.tree(gbm_train, i.tree = 1)
plot(gbm_train, i.var = c("LSAT","UGPA"))

### Simulating Law School Admissions

We now will use this model to simulate law school admissions on the testing set. In our simulated admissions cycle the tier 5 and 6 law schools will simply accept the same number ($n$) of students as they did in this data, but will instead simply accept the $n$ students that are the most likely to pass the bar without considering any other factors. 

#### Exercise 3: Simulate Law School Admissions Based on Predicted Bar Passage
First, use the model to predict the likelihood each individual in the test set will pass the bar. Then select the $n$ students with the highest predicted likelihood of passing the bar to be selected into the tier 5 and 6 schools. These will be the admitted students in the simulation. 

Second, to help understand how the model is working, graph 
* actual bar passage versus predicted bar passage
* the distribution of LSAT and GPA for students accepted to a tier 5 or 6 school and 
* the distribution of LSAT and GPA for students not accepted to a tier 5 or 6 school.  


In [None]:
# WRITE CODE HERE

# START solution
#Simulate law school admissions based on predicted bar passage
test$pred <- predict(gbm_train, test)

select_count <- nrow(subset(test, TIER >= 5))
reject_count <- nrow(test)-select_count

selected <- top_n(test, select_count, pred)  
rejected <- top_n(test, -reject_count, pred)  

#Graphically present results of simulated admissions
predict_plot <- ggplot(data=test, aes(x=PASS_BAR, y=pred)) + 
    geom_jitter(aes(color=as.factor(RACE)), alpha=.6, width=.3) +
    xlab("True Value (Jittered)") +
    ylab("Prediction Score") +
    ggtitle("Predicted Bar Passage Versus Actual Bar Passage") +
    labs(color="Race")
predict_plot

LSAT_UGPA_selected <- ggplot(data=selected, aes(x=LSAT, y=UGPA)) + 
    geom_jitter(aes(color=factor(RACE)), alpha=.6, height=.03, width=0) +
    scale_color_discrete(name = "Race") +
    xlab("LSAT") +
    ylab("Undergraduate GPA") +
    ggtitle("LSAT and GPA for Students Attending Tier 5 and 6 in Simulation") +
    ylim(c(1.5,4)) + 
    xlim(c(15,50))
LSAT_UGPA_selected

LSAT_UGPA_rejected <- ggplot(data=rejected, aes(x=LSAT, y=UGPA)) + 
    geom_jitter(aes(color=factor(RACE)), alpha=.6, height=.03, width=0) +
    scale_color_discrete(name = "Race") +
    xlab("LSAT") +
    ylab("Undergraduate GPA") +
    ggtitle("LSAT and GPA for Students Not Attending Tier 5 and 6 in Simulation") +
    xlim(c(15,50))
LSAT_UGPA_rejected
#END Solution

### Evaluating Disparate Impact

Having simulated admissions to law school, we now want to check how the number of minority students who would be admitted to tier 5 and 6 law schools if our model was used for admissions compares to how many minority students were actually admitted. We can consider our typical measures of bias, but because we have the bar passage data, which is an outcome measure, we can also check the accuracy of our algorithm against that outcome variable. However, it is also worth consider if there are any problems with using bar passage as a neutral test for the validity of the algorithm. 

#### Exercise 4: Comparing Admissions Rates
* Calculate the number of minority students admitted to tier 5 and 6 schools in the actual data and compare that to the number of minority students so admitted in the simulated data. This will illustrate how the simulated results compare to actuanl law school admissions for the period in question, but doesn't nessecarily tell us how to interpret any variation. 
* Next, calculate the percent of law school applicants who are minorities. Then calculate what percentage of admitted students are minorities for both the simulated and actual admissions. 
* Finally, calculate the the actual bar passage rate for white and minority students. If minority admitted students pass the bar at higher rates than admitted white students this suggests, under Becker's outcome test, that the algorithmic selection process is biased in favor of white students.

In [None]:
# WRITE CODE HERE

# START solution
actual_selected <- subset(test, TIER >= 5)
selected_minority <- subset(selected, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1)
selected_white <- subset(selected, RACE == 7)
rejected_minority <- subset(rejected, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1)
actual_minority <- subset(actual_selected, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1)
actual_white <- subset(actual_selected, RACE == 7)

#Comparing simulated admissions to actual admissions
print("Raw numerical differences:")
print("Number of minority students admitted to tier 5 and 6 law schools in simulated admissions")
gbm_minority_count <- nrow(selected_minority)
gbm_minority_count
print("Number of minority students admitted to tier 5 and 6 law schools in actual law school admissions:")
actual_minority_count <- nrow(actual_minority)
actual_minority_count

#Comparing parity of admissions decisions by race
print("Percentage minority compared to total applicant pool:")
print("Percent of applications who are minorities:")
percent_applicant_minority <- (nrow(selected_minority) + nrow(rejected_minority)) / nrow(test)
percent_applicant_minority
print("Percent of minority students admitted to tier 5 and 6 law schools in simulated admissions")
gbm_percent_minorty <- nrow(selected_minority) / nrow(test)
gbm_percent_minorty
print("Percent of minority students admitted to tier 5 and 6 law schools in actual law school admissions:")
actual_percent_minority <- nrow(actual_minority) / nrow(test)
actual_percent_minority

#Assessing admissions decision using the outcome test (i.e., comparing bar passage rates)
print("Comparing bar passage rates:")
print("Percent of white students admitted to tier 5 and 6 law schools who pass bar in simulated admissions")
gbm_white_bar_passage <- nrow(subset(selected_white, PASS_BAR == 1)) / nrow(selected_white)
gbm_white_bar_passage
print("Percent of minority students admitted to tier 5 and 6 law schools who pass bar in simulated admissions")
gbm_minority_bar_passage <- nrow(subset(selected_minority, PASS_BAR == 1)) / nrow(selected_minority)
gbm_minority_bar_passage

print("Percent of white students admitted to tier 5 and 6 law schools who pass bar in actual law school admissions:")
actual_white_bar_passage <- nrow(subset(actual_white, PASS_BAR == 1)) / nrow(actual_white)
actual_white_bar_passage
print("Percent of minority students admitted to tier 5 and 6 law schools who pass bar in actual law school admissions:")
actual_minority_bar_passage <- nrow(subset(actual_minority, PASS_BAR == 1)) / nrow(actual_minority)
actual_minority_bar_passage

print("Percent of all white law school applicants who pass bar in actual law school admissions:")
population_white_bar_passage <- nrow(subset(test, RACE == 7 & PASS_BAR == 1)) / nrow(subset(test, RACE == 7))
population_white_bar_passage
print("Percent of minority law school applicants who pass bar in actual law school admissions:")
population_minority_bar_passage <- nrow(subset(subset(test, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1), PASS_BAR == 1)) / nrow(subset(test, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1))
population_minority_bar_passage

#END Solution

Note, that the results are similar if one chooses to model and simulate admissions for tier 6 or tiers 4, 5, and 6 law schools. 

### Modeling Bar Passage Without Race or Gender

#### Exercise 5: Creating a Model for Bar Passage Without Race and Gender

One aspect of the above model that we might be concerned about is the fact that it specifically includes suspect class information, namely race and gender. But does leaving out race and gender impact the results of simulated law school admissions or even make the results potentially more biased?

Consider that if race and gender are left out, the model has less information to train on and we would expect it therefore, all else being equal, to be less accurate than a model with that information. And while the model will no longer explicitly categorize based on race and gender, racial and gender differences may still get picked up by the model through other correlated variables. 

Based on the model above, create a new gradient boosted tree that leaves out the applicants' race and gender. Then simulate law school admissions by selecting the top $n$ students to be admitted to tier 5 and 6 law schools. 

In [None]:
# WRITE CODE HERE

#START Solution
set.seed(12346)
gbm_train_2 <- gbm(PASS_BAR ~ LSAT + UGPA + 
             FAM_INC_2 + FAM_INC_3 + FAM_INC_4 + FAM_INC_5, data=train.conditioned,
             distribution = "gaussian", n.trees = 2000,
             shrinkage = 0.005, interaction.depth = 5, cv.folds=10)


test$pred_2 <- predict(gbm_train_2, test)

selected_2 <- top_n(test, select_count, pred_2)  
rejected_2 <- top_n(test, -reject_count, pred_2)
#END Solution

#### Exercise 6: Assess the racial distribution of students admitted under the new model
Use the same tests as above to assess the racial impact of this new model and whether there is evidence of algorithmic bias.

In [None]:
# WRITE CODE HERE

# START solution
selected_minority_2 <- subset(selected_2, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1)
rejected_minority_2 <- subset(rejected_2, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1)
selected_white_2 <-subset(selected_2, RACE == 7)

#Comparing simulated admissions to actual admissions
print("Raw numerical differences:")
print("Number of minority students admitted to tier 5 and 6 law schools in simulated admissions")
gbm_no_race_gender_minority_count <- nrow(selected_minority_2)
gbm_no_race_gender_minority_count
print("Number of minority students admitted to tier 5 and 6 law schools in actual law school admissions:")
actual_minority_count

#Comparing parity of admissions decisions by race
print("Percentage minority compared to total applicant pool:")
print("Percent of applications who are minorities:")
percent_applicant_minority
print("Percent of minority students admitted to tier 5 and 6 law schools in simulated admissions")
gbm_no_race_gender_percent_minority <- nrow(selected_minority_2) / nrow(test)
gbm_no_race_gender_percent_minority
print("Percent of minority students admitted to tier 5 and 6 law schools in actual law school admissions:")
actual_percent_minority

#Assessing admissions decision using the outcome test (i.e., comparing bar passage rates)
print("Comparing bar passage rates:")
print("Percent of white students admitted to tier 5 and 6 law schools who pass bar in simulated admissions")
gbm_no_race_gender_white_bar_passage <- nrow(subset(selected_white_2, PASS_BAR == 1)) / nrow(selected_white_2)
gbm_no_race_gender_white_bar_passage
print("Percent of minority students admitted to tier 5 and 6 law schools who pass bar in simulated admissions")
gbm_no_race_gender_minority_bar_passage <- nrow(subset(selected_minority_2, PASS_BAR == 1)) / nrow(selected_minority_2)
gbm_no_race_gender_minority_bar_passage

print("Percent of white students admitted to tier 5 and 6 law schools who pass bar in actual law school admissions:")
actual_white_bar_passage
print("Percent of minority students admitted to tier 5 and 6 law schools who pass bar in actual law school admissions:")
actual_minority_bar_passage

#END Solution

### Testing an Alternative Admission Process: A Top 10% Plan

Texas adopted a top ten percent admission plan in 1997 for undergraduate admissions in the state. Anti-affirmative action advocates subsequently challenged the top-ten percent plan in the high profile affirmative action lawsuit *Fisher v. Texas* which was the subject of Supreme Court decisions in 2013 and 2016. 

A top ten percent admissions plan works by automatically admitting all students with a GPA in the top 10% of their graduating class and then admitting students for remaining spots based on holistic review. For our purposes we will simulate this by admitting all students with a GPA within the top 10% of all law school applicants and then using the bar passage rate model to admit students for the remaining spaces in the class. Notice, however, that this approximation fails to capture how a student's GPA ranks within their own high schools, which is central for how the top ten percent plan actually works. That is in Texas's actual policy, students are only compared to the other students within the same school. 

#### Exercise 7: Simulate a top ten percent plan for law school admissions
Simulate an admissions cycle for tier 5 and 6 law schools assuming a centralized top ten percent plan. First select all applicants with a UGPA in the top ten percent of applicants and then round out the class with students who have the highest predicted bar passage scores.


In [None]:
#WRITE CODE HERE

#BEGIN SOLUTION
ten_percent_count <- round(nrow(test)*.1)
selected_gpa <- top_n(test, ten_percent_count, UGPA)  
unselected_gpa <- test[!(test$ID %in% selected_gpa$ID),]

remainder_select_count <- select_count - nrow(selected_gpa)
selected_remainder <-  top_n(unselected_gpa, remainder_select_count, pred)

selected_topten <- rbind(selected_remainder, selected_gpa)
rejected_topten <- test[!(test$ID %in% selected_topten$ID),]

# Note, this algorithm slightly overadmits students because it admits additional students 
# in the case of prediction scores

#END Solution

#### Exercise 8: Assess the racial distribution of students admitted under the simulated top ten percent plan

Repeat the analysis for racial impact and algorithmic bias as above but using the top-ten percent plan simulation.

In [None]:
#WRITE CODE HERE

#BEGIN SOLUTION
selected_minority_topten <- subset(selected_topten, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1)
rejected_nminority_topten <- subset(rejected_topten, BLACK ==1 | ASIAN == 1 | OTHER == 1 | HISP ==1)
selected_white_topten <- subset(selected_topten, RACE == 7)

#Comparing simulated admissions to actual admissions
print("Raw numerical differences:")
print("Number of minority students admitted to tier 5 and 6 law schools in top-ten percent plan simulated admissions")
topten_minority_count <- nrow(selected_minority_topten)
topten_minority_count
print("Number of minority students admitted to tier 5 and 6 law schools in actual law school admissions:")
actual_minority_count

#Comparing parity of admissions decisions by race
print("Percentage minority compared to total applicant pool:")
print("Percent of applicants who are minorities:")
percent_applicant_minority
print("Percent of minority students admitted to tier 5 and 6 law schools in top-ten percent plan simulated admissions")
topten_percent_minorty <- nrow(selected_minority_topten) / nrow(test)
topten_percent_minorty
print("Percent of minority students admitted to tier 5 and 6 law schools in actual law school admissions:")
actual_percent_minority

#Assessing admissions decision using the outcome test (i.e., comparing bar passage rates)
print("Comparing bar passage rates:")
print("Percent of white students admitted to tier 5 and 6 law schools who pass bar in top-ten percent plan simulated admissions")
topten_white_bar_passage <- nrow(subset(selected_white_topten, PASS_BAR == 1)) / nrow(selected_white_topten)
topten_white_bar_passage
print("Percent of minority students admitted to tier 5 and 6 law schools who pass bar in top-ten percent plan simulated admissions")
topten_minority_bar_passage <- nrow(subset(selected_minority_topten, PASS_BAR == 1)) / nrow(selected_minority_topten)
topten_minority_bar_passage

print("Percent of white students admitted to tier 5 and 6 law schools who pass bar in actual law school admissions:")
actual_white_bar_passage
print("Percent of minority students admitted to tier 5 and 6 law schools who pass bar in actual law school admissions:")
actual_minority_bar_passage
#END SOLUTION

### Summarizing Results

You may find it helpful to summarize the differences between the different models for easy comparison.  

#### Exercise 10: Create a Summary Table
Create a table that summarizes the results of each model. For each model (and the full population of applicants) as applicable) include
* The number of minority students admitted
* The percent of admitted students who belong to a minority group
* The bar passage rate for white students
* The bar passage rate for minority students

In [None]:
#WRITE CODE HERE

#BEGIN SOLUTION
#Note this solution uses result variables that have been defined in the code above along the way
model <- c("Full Applicant Population","Actual Admissions","Gradient Boosted Tree", "GBTree Without Race or Gender", "Ten Percent Plan")
minority_students_admitted <- c(NA,actual_minority_count, gbm_minority_count, gbm_no_race_gender_minority_count, topten_minority_count)
percent_minority <- c(percent_applicant_minority, actual_percent_minority, gbm_percent_minorty, gbm_no_race_gender_percent_minority, topten_percent_minorty)
bar_passage_rate_white <- c(population_white_bar_passage, actual_white_bar_passage, gbm_white_bar_passage, gbm_no_race_gender_white_bar_passage, topten_white_bar_passage)
bar_passage_rate_minority <- c(population_minority_bar_passage, actual_minority_bar_passage, gbm_minority_bar_passage, gbm_no_race_gender_minority_bar_passage, topten_minority_bar_passage)

results <- data.frame(model, minority_students_admitted, percent_minority, bar_passage_rate_white, bar_passage_rate_minority)
results
#END SOLUTION

### Further Questions

Consider what some of the potential problems with this data set are. What factors are not represented in the data that might be relevant for predicting outcomes? Are their any concerns about state bar passage as an outcome measure? What factors might drive the differences between the different models?  How well does this model mimic the procedure of the actual admissions process? Are there important differences between the populations of interest that may influence the model in undesirable ways? How does the performance of actual admissions officers compare to the models we have here and to the extent there are differences in outcomes, what factors might drive those differences?