# Law, Bias, and Algorithms
## Narrow Tailoring and Disparate Impact in Law School Admissions
This notebook will provide an example of how algoritms can be used to classify individuals. Specifically, it will consider how top tier law schools select admits. 

This notebook uses the example of law school admissions and considers a whether top law schools would be able to achieve current levels of diversity without directly using race in admissions. 

This notebook proceeds by first exploring a data set on law school admissions. It then reverse engineers a possible model for law school admissions and asks you to consdier alternative models. Finnally, the notebook ends by considering whether predicted bar passage could serve as a better metric for law school admissions. Throughout the notebook asks you to consider how you could implement an admission algorithm while maintaining diversity in top law schools. 

In [1]:
# Some initial setup
options(digits = 3)
library(tidyverse)
library(gbm)
library(zoo)
theme_set(theme_bw())

# Read the data
data <- read_csv("../data/bar_passage_data.csv", col_types = cols())

“replacing previous import by ‘tibble::tibble’ when loading ‘broom’”── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
“package ‘readr’ was built under R version 3.2.5”── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loaded gbm 2.1.4
“package ‘zoo’ was built under R version 3.2.5”
Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric



Each row in the data corresponds to a law school admit. The dataset contains the following variables:

* An ID number:
    * `ID`
    
    
* Base demographic information about the applicant:
    * `MINORITY`, `MALE`
    * `RACE` is encoded as follows:        
        * 0: Non-hispanic White
        * 1: Asian, Black, Hispanic, American Indian, Alaskan Native, or Other
    * `MALE` is coded as 1 for male applicants and 0 for female applicants
        
        
* Outcome of interest, Bar Passage:
    * `PASS_BAR`, `BAR`
    * `PASS_BAR` is an indicator variable and is encoded as 0 regardless of why the student did not pass the exam.  They may have dropped out of law school, never taken the bar, or failed the exam. `PASS_BAR` is encoded as 1 if the student eventually passes the bar. 
    * `BAR` provides more detail about bar results and test history
    
 
* Academic Indicators:
    * `UGPA` (undergraduate GPA), `LSAT` (LSAT score, scaled to be between 10 and 50)
    
    
* Tier of Law School Attended:
    * `TOP_TIER` is an indicator variable for whether an applicant ultmiately attends a top tier school
    * Note that students who attend historically black colleges and universities were removed as those schools are outliers in law school admissions.


* Family Income Quintile:
    * `FAM_INC` provides the family income quintile
    * `FAM_INC_1`, `FAM_INC_2`, `FAM_INC_3`, `FAM_INC_4`,` FAM_INC_5` are indicator variables for the income quintile

Law school admits whose entries had missing data have been removed.

### Exploratory Data Analysis
#### Excercise 1: Initial Data Exploration
Create two tables showing 
* one, mean LSAT and undergraduate GPA by race, and
* two, the total number of law school admits, the number of minority admits, and the percentage of law school admits who are minorties

In [16]:
#WRITE CODE HERE

#BEGIN SOLUTION
data %>% 
    group_by(MINORITY) %>%
    summarize(mean(LSAT),mean(UGPA))

percent_applicant_minority <- (nrow(filter(data, MINORITY == 1)) / nrow(data))
population <- data_frame(A = "Full Applicant Population", B = nrow(data), C = sum(data$MINORITY == 1), D = round(percent_applicant_minority,3))
colnames(population) <- c("","Total Law School Admits", "Minority students admitted","Percent of population minority")
population
#END SOLUTION

MINORITY,mean(LSAT),mean(UGPA)
0,37.2,3.25
1,32.0,3.01


Unnamed: 0,Total Law School Admits,Minority students admitted,Percent of population minority
Full Applicant Population,26509,3345,0.126


The majority-minority test gap has been the subject of extensive scientific inquiry. Potential causes may include differences in school resources, poverty, family structure, environment, and discrimination.

#### Excercise 2: Current Demographic Composituion and Disparities
Create a table showing the proportion students who are minorities at high and non-high tier law schools

In [17]:
#WRITE CODE HERE

#BEGIN SOLUTION
data %>% 
    group_by(TOP_TIER) %>%
    summarize(mean(MINORITY))
#END SOLUTION

TOP_TIER,mean(MINORITY)
0,0.118
1,0.149


### Reverse Engineering Current Admissions

We will now reverse engineer an estimated weighting for how law schools make their admissions decisions. We will do this by creating a model of top_tier law school attendance based on LSAT, Undergraduate GPA, Minority status, and low income status. 

In R, the formulas for statistical models use a special formula object which is created with a common syntax. The formula object is written as "dependent variable" ~ "independent variables" with each indepedent variable seperated with the "+" symbol. 

In [12]:
lm_admit <- lm(TOP_TIER ~ LSAT + UGPA + MINORITY + FAM_INC_1, data = data)
summary(lm_admit)


Call:
lm(formula = TOP_TIER ~ LSAT + UGPA + MINORITY + FAM_INC_1, data = data)

Residuals:
   Min     1Q Median     3Q    Max 
-0.792 -0.288 -0.157  0.371  1.377 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.257021   0.024141  -52.07   <2e-16 ***
LSAT         0.025529   0.000481   53.10   <2e-16 ***
UGPA         0.171431   0.006219   27.57   <2e-16 ***
MINORITY     0.223587   0.007981   28.02   <2e-16 ***
FAM_INC_1    0.069476   0.016304    4.26    2e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.406 on 26504 degrees of freedom
Multiple R-squared:  0.144,	Adjusted R-squared:  0.144 
F-statistic: 1.12e+03 on 4 and 26504 DF,  p-value: <2e-16


#### Excercise 3: 
Discuss the meaning of this model. What does it say about how law schools are admitting students? How accurate do you think it is? Do you suspect that it is misrepresenting or simplyifing the law school admissions process?

### Simulating Law School Admissions

You will now create an algorithm that simulates law school admissions based on a weighting of your determination. You will create a function that ranks all incoming law students based on their attribures and then admits the top $n$ students to top tier schools, where $n$ is the number of students actually admitted by top tier schools. 

#### Excercise 4: Exploring Alternative Admissions Functions

Create a function that siulates an admissions cycle by giving weights to LSAT, Undergraduate GPA, Minority Status, and low-income status. Are you able to create admissions creiteria that match the diverisity of the application pool? Are you able to do so without explictly using race? Recall that _Gratz_ declared using race in a points based way as part of college admissions unconstitutional. 

In [None]:
#WRITE CODE HERE

#BEGIN SOLUTION

# EXCERCISE 4

#END SOLUTION

### Using Predicted Bar Passage as a Selection Criterion

As an additioanl excercise we can consider what law school admissions would look like if law schools selected law students to optimize bar passage rates. This approach might be motivated from two perspectives. First, perhaps using an outcome-based algorithm would allow schools to lessen the weight on LSAT scores, given the critiques of standardized tests as favoring affluent non-minority groups, and hence constitute a "workable race-neutral alternative." Second, more crudely, one of the major inputs into U.S News and World Report law school rankings is bar passage. Schools might want to admit a class to increase bar passages rates or US News might increase the weight of bar passage in its rankings. Our goal here is to examine whether the adoption of such a policy is a workable alternative and whether it might have disparate impact.

#### Excercise 5:

We would now like to simulate admissions based on them admiting individual students who are predicted to pass the bar exam. Create a model to predict bar passage and then use this model to simulate an admissions cycle where the students predicted as being the most likely to pass the bar are admitted into the highest tier law schools. We will create our predictive model using simple linear regression. (Because we have a binary outcome, technically a logistic regression would be more appropriate here, but we are using a simple linear model for ease of interpretation.) 

In [24]:
set.seed(12346)

'%!in%' <- function(x,y){
    !('%in%'(x,y))
}

count <- nrow(data)
train_index <- sample(1:count, count*.8, replace = FALSE)
train <- data[data$ID %in% train_index,]
test <- data[data$ID %!in% train_index,]
# WRITE CODE HERE

# START solution
#Simulate law school admissions based on predicted bar passage

train.conditioned <- subset(train, train$TOP_TIER == 1)

lm_pass_bar <- lm(PASS_BAR ~ LSAT + UGPA + MALE + MINORITY + 
             FAM_INC_2 + FAM_INC_3 + FAM_INC_4 + FAM_INC_5, data = train.conditioned)
summary(lm_pass_bar)

test$pred <- predict(lm_pass_bar, test)

select_count <- nrow(subset(test, TOP_TIER == 1))
selected <- top_n(test, select_count, pred)  
selected_minority <- subset(selected, MINORITY == 1)
minority_count <- nrow(selected_minority_lm)
percent_minorty <- nrow(selected_minority) / nrow(selected)
pass_bar_results <- data_frame(A = "Linear Model for Bar Passage", B = minority_count, C = round(percent_minorty,3))
colnames(pass_bar_results) <- c("Model","Minority students admitted to top tier school","Percent of admits who are minority")
pass_bar_results
#END Solution


Call:
lm(formula = PASS_BAR ~ LSAT + UGPA + MALE + MINORITY + FAM_INC_2 + 
    FAM_INC_3 + FAM_INC_4 + FAM_INC_5, data = train.conditioned)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.9803  0.0718  0.1258  0.1807  0.4479 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.28986    0.05945    4.88  1.1e-06 ***
LSAT         0.00976    0.00104    9.40  < 2e-16 ***
UGPA         0.03266    0.01360    2.40   0.0164 *  
MALE        -0.02181    0.00994   -2.20   0.0282 *  
MINORITY    -0.01269    0.01442   -0.88   0.3790    
FAM_INC_2    0.02570    0.03513    0.73   0.4645    
FAM_INC_3    0.05491    0.03270    1.68   0.0932 .  
FAM_INC_4    0.09772    0.03255    3.00   0.0027 ** 
FAM_INC_5    0.08879    0.03503    2.53   0.0113 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.354 on 5264 degrees of freedom
Multiple R-squared:  0.0367,	Adjusted R-squared:  0.0353 
F-statistic: 25.1 on 8 and 5264 DF, 

Model,Minority students admitted to top tier school,Percent of admits who are minority
Linear Model for Bar Passage,72,0.045


Consider if an admissions office came to you with the proposal of using this model for dertermining which law students that school would admit. How would you evaluate the model and what would you recomemnd to the admissions office? If this model was used, would there be a valid disparate action claim for any applicants who are rejected from top tier schools?

#### Discussion Questions

One way to characterize the use of bar passage information is as an attempt to reduce the importance of the LSAT in determining law school admissions. Does using bar passage data fulfill the goal of reducing emphasis on the LSAT?

Consider what some of the potential problems with this data set are. What factors are not represented in the data that might be relevant for predicting outcomes on the bar exam? For success as an attorney? Are their any concerns about state bar passage as an outcome measure? What factors might drive the differences between the different models?  

How well do these models mimic the procedure of the actual admissions process? How does the performance of actual admissions officers compare to the models we have here and to the extent there are differences in outcomes, what factors might drive those differences? 

Are there important differences between the populations of interest that may influence the model in undesirable ways? Consider whether minority students are more likely to practice in jurisdictions with lower bar passage rates (e.g., NY or CA)? Consider whether stereotype threat or implicit bias might explain differences in academic or bar passage performance between white and minority students and what implications that has for the approach you've studied above.