# Law, Order, and Algorithms
## Narrow Tailoring and Disparate Impact in Law School Admissions
In this exercise, we'll examine admissions decisions at top-tier law schools using the dataset from the _LSAC National Longitudinal Bar Passage Study_ ([Wightman and Ramsey, 1998](https://files.eric.ed.gov/fulltext/ED469370.pdf)).
This study presents national longitudinal bar passage data gathered from the class that started law school
in fall 1991 over a 5-year period.
In our analysis, we will focus on diversity and affirmative action policies. We'll explore a simple method to reverse engineer admissions criteria, and investigate the extent to which race-blind policies can achieve diversity. We'll also consider the consequences on diversity of a hypothetical scenario in which admissions decisions are based on statistical likelihood of bar passage.

In [1]:
# Some initial setup
options(digits = 3)
library(tidyverse)
theme_set(theme_bw())

# Read the data
bar_data <- read_csv("../data/bar_passage_data.csv", 
                 col_types = cols(MINORITY="l", TOP_TIER="l", MALE="l", PASS_BAR="l")) %>% 
    mutate(FAM_INC = as.factor(FAM_INC))

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──
[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.0
[32m✔[39m [34mtidyr  [39m 1.1.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.4.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Each row in the data corresponds to a law school admit. The dataset contains the following variables:

* An ID number:
    * `ID`
    
    
* Base demographic information about the applicant:
    * `MINORITY` is encoded as follows:        
        * `False`: Non-Hispanic white
        * `True`: Asian, Black, Hispanic, American Indian, Alaskan Native, or Other
    * `MALE` is coded as `True` for male applicants and `False` for female applicants
        
        
* Outcome of interest, Bar Passage:
    * `PASS_BAR` is an indicator variable and is encoded as 0 regardless of why the student did not pass the exam.  They may have dropped out of law school, never taken the bar, or failed the exam. `PASS_BAR` is encoded as 1 if the student eventually passes the bar. 
    * `BAR` provides more detail about bar results and test history
    
 
* Academic Indicators:
    * `UGPA` (undergraduate GPA), `LSAT` (LSAT score, scaled to be between 10 and 50)
    
    
* Tier of Law School Attended:
    * `TOP_TIER` is an indicator variable for whether an applicant ultmiately attends a top tier school
    * Note that students who attend historically Black colleges and universities were removed as those schools are outliers in law school admissions.


* Family Income Quintile:
    * `FAM_INC` provides the family income quintile
    * `FAM_INC_1`, `FAM_INC_2`, `FAM_INC_3`, `FAM_INC_4`,` FAM_INC_5` are indicator variables for the income quintile, where `FAM_INC_1` is the lowest-income quintile

Law school admits whose entries had missing data have been removed.

### Exploratory Data Analysis

We start our analysis by exploring class composition and racial disparities.

#### Exercise 1: Demographic Composition and Disparities

1. For both top-tier schools  and the full set of schools, compute the total number of law school admits, the number of minority admits, and the percentage of law school admits who are minorties.
1. Compute the average LSAT and undergraduate GPA by minority status.

In [18]:
# WRITE CODE HERE
# START SOLUTION

# 1.
# Demographic composition
bar_data %>%
    group_by(TOP_TIER) %>%
    summarize(
        total_admits = n(),
        minority_admits = sum(MINORITY),
        minority_proportion = mean(MINORITY)
    )

# 2.
# Average LSAT and GPA by group
bar_data %>% 
    group_by(MINORITY) %>%
    summarize(
        mean_LSAT = mean(LSAT),
        mean_UGPA = mean(UGPA)
    )

# END SOLUTION

`summarise()` ungrouping output (override with `.groups` argument)


TOP_TIER,total_admits,minority_admits,minority_proportion
<lgl>,<int>,<int>,<dbl>
False,19627,2322,0.118
True,6882,1023,0.149


`summarise()` ungrouping output (override with `.groups` argument)


MINORITY,mean_LSAT,mean_UGPA
<lgl>,<dbl>,<dbl>
False,37.2,3.25
True,32.0,3.01


We note that the majority-minority test gap has been the subject of extensive scientific inquiry. Potential causes include differences in school resources, poverty, family structure, environment, and discrimination.

### Reverse Engineering Current Admissions

We now attempt to reverse engineer admissions criteria for top-tier law schools. To do so, we make three key assumptions. First, we assume that students in our dataset comprise the full set of students who _applied_ to law school. In reality, our dataset only contains students who ultimately enrolled at a law school. Second, we assume that [students accepted to top-tier law schools](https://abovethelaw.com/2013/03/which-law-schools-had-the-highest-yield-rate/) all decided to enroll at a top-tier school. Finally, we assume that admissions decisions are based on a relatively small set of factors that we have access to: LSAT score, GPA, minority status, and family income. This is a coarse approximation of actual admissions policies, but is instructive nevertheless.

Given these assumptions, we can try to reconstruct admissions policies by fitting a simple logistic regression model that predicts acceptance to a top-tier school based on the available information. 

In R, you can specify statistical models using formulas of the form `outcome variable ~ input variables` with each input variable seperated with the `+` symbol. We'll learn more about these models in the coming weeks, but for now we'll treat them (mostly) as black boxes.

In [16]:
# fit a logistic regression to predict acceptance at a top-tier school
lr_admit <- glm(TOP_TIER ~ LSAT + UGPA + MINORITY + FAM_INC_1 + FAM_INC_2,
                    data = bar_data, family="binomial")

# summarize the model
summary(lr_admit)


Call:
glm(formula = TOP_TIER ~ LSAT + UGPA + MINORITY + FAM_INC_1 + 
    FAM_INC_2, family = "binomial", data = bar_data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.032  -0.766  -0.524   0.797   3.228  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -10.57882    0.17780  -59.50  < 2e-16 ***
LSAT           0.15764    0.00334   47.18  < 2e-16 ***
UGPA           1.04883    0.04036   25.98  < 2e-16 ***
MINORITYTRUE   1.26960    0.04979   25.50  < 2e-16 ***
FAM_INC_1      0.37514    0.10177    3.69  0.00023 ***
FAM_INC_2     -0.03270    0.05140   -0.64  0.52470    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 30361  on 26508  degrees of freedom
Residual deviance: 26223  on 26503  degrees of freedom
AIC: 26235

Number of Fisher Scoring iterations: 4


The list above shows the coefficient for each covariate estimated by our logistic regression model. We can think of the coefficients as indicating how much different factors are weighted when making admissions decisions.

#### Exercise 2: 
Discuss the meaning of this model. What does it say about how law schools are admitting students? How accurate do you think it is? In what ways do you think it is misrepresenting or simplyifing the law school admissions process?

### Simulating Law School Admissions

#### Exercise 3: Exploring Alternative Admissions Policies

You'll now create an algorithm for admitting students to top-tier schools based on any given weighting of LSAT, GPA, minority status, and low-income status. Once the weights are provided, the algorithm should sort all the applicants and return the subset of $n$ = 6,882 applicants ranked highest, where $n$ is the actual number admitted to the top-tier schools.

Explore various admissions policies. Are you able to create admissions criteria that match the nominal academic quality (as measured by GPA and LSAT scores) and diversity of the set of students actually admitted to top-tier schools? Are you able to do so without explictly using race? Recall that _Gratz_ declared using race in a points based way as part of college admissions unconstitutional. 

In [32]:
# WRITE CODE HERE
admit_n <- sum(bar_data$TOP_TIER)

# weights inferred from the logistic regression above.
# these can be modified to explore alternative policies
LSAT_wt <- 0.16
GPA_wt <- 1
MINORITY_wt <- 1.3
INC1_wt <- 0.38
INC2_wt <- 0

# START SOLUTION

# rank applicants by the given weights, and return the top admit_n
admitted <- bar_data %>% 
    mutate(score = 
               LSAT * LSAT_wt + 
               UGPA * GPA_wt + 
               MINORITY * MINORITY_wt + 
               FAM_INC_1 * INC1_wt + 
               FAM_INC_2 * INC2_wt) %>%
    arrange(desc(score)) %>%
    slice(1:admit_n)

# compute the diversity of the admitted student body
admitted %>%
    summarize(
        minority_p = mean(MINORITY),
        mean_gpa = mean(UGPA),
        mean_lsat = mean(LSAT)
    )

# END SOLUTION

minority_p,mean_gpa,mean_lsat
<dbl>,<dbl>,<dbl>
0.173,3.51,42.4


### Using Predicted Bar Passage as a Selection Criterion

Finally, we consider what would happen if law schools selected students to optimize bar passage rates. This approach might be motivated from two perspectives. First, perhaps using an outcome-based algorithm would allow schools to lessen the weight on LSAT scores, given the critiques of standardized tests as favoring affluent non-minority groups, and hence constitute a "workable race-neutral alternative." Second, more crudely, one of the major inputs into U.S News and World Report law school rankings is bar passage. Schools might want to admit a class to increase bar passage rates or U.S. News might increase the weight of bar passage in its rankings. Our goal here is to examine whether the adoption of such a policy is a workable alternative and whether it might have disparate impact.

#### Exercise 4:

Create a model to predict bar passage and then use this model to simulate an admissions cycle where the students predicted as being the most likely to pass the bar are admitted into the highest tier law schools. Create the predictive model using logistic regression as shown above.

Suppose an admissions office came to you and proposed using this model to determine which students are admitted. How would you evaluate the model and what would you recomemnd to the admissions office? If this model were used, would there be a valid disparate action claim for any rejected applicants?

In [33]:
# WRITE CODE HERE
# START SOLUTION

# predict bar passage rates via logistic regresssion
bar_model <- glm(PASS_BAR ~ LSAT + UGPA, data = bar_data, family = "binomial")

summary(bar_model)

# select students most likely to pass the bar
admitted <- bar_data %>%
    mutate(pass_p = predict(bar_model, .)) %>% 
    arrange(desc(pass_p)) %>%
    slice(1:admit_n)

# compute the diversity of the admitted student body
admitted %>%
    summarize(
        minority_p = mean(MINORITY),
        mean_gpa = mean(UGPA),
        mean_lsat = mean(LSAT)
    )

# END SOLUTION


Call:
glm(formula = PASS_BAR ~ LSAT + UGPA, family = "binomial", data = bar_data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.363   0.439   0.572   0.689   1.482  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.6867     0.1401  -19.18  < 2e-16 ***
LSAT          0.0879     0.0029   30.33  < 2e-16 ***
UGPA          0.2995     0.0386    7.77  8.1e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 26260  on 26508  degrees of freedom
Residual deviance: 25049  on 26506  degrees of freedom
AIC: 25055

Number of Fisher Scoring iterations: 4


minority_p,mean_gpa,mean_lsat
<dbl>,<dbl>,<dbl>
0.0584,3.48,43.1


#### Discussion Questions

* One way to characterize the use of bar passage information is as an attempt to reduce the importance of the LSAT in determining law school admissions. Does using bar passage data fulfill the goal of reducing emphasis on the LSAT?

* Consider what some of the potential problems with this dataset are. What factors are not represented in the data that might be relevant for predicting outcomes on the bar exam? For success as an attorney? Are their any concerns about state bar passage as an outcome measure?

* How well do these models mimic the procedure of the actual admissions process? How does the performance of actual admission officers compare to the models we have here and to the extent there are differences in outcomes, what factors might drive those differences? 

* Are there important differences between the populations of interest that may influence the model in undesirable ways? Consider whether minority students are more likely to practice in jurisdictions with lower bar passage rates (e.g., NY or CA)? Consider whether stereotype threat or implicit bias might explain differences in academic or bar passage performance between white and minority students and what implications that has for the approach you've studied above.