-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.en.Rmd
671 lines (528 loc) · 37 KB
/
index.en.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
---
title: 'Which law schools best prepare students for the bar exam?'
author: admin
date: '2020-12-08'
slug: residual-analysis-of-law-school-and-bar-passage
categories:
- bayesian analysis
- law school
tags: []
subtitle: "Spoiler: I don't know"
summary: ''
authors: []
lastmod: '2020-12-08T19:43:42-05:00'
featured: no
image:
caption: '<span>Photo by <a href="https://unsplash.com/@hudsoncrafted?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Debby Hudson</a> on <a href="https://unsplash.com/s/photos/studying?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a></span>'
focal_point: ''
preview_only: no
projects: []
links:
- icon: github
icon_pack: fab
name: Github
url: https://github.com/shanejorr/shanes-site/blob/main/content/post/2020-12-08-residual-analysis-of-law-school-and-bar-passage/index.en.Rmd
---
```{r setup, echo = F}
knitr::opts_chunk$set(echo = F,
message = F,
warning = F,
error = F)
```
```{r importLibraries}
library(knitr)
library(rstanarm)
library(tidybayes)
library(gridExtra)
library(kableExtra)
library(glue)
library(patchwork)
library(splines)
library(gt)
library(bayesplot)
library(tidyverse)
# stan options
# detect and use max number of cores to run models in parellel
options(mc.cores = parallel::detectCores())
```
```{r}
aggregate_rates <- function(df, group_var) {
df %>%
group_by_at(group_var) %>%
summarize(takers = sum(takers),
passers = sum(passers)) %>%
mutate(pass_rate = passers / takers) %>%
ungroup()
}
# create ggplot theme for this post
post_theme <- theme_minimal() +
theme(legend.position = 'bottom',
plot.title = element_text(size = 11),
plot.subtitle = element_text(size = 10),
axis.title = element_text(size = 10))
theme_set(post_theme)
```
```{r dataImport}
# Data import and cleaning -----------------------------
# all the ABA datasets are on GitHub
# we will save the directory to the repo, for less typing
data_dir <- 'https://raw.githubusercontent.com/shanejorr/ABA-509-disclosures/main/data/'
#import bar passage for all schools data set
bar <- read_csv(glue("{data_dir}First-Time_Bar_Passage_(Jurisdiction-Level).csv")) %>%
select(-schoolid, -calendaryear) %>%
# rename year column for easier future reference
rename(year = firsttimebaryear) %>%
# only keep 2014 and later
# our admissions data starts at class entering in 2011, which is 2014 bar year,
# so 2014 is the earliest bar exam year we can use
filter(year >= 2014)
#import admissions data so we can extract a school's UPGA and LSAT
admiss <- read_csv(glue("{data_dir}Admissions.csv")) %>%
select(schoolname, calendaryear, uggpa75, uggpa50, uggpa25, lsat75, lsat50, lsat25) %>%
# students in these admissions classes will take bar in three years
# so, add three to year, so it matches the bar year
mutate(year = calendaryear+3) %>%
# scale gpa and lsat meterics by converting to mean of 0 and standard deviation of 1
mutate(across(uggpa75:lsat25, scale, .names = "{col}_std")) %>%
# create an indicator variable that is the sum of the standardized lsat and gpa variables
mutate(indicator = rowSums(select(., contains('_std')))) %>%
# standardize this indicator variable
mutate(indicator = scale(indicator))
# add attrition information
attrit <- read_csv(glue("{data_dir}Attrition.csv")) %>%
select(schoolname, calendaryear, pctjd1attrition) %>%
mutate(log_attrit = log(pctjd1attrition + 0.01),
log_attrit = scale(log_attrit, center = T, scale = F))
# combine attrition information with
admiss <- admiss %>%
left_join(attrit, by=c('schoolname', 'calendaryear')) %>%
select(-calendaryear)
#combine bar results and admissions datasets
bar <- left_join(bar, admiss, by=c('schoolname', 'year')) %>%
# remove WI because all law grads from WI are listed as passing the bar
filter(state != "WI",
# extreme outlier
state != "PR")
# aggregate school rates by year
school_year_rates <- aggregate_rates(bar, c('schoolname', 'year')) %>%
left_join(admiss, by = c('schoolname', 'year'))
us_rate <- bar %>%
mutate(overall = 'US Rate') %>%
aggregate_rates('overall') %>%
select(pass_rate) %>%
.[[1]]
# create dataset for modeling
bar_mod <- bar %>%
select(schoolname, takers, passers, schoolpasspct, state, year, contains("_std"), indicator, pctjd1attrition, log_attrit) %>%
drop_na() %>%
# only keep school / state / year combinations with at least 10 bar takers
filter(takers >= 20)
```
Evaluating the effectiveness of law schools is challenging for two reasons. The first hurdle lies in defining effectiveness. What does it mean for one law school to provide a better legal education than other schools? Generally, this centers on producing effective lawyers, although such a definition simply punts the definition of effectiveness up a level. Regardless, you cannot be an effective lawyer without actually becoming a licensed attorney. And to do this, you have to pass the bar exam. Thus, let's make 'preparing students to pass the bar exam' one measure of law school effectiveness.
Now for the second hurdle. How do we measure whether a school effectively prepares students for the bar exam? This question is causal in nature. For example, let's take a student deciding on which law school to attend. She knows she is going to take the 2023 North Carolina bar exam. In evaluating law school effectiveness, we want to know her probability of bar passage in NC, based solely on her pre-law school characteristics, given various law schools. These differences in probabilities between law schools shed light on each law school's effectiveness in preparing students for the bar exam.
This post is my attempt to answer this question. I lay out the problems with my approach at the end, but the bottom line is that I'm not convinced that it's possible to answer this question with current data sources. But, I'll let readers be the final judge.
## Overview of Method
As already stated, we want to know how well law schools prepare students for the bar exam. We'll do this by predicting each school's bar passage rate based on school-level bar passage predictors and then measure how well each school performs against its predicted passage rate. To account for student quality, we will incorporate the school's median, 25th percentile, and 75th percentile undergraduate GPA and LSAT score of incoming students.
We'll also account for first-year attrition. The reason is that schools typically flunk-out poor performing students and these students have a low probability of bar passage. Thus, the quickest way for a school to raise its bar passage rate is to kick out more low-performing students.
## Exploration of LSAT, undergrad GPA, and attrition as predictors
AccessLex Institute [aggregates the data](https://analytix.accesslex.org/DataSet) law school data. We'll use this data to find each school's yearly median, 25th percentile, and 75th percentile undergraduate GPA and LSAT score of incoming students, attrition rate, and bar passage rate by state and year. Years in this post represent the year students took the bar exam. For admissions factors such as median undergrad GPA and LSAT this means that the years represent the median values for students who entered law school school three years prior to the stated year.
### Relationship between bar passage and both undergraduate GPA and LSAT scores
Prior to modeling, however, we'll examine the individual relationships between the predictors and bar passage. This will provide us initial confirmation that we should at least test them in models. Figure \@ref(fig:plotAdmisPredictors) shows the relationship between both median undergrad GPA (top plot) and median LSAT score (bottom plot) and bar passage. Each reveal a positive relationship between the admissions factor and bar passage.
```{r plotAdmisPredictors, fig.height = 7, fig.width = 6, fig.cap = "The top plot shows the relationship between a school's median undergrad GPA and bar passage rate. The bottom plot highlights the relationship between median LSAT and bar passage. Both median undergrad GPA and median LSAT correlate with bar passage."}
# use the same alpha and size for each plot
plot_alpha <- .5
plot_size <- .8
set2_colors <- c('#e8400c', '#0971b2')
uggpa_plot <- ggplot(school_year_rates, aes(uggpa50, pass_rate)) +
geom_point(color = set2_colors[1], alpha = plot_alpha, size = plot_size) +
facet_wrap(~year, ncol=3) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = 'Undergraduate GPA',
x = 'Median school undergraduate GPA',
y = 'School bar passage rate')
lsat_plot <- ggplot(school_year_rates, aes(lsat50, pass_rate)) +
geom_point(color = set2_colors[2], alpha = plot_alpha, size = plot_size) +
facet_wrap(~year, ncol=3) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = 'LSAT',
x = 'Median school LSAT score',
y = 'School bar passage rate')
uggpa_plot / lsat_plot
```
### Correlation between median LSAT and median undergrad GPA
We'll next look at the correlation between median undergrad GPA and median LSAT. Our prior is that these two predictors are highly correlated, as more elite schools have higher median undergrad GPAs and LSAT scores. This could cause colinearity problems within our models, which might widen the uncertainty of predictions.
Figure \@ref(fig:plotAdmisCorrelation) confirms our prior: median undergrad GPA and median LSAT are highly correlated. In the plot, both predictors are standardized to have a mean of 0 and standard deviation of 1, allowing us to compare their linear relationship - the blue line - with a prefect correlation represented by a line with a slope of 1 - the red dashed line. The plot shows an extremely strong relationship between median undergrad GPA and median LSAT. The linear relationship is only slightly flatter than the perfect relationship (red dashed line).
```{r plotAdmisCorrelation, fig.height = 4, fig.width = 6, fig.cap = "Correlation between each school's standardized median LSAT and standardized median undergrad GPA. Blue line is the fit line of the data, red line is a hypothetical perfect fit line - a slope of 1. LSAT and GPA are highly correlated, with the actual fit line mirroring the perfect fit line."}
ggplot(school_year_rates, aes(scale(uggpa50), scale(lsat50))) +
geom_abline(intercept = 0, slope = 1, linetype = 2, size = 1,
color = 'indianred2') +
geom_smooth(method = 'lm', alpha = .7, se = F) +
geom_point(alpha = .3, size = plot_size) +
facet_wrap(~year, ncol=3) +
labs(title = "Correlation Between a School's Median LSAT and Median Undergrad GPA",
subtitle = "Blue line is the fit line | Red line is a slope of 1",
x = 'Standardized median school undergraduate GPA',
y = 'Standardized median school LSAT')
```
### Association between 1L attrition and both median LSAT and median undergrad GPA
Let's also look at the association between first year (1L) attrition and both median undergrad GPA and median LSAT. Our expectation, based on prior knowledge, is that schools with lower median undergrad GPA and LSAT scores will also have higher attrition. Figure \@ref(fig:plotAttritAdmis) confirms this prior. But, the relationship is not too strong.
```{r plotAttritAdmis, fig.height = 7, fig.width = 6, fig.cap = "The top plot shows the relationship between a school's median undergrad GPA and 1L attrition. The bottom plot shows the relationship between median LSAT and 1L attrition. Both median undergrad GPA and median LSAT correlate slightly with 1L attrition."}
y_atrit_label <- 'School 1L attrition rate (%)'
uggpa_plot <- ggplot(school_year_rates, aes(uggpa50, pctjd1attrition)) +
geom_point(color = set2_colors[1], alpha = plot_alpha, size = plot_size) +
facet_wrap(~year, ncol=3) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = 'Undergraduate GPA',
x = 'Median school undergraduate GPA',
y = y_atrit_label) +
theme(legend.position = 'none')
lsat_plot <- ggplot(school_year_rates, aes(lsat50, pctjd1attrition)) +
geom_point(color = set2_colors[2], alpha = plot_alpha, size = plot_size) +
facet_wrap(~year, ncol=3) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(title = 'LSAT',
x = 'Median school LSAT score',
y = y_atrit_label) +
theme(legend.position = 'none')
uggpa_plot / lsat_plot
```
### Association between 1L attrition and bar passage
Finally, figure \@ref(fig:plotAttritPassage) highlights the association between attrition and bar passage. Not surprisingly, it's negative: schools with higher attrition rates have lower bar passage rates. This finding is not unexpected because we saw from figure \@ref(fig:plotAdmisPredictors) that higher median undergrad GPA and LSAT scores correlate with higher bar passage rates and figure \@ref(fig:plotAttritAdmis) showed us that lower median undergrad GPA and LSAT scores are associated with higher attrition rates. Putting both findings together, we would assume that higher attrition rates correlate with lower bar passage rates.
```{r plotAttritPassage, fig.height = 4, fig.width = 6, fig.cap = "Correlation between bar passage rates and 1L attrition. There is a slight negative relationship, but there is also a lot of variance in the relationship."}
ggplot(school_year_rates, aes(pctjd1attrition, pass_rate)) +
geom_point(alpha = .3, size = plot_size) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
facet_wrap(~year, ncol=3) +
labs(title = "Correlation Between a School's bar passage rate and 1L attrition rate",
x = '1L attrition rate (%)',
y = 'Bar passage rate (%)')
```
## Difference between predicted and actual bar passage
### Models with admissions factors as predictors
We now turn to the heart of this post. We'll predict each school's bar passage rate and see how actual rates stack up against predictions. But, we need a good model to do this. We know our potential predictors: median, 25th percentile, and 75th percentile undergraduate GPA and LSAT score, and first year attrition. But, we don't know the functional form of the model that provides the best predictions. Do we need all predictors or only a couple of them?
We will use two steps to search for the best model. First, we will conduct model comparison and validation on four different models that only include the undergrad GPA and LSAT predictors, along with year and state. Then, we will take this model and compare it against models with attrition as an additional predictor. Models will be compared by testing their predictions out of sample using leave-one-out cross validation (LOO). Posterior prediction checks will also be used to ensure the fit is reasonable.
All models are Bayesian hierarchical logistic regressions. The response is a school's bar passage rate for a given state and year, with the model weighted by the number of takers. All models have the same group-level predictors: state and year. They only differ in their school-level predictors. The four initial models are below:
1. *Linear:* median undergraduate GPA and median LSAT score for the given graduating class year.
```{r pca}
# calculate principal components for six undergrad GPA and LSAT variables.
bar_pca <- bar_mod %>%
select(contains("_std")) %>%
prcomp()
bar_pca_summary <- bar_pca %>%
summary(.) %>%
.[6] %>%
as.data.frame()
# get first two principal component values
bar_pca_x <- bar_pca$x %>%
as.data.frame() %>%
select(PC1, PC2)
# add princ component values to bar dataset
bar_mod <- bar_mod %>%
bind_cols(bar_pca_x)
# create column names for PCA table
pca_cols <- glue("PC {seq(1, 6)}")
var_explained <- round(bar_pca_summary[[2]][3] * 100, 0)
```
2. *PCA:* Figure \@ref(fig:plotAdmisPredictors) showed that median undergraduate GPA and median LSAT score are highly correlated, raising fears of colinearity. The 25th, median, and 75th percentiles of each metric are also highly correlated with themselves (the 25th percentile undergrad GPA is correlated with the median undergrad GPA). We'll use Principal Component Analysis (PCA) to reduce the six highly correlated percentile variables into a smaller number of uncorrelated ones, while maintaining as much of the information in the variables as possible. The table below shows that the first two principal components contain `r var_explained`% of the variance in the six LSAT and GPA variables. By only using these two principal components as predictors, we get `r var_explained`% of the variance with only two uncorrelated variables.
```{r}
bar_pca_summary %>%
kable(col.names = pca_cols, digits = 2)
```
3. *PCA spline:* To account for possible non-linear relationships between the two PCA variables and bar passage, we'll examine a model with splines added to the two PCA variables.
4. *Spline:* This model contains median undergraduate GPA and median LSAT score like model 1, but adds splines to them.
```{r barMods, cache = T, eval = T}
# save the number of schools for later use
n_schools <- length(unique(bar_mod$schoolname))
school_mod_linear <- stan_glmer(cbind(passers, takers - passers) ~ lsat50_std + uggpa50_std +
(1 | state) + (1 | year),
prior_intercept = student_t(1.25, 1, 7),
iter = 2000,
seed = 145,
data = bar_mod,
family = binomial("logit"))
school_mod_pca <- stan_glmer(cbind(passers, takers - passers) ~ PC1 + PC2 + (1 | state) + (1 | year),
prior_intercept = student_t(1.25, 1, 7),
iter = 2000,
seed = 145,
data = bar_mod,
family = binomial("logit"))
school_mod_pca_spline <- stan_glmer(cbind(passers, takers - passers) ~ ns(PC1, df=4) + ns(PC2, df=4) +
(1 | state) + (1 | year),
prior_intercept = student_t(1.25, 1, 7),
iter = 2000,
seed = 145,
data = bar_mod,
family = binomial("logit"))
school_mod_spline <- stan_glmer(cbind(passers, takers - passers) ~ ns(lsat50_std, df=4) +
ns(uggpa50_std, df=4) +
(1 | state) + (1 | year),
prior_intercept = student_t(1.25, 1, 7),
iter = 2000,
seed = 145,
data = bar_mod,
family = binomial("logit"))
mod_list <- list(school_mod_linear, school_mod_pca, school_mod_pca_spline, school_mod_spline)
```
### Compare admissions factors models using LOO
The table below compares each model by its expected log posterior density (ELPD). The `Difference in ELPD` column shows the difference in ELPD between the best fitting model - the model on the first row - and the model in the given row. The PCA spline model performs best. Plus, the standard error of the difference in ELPD between the best and second best model is small enough that we can have confidence that the PCA spline model did not perform best due to some randomness in the sampling draws. Assuming the model's posterior predictive check pans out, we'll start with it when adding attrition.
```{r looCompare}
mod_loo_list <- map(mod_list, loo, k_threshold = 0.7)
mod_loo_list <- list(linear = mod_loo_list[[1]],
pca = mod_loo_list[[2]],
pca_spline = mod_loo_list[[3]],
spline = mod_loo_list[[4]])
# columns need from loo_compare
loo_cols <- c('model', 'elpd_diff', 'se_diff')
loo_col_names <- c('Model', 'Difference in ELPD', 'Std. Error of Difference')
# titles for plots, in same order as models
# used in plots and in mapping model names to descriptions
plot_titles <- c('Linear', 'PCA', 'PCA spline', 'Spline')
# mapping of model names to descriptions
mod_descriptions <- c(linear = plot_titles[1],
pca = plot_titles[2],
pca_spline = plot_titles[3],
spline = plot_titles[4])
loo_compare(mod_loo_list) %>%
as.data.frame() %>%
mutate(model = row.names(.),
model = recode(model, !!!mod_descriptions)) %>%
.[loo_cols] %>%
kable(col.names = loo_col_names, digits = 0, row.names = F)
```
### Poserior predictive checks of admissions factors models
Posterior predictive checks (PPC) simulate y values - bar passage rates - from the models. Our checks simulate 200 distributions of bar passage rates based on 200 model simulations. We can then compare these distributions to the actual distribution of rates. Ideally, they align.
Figure \@ref(fig:ppcPlots) shows the PPC plots. All models have the same problem: they overpredict at the 75% to 90% bar passage rates and underpredict at the tails. This is shown by the dark line (actual distribution) being lower than the light lines (simulated distributions) at the 75% to 90% range and higher than the light lines outside this point. The takeaway from the misfit models is that there will be too many predictions in the 75% to 90% range and not enough at each tail.
```{r postpredictions, cache = T}
# function to create posterior predictive check plot
ppc_plot <- function(mod, num_takers, pass_pct, plot_title) {
# creates posterior predictions and then creates ppc plot
pred_draws <- posterior_predict(mod, draws = 100) %>%
t()
# calculate percentage probability from number of passers in prediction
pred_draws <- t(pred_draws / num_takers)
# create plot
ppc_dens_overlay(yrep = pred_draws, y = pass_pct) +
ggtitle(plot_title)
}
ppc_mod_plots <- map2(mod_list, plot_titles,
ppc_plot, num_takers = bar_mod$takers, pass_pct = bar_mod$schoolpasspct)
```
```{r ppcPlots, fig.height = 3, fig.width = 8, fig.cap = "Posterior predictive check for all models. A problem in all models is that they fail to accurately predict at the most likely bar passage rates. We know this because at the highest points in the curve, the predicted passage rates for schools is higher than the actual rates."}
wrap_plots(ppc_mod_plots, nrow = 1)
```
### Add attrition as a predictor
Step two involves using the best model in the previous section, PCA with a spline, and adding attrition as a predictor. We'll incorporate attrition using the natural logarithm of attrition. The natural logarithm is used because attrition rates have a right skew: most schools have attrition rates between 0% and 15%, but some schools have rates greater than 30%. Using the natural logarithm removes much of the right skew and makes the distribution more symmetrical.
We'll create two different models with attrition, each an extension of the PCA spline model. The first adds the natural logarithm of attrition, centered, as an individual predictor. The second also includes the natural logarithm of attrition, but wraps it in a spline to model non-linearity.
```{r runMods2, cache = T}
school_mod_pca_spline_att <- update(school_mod_pca_spline,
formula = . ~ ns(PC1, df=4) + ns(PC2, df=4) +
log_attrit + (1 | state) + (1 | year))
school_mod_pca_spline2_att <- update(school_mod_pca_spline,
formula = . ~ ns(PC1, df=4) + ns(PC2, df=4) +
ns(log_attrit, df=4) + (1 | state) + (1 | year))
```
The table below compares each model based on the ELPD using leave-one-out cross validation. We see little difference between the two models with attrition added, but these two models are better than the model without attrition.
```{r looCompare2}
mod_list2 <- list(without_attrit = school_mod_pca_spline,
with_attrit = school_mod_pca_spline_att,
with_spline = school_mod_pca_spline2_att)
mod_loo_list2 <- map(mod_list2, loo, k_threshold = 0.7)
plot_titles2 <- c('PCA spline w/o attrition', 'PCA spline with attrition', 'PCA spline with attrition spline')
# mapping of model names to descriptions
mod_descriptions2 <- c(without_attrit = plot_titles2[1],
with_attrit = plot_titles2[2],
with_spline = plot_titles2[3])
loo_compare(mod_loo_list2) %>%
as.data.frame() %>%
mutate(model = row.names(.),
model = recode(model, !!!mod_descriptions2)) %>%
.[loo_cols] %>%
kable(col.names = loo_col_names, digits = 0, row.names = F)
```
Moving on to the model check, the plot of posterior predictive checks in figure \@ref(fig:ppcPlots2) shows that the models with attrition have the same problem we encountered in figure \@ref(fig:ppcPlots). More predicted pass rates fall between 70% and 90% than would be expected given the distribution of actual bar passage rates.
Despite the non-ideal fit, the PCA spline model with attrition will be used to predict bar passage rates of schools. This model performs almost identically to the PCA spline model with attrition added as a spline, but is simpler. All things equal, we'll opt for the simpler.
```{r ppcPlots2, fig.height = 3, fig.width = 8, fig.cap = "Posterior predictive checks. The attrition model cehcks reveal the same problem as the models without attrition - too many predictions in the 75% to 90% range."}
ppc_mod_plots2 <- map2(mod_list2, plot_titles2,
ppc_plot, num_takers = bar_mod$takers, pass_pct = bar_mod$schoolpasspct)
wrap_plots(ppc_mod_plots2, nrow = 1)
```
## Comparing actual and predicted bar passage rates for all US law schools
```{r}
# create fitted draws and HPI of fitted draws
# we'll create predictions and then aggregate them by school twice,
# so use a function for this task
create_predictions <- function(df, mod) {
df %>%
# create posterior predictions on dataset
add_predicted_draws(mod, n = 500, seed = 938) %>%
# sum the number of predicted passers for each school
group_by(schoolname, .draw) %>%
summarize(total_takers = sum(takers),
predicted_passers = sum(.prediction),
actual_passers = sum(passers)) %>%
# calculate actual and predicted pass rates
mutate(pass_rate = actual_passers / total_takers,
predicted_pass_rate = predicted_passers / total_takers,
# calculate residual
residual = pass_rate - predicted_pass_rate)
}
# create predictions by school
predicted_pass_draws <- create_predictions(bar_mod, school_mod_pca_spline_att)
predicted_pass_hdi <- predicted_pass_draws %>%
ungroup() %>%
# we want summary statistics by school
group_by(schoolname) %>%
median_hdi(predicted_pass_rate) %>%
# add actual pass rates to this dataset for plotting
left_join(aggregate_rates(bar_mod, 'schoolname'), by = 'schoolname')
# calculate HDI of residuals
predicted_pass_residual <- predicted_pass_draws %>%
ungroup() %>%
# we want summary statistics by school
group_by(schoolname) %>%
median_hdi(residual)
```
```{r}
# create plot of actual and expected values
# plot actual pass rates and predictions
n_rows <- nrow(predicted_pass_hdi)
# point size for all points in all plots
plot_point_size <- .7
plot_actual <- predicted_pass_hdi %>%
ggplot(aes(predicted_pass_rate, fct_reorder(schoolname,pass_rate))) +
# bayesian predicted pass rate
geom_point(alpha=.6, color = set2_colors[1], size = plot_point_size) +
geom_vline(xintercept = us_rate, linetype = 2, alpha = .7) +
annotate('text', x = us_rate - .02, y = n_rows-4,
angle = 90, alpha = .7, size = 3,
label = "Overall pass rate") +
# bayesian 90% HPI
geom_errorbarh(aes(xmax = .lower, xmin = .upper), color = set2_colors[1]) +
# actual bar passage rate
geom_point(aes(pass_rate, fct_reorder(schoolname, pass_rate)),
color = set2_colors[2], size = plot_point_size) +
scale_x_continuous(labels = scales::percent_format(accuracy=1)) +
labs(x="Predicted and actual\nBar passage rates",
y=NULL)
```
```{r}
# calculate probability that predicted value is lower than actual value, to be plotted
prob_greater_zero <- predicted_pass_draws %>%
mutate(greater_zero = residual > 0) %>%
group_by(schoolname) %>%
summarize(perc_greater_zero = sum(greater_zero) / n()) %>%
mutate(perc_greater_zero = glue("{round(perc_greater_zero*100, 0)}%")) %>%
select(schoolname, perc_greater_zero)
# create plot of residuals
plot_residuals <- predicted_pass_residual %>%
# add pass rate, so we can order by it in the plot
left_join(predicted_pass_hdi[c('schoolname', 'pass_rate')], by = 'schoolname') %>%
left_join(prob_greater_zero, by = 'schoolname') %>%
mutate_at(vars(residual, .lower, .upper), ~.*100) %>%
ggplot(aes(residual, fct_reorder(schoolname,pass_rate))) +
# bayesian predicted pass rate
geom_point(alpha=.6, color = set2_colors[1], size = plot_point_size) +
geom_vline(xintercept = 0, linetype = 2, alpha = .7) +
# bayesian 95% HPI
geom_errorbarh(aes(xmax = .lower, xmin = .upper), color = set2_colors[1]) +
geom_text(aes(x = 40, y = schoolname, label = perc_greater_zero)) +
labs(x='Actual minus predicted\nBar passage rate',
y=NULL) +
theme(axis.text.y=element_blank())
```
With technicalities behind use, let's get to the results. Figure \@ref(fig:plotPredictions) compares each law school's predicted and actual bar passage rates from 2014 to 2019. The left plot shows the 95% credible intervals for each school's predicted bar passage rate (in red) and its actual rate (blue). The right plot, meanwhile, displays the 95% credible interval for each school's residual. The residual is the actual rate minus the predicted rate. Finally, the percentages on the right represent the probability that the school's actual rate is higher than its predicted rate. Stated differently, it's the probability that the residual is positive.
```{r plotPredictions, fig.height = 25, fig.width = 10, fig.cap = "Left plot reveals actual (blue dot) and predicted (red dot) bar passage rates. Right plot shows residuals: actual minus predicted bar passage rates. The percentage on the right is the probability that the actual bar passage rate is higher than the predicted rate."}
# combine two plots and display results
plot_actual + plot_residuals +
plot_annotation(
title = 'Comparing Actual and Predicted Bar Passage Rates for All Law Schools',
subtitle = 'Bands are 95% credible intervals | Percentage is probability that actual rate is higher than predicted rate'
)
```
## Three problems with the methodology
What do we make of the plots? For starters, simply ranking each school by its residual is ill-advised for three reasons.
```{r}
# get school ranks for the following histogram and later listing of ranks
school_ranks <- predicted_pass_draws %>%
group_by(.draw) %>%
mutate(num_rank = rank(-residual))
# get each school's credible interval for ranks
school_ranks_ci <- school_ranks %>%
ungroup() %>%
#mutate(school_group = group_indices(., schoolname)) %>%
group_by(schoolname) %>%
median_hdci(num_rank, .width = .9)
# get Kentucky's ci
uk_rank <- school_ranks_ci %>%
filter(str_detect(schoolname, "^Univ.*Kentucky"))
```
1. Schools have a lot of uncertainty in their residuals. A school's ranking is it's true residual plus uncertainty in the residuals; the dreaded error terms in models. The higher the uncertainty, the more that the ranking for specific schools will simply reflect randomness. Here, the uncertainty is large enough to where rankings will reflect more randomness than we are comfortable incorporating. Figure \@ref(fig:plotPredictions) illustrates this. It shows the University of Kentucky's residual ranking among all `r n_schools` law schools based on 500 simulations of the model. The meidan simualted ranking is `r round(uk_rank$num_rank[1], 0)` and the 90% credible interval stretches from `r uk_rank$.lower[1]` to `r uk_rank$.upper[1]`. This range is too wide to be helpful.
```{r ukResidualRankPlot, fig.cap = "The University of Kentucky's residual rankings from 500 model simulations. There is a wide degree of variance in the rankings, making them unhelpful."}
school_ranks %>%
filter(schoolname == 'University of Kentucky') %>%
ggplot(aes(x = num_rank)) +
geom_histogram(binwidth = 10) +
scale_x_continuous(limits = c(0, 200), breaks = seq(0, 200, 10)) +
labs(title = "Univ. of Kentucky Ranking out of 500 Simulations",
y = 'Number of simulations with ranking',
subtitle = 'Bins are in units of 10',
x = glue('Rank among all {n_schools} law schools'))
```
2. Schools with high LSAT and undergrad GPAs will also have high predicted bar passage rates. Since neither actual nor predicted rates can exceed 100%, this leaves schools with high LSAT and undergrad GPAs little room to outperform their predictions. When you are already at the top, you can't go much higher.
3. The predictions are only as good as the model. As figure \@ref(fig:ppcPlots2) shows, the model is not ideal. It potentially misfits by different amounts depending on the predicted bar passage rate.
## Comparing similair schools
Taking reasons two and three together, a better approach might be to compare similar schools. For example, with might compare schools with similar admissions factors and where the majority of bar takers are from the same state. To highlight this method, we'll compare three North Carolina schools with similar admissions factors: Campbell University, Elon University, and North Carolina Central University.
```{r}
nc_residuals <- bar_mod %>%
filter(str_detect(schoolname, '^Campbell|^Elon|^North Carolina Central'),
state == 'NC') %>%
# make predictions for selected schools
create_predictions(school_mod_pca_spline_att) %>%
# the school names will be turned into column names when we widen dataset,
# so rename schools to better column names
mutate(schoolname = recode(schoolname, `Campbell University` = 'campbell',
`Elon University` = 'elon',
`North Carolina Central University` = 'central')) %>%
pivot_wider(id_cols = .draw, names_from = 'schoolname', values_from = 'residual') %>%
# calculate difference in residuals
mutate(`Campbell minus Elon` = campbell - elon,
`Campbell minus NC Central` = campbell - central,
`Elon minus NC Central` = elon - central) %>%
# put in long form so all differences are in the same column
pivot_longer(cols = contains("minus"), names_to = 'comparison', values_to = 'diff')
```
Figure \@ref(fig:plotResidCompare) shows the comparison's results. It's the difference between each school in their difference of residuals, only incorporating North Carolina bar exam results. To account for uncertainty, the entire probability distribution is shown for these differences.
Here's how it works. Let's say that Campbell's actual NC bar passage rate is 70% and their predicted rate is 60%, for a difference of 10 percentage points. Elon's actual NC rate and predicted rate are both 60%, for a difference of 0. In this situation, Campbell's residual - difference between actual and predicted bar passage rate - is 10 percentage points higher than Elon's (10 - 0).
```{r plotResidCompare, fig.cap = "Difference in residuals between comparitive law schools in North Carolina. Predictions and residuals are limited to the NC bar exam.", fig.height = 3, fig.width = 4.5}
# calculate HDI of differences
nc_residuals %>%
ggplot(aes(x = diff*100, y = comparison)) +
stat_halfeye(alpha = .4, color = set2_colors[2], fill = set2_colors[2]) +
geom_vline(xintercept = 0, linetype = 2, alpha = .4) +
scale_color_brewer(type = 'qualitative', palette = 'Set2') +
labs(title = 'Differences in Residuals Between\nComparative NC Schools',
x = 'Percentage point difference in residuals',
y = 'Comparison') +
theme(legend.position = 'none')
```
OK, so this *might* be useful. But, I'm still not comfortable with it due to the problems previously mentioned. The best course of action is to scrap the whole enterprise. But, in the spirit of learning from what goes wrong, I figured I would post it here for posterity.
## Addendum: the final rankings that have little value
If you ignored this post's advice and want to know the actual modeled ranking, here you go. Again, you shouldn't make much of them; I'm not. And you especially shouldn't make much of them once you see the wide 90% credible intervals. Also, going back to problem two mentioned above, notice that almost all of the top ranking schools have low LSAT / GPA combinations and low bar passage rates.
```{r}
school_ranks_ci %>%
mutate(ordered_rank = min_rank(num_rank)) %>%
select(schoolname, ordered_rank, .lower, .upper) %>%
mutate(across(where(is.numeric), ~round(., 0))) %>%
arrange(ordered_rank) %>%
gt(rowname_col = "schoolname") %>%
cols_label(
ordered_rank = "Rank",
.lower = "Lower bound",
.upper = "Upper bound"
) %>%
tab_spanner(
label = "90% Credible Interval",
columns = vars(`.lower`, `.upper`)
)
```