-
Notifications
You must be signed in to change notification settings - Fork 0
/
model_building.Rmd
448 lines (346 loc) · 20.8 KB
/
model_building.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
---
title: "Model Building"
output:
html_document:
code_folding: hide
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
options(scipen = 999)
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
message = FALSE,
fig.width = 12
)
library(tidyverse)
library(readxl)
library(plotly)
library(geepack)
library(modelr)
source('gee_stepper_o.R') ## custom stepwise function
```
```{r read_data, echo = FALSE}
joined_data_bklyn_nomissing =
readRDS('./data/joined_data_bklyn_nomissing')
## integrating neighborhood as predictor
neighborhood_df =
read_excel("./data/NYC_neighborhoods_by_censustract_2010.xlsx",
skip = 5,
col_names = c("borough", "county_code", "boro_code", "census_tract", "PUMA", "nta_code", "neighborhood")
) %>%
#creating 11 digit FIPS code by pasting country code (36) with `county_code` and `census_tract`
mutate(geo_id = paste0("36", county_code, census_tract)
) %>%
rename(geoid = geo_id)
joined_data_bklyn_nomissing =
left_join(joined_data_bklyn_nomissing,
neighborhood_df) %>%
select(colnames(joined_data_bklyn_nomissing), neighborhood) %>%
mutate(neighborhood = as.factor(neighborhood),
years_since_2010 = as.factor(years_since_2010))
```
# Predictors of Eviction in Brooklyn, 2010-2016
## Overview
Using existing data, we investigated factors that are associated with - and can potentially be used to predict - eviction rates among census tracts in Brooklyn. We used Exploratory Analyses (see navbar menu above) to inform our formal hypotheses and analyses detailed in this section.
We used Generalized Estimating Equations (GEE), which can be used when data are correlated and distinction of within-group and between-group estimates is not desired. Since our data are repeated over time within census tract (2010-2016), we expect there to be autocorrelation present across years, and we can address this using GEE modeling techniques. For more technical information on models used, see the [Formal Analysis](https://simmwill.github.io/evictions/overview.html#formal_analysis) section of our project report.
Overall, we found that eviction rates are significantly higher, on average, in census tracts with higher proportions of non-white residents (we tested percents of Black, Asian, American Indian / Alaska Native, Native Hawaiian / Pacific Islander, and Hispanic residents). We also found that low-income communities are more at-risk. On average, for every ten percent increase in rent burden (percent of one's income that is paid for housing), eviction rates are expected to increase 19.29 percent, holding race, year, and population density constant. These results have been shown in previous research and have significant implications for NYC's affordable housing and eviction landscape.
***
## Analysis
Since our outcome, eviction rate, is calculated using a **count** variable (number of evictions) repeated over time within areas, we'll model it using GEE with a Poisson link function.
```{r poisson_dist, message = FALSE}
#####
## poisson distribution of counts (and rates)
joined_data_bklyn_nomissing %>%
ggplot(aes(x = eviction_rate)) +
geom_histogram(binwidth = 0.1) +
theme_light() +
labs(x = "Eviction count per 100 renter-occupied households",
y = "Count",
title = "Distribution of eviction counts, standardized by area") +
theme(plot.title = element_text(hjust = 0.5))
```
## Relevant predictors
Although we formulated specific hypotheses regarding predictors of eviction in Brooklyn based on existing research, we wanted to include as many potential predictors as possible to test against our hypothesized predictors. The data we are using includes the following variables:
* `evictions`. Number of evictions at census level.
* `eviction_filings`. Number of eviction filings at census level.
* `renter_occupied_households`. Number of renter occupied households in census tract. As mentioned above, this forms the denominator of our modelled rate and will be included in all models as an offset term (as `log(renter_occupied_households)`).
* `years_since_2010`. Since our data range from 2010 to 2016, and we did not want to assume a constant effect of time, we included year as a set of indicator variables in all models except our empty model (see below).
* `hisp`. Percent of population (at census tract level, for all race/ethnicity variables) that self-report Hispanic ethnicity.
* `white`. Percent self-reporting White race.
* `black`. Percent self-reporting Black race.
* `asian`. Percent self-reporting Asian race.
* `aian`. Percent self-reporting American Indian / Alaska Native race.
* `nhpi`. Percent self-reporting Native Hawaiian / Pacific Islander race.
* `other`. Percent self-reporting other race.
* `rent_burden`. Average percent of income spent on rent.
* `density`. Population density.
* `pct_eng`. Percent of population who speak English less than 'Very Well'. This is interpreted as a proxy for percent English as a second language (ESL) speakers.
* `median_household_income`. Median census tract household income in USD.
* `poverty_rate`. Percent living below Federal Poverty Line (FPL).
* `median_gross_rent`. Median census tract gross rent in USD.
* `pct_renter_occupied`. Percent of census tract occupied by renters.
* `median_property_value`. Median census tract property value in USD.
* `family_size`. Average family size in census tract.
* `pct_fam_households`. Percentage of census tract households that contain families.
## Hypothesis
Based on prior reading and research, we hypothesize the following predictors of eviction count at the census tract level. See above for variable definitions.
* `years_since_2010`
* `rent_burden`
* `density`
* `pct_eng`
* Race/ethnicity variables (`white`, `black`, `asian`, `aian`, `nhpi`, `other`, `hisp`)
_**For detailed model statements, see Report (link in navbar above).**_
## Correlation matrix
Before proceeding, it is important to assess crude correlation among our relevant variables, in case issues of multicollinearity arise during model development.
```{r corr_matrix}
joined_data_bklyn_nomissing %>%
mutate(year = as.numeric(year)) %>%
#select(-pct_nonwhite_racedata, -pct_af_am, am_ind_ak_native = aian) %>%
select(year, evictions,
pct_eng, rent_burden, density,
white, black, asian, aian, nhpi, other, hisp,
median_household_income, poverty_rate, median_gross_rent, median_property_value, pct_renter_occupied,
family_size, pct_fam_households) %>%
# select_if(is.numeric) %>%
cor() %>%
corrplot::corrplot(type = "lower",
method = "square",
addCoef.col = "black",
diag = FALSE,
number.cex = .6,
tl.col = "black",
tl.cex = .9,
tl.srt = 45)
```
Of relevance to our hypotheses, the following variables were highly correlated and thus may not be accurately interpreted in a model as independent predictors:
* `pct_eng` and `year` (*r* = 0.76)
* `white` and `black` (*r* = -0.87)
* `hisp` and `other` race (*r* = 0.85)
## Note on confounding
After performing exploratory data analysis and crude univariate plots (*not shown*), we found that the relationship between `pct_eng` - percent of ESL speakers - and eviction rate is confounded by `black` race.
At first, the relationship between ESL speakers and eviction rate was inverse - as ESL increased, evictions actually decreased. This is counter to our hypotheses, as non-English speaking and immigrant communities have been shown to experience much higher rates of eviction. However, we knew that Black race had the potential to confound this relationship, as Black NYC communities are overwhelmingly English-speaking and also at an increased risk of eviction due to other factors.
Once we adjusted for Black race in the relationship between ESL and evictions, the relationship flipped, and EsL was demonstrated to have a positive association with eviction rate, adjusting for Black race.
## Refining and subdefining our hypothesized model
We refined our hypothesized models based on our exploratory confounding and other analyses:
1. As noted above, ESL should not be included in a model without adjusting for Black race nor should it be interpreted independently of year due to high correlation. Thus, our first hypothesized sub-model will include ESL and percent of all race/ethnicity variables (interpreted as the independent effects of ESL and years since 2010 [highly correlated], rent burden, or population density on evictions, adjusting for each other predictor variable and race/ethnicity).
2. As also noted above, percent White race and percent Black race are highly correlated in our data (*r* = 0.87), and percent other race and percent Hispanic ethnicity are also highly correlated (*r* = 0.85). Thus, a second hypothesized sub-model will observe the independent effects of percent racial/ethnic compositions (**excluding** White and Other race), rent burden, population density, or years since 2010 on eviction rates, adjusting for other predictor variables.
## Stepwise automatic model selection
To contrast with our hypothesis-informed model building process, we wanted to test a stepwise model selection algorithm. We used the `gee_stepper` function within the `pstools` package, which performs a forward step selection process with a set of GEE predictors, using QIC to find a 'best fit' model.
We slightly modified the function code to create our own function - `gee_stepper_o` - in order to include our offset term by default in the stepwise selection process.
```{r stepwise, eval = FALSE}
## use full dataset
full_fit =
geeglm(evictions ~
offset(log(renter_occupied_households)) + years_since_2010 +
pct_eng + rent_burden + pct_nonwhite_racedata + density + ## hypothesized
black + aian + asian + nhpi + other + hisp + ## race
poverty_rate + pct_renter_occupied + median_gross_rent + median_household_income + median_property_value +
eviction_filings + family_size + pct_fam_households,
data = joined_data_bklyn_nomissing,
id = geoid,
family = poisson,
corstr = "ar1")
gee_stepper_o(full_fit, formula(full_fit)) ## customized function to automatically include offset and time covariate
```
After running stepwise selection, our GEE stepwise model includes predictors:
* [`offset`] - held constant
* `black`
* `hisp`
* `rent_burden`
* `median_gross_rent`
* `density`
* `poverty_rate`
* `median_household_income`
* `asian`
* `nhpi`
* `aian`
* `pct_renter_occupied`
In this model, to be conservative in our interpretations, `median_household_income` should not be interpreted separately from `poverty_rate` (*r* = -0.73) or `median_gross_rent` (*r* = -0.74).
## Summary of models
To summarize, we will consider the following models to predict census tract-level eviction rates:
Two hypothesized models:
1. A hypothesized model based primarily on percent ESL speakers [Equation 2, above]
2. A hypothesized model based primarily on independent effects of race/ethnicity percentages [Equation 3, above]
3. A model selected via a GEE-specific forward stepwise selection process
In addition to these three models (1, 2, 3), we propose two baseline models:
4. An empty model, in which there are no predictors, and only the outcome and offset terms are included on opposite sides of the equation, i.e. `evictions ~ offset(log(renter_occupied_households))`
5. A naive model, which only uses time to predict eviction rate, i.e. the above model, plus `years_since_2010` on the right-hand-side of the model
***
## Cross-validation of models
Since our data are modelled using GEE and are not all nested within each other, there is not a straightforward way to compare their prediction capabilities such as hypothesis testing or AIC (in this case, QIC). However, we can use **cross validation**, which allows us to compare prediction accuracy across non-nested models. While this technique does not allow us to assess statistical significance of hypotheses/models/predictors tested, we still believe it will be useful in gauging model usefulness.
Specifically, we will compare the repeat-sampled distribution of each model's **root mean squared error**, with lower values indicating lower error and better overall prediction.
(For more technical notes, see [corresponding section](https://simmwill.github.io/evictions/overview.html#cross-validation_of_models) in our report.)
```{r models}
## models
empty_model = ## no predictors
geeglm(formula = evictions ~ offset(log(renter_occupied_households)),
data = joined_data_bklyn_nomissing,
family = poisson,
id = geoid,
corstr = "ar1")
naive_model = ## just year
geeglm(formula = evictions ~ offset(log(renter_occupied_households)) + years_since_2010,
data = joined_data_bklyn_nomissing,
family = poisson,
id = geoid,
corstr = "ar1")
hyp_model_eng =
geeglm(evictions ~
offset(log(renter_occupied_households)) +
years_since_2010 +
pct_eng + rent_burden + density +
white + black + asian + aian + nhpi + other + hisp +
pct_eng*black,
data = joined_data_bklyn_nomissing,
id = geoid,
family = poisson,
corstr = "ar1")
hyp_model_race =
geeglm(evictions ~
offset(log(renter_occupied_households)) +
years_since_2010 +
rent_burden + density +
black + asian + aian + nhpi + hisp,
data = joined_data_bklyn_nomissing,
id = geoid,
family = poisson,
corstr = "ar1")
step_model =
geeglm(evictions ~
offset(log(renter_occupied_households)) +
black + hisp + rent_burden + median_gross_rent +
density + poverty_rate + median_household_income + asian +
nhpi + aian + pct_renter_occupied,
family = poisson,
data = joined_data_bklyn_nomissing,
id = geoid,
corstr = "ar1")
```
```{r cv, fig.width = 9}
set.seed(2)
## creating test-training pairs
model_data_cv =
joined_data_bklyn_nomissing %>%
crossv_mc(., 100)
## unpacking pairs
model_data_cv =
model_data_cv %>%
mutate(
train = map(train, as_tibble),
test = map(test, as_tibble))
## assessing prediction accuracy
model_data_cv =
model_data_cv %>%
mutate(empty = map(train, ~geeglm(formula = formula(empty_model),
family = poisson,
data = joined_data_bklyn_nomissing,
id = geoid,
corstr = "ar1")),
naive = map(train, ~geeglm(formula = formula(naive_model),
family = poisson,
data = joined_data_bklyn_nomissing,
id = geoid,
corstr = "ar1")),
gee_eng = map(train, ~geeglm(formula = formula(hyp_model_eng),
family = poisson,
data = joined_data_bklyn_nomissing,
id = geoid,
corstr = "ar1")),
gee_race = map(train, ~geeglm(formula = formula(hyp_model_race),
family = poisson,
data = joined_data_bklyn_nomissing,
id = geoid,
corstr = "ar1")),
gee_step = map(train, ~geeglm(formula = formula(step_model),
family = poisson,
data = joined_data_bklyn_nomissing,
id = geoid,
corstr = "ar1")))
model_data_cv =
model_data_cv %>%
mutate(rmse_empty_gee = map2_dbl(empty, test, ~rmse(model = .x, data = .y)),
rmse_naive_gee = map2_dbl(naive, test, ~rmse(model = .x, data = .y)),
rmse_eng_gee = map2_dbl(gee_eng, test, ~rmse(model = .x, data = .y)),
rmse_race_gee = map2_dbl(gee_race, test, ~rmse(model = .x, data = .y)),
rmse_step_gee = map2_dbl(gee_step, test, ~rmse(model = .x, data = .y)))
## plotting RMSEs
model_data_cv %>%
select(starts_with("rmse"), -contains('glm')) %>%
pivot_longer(
everything(),
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_") %>%
mutate(model = fct_reorder(model, rmse, .desc = TRUE),
model = fct_recode(model, "Empty" = "empty_gee",
"Naive" = "naive_gee",
"Race Model" = "race_gee",
"ESL Model" = "eng_gee",
"Stepwise Model" = "step_gee")) %>%
ggplot(aes(x = model, y = rmse, fill = model)) +
geom_violin() +
theme_light() +
labs(x = "Model",
y = "RMSE",
title = "Cross-Validating Models") +
scale_fill_brewer(type = 'seq', palette = 'Purples') +
theme(legend.position = 'none',
plot.title = element_text(hjust = 0.5))
```
As we can see, our empty and naive models are clearly inferior in terms of their prediction abilities to our other hypothesized and selection-based models.
This is the only clear-cut difference, as the latter three models - our two hypothesized models, and our stepwise model - are very similar in terms of their prediction capabilities.
(*Note:* None of our models had particularly good predictive capabilities, and the differences between our models given their RMSE distributions is comparatively small. However, we will assume that the difference in RMSE is still meaningful.)
As this is the case, we will opt for model simplicity and select the simplest model with the most easily-interpretable results: the hypothesized model based primarily on the independent effects of non-White race/ethnicity percentages, rent burden, population density, and years since 2010.
## Conclusions
Our model can be estimated via GEE:
```{r final_model}
hyp_model_race =
geeglm(evictions ~
offset(log(renter_occupied_households)) +
years_since_2010 +
rent_burden + density +
black + asian + aian + nhpi + hisp,
data = joined_data_bklyn_nomissing,
id = geoid,
family = poisson,
corstr = "ar1")
hyp_model_race %>%
broom::tidy() %>%
knitr::kable()
```
Here, we can see that all hypothesized predictors are significantly associated with eviction rate in Brooklyn except for American Indian / Alaska Native race (p = 0.25) at the 5% level of significance.
Since we used a Poisson (log) link function, we should exponentiate our parameter estimates to get an interpretable estimated measure of effect, interpreted as an estimated **risk ratio**.
For example, to interpret our parameter estimate for `black`, we would exponentiate our parameter estimate, 0.01887 (e^0.01887 = 1.0190). This is equal to our estimated risk ratio. We can thus say that the population-average eviction rate (or risk of eviction) is expected to be 1.0190 times higher for each one percentage-point increase in Brooklyn's Black population, adjusting for percentage of other non-White race/ethnicity (Asian, American Indian / Alaska Native, Native Hawaiian / Pacific Islander, and Hispanic), rent burden, population density, and years since 2010.
To make our risk estimates even more meaningful, we can interpret risk ratios for multiple units. For example, given the above, e^0.01887*10 = 1.2076; for each 10 percent increase in Brooklyn's Black population, we expect the eviction rate to increase by 21 percent (RRest = 1.21), adjusting for other factors listed in the previous paragraph. Similarly, on average, for every ten percent increase in rent burden, eviction rates are expected to increase 19.29 percent (RRest = 1.1929), adjusting for other factors included in model.
_**See further discussion in our project report.**_
## Interactive prediction model
We want to enable readers to more easily understand which predictors affect eviction rate and to what extents. Thus, using the race-based GEE model we selected from our five compared models, we created an interactive prediction tool that allows readers to adjust levels of various predictors to see the expected change in eviction rate (evictions per 100 renter-occupied households). Further than the analyses above, readers can also get a sense of these effects in each neighborhood of Brooklyn.
Visit our [Prediction Tool](https://willsimmons.shinyapps.io/model_building) to learn more. (You can also find the link in the navbar above.)