New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update: new web agg write #105
Changes from 4 commits
135f1df
0581f09
e0c77c0
8a6eab3
e7e96df
de641fd
f249797
9336bee
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,169 @@ | ||
#' Alternate Aggregate UCLA data for website groupings | ||
#' | ||
#' Reads the UCLA aggregates counts for states for most recent | ||
#' data within a given window and reports either state level data or national | ||
#' data. States include values for the 50 states broken down by carceral type, | ||
#' prison, ICE, Federal, Juvenile, Psychiatric, and county. For prisons, data | ||
#' from the Marshall project is also incorporated. | ||
#' | ||
#' @param window integer, the day range of acceptable data to pull from, ignored | ||
#' if all dates is true | ||
#' @param all_dates logical, get time series data rather than just latest counts | ||
#' @param week_grouping logical, use weekly grouping for past data? else monthly | ||
#' | ||
#' @return data frame with aggregated counts at the state level by web groups | ||
#' | ||
#' @examples | ||
#' \dontrun{ | ||
#' alt_aggregate_counts() | ||
#' } | ||
#' @export | ||
|
||
alt_aggregate_counts <- function( | ||
window = 31, all_dates = FALSE, week_grouping = TRUE){ | ||
|
||
# How to round data when doing all dates | ||
round_ <- ifelse(week_grouping, "week", "month") | ||
|
||
# read in ucla data and do the appropriate grouping | ||
fac_long_df <- read_scrape_data( | ||
window = window, all_dates = all_dates, wide_data = FALSE) %>% | ||
mutate(Web.Group = case_when( | ||
Jurisdiction == "immigration" ~ "ICE", | ||
Jurisdiction == "federal" ~ "Federal", | ||
Age == "Juvenile" ~ "Juvenile", | ||
Jurisdiction == "state" ~ "Prison", | ||
Jurisdiction == "psychiatric" ~ "Psychiatric", | ||
Jurisdiction == "cou nty" ~ "County", | ||
TRUE ~ NA_character_ | ||
)) %>% | ||
filter(State != "Not Available") %>% | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thoughts on leaving these in and just keeping the state as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like this idea but I'm not sure, practically, whether this matters. If state = "Not Available", would the numbers show up anywhere on the website? I don't think so There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just checking, since these are aggregated they wouldn't show up here right? this is only for "facilities" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's my understanding! |
||
select( | ||
Name, Date, State, Measure, Web.Group, value, Population.Feb20) %>% | ||
mutate(Rate = value/Population.Feb20) | ||
|
||
# pull in the comparable MP data | ||
mp_df <- read_mpap_data( | ||
all_dates = all_dates, window = window) %>% | ||
filter(State != "Federal")%>% | ||
tidyr::pivot_longer( | ||
-(State:Date), names_to = "Measure", values_to = "MP") | ||
|
||
mp_pop_df <- read_mpap_pop_data() %>% | ||
tidyr::pivot_longer( | ||
-State, names_to = "Group", values_to = "Population.Feb20") %>% | ||
mutate(Group = ifelse( | ||
str_starts(Group, "Staff"), "Staff", "Residents")) | ||
|
||
if(all_dates){ | ||
mp_df <- mp_df %>% | ||
filter(!is.na(Date)) %>% | ||
mutate(Date = lubridate::floor_date(Date, round_)) %>% | ||
group_by(State, Date, Measure) %>% | ||
filter(MP == max_na_rm(MP)) %>% | ||
filter(1:n() == 1) %>% | ||
ungroup() | ||
} | ||
|
||
pop_threshold <- .8 | ||
|
||
# aggregate the data together | ||
if(all_dates){ | ||
state_df <- fac_long_df %>% | ||
mutate(Date = lubridate::floor_date(Date, round_)) %>% | ||
rename(UCLA = value) %>% | ||
filter(!is.na(UCLA)) %>% | ||
group_by( | ||
State, Date, Measure, Web.Group, Name, Population.Feb20) %>% | ||
summarize(UCLA = max_na_rm(UCLA), .groups = "drop_last") %>% | ||
mutate(has_statewide = "STATEWIDE" %in% Name) %>% | ||
# if state wide and other counts exist for a measure | ||
# only take max date | ||
filter(!(has_statewide) | Date == max(Date)) %>% | ||
mutate(has_statewide = "STATEWIDE" %in% Name) %>% | ||
# if state wide and other counts still exist for a measure | ||
# only use statewide | ||
filter(!(has_statewide & Name != "STATEWIDE")) %>% | ||
group_by(State, Date, Web.Group, Measure) %>% | ||
mutate(rem_thresh = | ||
mean(!is.na(Population.Feb20)) < pop_threshold) %>% | ||
mutate(Population.Feb20 = ifelse( | ||
rem_thresh, NA, Population.Feb20)) %>% | ||
select(-rem_thresh) %>% | ||
mutate(Rate = UCLA/Population.Feb20) %>% | ||
summarise( | ||
UCLA = sum_na_rm(UCLA), | ||
Rate = sum_na_rm(Rate*Population.Feb20)/ | ||
sum_na_rm(Population.Feb20), | ||
Date = max(Date), .groups = "drop") %>% | ||
mutate(Rate = ifelse(str_starts(Measure, "Staff"), NA, Rate)) %>% | ||
filter(!str_ends(Measure, "Population")) | ||
|
||
pri_df <- state_df %>% | ||
filter(Web.Group == "Prison") %>% | ||
full_join(mp_df, by = c("State", "Date", "Measure")) | ||
Comment on lines
+105
to
+107
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want the |
||
|
||
}else{ | ||
state_df <- fac_long_df %>% | ||
rename(UCLA = value) %>% | ||
filter(!is.na(UCLA)) %>% | ||
group_by(State, Measure, Web.Group) %>% | ||
mutate(has_statewide = "STATEWIDE" %in% Name) %>% | ||
# if state wide and other counts exist for a measure only take more | ||
# recently scraped data | ||
filter(!(has_statewide) | Date == max(Date)) %>% | ||
mutate(has_statewide = "STATEWIDE" %in% Name) %>% | ||
# if state wide and other counts still exist for a measure only | ||
# use statewide measures | ||
filter(!(has_statewide & Name != "STATEWIDE")) %>% | ||
group_by(State, Measure, Web.Group) %>% | ||
mutate(rem_thresh = | ||
mean(!is.na(Population.Feb20)) < pop_threshold) %>% | ||
mutate(Population.Feb20 = ifelse( | ||
rem_thresh, NA, Population.Feb20)) %>% | ||
select(-rem_thresh) %>% | ||
summarise( | ||
UCLA = sum_na_rm(UCLA), | ||
Rate = sum_na_rm(Rate*Population.Feb20)/ | ||
sum_na_rm(Population.Feb20), | ||
Date = max(Date), .groups = "drop") %>% | ||
mutate(Rate = ifelse(str_starts(Measure, "Staff"), NA, Rate)) %>% | ||
filter(!str_ends(Measure, "Population")) | ||
|
||
pri_df <- state_df %>% | ||
filter(Web.Group == "Prison") %>% | ||
full_join(select(mp_df, -Date), by = c("State", "Measure")) | ||
} | ||
|
||
# combine MP and UCLA data | ||
out_agg_df <- pri_df %>% | ||
mutate(Group = ifelse( | ||
str_starts(Measure, "Staff"), "Staff", "Residents")) %>% | ||
left_join(mp_pop_df, by = c("State", "Group")) %>% | ||
select(-Group) %>% | ||
mutate(Val = case_when( | ||
is.na(UCLA) & is.na(MP) ~ NA_real_, | ||
is.na(UCLA) ~ MP, | ||
is.na(MP) ~ UCLA, | ||
UCLA >= MP ~ UCLA, | ||
TRUE ~ MP | ||
)) %>% | ||
mutate(Rate = ifelse(is.na(Rate), UCLA/Population.Feb20, Rate)) %>% | ||
mutate( | ||
Rate = case_when( | ||
is.na(UCLA) & is.na(MP) ~ NA_real_, | ||
is.na(UCLA) ~ MP/Population.Feb20, | ||
is.na(MP) ~ Rate, | ||
UCLA >= MP ~ Rate, | ||
TRUE ~ MP/Population.Feb20 | ||
) | ||
) %>% | ||
select(-MP, -UCLA, -Population.Feb20) %>% | ||
bind_rows( | ||
state_df %>% | ||
filter(Web.Group != "Prison") %>% | ||
rename(Val = UCLA) | ||
) | ||
|
||
return(out_agg_df) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
#' A re-coding of the max function which returns NA if all values are NA | ||
#' | ||
#' A re-coding of the max function which returns NA if all values are NA. | ||
#' Normally max(na.rm=TRUE) will return -Inf if all values are NA. | ||
#' | ||
#' @param x numeric vector | ||
#' @return numeric vector of length 1 | ||
#' | ||
#' @examples | ||
#' max_na_rm(c(1:2, NA)) | ||
#' max_na_rm(c(NA, NA)) | ||
#' | ||
#' @export | ||
|
||
max_na_rm <- function(x){ | ||
if(all(is.na(x))){ | ||
return(NA) | ||
} | ||
max(x, na.rm = TRUE) | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
#' Get Feb 20, 2020 population data for state prisons from the MP | ||
#' | ||
#' Reads the MP dataset and for each prison jurisdiction pulls the staff and | ||
#' resident population data closest to Feb 20, 2020. | ||
#' | ||
#' @return data frame with jurisdiction level pop data from MP | ||
#' | ||
#' @examples | ||
#' \dontrun{ | ||
#' read_staff_popfeb20() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be |
||
#' } | ||
#' @export | ||
|
||
|
||
read_mpap_pop_data <- function(){ | ||
pri_pop_df <- "https://raw.githubusercontent.com/themarshallproject/" %>% | ||
str_c("COVID_prison_data/master/data/prison_populations.csv") %>% | ||
readr::read_csv(col_types = readr::cols()) %>% | ||
mutate(Date = lubridate::mdy(as_of_date)) %>% | ||
mutate(dtime = abs(as.numeric(Date - lubridate::ymd("2020-02-01")))) %>% | ||
group_by(name) %>% | ||
filter(dtime == min(dtime)) %>% | ||
filter(Date == min(Date)) %>% | ||
ungroup() %>% | ||
select(State = name, Population.Feb20 = pop) | ||
|
||
staff_pop_df <- "https://raw.githubusercontent.com/themarshallproject/" %>% | ||
str_c("COVID_prison_data/master/data/staff_populations.csv") %>% | ||
readr::read_csv(col_types = readr::cols()) %>% | ||
mutate(Date = lubridate::mdy(as_of_date)) %>% | ||
mutate(dtime = abs(as.numeric(Date - lubridate::ymd("2020-02-01")))) %>% | ||
group_by(name) %>% | ||
filter(dtime == min(dtime)) %>% | ||
filter(Date == min(Date)) %>% | ||
ungroup() %>% | ||
select(State = name, Staff.Population.Feb20 = pop) | ||
|
||
full_join(pri_pop_df, staff_pop_df, by = "State") | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
#' Get Feb 20, 2020 population data for applicable rows in the fac_data | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we use this function anywhere? // I think it's useful, but curious why it ended up here! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wanted to add it in if were to start using staff data denoms anywhere. Not using yet. |
||
#' | ||
#' Reads the UCLA dataset and for each facility pulls the staff population data | ||
#' closest to Feb 20, 2020 in the UCLA historical data. | ||
#' | ||
#' @return data frame with population data for staff by facility | ||
#' | ||
#' @examples | ||
#' \dontrun{ | ||
#' read_staff_popfeb20() | ||
#' } | ||
#' @export | ||
|
||
read_staff_popfeb20 <- function(){ | ||
read_scrape_data( | ||
window = window, all_dates = T, wide_data = FALSE) %>% | ||
filter(Measure == "Staff.Population") %>% | ||
filter(!is.na(Facility.ID) & !is.na(value)) %>% | ||
group_by(Facility.ID) %>% | ||
mutate(dtime = abs(as.numeric(Date - lubridate::ymd("2020-02-01")))) %>% | ||
filter(dtime == min(dtime)) %>% | ||
filter(Date == min(Date)) %>% | ||
ungroup() %>% | ||
select(Facility.ID, Name, State, Staff.Population.Feb20 = value) | ||
} |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
misspelling here on "county"