New aggregation function #99

nmmarquez · 2021-04-28T01:05:10Z

Currently we have one aggregation function whose primary intention is to aggregate prison data and compare to the Marshall Project's reported numbers for variables which we both report on. After reviewing I think we should keep that function and have a new function which calculates aggregates by geographic region and the jurisdiction which reports/is responsible for that facility.

An example

calc_jurisdiction_agg(state = TRUE)

Date	State	Jurisdiction	Measure	UCLA
2021-04-27	Alabama	State	Resident.Confirmed	10
2021-04-27	Alabama	County	Resident.Confirmed	19
...	...	...	...	...
2021-04-27	Wyoming	Youth	Staff.Deaths	1

Additionally there should be at rates option where we also calculate rates. The trouble is their seems to be several options that we can use for calculating rates. Population.Feb20, most recent Residents.Population, some external source that reports on aggregated population totals. I think that their should be an option to choose which one to use in the function to calculate rates

calc_jurisdiction_agg(state = TRUE, rates = TRUE, denom = "Population.Feb20")

Date	State	Jurisdiction	Measure	UCLA	Rate
2021-04-27	Alabama	State	Resident.Confirmed	10	.05
2021-04-27	Alabama	County	Resident.Confirmed	19	.09
...	...	...	...	...	...
2021-04-27	Wyoming	Youth	Staff.Deaths	1	.01

Here is what im thinking function documentation looks like

#' Read data extracted by webscraper
#'
#' Reads either time series or latest data from the web scraper runs.
#'
#' @param state logical, return state level data
#' @param all_dates logical, get all dates available or only most recent data?
#' @param window int, how far to go back (in days) to look for values from a given
#' facility to populate NAs for ALL scraped variables. Used when all_dates is FALSE
#' @param window_pop int, how far to go back (in days) to look for values from a given
#' facility to populate NAs in Residents.Population. Used when coalesce_pop is TRUE
#' @param coalesce_func function, how to combine redundant faciliities
#' @param rates logical, include rates in the return
#' @param denom character, which denominator to use for rates default 'Population.Feb20'
#' @param wide_data logical, return wide data as opposed to long
#'
#' @return dataframe with scraped data
#'
#' @examples
#' \dontrun{
#' read_scrape_data(all_dates = FALSE)
#' }
#' read_scrape_data(all_dates = TRUE, state = "Wyoming")
#'
#' @export

Some outstanding issues I have.

I think it would be useful to have some external population sources as well. Which are already aggregated but curious to hear if yall feel the same.
Do we want to still do MP replacement for prison data here? im leaning yes.
Should we include a facility number count (difficult to do with statewide but we should be able to do it)? Other columns?

The text was updated successfully, but these errors were encountered:

erika-tyagi · 2021-04-28T21:21:32Z

Overall, this makes a LOT of sense to me, thank you so much for scoping!!

I think it would be useful to have some external population sources as well. Which are already aggregated but curious to hear if yall feel the same.

I agree about the external sources, but curious which ones you had in mind. I'd love to integrate MP staff population data, because even imperfect staff rates still feel really valuable to me. A question for me is deciding if the user should specify just one, or whether there should be some coalesce hierarchy across sources.

Do we want to still do MP replacement for prison data here? im leaning yes.

I also think yes. The vast majority of our data will inevitably be from state and federal prisons, and I think there's a lot of value in making that as comprehensive as possible. Beyond filling in gaps, I think that's also really useful for places like Ohio or Texas.

Should we include a facility number count (difficult to do with statewide but we should be able to do it)? Other columns?

I personally don't think a facility number count is super useful. One thing that I think might be nice as a (non-default!) option in the long table would be an explicit column for the various population options (so the explicit denominators in addition to the rate column).

A few other initial thoughts:

Confirming the list of jurisdiction values: state, county, federal, immigration, psychiatric, and youth...?
Would the wide version with rates just have two columns per metric (e.g. Residents.Confirmed and Residents.Confirmed.Rate)?
Thoughts on dropping some of the sparser variables (i.e. Negative, Pending, Quarantine) from here? I know it's programmatically no different, but I like the idea of reporting a leaner set of variables.
Would we still want vaccine collapsing to happen?

I'll probably have more thoughts when we chat tomorrow, but yay thank you again!!

nmmarquez added the enhancement New feature or request label Apr 28, 2021

erika-tyagi mentioned this issue May 3, 2021

Updating data files uclalawcovid19behindbars/covid19_behind_bars_scrapers#229

Closed

nmmarquez linked a pull request Jun 3, 2021 that will close this issue

update: new web agg write #105

Merged

nmmarquez closed this as completed in #105 Jun 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New aggregation function #99

New aggregation function #99

nmmarquez commented Apr 28, 2021

erika-tyagi commented Apr 28, 2021

New aggregation function #99

New aggregation function #99

Comments

nmmarquez commented Apr 28, 2021

erika-tyagi commented Apr 28, 2021