Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New aggregation function #99

Closed
nmmarquez opened this issue Apr 28, 2021 · 1 comment · Fixed by #105
Closed

New aggregation function #99

nmmarquez opened this issue Apr 28, 2021 · 1 comment · Fixed by #105
Labels
enhancement New feature or request

Comments

@nmmarquez
Copy link
Contributor

Currently we have one aggregation function whose primary intention is to aggregate prison data and compare to the Marshall Project's reported numbers for variables which we both report on. After reviewing I think we should keep that function and have a new function which calculates aggregates by geographic region and the jurisdiction which reports/is responsible for that facility.

An example

calc_jurisdiction_agg(state = TRUE)
Date State Jurisdiction Measure UCLA
2021-04-27 Alabama State Resident.Confirmed 10
2021-04-27 Alabama County Resident.Confirmed 19
... ... ... ... ...
2021-04-27 Wyoming Youth Staff.Deaths 1

Additionally there should be at rates option where we also calculate rates. The trouble is their seems to be several options that we can use for calculating rates. Population.Feb20, most recent Residents.Population, some external source that reports on aggregated population totals. I think that their should be an option to choose which one to use in the function to calculate rates

calc_jurisdiction_agg(state = TRUE, rates = TRUE, denom = "Population.Feb20")
Date State Jurisdiction Measure UCLA Rate
2021-04-27 Alabama State Resident.Confirmed 10 .05
2021-04-27 Alabama County Resident.Confirmed 19 .09
... ... ... ... ... ...
2021-04-27 Wyoming Youth Staff.Deaths 1 .01

Here is what im thinking function documentation looks like

#' Read data extracted by webscraper
#'
#' Reads either time series or latest data from the web scraper runs.
#'
#' @param state logical, return state level data
#' @param all_dates logical, get all dates available or only most recent data?
#' @param window int, how far to go back (in days) to look for values from a given
#' facility to populate NAs for ALL scraped variables. Used when all_dates is FALSE
#' @param window_pop int, how far to go back (in days) to look for values from a given
#' facility to populate NAs in Residents.Population. Used when coalesce_pop is TRUE
#' @param coalesce_func function, how to combine redundant faciliities
#' @param rates logical, include rates in the return
#' @param denom character, which denominator to use for rates default 'Population.Feb20'
#' @param wide_data logical, return wide data as opposed to long
#'
#' @return dataframe with scraped data
#'
#' @examples
#' \dontrun{
#' read_scrape_data(all_dates = FALSE)
#' }
#' read_scrape_data(all_dates = TRUE, state = "Wyoming")
#'
#' @export

Some outstanding issues I have.

  1. I think it would be useful to have some external population sources as well. Which are already aggregated but curious to hear if yall feel the same.
  2. Do we want to still do MP replacement for prison data here? im leaning yes.
  3. Should we include a facility number count (difficult to do with statewide but we should be able to do it)? Other columns?
@nmmarquez nmmarquez added the enhancement New feature or request label Apr 28, 2021
@erika-tyagi
Copy link
Member

Overall, this makes a LOT of sense to me, thank you so much for scoping!!

  1. I think it would be useful to have some external population sources as well. Which are already aggregated but curious to hear if yall feel the same.

I agree about the external sources, but curious which ones you had in mind. I'd love to integrate MP staff population data, because even imperfect staff rates still feel really valuable to me. A question for me is deciding if the user should specify just one, or whether there should be some coalesce hierarchy across sources.

  1. Do we want to still do MP replacement for prison data here? im leaning yes.

I also think yes. The vast majority of our data will inevitably be from state and federal prisons, and I think there's a lot of value in making that as comprehensive as possible. Beyond filling in gaps, I think that's also really useful for places like Ohio or Texas.

  1. Should we include a facility number count (difficult to do with statewide but we should be able to do it)? Other columns?

I personally don't think a facility number count is super useful. One thing that I think might be nice as a (non-default!) option in the long table would be an explicit column for the various population options (so the explicit denominators in addition to the rate column).

A few other initial thoughts:

  1. Confirming the list of jurisdiction values: state, county, federal, immigration, psychiatric, and youth...?
  2. Would the wide version with rates just have two columns per metric (e.g. Residents.Confirmed and Residents.Confirmed.Rate)?
  3. Thoughts on dropping some of the sparser variables (i.e. Negative, Pending, Quarantine) from here? I know it's programmatically no different, but I like the idea of reporting a leaner set of variables.
  4. Would we still want vaccine collapsing to happen?

I'll probably have more thoughts when we chat tomorrow, but yay thank you again!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants