Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in values when combining API calls #256

Closed
mbsabath opened this issue Jun 11, 2020 · 5 comments
Closed

Errors in values when combining API calls #256

mbsabath opened this issue Jun 11, 2020 · 5 comments

Comments

@mbsabath
Copy link

I'm using tidycensus version 0.9.9.5, and am getting errors in values in the 2000 decennial census when requesting a large list of variables (from both sf1 and sf3). The call to the API is as follows:

year <- 2000
if ("P004002" %in% dec_vars) {
      ## No Idea why this is a problem but we have to do this
      dec_data <- get_decennial("county", variables = setdiff(dec_vars, "P004002"), year = year)
      dec_data <- rbind(dec_data, get_decennial("county", variables = "P004002", year = year))
      
    } else {
      dec_data <- get_decennial("county", variables = dec_vars, year = year)
    }
    dec_data <- pivot_wider(dec_data, id_cols = c(GEOID, NAME),  names_from  = variable, values_from  = value)

Where dec_data is the following vector:

dec_vars <- c("P004002", "P001001", "P087002", "P087001", "P087008", "P087009", "P087016", "P087017", "H076001", "P007003", "P007001", "P007002", "P007004", "P007005", "PCT025004", "PCT025005", "PCT025012", "PCT025013", "PCT025020", "PCT025021", "PCT025028", "PCT025029", "PCT025036", "PCT025037", "PCT025045", "PCT025046", "PCT025053", "PCT025054", "PCT025061", "PCT025062", "PCT025069", "PCT025070", "PCT025078", "PCT025079", "PCT025002", "PCT025043", "PCT025077", "PCT025035", "PCT025076", "P053001", "H004002", "H004001", "P013001", "P012003", "P012004", "P012005", "P012027", "P012028", "P012029", "P012001", "P012006", "P012007", "P012008", "P012009", "P012010", "P012011", "P012012", "P012013", "P012014", "P012030", "P012031", "P012032", "P012033", "P012034", "P012035", "P012036", "P012037", "P012038", "P012015", "P012016", "P012017", "P012018", "P012019", "P012039", "P012040", "P012041", "P012042", "P012043", "P012020", "P012021", "P012022", "P012023", "P012024", "P012025", "P012044", "P012045", "P012046", "P012047", "P012048", "P012049")

When we query on a smaller subset for the same year (we noticed the issue when looking at the race statistics) we get results that seem correct. The query for the correct results is as follows:

dec_data <- get_decennial("county", variables = c("P007001","P007002","P007003","P007004","P007005"), year = 2000)
@mfherman
Copy link
Collaborator

Thanks for the report, @mbsabath. I can confirm this error and here is a slightly pared down reprex illustrating the issue. Looks like this is happening here when the two calls are being merged. I'll see what we need to do to fix this.

tidycensus/R/census.R

Lines 285 to 297 in 02b1dcb

if (length(variables) > 48) {
l <- split(variables, ceiling(seq_along(variables) / 48))
dat <- map(l, function(x) {
d <- try(load_data_decennial(geography, x, key, year, sumfile, state, county, show_call = show_call),
silent = TRUE)
# If sf1 fails, try to get it from sf3
if (inherits(d, "try-error")) {
d <- try(suppressMessages(load_data_decennial(geography, x, key, year, sumfile = "sf3", state, county, show_call = show_call)))
}
d
}) %>%
reduce(left_join, by = c("GEOID", "NAME"))

library(tidycensus)
library(dplyr)

dec_vars <- c("P001001", "P087002", "P087001", "P087008", "P087009", "P087016",
              "P087017", "H076001", "P007003", "P007001", "P007002", "P007004",
              "P007005", "PCT025004", "PCT025005", "PCT025012", "PCT025013",
              "PCT025020", "PCT025021", "PCT025028", "PCT025029", "PCT025036",
              "PCT025037", "PCT025045", "PCT025046", "PCT025053", "PCT025054",
              "PCT025061", "PCT025062", "PCT025069", "PCT025070", "PCT025078",
              "PCT025079", "PCT025002", "PCT025043", "PCT025077", "PCT025035",
              "PCT025076", "P053001", "H004002", "H004001", "P013001", "P012003",
              "P012004", "P012005", "P012027", "P012028", "P012029", "P012001")

dec_data <- get_decennial(
  geography = "state",
  state = "NY",
  variables = dec_vars,
  year = 2000
  )
#> Getting data from the 2000 decennial Census

dec_data_sm <- get_decennial(
  geography= "state",
  state = "NY",
  variables = c("P007001", "P007002", "P007004", "P007005"),
  year = 2000
  )
#> Getting data from the 2000 decennial Census

dec_data %>% 
  inner_join(dec_data_sm, by = c("GEOID", "NAME", "variable"))
#> # A tibble: 4 x 5
#>   GEOID NAME     variable  value.x  value.y
#>   <chr> <chr>    <chr>       <dbl>    <dbl>
#> 1 36    New York P007001  18976457 18976457
#> 2 36    New York P007002  16111441 12893689
#> 3 36    New York P007004   2791904    82461
#> 4 36    New York P007005     53637  1044976

Created on 2020-06-11 by the reprex package (v0.3.0)

@walkerke
Copy link
Owner

Thanks for filing! The issue here is that variable names are duplicated across SF1 and SF3 in the 2000 decennial Census. A brief example:

library(tidycensus)
library(tidyverse)

vars_sf1 <- load_variables(2000, "sf1", cache = TRUE)
vars_sf3 <- load_variables(2000, "sf3", cache = TRUE)

And to check:

> filter(vars_sf1, str_detect(name, "P007001"))
# A tibble: 1 x 3
  name    label      concept     
  <chr>   <chr>      <chr>       
1 P007001 RACE:Total P7. Race [8]
> 
> filter(vars_sf3, str_detect(name, "P007001"))
# A tibble: 1 x 3
  name    label            concept                            
  <chr>   <chr>            <chr>                              
1 P007001 Total population P7. Hispanic or Latino by Race [17]

get_decennial() tries to be helpful by guessing the summary file you want. If it encounters a variable not in SF1, it uses SF3 instead. This can cause problems, like in this case, if you are mixing variables from the two summary files.

@mfherman perhaps we should remove this behavior and throw an error message to avoid this issue, or maybe issue a warning so people know what they are getting.

@mfherman
Copy link
Collaborator

@walkerke Aha -- that makes much more sense than what I was seeing! I didn't realize the variable names were duplicated. Maybe it's safer to force the user to set the summary file explicitly. It is convenient to try SF3 for those vars not found in SF1, but if you don't know there are duplicate variable names (like me!) you get this unexpected result.

@mbsabath
Copy link
Author

I like that approach. Writing code to split up a varlist in to sf1 and sf3 variables is straightforward enough. The other option could be to do the check internally, assume sf1, and throw a warning for the duplicate variables saying that sf1 was assumed.

@mbsabath mbsabath reopened this Jun 12, 2020
@walkerke
Copy link
Owner

I pushed a solution to this with some commits this morning. I didn't want to introduce breaking changes in case people have functioning code that uses SF3 variables while leaving sumfile blank. Instead, get_decennial() now gives messages telling you which summary file you are using and an informative error message if you are mixing variables between summary files.

I still need to do some testing of this, however, as your above examples now run without error - possibly due to the way that get_decennial() splits then re-assembles calls with multiple variables, so I'm keeping this issue open for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants