Errors in values when combining API calls #256

mbsabath · 2020-06-11T14:23:15Z

I'm using tidycensus version 0.9.9.5, and am getting errors in values in the 2000 decennial census when requesting a large list of variables (from both sf1 and sf3). The call to the API is as follows:

year <- 2000
if ("P004002" %in% dec_vars) {
      ## No Idea why this is a problem but we have to do this
      dec_data <- get_decennial("county", variables = setdiff(dec_vars, "P004002"), year = year)
      dec_data <- rbind(dec_data, get_decennial("county", variables = "P004002", year = year))
      
    } else {
      dec_data <- get_decennial("county", variables = dec_vars, year = year)
    }
    dec_data <- pivot_wider(dec_data, id_cols = c(GEOID, NAME),  names_from  = variable, values_from  = value)

Where dec_data is the following vector:

dec_vars <- c("P004002", "P001001", "P087002", "P087001", "P087008", "P087009", "P087016", "P087017", "H076001", "P007003", "P007001", "P007002", "P007004", "P007005", "PCT025004", "PCT025005", "PCT025012", "PCT025013", "PCT025020", "PCT025021", "PCT025028", "PCT025029", "PCT025036", "PCT025037", "PCT025045", "PCT025046", "PCT025053", "PCT025054", "PCT025061", "PCT025062", "PCT025069", "PCT025070", "PCT025078", "PCT025079", "PCT025002", "PCT025043", "PCT025077", "PCT025035", "PCT025076", "P053001", "H004002", "H004001", "P013001", "P012003", "P012004", "P012005", "P012027", "P012028", "P012029", "P012001", "P012006", "P012007", "P012008", "P012009", "P012010", "P012011", "P012012", "P012013", "P012014", "P012030", "P012031", "P012032", "P012033", "P012034", "P012035", "P012036", "P012037", "P012038", "P012015", "P012016", "P012017", "P012018", "P012019", "P012039", "P012040", "P012041", "P012042", "P012043", "P012020", "P012021", "P012022", "P012023", "P012024", "P012025", "P012044", "P012045", "P012046", "P012047", "P012048", "P012049")

When we query on a smaller subset for the same year (we noticed the issue when looking at the race statistics) we get results that seem correct. The query for the correct results is as follows:

dec_data <- get_decennial("county", variables = c("P007001","P007002","P007003","P007004","P007005"), year = 2000)

mfherman · 2020-06-11T15:35:35Z

Thanks for the report, @mbsabath. I can confirm this error and here is a slightly pared down reprex illustrating the issue. Looks like this is happening here when the two calls are being merged. I'll see what we need to do to fix this.

tidycensus/R/census.R

Lines 285 to 297 in 02b1dcb

    
           if (length(variables) > 48) { 
        
             l <- split(variables, ceiling(seq_along(variables) / 48)) 
        
             dat <- map(l, function(x) { 
        
               d <- try(load_data_decennial(geography, x, key, year, sumfile, state, county, show_call = show_call), 
        
                          silent = TRUE) 
        
               # If sf1 fails, try to get it from sf3 
        
               if (inherits(d, "try-error")) { 
        
                 d <- try(suppressMessages(load_data_decennial(geography, x, key, year, sumfile = "sf3", state, county, show_call = show_call))) 
        
               } 
        
               d 
        
             }) %>% 
        
               reduce(left_join, by = c("GEOID", "NAME"))

library(tidycensus)
library(dplyr)

dec_vars <- c("P001001", "P087002", "P087001", "P087008", "P087009", "P087016",
              "P087017", "H076001", "P007003", "P007001", "P007002", "P007004",
              "P007005", "PCT025004", "PCT025005", "PCT025012", "PCT025013",
              "PCT025020", "PCT025021", "PCT025028", "PCT025029", "PCT025036",
              "PCT025037", "PCT025045", "PCT025046", "PCT025053", "PCT025054",
              "PCT025061", "PCT025062", "PCT025069", "PCT025070", "PCT025078",
              "PCT025079", "PCT025002", "PCT025043", "PCT025077", "PCT025035",
              "PCT025076", "P053001", "H004002", "H004001", "P013001", "P012003",
              "P012004", "P012005", "P012027", "P012028", "P012029", "P012001")

dec_data <- get_decennial(
  geography = "state",
  state = "NY",
  variables = dec_vars,
  year = 2000
  )
#> Getting data from the 2000 decennial Census

dec_data_sm <- get_decennial(
  geography= "state",
  state = "NY",
  variables = c("P007001", "P007002", "P007004", "P007005"),
  year = 2000
  )
#> Getting data from the 2000 decennial Census

dec_data %>% 
  inner_join(dec_data_sm, by = c("GEOID", "NAME", "variable"))
#> # A tibble: 4 x 5
#>   GEOID NAME     variable  value.x  value.y
#>   <chr> <chr>    <chr>       <dbl>    <dbl>
#> 1 36    New York P007001  18976457 18976457
#> 2 36    New York P007002  16111441 12893689
#> 3 36    New York P007004   2791904    82461
#> 4 36    New York P007005     53637  1044976

^{Created on 2020-06-11 by the reprex package (v0.3.0)}

walkerke · 2020-06-11T15:53:44Z

Thanks for filing! The issue here is that variable names are duplicated across SF1 and SF3 in the 2000 decennial Census. A brief example:

library(tidycensus)
library(tidyverse)

vars_sf1 <- load_variables(2000, "sf1", cache = TRUE)
vars_sf3 <- load_variables(2000, "sf3", cache = TRUE)

And to check:

> filter(vars_sf1, str_detect(name, "P007001"))
# A tibble: 1 x 3
  name    label      concept     
  <chr>   <chr>      <chr>       
1 P007001 RACE:Total P7. Race [8]
> 
> filter(vars_sf3, str_detect(name, "P007001"))
# A tibble: 1 x 3
  name    label            concept                            
  <chr>   <chr>            <chr>                              
1 P007001 Total population P7. Hispanic or Latino by Race [17]

get_decennial() tries to be helpful by guessing the summary file you want. If it encounters a variable not in SF1, it uses SF3 instead. This can cause problems, like in this case, if you are mixing variables from the two summary files.

@mfherman perhaps we should remove this behavior and throw an error message to avoid this issue, or maybe issue a warning so people know what they are getting.

mfherman · 2020-06-11T16:07:49Z

@walkerke Aha -- that makes much more sense than what I was seeing! I didn't realize the variable names were duplicated. Maybe it's safer to force the user to set the summary file explicitly. It is convenient to try SF3 for those vars not found in SF1, but if you don't know there are duplicate variable names (like me!) you get this unexpected result.

mbsabath · 2020-06-11T16:21:08Z

I like that approach. Writing code to split up a varlist in to sf1 and sf3 variables is straightforward enough. The other option could be to do the check internally, assume sf1, and throw a warning for the duplicate variables saying that sf1 was assumed.

walkerke · 2020-06-29T14:38:29Z

I pushed a solution to this with some commits this morning. I didn't want to introduce breaking changes in case people have functioning code that uses SF3 variables while leaving sumfile blank. Instead, get_decennial() now gives messages telling you which summary file you are using and an informative error message if you are mixing variables between summary files.

I still need to do some testing of this, however, as your above examples now run without error - possibly due to the way that get_decennial() splits then re-assembles calls with multiple variables, so I'm keeping this issue open for now.

mbsabath closed this as completed Jun 12, 2020

mbsabath reopened this Jun 12, 2020

kaseyzapatka mentioned this issue Jan 26, 2021

Continued 2000 Decennial sf1 api call issues #343

Closed

walkerke closed this as completed Feb 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors in values when combining API calls #256

Errors in values when combining API calls #256

mbsabath commented Jun 11, 2020

mfherman commented Jun 11, 2020

walkerke commented Jun 11, 2020

mfherman commented Jun 11, 2020

mbsabath commented Jun 11, 2020

walkerke commented Jun 29, 2020

Errors in values when combining API calls #256

Errors in values when combining API calls #256

Comments

mbsabath commented Jun 11, 2020

mfherman commented Jun 11, 2020

walkerke commented Jun 11, 2020

mfherman commented Jun 11, 2020

mbsabath commented Jun 11, 2020

walkerke commented Jun 29, 2020