Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vec_unique() is sensitive to the marked encoding, where base::unique() is not #553

Closed
jennybc opened this issue Aug 30, 2019 · 5 comments · Fixed by #565
Closed

vec_unique() is sensitive to the marked encoding, where base::unique() is not #553

jennybc opened this issue Aug 30, 2019 · 5 comments · Fixed by #565

Comments

@jennybc
Copy link
Member

jennybc commented Aug 30, 2019

See below for a more minimal reprex.

Long backstory The troublesome object comes from https://github.com/tidyverse/tidyr/issues/722. It presents as a problem with `tidyr::unnest()` but I've narrowed it down to a very weird phenomenon with `vctrs::vec_rbind()`. I've pulled out the relevant list-column here, as just a list of tibbles.
result_ok <- list(
  tibble::tibble(
    `Max Temp Flag` = character(0),
    `Min Temp (°C)` = character(0),
    `Min Temp Flag` = character(0),
    `Mean Temp (°C)` = character(0)
  ),
  tibble::tibble(
    `Max Temp Flag` = rep(NA_character_, 6),
    `Min Temp (°C)` = rep(NA_character_, 6),
    `Min Temp Flag` = rep(NA_character_, 6),
    `Mean Temp (°C)` = rep(NA_character_, 6)
  )
)

bad_df_file <- tempfile(fileext = ".rds")
curl::curl_download(
  "https://gist.github.com/paleolimbot/ec9b62b758ae57a5b4669fa771fc40a0/raw/e96b55f54d68b1cb3877bb358b28b99dc8836ceb/bad_df.rds",
  bad_df_file
)
result_bad <- readRDS(bad_df_file)[["result"]]

The only apparent difference is in the attributes of the tibble components,
i.e. the presence of flag_info.

str(result_ok)
#> List of 2
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    0 obs. of  4 variables:
#>   ..$ Max Temp Flag : chr(0) 
#>   ..$ Min Temp (°C) : chr(0) 
#>   ..$ Min Temp Flag : chr(0) 
#>   ..$ Mean Temp (°C): chr(0) 
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  4 variables:
#>   ..$ Max Temp Flag : chr [1:6] NA NA NA NA ...
#>   ..$ Min Temp (°C) : chr [1:6] NA NA NA NA ...
#>   ..$ Min Temp Flag : chr [1:6] NA NA NA NA ...
#>   ..$ Mean Temp (°C): chr [1:6] NA NA NA NA ...
str(result_bad)
#> List of 2
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    0 obs. of  4 variables:
#>   ..$ Max Temp Flag : chr(0) 
#>   ..$ Min Temp (°C) : chr(0) 
#>   ..$ Min Temp Flag : chr(0) 
#>   ..$ Mean Temp (°C): chr(0) 
#>   ..- attr(*, "flag_info")=Classes 'tbl_df', 'tbl' and 'data.frame': 0 obs. of  2 variables:
#>   .. ..$ flag       : chr(0) 
#>   .. ..$ description: chr(0) 
#>  $ :Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  4 variables:
#>   ..$ Max Temp Flag : chr [1:6] NA NA NA NA ...
#>   ..$ Min Temp (°C) : chr [1:6] NA NA NA NA ...
#>   ..$ Min Temp Flag : chr [1:6] NA NA NA NA ...
#>   ..$ Mean Temp (°C): chr [1:6] NA NA NA NA ...
#>   ..- attr(*, "flag_info")=Classes 'tbl_df', 'tbl' and 'data.frame': 13 obs. of  2 variables:
#>   .. ..$ flag       : chr [1:13] "A" "C" "E" "F" ...
#>   .. ..$ description: chr [1:13] "Accumulated" "Precipitation occurred, amount uncertain" "Estimated" "Accumulated and estimated" ...

In particular, the sub-tibble names appear to be the same.

nms_ok <- lapply(result_ok, names)
nms_bad <- lapply(result_bad, names)
identical(nms_ok, nms_bad)
#> [1] TRUE
identical(nms_ok[[1]], nms_ok[[2]])
#> [1] TRUE
identical(nms_bad[[1]], nms_bad[[2]])
#> [1] TRUE

But we get a different result from vec_rbind(). The columns with a special
character in the name aren’t correctly “matched up” with result_bad and we
get two copies.

vctrs::vec_rbind(!!!result_ok)   # 4 variables --> correct
#> # A tibble: 6 x 4
#>   `Max Temp Flag` `Min Temp (°C)` `Min Temp Flag` `Mean Temp (°C)`
#>   <chr>           <chr>           <chr>           <chr>           
#> 1 <NA>            <NA>            <NA>            <NA>            
#> 2 <NA>            <NA>            <NA>            <NA>            
#> 3 <NA>            <NA>            <NA>            <NA>            
#> 4 <NA>            <NA>            <NA>            <NA>            
#> 5 <NA>            <NA>            <NA>            <NA>            
#> 6 <NA>            <NA>            <NA>            <NA>
vctrs::vec_rbind(!!!result_bad)  # 6 variables --> wrong
#> # A tibble: 6 x 6
#>   `Max Temp Flag` `Min Temp (°C)` `Min Temp Flag` `Mean Temp (°C)`
#>   <chr>           <chr>           <chr>           <chr>           
#> 1 <NA>            <NA>            <NA>            <NA>            
#> 2 <NA>            <NA>            <NA>            <NA>            
#> 3 <NA>            <NA>            <NA>            <NA>            
#> 4 <NA>            <NA>            <NA>            <NA>            
#> 5 <NA>            <NA>            <NA>            <NA>            
#> 6 <NA>            <NA>            <NA>            <NA>            
#> # … with 2 more variables: `Min Temp (°C)` <chr>, `Mean Temp (°C)` <chr>

Stripping the flag_info attribute doesn’t rescue this. Seems irrelevant.

result_stripped <- result_bad
attr(result_stripped[[1]], "flag_info") <- NULL
attr(result_stripped[[2]], "flag_info") <- NULL
vctrs::vec_rbind(!!!result_stripped)
#> # A tibble: 6 x 6
#>   `Max Temp Flag` `Min Temp (°C)` `Min Temp Flag` `Mean Temp (°C)`
#>   <chr>           <chr>           <chr>           <chr>           
#> 1 <NA>            <NA>            <NA>            <NA>            
#> 2 <NA>            <NA>            <NA>            <NA>            
#> 3 <NA>            <NA>            <NA>            <NA>            
#> 4 <NA>            <NA>            <NA>            <NA>            
#> 5 <NA>            <NA>            <NA>            <NA>            
#> 6 <NA>            <NA>            <NA>            <NA>            
#> # … with 2 more variables: `Min Temp (°C)` <chr>, `Mean Temp (°C)` <chr>

The problem goes away with less challenging names, even without removing the
flag_info attribute.

result_renamed <- result_bad
names(result_renamed[[1]]) <- letters[1:4]
names(result_renamed[[2]]) <- letters[1:4]
vctrs::vec_rbind(!!!result_renamed)
#> # A tibble: 6 x 4
#>   a     b     c     d    
#>   <chr> <chr> <chr> <chr>
#> 1 <NA>  <NA>  <NA>  <NA> 
#> 2 <NA>  <NA>  <NA>  <NA> 
#> 3 <NA>  <NA>  <NA>  <NA> 
#> 4 <NA>  <NA>  <NA>  <NA> 
#> 5 <NA>  <NA>  <NA>  <NA> 
#> 6 <NA>  <NA>  <NA>  <NA>

Directly assigning the exact same names fixes it.

result_renamed_direct <- result_bad
nms <- c("Max Temp Flag", "Min Temp (°C)", "Min Temp Flag", "Mean Temp (°C)")
names(result_renamed_direct[[1]]) <- nms
names(result_renamed_direct[[2]]) <- nms
vctrs::vec_rbind(!!!result_renamed_direct)
#> # A tibble: 6 x 4
#>   `Max Temp Flag` `Min Temp (°C)` `Min Temp Flag` `Mean Temp (°C)`
#>   <chr>           <chr>           <chr>           <chr>           
#> 1 <NA>            <NA>            <NA>            <NA>            
#> 2 <NA>            <NA>            <NA>            <NA>            
#> 3 <NA>            <NA>            <NA>            <NA>            
#> 4 <NA>            <NA>            <NA>            <NA>            
#> 5 <NA>            <NA>            <NA>            <NA>            
#> 6 <NA>            <NA>            <NA>            <NA>

Re-assigning the same names this way does not fix it.

result_renamed_reassign <- result_bad
names(result_renamed_reassign[[1]]) <- names(result_bad[[1]])
names(result_renamed_reassign[[2]]) <- names(result_bad[[2]])
vctrs::vec_rbind(!!!result_renamed_reassign)
#> # A tibble: 6 x 6
#>   `Max Temp Flag` `Min Temp (°C)` `Min Temp Flag` `Mean Temp (°C)`
#>   <chr>           <chr>           <chr>           <chr>           
#> 1 <NA>            <NA>            <NA>            <NA>            
#> 2 <NA>            <NA>            <NA>            <NA>            
#> 3 <NA>            <NA>            <NA>            <NA>            
#> 4 <NA>            <NA>            <NA>            <NA>            
#> 5 <NA>            <NA>            <NA>            <NA>            
#> 6 <NA>            <NA>            <NA>            <NA>            
#> # … with 2 more variables: `Min Temp (°C)` <chr>, `Mean Temp (°C)` <chr>

BUT … re-assigning the same names with one level of indirection DOES fix it.

result_renamed_indirect <- result_bad
nms <- names(result_bad[[1]])
names(result_renamed_indirect[[1]]) <- nms
names(result_renamed_indirect[[2]]) <- nms
vctrs::vec_rbind(!!!result_renamed_indirect)
#> # A tibble: 6 x 4
#>   `Max Temp Flag` `Min Temp (°C)` `Min Temp Flag` `Mean Temp (°C)`
#>   <chr>           <chr>           <chr>           <chr>           
#> 1 <NA>            <NA>            <NA>            <NA>            
#> 2 <NA>            <NA>            <NA>            <NA>            
#> 3 <NA>            <NA>            <NA>            <NA>            
#> 4 <NA>            <NA>            <NA>            <NA>            
#> 5 <NA>            <NA>            <NA>            <NA>            
#> 6 <NA>            <NA>            <NA>            <NA>

I can’t see any differences in these names with rawToChar().

Putting some distinguishing data in makes it easier to see that the column
names aren’t being correctly “matched up”. Also indicates that the problem isn't due to the first tibble having zero rows.

result_augmented <- result_bad
result_augmented[[1]][1, ] <- rep("one", 4)
result_augmented[[2]][1, ] <- rep("two", 4)
vctrs::vec_rbind(!!!result_augmented)
#> # A tibble: 7 x 6
#>   `Max Temp Flag` `Min Temp (°C)` `Min Temp Flag` `Mean Temp (°C)`
#>   <chr>           <chr>           <chr>           <chr>           
#> 1 one             one             one             one             
#> 2 two             <NA>            two             <NA>            
#> 3 <NA>            <NA>            <NA>            <NA>            
#> 4 <NA>            <NA>            <NA>            <NA>            
#> 5 <NA>            <NA>            <NA>            <NA>            
#> 6 <NA>            <NA>            <NA>            <NA>            
#> 7 <NA>            <NA>            <NA>            <NA>            
#> # … with 2 more variables: `Min Temp (°C)` <chr>, `Mean Temp (°C)` <chr>

Created on 2019-08-30 by the reprex package (v0.3.0.9000)

@paleolimbot

This comment has been minimized.

@hadley

This comment has been minimized.

@DavisVaughan
Copy link
Member

I'd imagine the fix would go here

// Ignoring encoding for now

@jennybc
Copy link
Member Author

jennybc commented Aug 30, 2019

I could have sworn I looked at the encoding of these column names. But in any case, thanks all, and here is the most minimal reprex:

marked <- unknown <- "Max Temp (\u00B0C)"
Encoding(unknown) <- "unknown"

(x <- c(marked, unknown))
#> [1] "Max Temp (°C)" "Max Temp (°C)"

Encoding(x)
#> [1] "UTF-8"   "unknown"

unique(x)
#> [1] "Max Temp (°C)"

x[[1]] == x[[2]]
#> [1] TRUE

vctrs::vec_unique(x)
#> [1] "Max Temp (°C)" "Max Temp (°C)"

Created on 2019-08-30 by the reprex package (v0.3.0.9000)

@jennybc jennybc changed the title Peculiar list confuses vec_rbind() vec_unique() is sensitive to the marked encoding, where base::unique() is not Aug 30, 2019
@DavisVaughan
Copy link
Member

DavisVaughan commented Sep 4, 2019

After more research, vec_unique() and vec_equal() both did have problems with encodings, but that actually isn't quite enough to fix this. The real culprit is vec_match() (and vec_in() will need a fix too) when determining the common type in df_type2()

SEXP x_dups_pos = PROTECT(vctrs_match(x_names, y_names));

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants