Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrange() produces incorrect row ordering when one column is a vctr and a second column is not #1354

Closed
dholstius opened this issue Apr 7, 2021 · 1 comment · Fixed by #1376
Labels
bug an unexpected problem or unintended behavior

Comments

@dholstius
Copy link

dholstius commented Apr 7, 2021

Here's a smallish reprex.

Basically, the issue is that when one column is a vctr, using arrange() on it and some other column doesn't produce the expected result — the vctr column is correctly ordered, but the other column isn't.

This might be a problem with arrange(), in which case, I'm happy to file the issue elsewhere.

library(tidyverse)
library(vctrs)
#> 
#> Attaching package: 'vctrs'
#> The following object is masked from 'package:dplyr':
#> 
#>     data_frame
#> The following object is masked from 'package:tibble':
#> 
#>     data_frame

# Example data.
df1 <-
  mtcars %>%
  as_tibble(rownames = "name") %>%
  separate(name, into = c("make", "model"), fill = "right", extra = "merge") %>%
  filter(make %in% c("Merc", "Fiat"))

# Nothing fancy here. Yields correct row ordering.
arrange(df1, make, mpg)
#> # A tibble: 9 x 13
#>   make  model    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fiat  X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> 2 Fiat  128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 3 Merc  450SLC  15.2     8 276.    180  3.07  3.78  18       0     0     3     3
#> 4 Merc  450SE   16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3
#> 5 Merc  450SL   17.3     8 276.    180  3.07  3.73  17.6     0     0     3     3
#> 6 Merc  280C    17.8     6 168.    123  3.92  3.44  18.9     1     0     4     4
#> 7 Merc  280     19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4
#> 8 Merc  230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> 9 Merc  240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2

#
# Now turn the `make` column into a rudimentary `vctr`.
# This yields an **incorrect** row ordering:
#
# - `make` is correctly ordered; but
# - within each make, `mpg` is not.
#
df2 <- mutate(df1, make = new_vctr(make, class = c("foo", "character")))
arrange(df2, make, mpg)
#> # A tibble: 9 x 13
#>   make  model    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <foo> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fiat  128     32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#> 2 Fiat  X1-9    27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#> 3 Merc  240D    24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#> 4 Merc  230     22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#> 5 Merc  280     19.2     6 168.    123  3.92  3.44  18.3     1     0     4     4
#> 6 Merc  280C    17.8     6 168.    123  3.92  3.44  18.9     1     0     4     4
#> 7 Merc  450SE   16.4     8 276.    180  3.07  4.07  17.4     0     0     3     3
#> 8 Merc  450SL   17.3     8 276.    180  3.07  3.73  17.6     0     0     3     3
#> 9 Merc  450SLC  15.2     8 276.    180  3.07  3.78  18       0     0     3     3

Created on 2021-04-07 by the reprex package (v2.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.2 (2019-12-12)
#>  os       macOS Catalina 10.15.7      
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2021-04-07                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version     date       lib source                            
#>  assertthat    0.2.1       2019-03-21 [1] CRAN (R 3.6.0)                    
#>  backports     1.2.1       2020-12-09 [1] CRAN (R 3.6.2)                    
#>  broom         0.7.6       2021-04-05 [1] CRAN (R 3.6.2)                    
#>  cellranger    1.1.0       2016-07-27 [1] CRAN (R 3.6.0)                    
#>  cli           2.4.0       2021-04-05 [1] CRAN (R 3.6.2)                    
#>  colorspace    2.0-0       2020-11-11 [1] CRAN (R 3.6.2)                    
#>  crayon        1.4.1       2021-02-08 [1] CRAN (R 3.6.2)                    
#>  DBI           1.1.1       2021-01-15 [1] CRAN (R 3.6.2)                    
#>  dbplyr        2.1.1       2021-04-06 [1] CRAN (R 3.6.2)                    
#>  digest        0.6.27      2020-10-24 [1] CRAN (R 3.6.2)                    
#>  dplyr       * 1.0.5       2021-03-05 [1] CRAN (R 3.6.2)                    
#>  ellipsis      0.3.1       2020-05-15 [1] CRAN (R 3.6.2)                    
#>  evaluate      0.14        2019-05-28 [1] CRAN (R 3.6.0)                    
#>  fansi         0.4.2       2021-01-15 [1] CRAN (R 3.6.2)                    
#>  forcats     * 0.5.1       2021-01-27 [1] CRAN (R 3.6.2)                    
#>  fs            1.5.0       2020-07-31 [1] CRAN (R 3.6.2)                    
#>  generics      0.1.0       2020-10-31 [1] CRAN (R 3.6.2)                    
#>  ggplot2     * 3.3.3       2020-12-30 [1] CRAN (R 3.6.2)                    
#>  glue          1.4.2       2020-08-27 [1] CRAN (R 3.6.2)                    
#>  gtable        0.3.0       2019-03-25 [1] CRAN (R 3.6.0)                    
#>  haven         2.3.1       2020-06-01 [1] CRAN (R 3.6.2)                    
#>  highr         0.8         2019-03-20 [1] CRAN (R 3.6.0)                    
#>  hms           1.0.0       2021-01-13 [1] CRAN (R 3.6.2)                    
#>  htmltools     0.5.1.1     2021-01-22 [1] CRAN (R 3.6.2)                    
#>  httr          1.4.2       2020-07-20 [1] CRAN (R 3.6.2)                    
#>  jsonlite      1.7.2       2020-12-09 [1] CRAN (R 3.6.2)                    
#>  knitr         1.31        2021-01-27 [1] CRAN (R 3.6.2)                    
#>  lifecycle     1.0.0       2021-02-15 [1] CRAN (R 3.6.2)                    
#>  lubridate     1.7.10      2021-02-26 [1] CRAN (R 3.6.2)                    
#>  magrittr      2.0.1       2020-11-17 [1] CRAN (R 3.6.2)                    
#>  modelr        0.1.8       2020-05-19 [1] CRAN (R 3.6.2)                    
#>  munsell       0.5.0       2018-06-12 [1] CRAN (R 3.6.0)                    
#>  pillar        1.5.1       2021-03-05 [1] CRAN (R 3.6.2)                    
#>  pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 3.6.0)                    
#>  purrr       * 0.3.4       2020-04-17 [1] CRAN (R 3.6.1)                    
#>  R6            2.5.0       2020-10-28 [1] CRAN (R 3.6.2)                    
#>  Rcpp          1.0.6       2021-01-15 [1] CRAN (R 3.6.2)                    
#>  readr       * 1.4.0       2020-10-05 [1] CRAN (R 3.6.2)                    
#>  readxl        1.3.1       2019-03-13 [1] CRAN (R 3.6.0)                    
#>  reprex        2.0.0       2021-04-02 [1] CRAN (R 3.6.2)                    
#>  rlang         0.4.10.9000 2021-04-07 [1] Github (r-lib/rlang@bd8dc5b)      
#>  rmarkdown     2.7.7       2021-04-07 [1] Github (rstudio/rmarkdown@dcc03e5)
#>  rstudioapi    0.13        2020-11-12 [1] CRAN (R 3.6.2)                    
#>  rvest         1.0.0       2021-03-09 [1] CRAN (R 3.6.2)                    
#>  scales        1.1.1       2020-05-11 [1] CRAN (R 3.6.2)                    
#>  sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 3.6.0)                    
#>  stringi       1.5.3       2020-09-09 [1] CRAN (R 3.6.2)                    
#>  stringr     * 1.4.0       2019-02-10 [1] CRAN (R 3.6.0)                    
#>  styler        1.3.2       2020-02-23 [1] CRAN (R 3.6.0)                    
#>  tibble      * 3.1.0       2021-02-25 [1] CRAN (R 3.6.2)                    
#>  tidyr       * 1.1.3       2021-03-03 [1] CRAN (R 3.6.2)                    
#>  tidyselect    1.1.0       2020-05-11 [1] CRAN (R 3.6.2)                    
#>  tidyverse   * 1.3.0       2019-11-21 [1] CRAN (R 3.6.0)                    
#>  utf8          1.2.1       2021-03-12 [1] CRAN (R 3.6.2)                    
#>  vctrs       * 0.3.7       2021-03-29 [1] CRAN (R 3.6.2)                    
#>  withr         2.4.1       2021-01-26 [1] CRAN (R 3.6.2)                    
#>  xfun          0.22        2021-03-11 [1] CRAN (R 3.6.2)                    
#>  xml2          1.3.2       2020-04-23 [1] CRAN (R 3.6.2)                    
#>  yaml          2.2.1       2020-02-01 [1] CRAN (R 3.6.0)                    
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library
@DavisVaughan
Copy link
Member

DavisVaughan commented Apr 7, 2021

More minimal reprex:

library(vctrs)

x <- c("F", "F", "M", "M")
y <- c(32.4, 27.3, 24.4, 22.8)
v <- new_vctr(x, class = "foo")

df1 <- data_frame(x = x, y = y)
df1
#>   x    y
#> 1 F 32.4
#> 2 F 27.3
#> 3 M 24.4
#> 4 M 22.8

# good
vec_order(df1)
#> [1] 2 1 4 3

df2 <- data_frame(v = v, y = y)
df2
#>   v    y
#> 1 F 32.4
#> 2 F 27.3
#> 3 M 24.4
#> 4 M 22.8

# bad
vec_order(df2)
#> [1] 1 2 3 4

This has to do with the xtfrm.vctrs_vctr method. I've had a feeling this wasn't quite right. Calling vec_order(vec_order(proxy)) automatically breaks ties with a sequential ordering. This causes the groups in v to no longer look like groups.

vctrs:::xtfrm.vctrs_vctr
#> function(x) {
#>   proxy <- vec_proxy_order(x)
#> 
#>   if (is.object(proxy) && typeof(proxy) %in% c("integer", "double", "character")) {
#>     proxy <- unstructure(proxy)
#>   }
#> 
#>   # order(order(x)) ~= rank(x)
#>   if (typeof(proxy) %in% c("integer", "double")) {
#>     proxy
#>   } else {
#>     vec_order(vec_order(proxy))
#>   }
#> }
#> <bytecode: 0x7ffbbc770438>
#> <environment: namespace:vctrs>

xtfrm(x)
#> [1] 1 1 3 3
xtfrm(v)
#> [1] 1 2 3 4

To fix the xtfrm() method, we might be able to switch to using vec_rank(x, ties = "dense", na_propagate = TRUE) (or "min", really anything that isn't "sequential"), but we have to keep the fact that it orders character vectors in the C locale in mind.

This will disappear in general for arrange() when we switch it to use vec_order_radix(), which doesn't call xtfrm().

@DavisVaughan DavisVaughan added the bug an unexpected problem or unintended behavior label Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants