Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutate(row_number(key)) differs from mutate(dplyr::row_number(key)) #2792

Closed
cuttlefish44 opened this issue May 17, 2017 · 5 comments
Closed

mutate(row_number(key)) differs from mutate(dplyr::row_number(key)) #2792

cuttlefish44 opened this issue May 17, 2017 · 5 comments
Labels
Milestone

Comments

@cuttlefish44
Copy link

@cuttlefish44 cuttlefish44 commented May 17, 2017

mutate(row_number(key)) isn't equivalent to mutate(rank(key, ties.method = "first")) and mutate(dplyr::row_number(key)). I feel it's a bug.


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

data_frame(a = c("a", "b", "C")) %>% arrange(a)
#> # A tibble: 3 × 1
#>       a
#>   <chr>
#> 1     a
#> 2     b
#> 3     C

data_frame(a = c("a", "b", "C")) %>% mutate(rank(a, ties.method = "first"))
#> # A tibble: 3 × 2
#>       a `rank(a, ties.method = "first")`
#>   <chr>                            <int>
#> 1     a                                1
#> 2     b                                2
#> 3     C                                3

data_frame(a = c("a", "b", "C")) %>% mutate(row_number(a))
#> # A tibble: 3 × 2
#>       a `row_number(a)`
#>   <chr>           <int>
#> 1     a               2
#> 2     b               3
#> 3     C               1

data_frame(a = c("a", "b", "C")) %>% mutate(dplyr::row_number(a))
#> # A tibble: 3 × 2
#>       a `dplyr::row_number(a)`
#>   <chr>                  <int>
#> 1     a                      1
#> 2     b                      2
#> 3     C                      3
@yutannihilation
Copy link
Member

@yutannihilation yutannihilation commented May 18, 2017

This seems the difference between the implementation in R and one in C++.

I found a possibly similar issue in Rcpp and it seems a bit difficult to solve: RcppCore/Rcpp#251

The detail is described here in Rcpp-FAQ: https://github.com/RcppCore/Rcpp/blob/4e0f79c8651e17438ae88ef52b37139ff1135fc9/vignettes/Rcpp-FAQ.Rnw#L1642-L1695

Loading

@hadley
Copy link
Member

@hadley hadley commented May 18, 2017

We did work around this for arrange() though, so it should be possible to apply the same fix to row_number(). (Or if it's a character vector, we could just fall back to the R implementation).

Loading

@yutannihilation
Copy link
Member

@yutannihilation yutannihilation commented May 18, 2017

Oh, thanks for the good news. Sorry for misunderstanding.

Loading

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Jun 16, 2017

I see the same behavior with dplyr 0.5.0. We'll need to postpone this until after the upcoming point release. Happy to review a PR.

Loading

foo-bar-baz-qux added a commit to foo-bar-baz-qux/dplyr that referenced this issue Jul 14, 2017
…t ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (tidyverse#2792).
@foo-bar-baz-qux
Copy link
Contributor

@foo-bar-baz-qux foo-bar-baz-qux commented Jul 14, 2017

I've had a go at fixing this bug but I'm new to this codebase so let me know if I'm missing something. Here's my outline of the issue. I've just submitted a PR also.

Outline

  • The heart of the problem is that R uses a locale-based sort, while the C++ underlying dplyr uses C-locale sort. It's been reported before, and a fix issued for arrange() in the past, as @hadley pointed out. See #325 and #1299.

    • The fix is a little unorthodox in order to fit into the existing machinery and work around not being able to directly call the C version of the R sorting function; the vector is first passed and sorted in R and returned back to C, where the sorted vector is used as the basis of a comparison function for std::sort().
  • The problem would also affect other ordering/ranking functions like ntile(). As you can see, the C version differs from the R version of ntile(). It also appears that a number of other ordering/ranking functions don't work on character vectors? This seems like a different bug, but the same issue is faced even if they did work.

library(dplyr)                                

data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(dplyr::ntile(a, 3))                    
#> # A tibble: 5 x 2
#>       a `dplyr::ntile(a, 3)`
#>   <chr>                <int>
#> 1     a                    1
#> 2     b                    1
#> 3     C                    2
#> 4     D                    2
#> 5     E                    3
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(ntile(a, 3))                           
#> # A tibble: 5 x 2
#>       a `ntile(a, 3)`
#>   <chr>         <int>
#> 1     a             2
#> 2     b             3
#> 3     C             1
#> 4     D             1
#> 5     E             2
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(dplyr::percent_rank(a))                
#> # A tibble: 5 x 2
#>       a `dplyr::percent_rank(a)`
#>   <chr>                    <dbl>
#> 1     a                     0.00
#> 2     b                     0.25
#> 3     C                     0.50
#> 4     D                     0.75
#> 5     E                     1.00
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(percent_rank(a))                       
#> Error in mutate_impl(.data, dots): STRING_ELT() can only be applied to a 'character vector', not a 'char'
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(dplyr::cume_dist(a))                   
#> # A tibble: 5 x 2
#>       a `dplyr::cume_dist(a)`
#>   <chr>                 <dbl>
#> 1     a                   0.2
#> 2     b                   0.4
#> 3     C                   0.6
#> 4     D                   0.8
#> 5     E                   1.0
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(cume_dist(a))                          
#> Error in mutate_impl(.data, dots): STRING_ELT() can only be applied to a 'character vector', not a 'char'
                                              
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(dplyr::dense_rank(a))                  
#> # A tibble: 5 x 2
#>       a `dplyr::dense_rank(a)`
#>   <chr>                  <int>
#> 1     a                      1
#> 2     b                      2
#> 3     C                      3
#> 4     D                      4
#> 5     E                      5
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(dense_rank(a))                         
#> Error in mutate_impl(.data, dots): STRING_ELT() can only be applied to a 'character vector', not a 'char'
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(dplyr::min_rank(a))                    
#> # A tibble: 5 x 2
#>       a `dplyr::min_rank(a)`
#>   <chr>                <int>
#> 1     a                    1
#> 2     b                    2
#> 3     C                    3
#> 4     D                    4
#> 5     E                    5
                                              
data_frame(a = c("a", "b", "C", "D", "E")) %>%
mutate(min_rank(a))                           
#> Error in mutate_impl(.data, dots): STRING_ELT() can only be applied to a 'character vector', not a 'char'
  • For mutate(row_number(a)) the magic happens in the C version of row_number(), while mutate(dplyr::row_number(a)) uses the R version, hence the difference in sorting outcomes.

    • The key part of the C version that decides this behavior is located in the RowNumber class in Rank.h.
    • There are different versions of process() depending on the arguments, but the sorting happens using a custom comparison function that ultimately compares strings using strcmp() from C, which is why capital 'A' appears before lower case 'z'.
    • Using a case-insensitive string comparison function works for this reported test case, but doesn't solve the broader problem of locale-dependent sorting that would align the sorting from R and C.

Approaches

I've got working versions of all 3 approaches.

Approach 1

  • First approach is to directly re-use the function developed for solving this problem in the past (see the past issues above), by checking whether the input is a character vector and then using the special function. Otherwise, continue as normal.
   # Rank.h around line 292 as an example
    if( RTYPE == STRSXP) {
      CharacterVectorOrderer data_str_order( (SEXP) slice );
      IntegerVector out = data_str_order.get();
      return out;
    }
  • This feels a little clunky and doesn't match the rest of the structure. However, it may be safer than trying to make major changes to the structure, and the changes would be self contained in the sense that we add this snippet to the relevant part of the function without changing much else.
  • Should be more efficient because it doesn't go through a second round of sorting in C (see approach 2 below for details); this may be relevant for large columns of strings though R sorting time would probably be much more significant than the second round of C sorting.
  • Need to double check it handles NAs properly.

Approach 2

  • The second approach is to use the existing classes and formats, just tweaked to make the necessary adjustments to the classes for handling character vectors. Ultimately it will incorporate the same function as in approach 1, just with a few more steps along the way.
 # Add some new typedefs at lines 241 in Rank.h
  typedef OrderCharacterVectorVisitorImpl<ascending> Visitor_CharVec;
  typedef Compare_Single_OrderVisitor<Visitor_CharVec> Comparer_CharVec;

  # Make modifications in virtual SEXP process(const SlicingIndex& index)
    if( RTYPE == STRSXP) {
      Visitor_CharVec visitor((SEXP) slice);
      std::sort(x.begin(), x.end(), Comparer_CharVec(visitor));
    } else {
      Visitor visitor(slice);
      std::sort(x.begin(), x.end(), Comparer(visitor));
    }

Approach 3 (submitted as PR)

  • Third approach tries to copy the format of arrange() and use a different class, OrderVisitors, but modified to support a single vector rather than a list of vectors.

    • The good thing about this solution is that the OrderVisitors function already supports the conditional handling of the data to decide which comparison functions to use so there is no if/else logic in the process functions.
    • The outputs also maps back nicely to what the process() function was initially expecting anyway so the later part of the code remains unchanged.
    • It may also be preferable to have all ordering/ranking functions like arrange() and row_number() using the same classes.

I'd suggest approach 3 which I've submitted as a PR. Open to discussing and changing to other approaches though. Will add tests etc. but wanted to make sure the approach was appropriate.

Results

Original test case

data_frame(a = c("a", "b", "C")) %>% arrange(a)
## # A tibble: 3 x 1
##       a
##   <chr>
## 1     a
## 2     b
## 3     C

data_frame(a = c("a", "b", "C")) %>% mutate(rank(a, ties.method = "first"))
## # A tibble: 3 x 2
##       a `rank(a, ties.method = "first")`
##   <chr>                            <int>
## 1     a                                1
## 2     b                                2
## 3     C                                3

data_frame(a = c("a", "b", "C")) %>% mutate(row_number(a))
## # A tibble: 3 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     a               1
## 2     b               2
## 3     C               3

data_frame(a = c("a", "b", "C")) %>% mutate(dplyr::row_number(a))
## # A tibble: 3 x 2
##       a `dplyr::row_number(a)`
##   <chr>                  <int>
## 1     a                      1
## 2     b                      2
## 3     C                      3

Additional examples

data_frame(a = c("a", "b", "C")) %>% mutate(row_number(a))
## # A tibble: 3 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     a               1
## 2     b               2
## 3     C               3

data_frame(a = c("[", "]", "A", "z")) %>% mutate(row_number(a))
## # A tibble: 4 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     [               1
## 2     ]               2
## 3     A               3
## 4     z               4

data_frame(a = c(1, 22, 12, 30, 3)) %>% mutate(row_number(a))
## # A tibble: 5 x 2
##       a `row_number(a)`
##   <dbl>           <int>
## 1     1               1
## 2    22               4
## 3    12               3
## 4    30               5
## 5     3               2

data_frame(a = c("a", "C", "b", "D"), b = c(1, 1, 2, 2)) %>% group_by(b) %>% mutate(row_number(a))
## # A tibble: 4 x 3
## # Groups:   b [2]
##       a     b `row_number(a)`
##   <chr> <dbl>           <int>
## 1     a     1               1
## 2     C     1               2
## 3     b     2               1
## 4     D     2               2

data_frame(a = c("a", "C", "b", "D", "E")) %>% mutate(ntile(a, 3))
## # A tibble: 5 x 2
##       a `ntile(a, 3)`
##   <chr>         <int>
## 1     a             1
## 2     C             2
## 3     b             1
## 4     D             2
## 5     E             3

data_frame(a = c("[", "]", NA, "a", "z")) %>% mutate(row_number(a))
## # A tibble: 5 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     [               1
## 2     ]               2
## 3  <NA>              NA
## 4     a               3
## 5     z               4

Loading

foo-bar-baz-qux added a commit to foo-bar-baz-qux/dplyr that referenced this issue Jul 19, 2017
…t ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (tidyverse#2792).
foo-bar-baz-qux added a commit to foo-bar-baz-qux/dplyr that referenced this issue Jul 20, 2017
…t ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (tidyverse#2792).
foo-bar-baz-qux added a commit to foo-bar-baz-qux/dplyr that referenced this issue Jul 20, 2017
…t ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (tidyverse#2792).
foo-bar-baz-qux added a commit to foo-bar-baz-qux/dplyr that referenced this issue Aug 1, 2017
…t ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (tidyverse#2792).
@krlmlr krlmlr added this to the 0.7.3 milestone Aug 16, 2017
@krlmlr krlmlr added this to the 0.7.3 milestone Aug 16, 2017
@krlmlr krlmlr closed this in #2969 Dec 27, 2017
krlmlr added a commit that referenced this issue Dec 27, 2017
*  Fix `row_number()` and `ntile()` ordering to use the locale-dependent ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (#2792).

* Extract variable instead of passing cast variable.

* Add tests for checking sorting of character vectors.

* Use shielding before wrap subsetting.

* Update tests for windowed rank functions to also test under the C collation locale when dealing with character vectors.
@lock lock bot locked as resolved and limited conversation to collaborators Jun 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants