New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix row_number()
and ntile()
to align R and C sorting
#2969
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this approach looks good to me, and preferable over the other two you outlined in the issue. Could you please add testthat tests? Please double-check that they do fail without the change.
Please also make sure that existing tests pass: https://travis-ci.org/tidyverse/dplyr/builds/253492476#L793. (I wonder why the AppVeyor tests succeed.) |
Yep, I can do that. I'm just investigating why the tests fail at the moment on Travis, as they seem to pass on my local machine. |
63c3ca8
to
953cfa2
Compare
Unfortunately couldn't get the tests to pass with the 3rd approach. Couldn't replicate on local machine, but the logs on Travis suggest the test results didn't make any sense. I'm guessing it's something to do with data being silently cast to a different data type before selecting the sorting function. Nevertheless, I got it working with the second approach which is the more conservative option since it doesn't change the structure very much and uses the same format of selecting a comparison class based on the data type. Interestingly, I think the CI environments use C-locale so I added tests to compare the C versions against the R versions directly, rather than use static values. |
I have a slight preference for the third approach you had before, with the failing Travis runs. Does that version still succeed localy if you run it with |
…t ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (tidyverse#2792).
db7edf4
to
9411ceb
Compare
Ok, reverting back to the 3rd approach. Yes, previous version did work with AppVeyor passes, however, it looks like travis build fails due to some package installation errors(like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks good. I wonder if we should run the new tests both in the active and in the C locale.
|
||
expect_equal(mutate(df, rank = min_rank(desc(x)))$rank, 10:1) | ||
expect_equal(mutate(group_by(df, g), rank = min_rank(desc(x)))$rank, rep(5:1, 2)) | ||
|
||
expect_equal(mutate(df, rank = row_number(desc(x)))$rank, 10:1) | ||
expect_equal(mutate(group_by(df, g), rank = row_number(desc(x)))$rank, rep(5:1, 2)) | ||
|
||
# Test character vector sorting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also want to run the tests below in a different locale as well (via withr::with_collate()
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll create those additional tests in a C locale (in addition to the existing tests in the active locale).
res <- group_by(tmp, id) %>% mutate(var = row_number(value)) | ||
expect_equal(res$var, c(2, 3, 4, 5, 1, 5, 4, 1, 2, 3)) | ||
|
||
# Test character vector sorting by comparing C and R function outputs | ||
# Should be careful of testing against static return values due to locale differences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
tests/testthat/test-rank.R
Outdated
test_that("ntile handles character vectors consistently", { | ||
x1 <- c("[", "]", NA, "B", "y", "a", "Z") | ||
x2 <- c("a", "b", "C") | ||
expect_equal(ntile_h(x1, 3), ntile_h_dplyr(x1, 3)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same.
Are there any performance implications to this change? |
Yes, we seem to need to call back into R for the case of character vectors. The |
...this is two callbacks per column per group, which could be brought down to one or even zero. |
Can you do some rough benchmarking? I think it would be fine if it's 20-30% slower, but if it's 100% slower we'll need to think more (and advertise more explicitly in the release notes) (and maybe wait for a bigger release) |
…lation locale when dealing with character vectors.
Merge remote-tracking branch 'upstream/master' into fix_2792_submit # Conflicts: # inst/include/dplyr/Result/Rank.h
Hey @krlmlr, I've updated the tests and merged with the latest master. AppVeyor 64bit environment seems to have an issue with the testthat package. |
@krlmlr are you happy with this PR? |
I need to take a closer look. @foo-bar-baz-qux: Would you mind resolving the conflicts? |
Yep, I'll work on resolving those conflicts and update. |
Hey @krlmlr, I've resolved the conflicts. |
Thanks! |
Fix
row_number()
andntile()
ordering to use the locale-dependent ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (#2792).Please see issue #2792 for in-depth analysis and discussion.
Results
Original test case
Additional examples
Fixes #2792.