Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix row_number() and ntile() to align R and C sorting #2969

Merged
merged 7 commits into from Dec 27, 2017

Conversation

foo-bar-baz-qux
Copy link
Contributor

@foo-bar-baz-qux foo-bar-baz-qux commented Jul 14, 2017

Fix row_number() and ntile() ordering to use the locale-dependent ordering functions in R when dealing with character vectors, rather than always using the C-locale ordering function in C (#2792).

Please see issue #2792 for in-depth analysis and discussion.

Results

Original test case

data_frame(a = c("a", "b", "C")) %>% arrange(a)
## # A tibble: 3 x 1
##       a
##   <chr>
## 1     a
## 2     b
## 3     C

data_frame(a = c("a", "b", "C")) %>% mutate(rank(a, ties.method = "first"))
## # A tibble: 3 x 2
##       a `rank(a, ties.method = "first")`
##   <chr>                            <int>
## 1     a                                1
## 2     b                                2
## 3     C                                3

data_frame(a = c("a", "b", "C")) %>% mutate(row_number(a))
## # A tibble: 3 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     a               1
## 2     b               2
## 3     C               3

data_frame(a = c("a", "b", "C")) %>% mutate(dplyr::row_number(a))
## # A tibble: 3 x 2
##       a `dplyr::row_number(a)`
##   <chr>                  <int>
## 1     a                      1
## 2     b                      2
## 3     C                      3

Additional examples

data_frame(a = c("a", "b", "C")) %>% mutate(row_number(a))
## # A tibble: 3 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     a               1
## 2     b               2
## 3     C               3

data_frame(a = c("[", "]", "A", "z")) %>% mutate(row_number(a))
## # A tibble: 4 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     [               1
## 2     ]               2
## 3     A               3
## 4     z               4

data_frame(a = c(1, 22, 12, 30, 3)) %>% mutate(row_number(a))
## # A tibble: 5 x 2
##       a `row_number(a)`
##   <dbl>           <int>
## 1     1               1
## 2    22               4
## 3    12               3
## 4    30               5
## 5     3               2

data_frame(a = c("a", "C", "b", "D"), b = c(1, 1, 2, 2)) %>% group_by(b) %>% mutate(row_number(a))
## # A tibble: 4 x 3
## # Groups:   b [2]
##       a     b `row_number(a)`
##   <chr> <dbl>           <int>
## 1     a     1               1
## 2     C     1               2
## 3     b     2               1
## 4     D     2               2

data_frame(a = c("a", "C", "b", "D", "E")) %>% mutate(ntile(a, 3))
## # A tibble: 5 x 2
##       a `ntile(a, 3)`
##   <chr>         <int>
## 1     a             1
## 2     C             2
## 3     b             1
## 4     D             2
## 5     E             3

data_frame(a = c("[", "]", NA, "a", "z")) %>% mutate(row_number(a))
## # A tibble: 5 x 2
##       a `row_number(a)`
##   <chr>           <int>
## 1     [               1
## 2     ]               2
## 3  <NA>              NA
## 4     a               3
## 5     z               4

Fixes #2792.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this approach looks good to me, and preferable over the other two you outlined in the issue. Could you please add testthat tests? Please double-check that they do fail without the change.

@krlmlr
Copy link
Member

krlmlr commented Jul 15, 2017

Please also make sure that existing tests pass: https://travis-ci.org/tidyverse/dplyr/builds/253492476#L793. (I wonder why the AppVeyor tests succeed.)

@foo-bar-baz-qux
Copy link
Contributor Author

Yep, I can do that. I'm just investigating why the tests fail at the moment on Travis, as they seem to pass on my local machine.

@foo-bar-baz-qux foo-bar-baz-qux force-pushed the fix_2792_submit branch 3 times, most recently from 63c3ca8 to 953cfa2 Compare July 20, 2017 08:06
@foo-bar-baz-qux
Copy link
Contributor Author

Unfortunately couldn't get the tests to pass with the 3rd approach. Couldn't replicate on local machine, but the logs on Travis suggest the test results didn't make any sense. I'm guessing it's something to do with data being silently cast to a different data type before selecting the sorting function.

Nevertheless, I got it working with the second approach which is the more conservative option since it doesn't change the structure very much and uses the same format of selecting a comparison class based on the data type.

Interestingly, I think the CI environments use C-locale so I added tests to compare the C versions against the R versions directly, rather than use static values.

@krlmlr
Copy link
Member

krlmlr commented Jul 25, 2017

I have a slight preference for the third approach you had before, with the failing Travis runs. Does that version still succeed localy if you run it with LC_ALL=C ?

@foo-bar-baz-qux
Copy link
Contributor Author

Ok, reverting back to the 3rd approach. Yes, previous version did work with LC_ALL=C as well. I think I've found the problem though, forgot to use Shield<SEXP> which would explain the weird outputs; memory was probably getting wiped/overwritten at some point before the sorting and return, but it wouldn't always occur so it might work on some systems, some of the time. @krlmlr, your previous post helped clear up this (potential) problem.

AppVeyor passes, however, it looks like travis build fails due to some package installation errors(like stringi)?

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good. I wonder if we should run the new tests both in the active and in the C locale.


expect_equal(mutate(df, rank = min_rank(desc(x)))$rank, 10:1)
expect_equal(mutate(group_by(df, g), rank = min_rank(desc(x)))$rank, rep(5:1, 2))

expect_equal(mutate(df, rank = row_number(desc(x)))$rank, 10:1)
expect_equal(mutate(group_by(df, g), rank = row_number(desc(x)))$rank, rep(5:1, 2))

# Test character vector sorting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want to run the tests below in a different locale as well (via withr::with_collate())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll create those additional tests in a C locale (in addition to the existing tests in the active locale).

res <- group_by(tmp, id) %>% mutate(var = row_number(value))
expect_equal(res$var, c(2, 3, 4, 5, 1, 5, 4, 1, 2, 3))

# Test character vector sorting by comparing C and R function outputs
# Should be careful of testing against static return values due to locale differences
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

test_that("ntile handles character vectors consistently", {
x1 <- c("[", "]", NA, "B", "y", "a", "Z")
x2 <- c("a", "b", "C")
expect_equal(ntile_h(x1, 3), ntile_h_dplyr(x1, 3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.

@krlmlr krlmlr requested a review from hadley August 22, 2017 09:54
@hadley
Copy link
Member

hadley commented Aug 22, 2017

Are there any performance implications to this change?

@krlmlr
Copy link
Member

krlmlr commented Aug 22, 2017

Yes, we seem to need to call back into R for the case of character vectors. The CharacterVectorOrderer could be made faster, but correctness and consistency looks more important to me.

@krlmlr
Copy link
Member

krlmlr commented Aug 22, 2017

...this is two callbacks per column per group, which could be brought down to one or even zero.

@hadley
Copy link
Member

hadley commented Aug 22, 2017

Can you do some rough benchmarking? I think it would be fine if it's 20-30% slower, but if it's 100% slower we'll need to think more (and advertise more explicitly in the release notes) (and maybe wait for a bigger release)

…lation locale when dealing with character vectors.
Merge remote-tracking branch 'upstream/master' into fix_2792_submit

# Conflicts:
#	inst/include/dplyr/Result/Rank.h
@foo-bar-baz-qux
Copy link
Contributor Author

Hey @krlmlr, I've updated the tests and merged with the latest master. AppVeyor 64bit environment seems to have an issue with the testthat package.

@hadley
Copy link
Member

hadley commented Nov 2, 2017

@krlmlr are you happy with this PR?

@krlmlr
Copy link
Member

krlmlr commented Nov 2, 2017

I need to take a closer look. @foo-bar-baz-qux: Would you mind resolving the conflicts?

@foo-bar-baz-qux
Copy link
Contributor Author

Yep, I'll work on resolving those conflicts and update.

@foo-bar-baz-qux
Copy link
Contributor Author

Hey @krlmlr, I've resolved the conflicts.

@krlmlr krlmlr merged commit 699e71e into tidyverse:master Dec 27, 2017
@krlmlr
Copy link
Member

krlmlr commented Dec 27, 2017

Thanks!

@lock lock bot locked as resolved and limited conversation to collaborators Jun 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants