Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

ntile() distributes remainders across buckets differently in dplyr vs database #4995

Closed
isteves opened this issue Mar 17, 2020 · 2 comments 路 Fixed by #5054
Closed

ntile() distributes remainders across buckets differently in dplyr vs database #4995

isteves opened this issue Mar 17, 2020 · 2 comments 路 Fixed by #5054
Labels
bug funs 馃槅

Comments

@isteves
Copy link

@isteves isteves commented Mar 17, 2020

Hi! 馃憢 I'm seeing slight differences in the way ntile() distributes "remainders" across buckets in dplyr versus in the database I'm using (Redshift).

For example, when we divide mtcars (32 rows) into 5 buckets, the two remainders are distributed into bucket 1 & 3 (arrows mine).

library(dplyr)

mtcars_dplyr <- mtcars %>% 
  mutate(bucket = ntile(mpg, 5)) %>% 
  select(mpg, bucket)

mtcars_dplyr %>% count(bucket)
#> # A tibble: 5 x 2
#>   bucket     n
#>    <int> <int>
#> 1      1     7 <-----
#> 2      2     6
#> 3      3     7 <-----
#> 4      4     6
#> 5      5     6

Created on 2020-03-17 by the reprex package (v0.2.1)

When I do the same in the database, the remainders get distributed to the buckets in order. The Oracle docs describe it this way:

The remainder values (the remainder of number of rows divided by buckets) are distributed one for each bucket, starting with bucket 1.

library(dplyr)
library(dbplyr)

mtcars_db <- tbl_memdb(mtcars)

mtcars_db <- mtcars_db %>%
  mutate(bucket = ntile(mpg, 5)) %>%
  select(mpg, bucket) %>%
  collect()

mtcars_db %>% count(bucket)
#> # A tibble: 5 x 2
#>   bucket     n
#>    <int> <int>
#> 1      1     7 <-----
#> 2      2     7 <-----
#> 3      3     6
#> 4      4     6
#> 5      5     6

Created on 2020-03-17 by the reprex package (v0.2.1)

Update: I updated RSQLite and was able to generate a reprex of the db part to.

Let me know if there's anything else I can do on my end!

@hadley
Copy link
Member

@hadley hadley commented Mar 26, 2020

SQL server docs say the same thing, so this looks like a problem with our implementation.

@hadley hadley added bug funs 馃槅 labels Mar 26, 2020
@isteves
Copy link
Author

@isteves isteves commented Apr 2, 2020

Wow, quick turnaround. Thanks for your work!! 馃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug funs 馃槅
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants