-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Hi! 👋 I'm seeing slight differences in the way ntile() distributes "remainders" across buckets in dplyr versus in the database I'm using (Redshift).
For example, when we divide mtcars (32 rows) into 5 buckets, the two remainders are distributed into bucket 1 & 3 (arrows mine).
library(dplyr)
mtcars_dplyr <- mtcars %>%
mutate(bucket = ntile(mpg, 5)) %>%
select(mpg, bucket)
mtcars_dplyr %>% count(bucket)
#> # A tibble: 5 x 2
#> bucket n
#> <int> <int>
#> 1 1 7 <-----
#> 2 2 6
#> 3 3 7 <-----
#> 4 4 6
#> 5 5 6Created on 2020-03-17 by the reprex package (v0.2.1)
When I do the same in the database, the remainders get distributed to the buckets in order. The Oracle docs describe it this way:
The remainder values (the remainder of number of rows divided by buckets) are distributed one for each bucket, starting with bucket 1.
library(dplyr)
library(dbplyr)
mtcars_db <- tbl_memdb(mtcars)
mtcars_db <- mtcars_db %>%
mutate(bucket = ntile(mpg, 5)) %>%
select(mpg, bucket) %>%
collect()
mtcars_db %>% count(bucket)
#> # A tibble: 5 x 2
#> bucket n
#> <int> <int>
#> 1 1 7 <-----
#> 2 2 7 <-----
#> 3 3 6
#> 4 4 6
#> 5 5 6Created on 2020-03-17 by the reprex package (v0.2.1)
Update: I updated RSQLite and was able to generate a reprex of the db part to.
Let me know if there's anything else I can do on my end!