Skip to content

ntile() distributes remainders across buckets differently in dplyr vs database #4995

@isteves

Description

@isteves

Hi! 👋 I'm seeing slight differences in the way ntile() distributes "remainders" across buckets in dplyr versus in the database I'm using (Redshift).

For example, when we divide mtcars (32 rows) into 5 buckets, the two remainders are distributed into bucket 1 & 3 (arrows mine).

library(dplyr)

mtcars_dplyr <- mtcars %>% 
  mutate(bucket = ntile(mpg, 5)) %>% 
  select(mpg, bucket)

mtcars_dplyr %>% count(bucket)
#> # A tibble: 5 x 2
#>   bucket     n
#>    <int> <int>
#> 1      1     7 <-----
#> 2      2     6
#> 3      3     7 <-----
#> 4      4     6
#> 5      5     6

Created on 2020-03-17 by the reprex package (v0.2.1)

When I do the same in the database, the remainders get distributed to the buckets in order. The Oracle docs describe it this way:

The remainder values (the remainder of number of rows divided by buckets) are distributed one for each bucket, starting with bucket 1.

library(dplyr)
library(dbplyr)

mtcars_db <- tbl_memdb(mtcars)

mtcars_db <- mtcars_db %>%
  mutate(bucket = ntile(mpg, 5)) %>%
  select(mpg, bucket) %>%
  collect()

mtcars_db %>% count(bucket)
#> # A tibble: 5 x 2
#>   bucket     n
#>    <int> <int>
#> 1      1     7 <-----
#> 2      2     7 <-----
#> 3      3     6
#> 4      4     6
#> 5      5     6

Created on 2020-03-17 by the reprex package (v0.2.1)

Update: I updated RSQLite and was able to generate a reprex of the db part to.

Let me know if there's anything else I can do on my end!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behaviorfuns 😆

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions