Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dense_rank() ignores NAs in data frames, counts them in database tables #991

Closed
andreassoteriadesmoj opened this issue Aug 30, 2022 · 1 comment · Fixed by #1047
Closed

Comments

@andreassoteriadesmoj
Copy link

The documentation of function dense_rank currently does not mention that NA values won't be accounted for when working with data frames, but will be counted in when working with DB tables. Note the value of column dr in the two examples below.

Data frame:

mtcars$vs[1] <- NA
mtcars$vs[3] <- NA
mtcars %>% 
    dplyr::slice(1:5) %>% 
    dplyr::mutate(dr = dplyr::dense_rank(vs)) %>% 
    dplyr::arrange(dr)


#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb dr
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4  1
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2  1
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1 NA    3    1  2
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46 NA NA    4    4 NA
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61 NA  1    4    1 NA

DB table:

mtcars$vs[1] <- NA
mtcars$vs[3] <- NA
mtcars %>% 
    dplyr::slice(1:5) %>% 
    dbplyr::memdb_frame() %>% 
    dplyr::mutate(dr = dplyr::dense_rank(vs)) %>% 
    dplyr::arrange(dr)

# # Source:     SQL [5 x 12]
# # Database:   sqlite 3.39.1 [:memory:]
# # Ordered by: dr
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb    dr
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1  21       6   160   110  3.9   2.62  16.5    NA    NA     4     4     1
# 2  22.8     4   108    93  3.85  2.32  18.6    NA     1     4     1     1
# 3  21       6   160   110  3.9   2.88  17.0     0     1     4     4     2
# 4  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2     2
# 5  21.4     6   258   110  3.08  3.22  19.4     1    NA     3     1     3

This can be quite confusing when using the same code for different objects (data frames or DB tables).

@hadley hadley transferred this issue from tidyverse/dplyr Aug 30, 2022
@hadley
Copy link
Member

hadley commented Aug 30, 2022

Moving to dbplyr since this is a dbplyr bug.

Here's a rather more minimal reprex:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

db <- memdb_frame(x = c(1, 2, NA, 3))
db |> mutate(rank = dense_rank(x))
#> # Source:   SQL [4 x 2]
#> # Database: sqlite 3.39.2 [:memory:]
#>       x  rank
#>   <dbl> <int>
#> 1    NA     1
#> 2     1     2
#> 3     2     3
#> 4     3     4

Created on 2022-08-30 by the reprex package (v2.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants