Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent behaviour between dplyr::summarise and dbplyr::summarise with .groups argument #584

Closed
dstoeckel opened this issue Jan 27, 2021 · 2 comments · Fixed by #585
Closed
Labels
feature a feature request or enhancement

Comments

@dstoeckel
Copy link

dstoeckel commented Jan 27, 2021

In dplyr 1.0.0 the .groups argument was added to summarise to indicate what should happen with a grouping consisting of multiple variables. Unfortunately, this special argument does not seem to be recognized by the dbplyr::summarise implementation. Adding it results in a column .groups being added to the output.

This makes it difficult to write warning-free, generic code in the sense that it can take either a data.frame (or tibble) or an object returned from tbl.

The below code is a short illustration of the problem (using SQLite for testing purposes, the problem also exists for different databases/connectors)
library(DBI)
library(RSQLite)
library(dplyr)

test <- tibble(a = c("a", "a", "b", "b", "c", "c", "c", "d"), b = c(1,2,1,1,3,3,4,1), y = runif(8))

con <- dbConnect(SQLite())
copy_to(con, name = "test", test)

my_summary <- function(df) {
  df %>%
    group_by(a, b) %>%
    summarise(y_ = min(y, na.rm = FALSE), .groups = "drop") %>%
    collect(n=Inf)
}

print(my_summary(test))
print(my_summary(tbl(con, "test")))

The first call to my_summary yields something like

# A tibble: 6 x 3
  a         b     y_
* <chr> <dbl>  <dbl>

while the second one returns

# A tibble: 6 x 4
# Groups:   a [4]
  a         b     y_ .groups
  <chr> <dbl>  <dbl> <chr>

with the .groups column being set to "drop". Expected behaviour: both calls should return a tibble with the same schema.

In a way I suspect that there were always inconsistencies here (i.e. grouping dropping completely in dbplyr/SQL after a summarise vs. only the last variable being removed in dplyr) that are hard to fix, but recognizing .groups on the dbplyr side would be a nice consistency improvement.

@hadley
Copy link
Member

hadley commented Feb 2, 2021

The .groups argument is still experimental. It won't be added to dbplyr until it becomes stable in dplyr.

@hadley hadley added the feature a feature request or enhancement label Feb 2, 2021
@hadley
Copy link
Member

hadley commented Feb 2, 2021

Minimal reprex:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

df <- memdb_frame(x = 1, y = 1:3)
df %>% 
  group_by(x) %>% 
  summarise(y = mean(y, na.rm = TRUE), .groups = "drop") %>% 
  collect()
#> # A tibble: 1 x 3
#>       x     y .groups
#>   <dbl> <dbl> <chr>  
#> 1     1     2 drop

Created on 2021-02-02 by the reprex package (v0.3.0.9001)

hadley pushed a commit that referenced this issue Feb 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants