Inconsistent behaviour between dplyr::summarise and dbplyr::summarise with .groups argument #584

dstoeckel · 2021-01-27T09:16:58Z

In dplyr 1.0.0 the .groups argument was added to summarise to indicate what should happen with a grouping consisting of multiple variables. Unfortunately, this special argument does not seem to be recognized by the dbplyr::summarise implementation. Adding it results in a column .groups being added to the output.

This makes it difficult to write warning-free, generic code in the sense that it can take either a data.frame (or tibble) or an object returned from tbl.

The below code is a short illustration of the problem (using SQLite for testing purposes, the problem also exists for different databases/connectors)

library(DBI)
library(RSQLite)
library(dplyr)

test <- tibble(a = c("a", "a", "b", "b", "c", "c", "c", "d"), b = c(1,2,1,1,3,3,4,1), y = runif(8))

con <- dbConnect(SQLite())
copy_to(con, name = "test", test)

my_summary <- function(df) {
  df %>%
    group_by(a, b) %>%
    summarise(y_ = min(y, na.rm = FALSE), .groups = "drop") %>%
    collect(n=Inf)
}

print(my_summary(test))
print(my_summary(tbl(con, "test")))

The first call to my_summary yields something like

# A tibble: 6 x 3
  a         b     y_
* <chr> <dbl>  <dbl>

while the second one returns

# A tibble: 6 x 4
# Groups:   a [4]
  a         b     y_ .groups
  <chr> <dbl>  <dbl> <chr>

with the .groups column being set to "drop". Expected behaviour: both calls should return a tibble with the same schema.

In a way I suspect that there were always inconsistencies here (i.e. grouping dropping completely in dbplyr/SQL after a summarise vs. only the last variable being removed in dplyr) that are hard to fix, but recognizing .groups on the dbplyr side would be a nice consistency improvement.

The text was updated successfully, but these errors were encountered:

hadley · 2021-02-02T18:29:23Z

The .groups argument is still experimental. It won't be added to dbplyr until it becomes stable in dplyr.

hadley · 2021-02-02T18:30:37Z

Minimal reprex:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

df <- memdb_frame(x = 1, y = 1:3)
df %>% 
  group_by(x) %>% 
  summarise(y = mean(y, na.rm = TRUE), .groups = "drop") %>% 
  collect()
#> # A tibble: 1 x 3
#>       x     y .groups
#>   <dbl> <dbl> <chr>  
#> 1     1     2 drop

^{Created on 2021-02-02 by the reprex package (v0.3.0.9001)}

Fixes #584

mgirlich mentioned this issue Jan 27, 2021

support argument .groups in summarise() #585

Merged

hadley added the feature a feature request or enhancement label Feb 2, 2021

hadley closed this as completed in #585 Feb 3, 2021

hadley pushed a commit that referenced this issue Feb 3, 2021

support argument .groups in summarise() (#585)

abe84e6

Fixes #584

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour between dplyr::summarise and dbplyr::summarise with .groups argument #584

Inconsistent behaviour between dplyr::summarise and dbplyr::summarise with .groups argument #584

dstoeckel commented Jan 27, 2021 •

edited by hadley

Loading

hadley commented Feb 2, 2021

hadley commented Feb 2, 2021

Inconsistent behaviour between dplyr::summarise and dbplyr::summarise with .groups argument #584

Inconsistent behaviour between dplyr::summarise and dbplyr::summarise with .groups argument #584

Comments

dstoeckel commented Jan 27, 2021 • edited by hadley Loading

hadley commented Feb 2, 2021

hadley commented Feb 2, 2021

dstoeckel commented Jan 27, 2021 •

edited by hadley

Loading