Skip to content

Should complete() and expand() disallow accessing the group columns? #1299

@DavisVaughan

Description

@DavisVaughan

With #1289 we added a grouped-df method for complete(). Originally I made the decision to pass cur_data_all() through to each ungrouped call to complete() because it seemed like people were (nonsensically) completing on grouping variables. This actually causes issues!

tidyr/R/complete.R

Lines 99 to 108 in c73cf21

dplyr::summarise(
data,
complete(
data = dplyr::cur_data_all(),
...,
fill = fill,
explicit = explicit
),
.groups = "keep"
)

I am thinking we should just pass cur_data() through instead, which would avoid the issue seen below.

This would result in a breaking change that people would no longer be able to complete on a grouping variable (You'd get an error saying that the grouping variable isn't found). But I think that completing on a grouping variable is a bit useless, as the whole point is to complete within that variable, so you shouldn't have access to it.

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  item_id = c(1:2, 2, 3),
  value1 = c(1, NA, 3, 4),
  item_name = c("a", "a", "b", "b"),
  value2 = 4:7,
  group = c(1:2, 1, 2)
)
gdf <- group_by(df, group)
gdf
#> # A tibble: 4 × 5
#> # Groups:   group [2]
#>   item_id value1 item_name value2 group
#>     <dbl>  <dbl> <chr>      <int> <dbl>
#> 1       1      1 a              4     1
#> 2       2     NA a              5     2
#> 3       2      3 b              6     1
#> 4       3      4 b              7     2

# expand "within" each group. so the group value should be repeated.
expand(gdf, item_id, item_name)
#> # A tibble: 8 × 3
#> # Groups:   group [2]
#>   group item_id item_name
#>   <dbl>   <dbl> <chr>    
#> 1     1       1 a        
#> 2     1       1 b        
#> 3     1       2 a        
#> 4     1       2 b        
#> 5     2       2 a        
#> 6     2       2 b        
#> 7     2       3 a        
#> 8     2       3 b

# (the NAs in the 1 `group` should be 1, and should be 2 in the 2 `group`)
# we don't get that here because we passed the "full" data including grouping
# cols through to `complete.data.frame()`, which did a full-join resulting in
# missing values for those columns. So then `summarize()` thought it didn't
# need to add the grouping columns back on.
complete(gdf, item_id, item_name)
#> # A tibble: 8 × 5
#> # Groups:   group [3]
#>   group item_id item_name value1 value2
#>   <dbl>   <dbl> <chr>      <dbl>  <int>
#> 1     1       1 a              1      4
#> 2    NA       1 b             NA     NA
#> 3    NA       2 a             NA     NA
#> 4     1       2 b              3      6
#> 5     2       2 a             NA      5
#> 6    NA       2 b             NA     NA
#> 7    NA       3 a             NA     NA
#> 8     2       3 b              4      7

Created on 2022-01-10 by the reprex package (v2.0.1)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions