Disallow access to group columns in `expand()` and `complete()` #1300

DavisVaughan · 2022-01-10T23:08:39Z

Closes #1299
Follow up to #1289

In #1289 we added a grouped-df method for complete() that calls complete() "within" each group. This would send the entire group slice of data (including the grouping column) into the inner call to complete(). This turns out to be buggy, since it can generate missing values in the grouping columns if the within-group-expansion adds rows:

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  g = c("x", "x", "y"),
  a = c(1, 2, 2),
  b = c(1, 2, 2)
)
gdf <- group_by(df, g)
gdf
#> # A tibble: 3 × 3
#> # Groups:   g [2]
#>   g         a     b
#>   <chr> <dbl> <dbl>
#> 1 x         1     1
#> 2 x         2     2
#> 3 y         2     2

# The `NA`s should be `"x"` values.
# Columns `a` and `b` should be completed "within" each group of `g`
complete(gdf, a, b)
#> # A tibble: 5 × 3
#> # Groups:   g [3]
#>   g         a     b
#>   <chr> <dbl> <dbl>
#> 1 x         1     1
#> 2 <NA>      1     2
#> 3 <NA>      2     1
#> 4 x         2     2
#> 5 y         2     2

We never saw this with expand(), because each call to expand() "within" each group would only return the expansion columns, then the outer summarize() would add the group columns back on.

expand(gdf, a, b)
#> # A tibble: 5 × 3
#> # Groups:   g [2]
#>   g         a     b
#>   <chr> <dbl> <dbl>
#> 1 x         1     1
#> 2 x         1     2
#> 3 x         2     1
#> 4 x         2     2
#> 5 y         2     2

Compare this with complete(), where each inner call to complete() "within" each group returns all the columns in the data frame. This includes the grouping columns if they are passed through, and summarize() won't overwrite those, which is why we ended up with the problem above.

The fix is to use cur_data() rather than cur_data_all(), forcing summarize() to handle the re-addition of any group columns. This means you can no longer attempt to "complete" or "expand" on a group column, but that was pretty much undefined behavior previously, since conceptually you should be completing/expanding "within" each group, meaning you don't have access to that group info.

DavisVaughan · 2022-01-10T23:17:37Z

tests/testthat/test-expand.R

-  # Same as split by group, expand, combine.
-  # Note that this produces duplicate rows in the result. Not ideal, but
-  # would probably break revdeps if we removed this grouped-df method.
-  expect <- vec_rbind(
-    expand(df[1,], g, a),
-    expand(df[2:3,], g, a)


I also really like this change because it means that we can no longer produce duplicate rows in the result of expand() (which I previously mentioned should be an invariant of this function). This was only an issue if we allow expansion on a grouping column

DavisVaughan added 3 commits January 10, 2022 17:36

Only allow access to the non-group variables in expand()

8049ce4

Only allow access to the non-group variables in complete()

2ea67e7

NEWS bullet

9311b28

DavisVaughan commented Jan 10, 2022

View reviewed changes

DavisVaughan requested a review from hadley January 10, 2022 23:18

hadley approved these changes Jan 11, 2022

View reviewed changes

DavisVaughan mentioned this pull request Jan 11, 2022

Document which functions work with grouped data #952

Closed

DavisVaughan merged commit 49405de into tidyverse:main Jan 11, 2022

DavisVaughan deleted the fix/complete-expand-group-access branch January 11, 2022 13:37

DavisVaughan mentioned this pull request Jan 11, 2022

Don't rely on exact column ordering when grouped data is involved epiforecasts/covidregionaldata#445

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disallow access to group columns in `expand()` and `complete()` #1300

Disallow access to group columns in `expand()` and `complete()` #1300

DavisVaughan commented Jan 10, 2022 •

edited

Loading

DavisVaughan Jan 10, 2022

Disallow access to group columns in expand() and complete() #1300

Disallow access to group columns in expand() and complete() #1300

Conversation

DavisVaughan commented Jan 10, 2022 • edited Loading

DavisVaughan Jan 10, 2022

Choose a reason for hiding this comment

Disallow access to group columns in `expand()` and `complete()` #1300

Disallow access to group columns in `expand()` and `complete()` #1300

DavisVaughan commented Jan 10, 2022 •

edited

Loading