Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow access to group columns in expand() and complete() #1300

Merged

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jan 10, 2022

Closes #1299
Follow up to #1289

In #1289 we added a grouped-df method for complete() that calls complete() "within" each group. This would send the entire group slice of data (including the grouping column) into the inner call to complete(). This turns out to be buggy, since it can generate missing values in the grouping columns if the within-group-expansion adds rows:

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  g = c("x", "x", "y"),
  a = c(1, 2, 2),
  b = c(1, 2, 2)
)
gdf <- group_by(df, g)
gdf
#> # A tibble: 3 × 3
#> # Groups:   g [2]
#>   g         a     b
#>   <chr> <dbl> <dbl>
#> 1 x         1     1
#> 2 x         2     2
#> 3 y         2     2

# The `NA`s should be `"x"` values.
# Columns `a` and `b` should be completed "within" each group of `g`
complete(gdf, a, b)
#> # A tibble: 5 × 3
#> # Groups:   g [3]
#>   g         a     b
#>   <chr> <dbl> <dbl>
#> 1 x         1     1
#> 2 <NA>      1     2
#> 3 <NA>      2     1
#> 4 x         2     2
#> 5 y         2     2

We never saw this with expand(), because each call to expand() "within" each group would only return the expansion columns, then the outer summarize() would add the group columns back on.

expand(gdf, a, b)
#> # A tibble: 5 × 3
#> # Groups:   g [2]
#>   g         a     b
#>   <chr> <dbl> <dbl>
#> 1 x         1     1
#> 2 x         1     2
#> 3 x         2     1
#> 4 x         2     2
#> 5 y         2     2

Compare this with complete(), where each inner call to complete() "within" each group returns all the columns in the data frame. This includes the grouping columns if they are passed through, and summarize() won't overwrite those, which is why we ended up with the problem above.

The fix is to use cur_data() rather than cur_data_all(), forcing summarize() to handle the re-addition of any group columns. This means you can no longer attempt to "complete" or "expand" on a group column, but that was pretty much undefined behavior previously, since conceptually you should be completing/expanding "within" each group, meaning you don't have access to that group info.

Comment on lines -70 to -75
# Same as split by group, expand, combine.
# Note that this produces duplicate rows in the result. Not ideal, but
# would probably break revdeps if we removed this grouped-df method.
expect <- vec_rbind(
expand(df[1,], g, a),
expand(df[2:3,], g, a)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also really like this change because it means that we can no longer produce duplicate rows in the result of expand() (which I previously mentioned should be an invariant of this function). This was only an issue if we allow expansion on a grouping column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should complete() and expand() disallow accessing the group columns?
2 participants