Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add grouped-df method for complete() #1289

Merged

Conversation

DavisVaughan
Copy link
Member

Closes #396
Closes #966
Closes #1110
Closes #1288 (Alternate approach)

Here is CRAN complete(), which demonstrates the issue from #396.

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  g = factor(c("x", "y", "y")),
  id = c("a", "a", "b"),
  x = c(1, 2, 3)
)

gdf <- group_by(df, g)
gdf
#> # A tibble: 3 × 3
#> # Groups:   g [2]
#>   g     id        x
#>   <fct> <chr> <dbl>
#> 1 x     a         1
#> 2 y     a         2
#> 3 y     b         3

complete(gdf, g, id)
#> # A tibble: 6 × 3
#> # Groups:   g [2]
#>   g     id        x
#>   <fct> <chr> <dbl>
#> 1 x     a         1
#> 2 y     a         2 # <- This should be NA, not 2
#> 3 x     a         1 # <- This should be NA, not 1
#> 4 x     b        NA
#> 5 y     a         2
#> 6 y     b         3

complete() on a grouped data frame should work like:

  • Split by group
  • Complete on each group
  • Combine results

In other words, we should have gotten this:

res_x <- complete(df[1,], g, id)
res_x
#> # A tibble: 2 × 3
#>   g     id        x
#>   <fct> <chr> <dbl>
#> 1 x     a         1
#> 2 y     a        NA

res_y <- complete(df[2:3,], g, id)
res_y
#> # A tibble: 4 × 3
#>   g     id        x
#>   <fct> <chr> <dbl>
#> 1 x     a        NA
#> 2 x     b        NA
#> 3 y     a         2
#> 4 y     b         3

group_by(vec_rbind(res_x, res_y), g)
#> # A tibble: 6 × 3
#> # Groups:   g [2]
#>   g     id        x
#>   <fct> <chr> <dbl>
#> 1 x     a         1
#> 2 y     a        NA
#> 3 x     a        NA
#> 4 x     b        NA
#> 5 y     a         2
#> 6 y     b         3

The issue is that there is no grouped-df method for complete(), so we end up correctly doing the expansion by group using the grouped-df part of expand(), but we don't do the join right. Rather than joining each expansion to each individual group it was expanded from, we join the total expanded result to the original data, which results in the wrongly filled values from above.

I feel that we probably didn't ever need a grouped-df method for expand() for a few reasons:

  • By definition, it generates a "new" data frame and is really just a crossing() wrapper! So it should return a bare tibble.
  • I think an invariant of expand() should be that it never returns duplicated rows. But that can happen through the grouped-df method (see below)
  • We could have instead fixed this with the grouped-df method for complete(), which is added here

That said, I think removing the grouped-df method for expand() would probably break at least one revdep (I haven't checked), so I am not doing that here.

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  g = factor(c("x", "y", "y")),
  id = c("a", "a", "b")
)

gdf <- group_by(df, g)

# (x, a) is duplicated, for example
expand(gdf, g, id)
#> # A tibble: 6 × 2
#> # Groups:   g [2]
#>   g     id   
#>   <fct> <chr>
#> 1 x     a    
#> 2 y     a    
#> 3 x     a    
#> 4 x     b    
#> 5 y     a    
#> 6 y     b

# that should probably just return this:
expand(df, g, id)
#> # A tibble: 4 × 2
#>   g     id   
#>   <fct> <chr>
#> 1 x     a    
#> 2 x     b    
#> 3 y     a    
#> 4 y     b

@DavisVaughan DavisVaughan added ask :bowtie: and removed ask :bowtie: labels Dec 21, 2021
This is the missing piece to solve the original issue in tidyverse#396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

complete doesn't work well with grouped data frames. complete() on grouping variables gives wrong output
1 participant