complete() on grouping variables gives wrong output #396

huftis · 2018-01-09T20:12:05Z

When running complete() on a grouped tibble and some of the variables completed on are also grouping variables, the resulting tibble is incorrect.

Here’s a reprex. First, I’ll create a simple tibble with three factors (gr1, gr2, splitgroup) and one numeric variable (x). The factor splitgroup is identical to gr1, so grouping on either variable should results in identical output. However, it doesn’t (I’ll remove splitgroup from the output just so that it doesn’t effect the ordering of the columns). There’s not even the same number of rows in the output:

library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1          ✔ purrr   0.2.4     
#> ✔ tibble  1.4.1          ✔ dplyr   0.7.4.9000
#> ✔ tidyr   0.7.2.9000     ✔ stringr 1.2.0     
#> ✔ readr   1.1.1          ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

# Example data
d = tibble(
  gr1 = factor(c("A", "B", "B")),
  gr2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30),
  splitgroup = gr1
)

# Complete on grouping variable
d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2) %>% 
  select(-splitgroup) # 10 rows
#> # A tibble: 10 x 3
#> # Groups:   gr1 [2]
#>    gr1    gr2        x
#>    <fctr> <fctr> <dbl>
#>  1 A      1       10.0
#>  2 A      2       NA  
#>  3 B      1       NA  
#>  4 B      2       20.0
#>  5 B      2       30.0
#>  6 A      1       10.0
#>  7 A      2       NA  
#>  8 B      1       NA  
#>  9 B      2       20.0
#> 10 B      2       30.0

# Completing on non-grouping but identical variable (should give same output)
d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2) %>% 
  ungroup %>% select(-splitgroup) # 9 rows
#> # A tibble: 9 x 3
#>   gr1    gr2        x
#>   <fctr> <fctr> <dbl>
#> 1 A      1       10.0
#> 2 A      2       NA  
#> 3 B      1       NA  
#> 4 B      2       NA  
#> 5 A      1       NA  
#> 6 A      2       NA  
#> 7 B      1       NA  
#> 8 B      2       20.0
#> 9 B      2       30.0

# Alternative method to find the *expected* results
# (which are identical to the results from the
# `group_by(splitgroup)` approach)
d %>% 
  split(.$gr1) %>% 
  map_df(~complete(., gr1, gr2)) %>% 
  select(-splitgroup) # 9 rows
#> # A tibble: 9 x 3
#>   gr1    gr2        x
#>   <fctr> <fctr> <dbl>
#> 1 A      1       10.0
#> 2 A      2       NA  
#> 3 B      1       NA  
#> 4 B      2       NA  
#> 5 A      1       NA  
#> 6 A      2       NA  
#> 7 B      1       NA  
#> 8 B      2       20.0
#> 9 B      2       30.0

The text was updated successfully, but these errors were encountered:

jnolis · 2018-01-16T18:14:42Z

I was just coming here to post the same issue. It seems that complete ignores the grouping variables entirely. I can imagine some valid reasons for that, but if so then tidyr should output an error or at least a warning. This has caused me issues multiple times.

kendonB · 2018-01-24T22:28:46Z

There's a natural case here to output a warning/throw an error when trying to complete a grouped tibble that is being completed based on the same groups - the operation would do nothing if complete operates on the group_df like other functions operate on group_dfs.

tbh, I can not see a use case for using complete on a grouped df. I would advocate just ignoring the fact that it's a group_df altogether and have the behavior be identical regardless of initial grouping.

huftis · 2018-01-25T09:52:47Z

@kendonB Completing based on just the grouping variable(s) (e.g. completing A based on a tibble grouped by A) might be a strange thing do, but completing based on the grouping variables and other variables (e.g. completing A and B based on a tibble grouped by A) may very well be both sensible and useful.

huftis · 2018-01-25T10:03:41Z

@kendonB, @hadley Hmm, I’m not sure I agree with myself anymore. I was sure I had an example where completing based on a grouping variable made sense. But I can’t find it anymore. And thinking about it, I no longer am convinced that it makes sense. So having an error message seems fine to me (I think an error is better than a warning, as the current behaviour of completing on a grouped variable is buggy).

hadley · 2018-01-25T13:48:43Z

complete() doesn't have a grouped_df method, so this is the result of expand.grouped_df():

#' @export
expand.grouped_df <- function(data, ...) {
  dots <- quos(...)
  dplyr::do(data, expand(., !!! dots))
}

which TBH I'm not sure about either - the use of do() is a warning sign. I think it should probably translate more like

df %>% group_by(a) %>% expand(b)
# same as
df %>% expand(nesting(a), b)

df %>% group_by(a, b) %>% expand(c)
# ->
df %>% expand(nesting(a, b), c)

df %>% group_by(a, b) %>% expand(b, c)
# ->
df %>% expand(nesting(a), b, c)

I'm not sure that if that would fix the issues with complete() too.

hadley · 2018-01-25T13:53:07Z

@huftis it's not clear to me that those commands should give the same results

To me:

d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(gr1, gr2) %>%
  group_by(gr1)

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(splitgroup, gr1, gr2)
  group_by(splitgroup)

(That's not what happens currently)

gvwilson · 2019-01-04T01:15:58Z

I tripped over this one earlier today - took me a while to figure out that complete doesn't respect group_by:

library(tidyverse)

raw <- tribble(
  ~left, ~right, ~value,
  "a",   "x",    1,
  "a",   "x",    2,
  "a",   "y",    1,
  "a",   "y",    2,
  "b",   "x",    1,
  "b",   "x",    2
)

cat("The raw data table\n")
raw

cat("\nCompleting the raw data table (fills in 'b'-'y')\n")
raw %>%
  complete(left, right, fill = list(value = 0))

cat("\nSumming values for distinct combinations of 'left' and 'right'\n")
raw %>%
  group_by(left, right) %>%
  summarize(total = sum(value))

cat("\nTrying to complete summation does _not_ fill in 'b'-'y'\n")
raw %>%
  group_by(left, right) %>%
  summarize(total = sum(value)) %>%
  complete(left, right, fill = list(total = 0))

What's the recommended workaround?
Should I PR an addition to the docs for complete to warn people about this case?

hadley · 2019-01-04T13:49:02Z

Add the grouping variables to complete() and then regroup? I think this got stuck last time because no one reassured me that my thoughts about the desired behaviour were correct.
I'd prefer to just fix it, and the timeline for getting it into CRAN is likely to be the same either way.

kendonB · 2019-01-04T22:50:06Z

I still think that completing based on a grouping variable is an odd thing to do and should either throw an error or a warning. Responding to @hadley's year-old example:

He says:

d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(gr1, gr2) %>%
  group_by(gr1)

I think that complete should operate on a grouped df just like other functions do. Other functions can see the grouping variables but they always only have a single value within the group. The grouping variables are always then already "complete" within the subdataframes and there would be no reason to include them, which might warrant a warning. The above then should not be equivalent.

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(splitgroup, gr1, gr2)
  group_by(splitgroup)

Again, I would not expect these to be equivalent. In the former, the user is asking tidyr to complete the subdataframes defined by the grouping variable splitgroup. The set of values for gr1 and gr2 need not be same across different values of splitgroup so every value from the big dataframe wouldn't necessarily appear in every group. The latter example asks tidyr to complete the entire dataframe by all three variables, ensuring that all combinations of all three variables appear, then group by splitgroup.

hadley · 2019-03-08T13:54:05Z

@kendonB I agree with your analysis, although I think that we can just silently ignore grouping variables that are respecified in complete(). (I think detecting them is potentially hard because both complete() and group_by() have mutate() semantics).

hadley · 2019-04-29T20:38:35Z

Or should

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)

# Be equivalent to

d %>% 
  complete(nesting(splitgroup, crossing(gr1, gr2)) %>%

hadley · 2019-04-29T20:46:22Z

That code doesn't work, I suspect because I haven't full thought this through.

thomas-bierhance-exxcellent · 2020-01-17T14:07:33Z

I would have expected that it behaves like this

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)
# Be equivalent to
d %>%
  expand(nesting(splitgroup, gr1)) %>%
  full_join(
    d %>% expand(nesting(splitgroup, gr2)),
    by="splitgroup") %>%
  left_join(d)

This would basically behave more or less like applying complete on each group separately.

DavisVaughan · 2021-12-20T18:13:58Z

I am beginning to think that expand() should not have a grouped_df method at all, and should return a bare tibble rather than one that is reconstructed from data. complete() would still return a grouped-df through its usage of reconstruct_tibble().

I think expand() fundamentally creates a "new" data frame, like expand_grid(). It is just a small wrapper around that which also does factor expansion and makes it easier to specify where those columns come from (i.e. data or some external vector). So it should return a bare tibble.
I think the factor level expansion that occurs with expand() is at odds with the idea of expanding within the groups that come from dplyr. This is why the original example here results in something so confusing. Here is an example:

library(tidyverse)

df <- tibble(
  g1 = factor(c("A", "B", "B")),
  g2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30)
)

df
#> # A tibble: 3 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 B     2        20
#> 3 B     2        30

# Typically get a distinct combination of rows back
# (I think this should be an invariant of expand())
df %>% 
  expand(g1, g2)
#> # A tibble: 4 × 2
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2

# When we group by g1, we get duplicates in the expansion due to factor expansion
# Notice (A, 1) is duplicated! I think this is a big bug, because it goes against the invariant from above.
df %>% 
  group_by(g1) %>% 
  expand(g1, g2)
#> # A tibble: 8 × 2
#> # Groups:   g1 [2]
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2    
#> 5 A     1    
#> 6 A     2    
#> 7 B     1    
#> 8 B     2

# We get duplicates because it basically does this. The factor level expansion
# fills in the data we "lose" from splitting by the g1 groups
df %>%
  group_split(g1)
#> <list_of<
#>   tbl_df<
#>     g1: factor<a022a>
#>     g2: factor<6ab52>
#>     x : double
#>   >
#> >[2]>
#> [[1]]
#> # A tibble: 1 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 
#> [[2]]
#> # A tibble: 2 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 B     2        20
#> 2 B     2        30

df %>%
  group_split(g1) %>%
  lapply(function(df) expand(df, g1, g2))
#> [[1]]
#> # A tibble: 4 × 2
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2    
#> 
#> [[2]]
#> # A tibble: 4 × 2
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2

If expand() didn't have a grouped-df method, this theoretical code would return the exact same thing (basically, the group_by() does nothing):

df %>%
  group_by(g) %>%
  expand(g, x)

df %>%
  expand(g, x)

# -----

df %>%
  group_by(g) %>%
  expand(x)

df %>%
  expand(x)

With complete(), the actual data would still be the same, but the result in the group-by cases would be grouped.

If you do want to complete "within" groups, you can use the new-ish feature of summarize() that allows you to return multi row outputs. So, you could recreate the expected original result, which is still confusing but a little easier to reason about now.

library(tidyverse)

df <- tibble(
  g1 = factor(c("A", "B", "B")),
  g2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30)
)

df
#> # A tibble: 3 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 B     2        20
#> 3 B     2        30

# Confusing but explicit, expected, and more predictable
df %>%
  group_by(g1) %>%
  summarise(complete(cur_data_all(), g1, g2), .groups = "drop")
#> # A tibble: 9 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 A     2        NA
#> 3 B     1        NA
#> 4 B     2        NA
#> 5 A     1        NA
#> 6 A     2        NA
#> 7 B     1        NA
#> 8 B     2        20
#> 9 B     2        30

This idea makes more sense with this modified example from the test suite:

library(tidyverse)

df <- tibble(
  a = c(1L, 1L, 2L),
  b = c(1L, 2L, 1L),
  c = c(2L, 1L, 1L),
  d = c(3L, 4L, 5L)
)

df
#> # A tibble: 3 × 4
#>       a     b     c     d
#>   <int> <int> <int> <int>
#> 1     1     1     2     3
#> 2     1     2     1     4
#> 3     2     1     1     5

df %>% 
  group_by(a) %>% 
  summarise(expand(cur_data(), b, c), .groups = "drop")
#> # A tibble: 5 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     1     1
#> 2     1     1     2
#> 3     1     2     1
#> 4     1     2     2
#> 5     2     1     1

df %>% 
  group_by(a) %>% 
  summarise(complete(cur_data(), b, c), .groups = "drop")
#> # A tibble: 5 × 4
#>       a     b     c     d
#>   <int> <int> <int> <int>
#> 1     1     1     1    NA
#> 2     1     1     2     3
#> 3     1     2     1     4
#> 4     1     2     2    NA
#> 5     2     1     1     5

hadley · 2021-12-20T21:15:47Z

That reasoning sounds logical to me.

This is the missing piece to solve the original issue in tidyverse#396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data.

* Require dplyr >=1.0.0 * NEWS bullet * Replace `do()` with `summarise()` in `expand()` * Add `expand()` test for expanding on grouping variable * Add grouped-df method for `complete()` This is the missing piece to solve the original issue in #396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data. * NEWS bullet * Add an example of using `complete()` on a grouped-df * Mention grouped-df behavior of `complete()` in the docs

DavisVaughan · 2021-12-21T18:42:52Z

The reasoning in #396 (comment) is sort of right, but not completely. We ended up adding a grouped-df method for complete(), which is really all that was needed to fix the original issue. i.e. this now produces the correct results:

library(tidyverse)

d = tibble(
  gr1 = factor(c("A", "B", "B")),
  gr2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30),
  splitgroup = gr1
)

d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2) %>% 
  select(-splitgroup)
#> # A tibble: 9 × 3
#> # Groups:   gr1 [2]
#>   gr1   gr2       x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 A     2        NA
#> 3 B     1        NA
#> 4 B     2        NA
#> 5 A     1        NA
#> 6 A     2        NA
#> 7 B     1        NA
#> 8 B     2        20
#> 9 B     2        30

I do think that expand() probably never needed a grouped-df method. Only complete() should have ever gotten one.

complete() is a high level verb that is a combination of expand() with full_join() and replace_na(). So adding a grouped-df method is not unreasonable here
expand() is a lower level verb. We typically don't special case grouped data frames in lower level verbs.
expand() should have the invariant that it always returns unique rows. But with the grouped-df method it has the potential to return duplicate rows.

Nevertheless, the grouped-df method for expand() has been around for quite awhile, and someone probably depends on it. So we decided to keep it and only fix complete().

hadley mentioned this issue Jan 24, 2018

consider a warning for complete when operating on a grouped df? #399

Closed

hadley added bug an unexpected problem or unintended behavior pivoting ♻️ pivot rectangular data to different "shapes" labels Jan 4, 2019

hadley added the group 👨‍👨‍👦‍👦 label Mar 8, 2019

hadley added missing values 💀 and removed pivoting ♻️ pivot rectangular data to different "shapes" labels May 19, 2020

This was referenced Mar 19, 2021

complete doesn't work well with grouped data frames. #966

Closed

Update complete.R #1110

Closed

DavisVaughan added the grids #️⃣ expanding, nesting, crossing, ... label Nov 19, 2021

DavisVaughan mentioned this issue Dec 20, 2021

Remove grouped-df method from expand(), and don't reconstruct #1288

Closed

DavisVaughan mentioned this issue Dec 21, 2021

Add grouped-df method for complete() #1289

Merged

DavisVaughan closed this as completed in #1289 Dec 21, 2021

DavisVaughan mentioned this issue Mar 9, 2022

Bug in v1.2 : complete() errors when given the grouping variable of a grouped df #1332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

complete() on grouping variables gives wrong output #396

complete() on grouping variables gives wrong output #396

huftis commented Jan 9, 2018

jnolis commented Jan 16, 2018

kendonB commented Jan 24, 2018

huftis commented Jan 25, 2018

huftis commented Jan 25, 2018

hadley commented Jan 25, 2018

hadley commented Jan 25, 2018

gvwilson commented Jan 4, 2019

hadley commented Jan 4, 2019

kendonB commented Jan 4, 2019 •

edited

Loading

hadley commented Mar 8, 2019

hadley commented Apr 29, 2019 •

edited

Loading

hadley commented Apr 29, 2019

thomas-bierhance-exxcellent commented Jan 17, 2020 •

edited

Loading

DavisVaughan commented Dec 20, 2021 •

edited

Loading

hadley commented Dec 20, 2021

DavisVaughan commented Dec 21, 2021

complete() on grouping variables gives wrong output #396

complete() on grouping variables gives wrong output #396

Comments

huftis commented Jan 9, 2018

jnolis commented Jan 16, 2018

kendonB commented Jan 24, 2018

huftis commented Jan 25, 2018

huftis commented Jan 25, 2018

hadley commented Jan 25, 2018

hadley commented Jan 25, 2018

gvwilson commented Jan 4, 2019

hadley commented Jan 4, 2019

kendonB commented Jan 4, 2019 • edited Loading

hadley commented Mar 8, 2019

hadley commented Apr 29, 2019 • edited Loading

hadley commented Apr 29, 2019

thomas-bierhance-exxcellent commented Jan 17, 2020 • edited Loading

DavisVaughan commented Dec 20, 2021 • edited Loading

hadley commented Dec 20, 2021

DavisVaughan commented Dec 21, 2021

kendonB commented Jan 4, 2019 •

edited

Loading

hadley commented Apr 29, 2019 •

edited

Loading

thomas-bierhance-exxcellent commented Jan 17, 2020 •

edited

Loading

DavisVaughan commented Dec 20, 2021 •

edited

Loading