Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

complete() on grouping variables gives wrong output #396

Closed
huftis opened this issue Jan 9, 2018 · 16 comments · Fixed by #1289
Closed

complete() on grouping variables gives wrong output #396

huftis opened this issue Jan 9, 2018 · 16 comments · Fixed by #1289
Labels
bug an unexpected problem or unintended behavior grids #️⃣ expanding, nesting, crossing, ... group 👨‍👨‍👦‍👦 missing values 💀

Comments

@huftis
Copy link

huftis commented Jan 9, 2018

When running complete() on a grouped tibble and some of the variables completed on are also grouping variables, the resulting tibble is incorrect.

Here’s a reprex. First, I’ll create a simple tibble with three factors (gr1, gr2, splitgroup) and one numeric variable (x). The factor splitgroup is identical to gr1, so grouping on either variable should results in identical output. However, it doesn’t (I’ll remove splitgroup from the output just so that it doesn’t effect the ordering of the columns). There’s not even the same number of rows in the output:

library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1          ✔ purrr   0.2.4     
#> ✔ tibble  1.4.1          ✔ dplyr   0.7.4.9000
#> ✔ tidyr   0.7.2.9000     ✔ stringr 1.2.0     
#> ✔ readr   1.1.1          ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

# Example data
d = tibble(
  gr1 = factor(c("A", "B", "B")),
  gr2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30),
  splitgroup = gr1
)

# Complete on grouping variable
d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2) %>% 
  select(-splitgroup) # 10 rows
#> # A tibble: 10 x 3
#> # Groups:   gr1 [2]
#>    gr1    gr2        x
#>    <fctr> <fctr> <dbl>
#>  1 A      1       10.0
#>  2 A      2       NA  
#>  3 B      1       NA  
#>  4 B      2       20.0
#>  5 B      2       30.0
#>  6 A      1       10.0
#>  7 A      2       NA  
#>  8 B      1       NA  
#>  9 B      2       20.0
#> 10 B      2       30.0

# Completing on non-grouping but identical variable (should give same output)
d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2) %>% 
  ungroup %>% select(-splitgroup) # 9 rows
#> # A tibble: 9 x 3
#>   gr1    gr2        x
#>   <fctr> <fctr> <dbl>
#> 1 A      1       10.0
#> 2 A      2       NA  
#> 3 B      1       NA  
#> 4 B      2       NA  
#> 5 A      1       NA  
#> 6 A      2       NA  
#> 7 B      1       NA  
#> 8 B      2       20.0
#> 9 B      2       30.0

# Alternative method to find the *expected* results
# (which are identical to the results from the
# `group_by(splitgroup)` approach)
d %>% 
  split(.$gr1) %>% 
  map_df(~complete(., gr1, gr2)) %>% 
  select(-splitgroup) # 9 rows
#> # A tibble: 9 x 3
#>   gr1    gr2        x
#>   <fctr> <fctr> <dbl>
#> 1 A      1       10.0
#> 2 A      2       NA  
#> 3 B      1       NA  
#> 4 B      2       NA  
#> 5 A      1       NA  
#> 6 A      2       NA  
#> 7 B      1       NA  
#> 8 B      2       20.0
#> 9 B      2       30.0
@jnolis
Copy link

jnolis commented Jan 16, 2018

I was just coming here to post the same issue. It seems that complete ignores the grouping variables entirely. I can imagine some valid reasons for that, but if so then tidyr should output an error or at least a warning. This has caused me issues multiple times.

@kendonB
Copy link

kendonB commented Jan 24, 2018

There's a natural case here to output a warning/throw an error when trying to complete a grouped tibble that is being completed based on the same groups - the operation would do nothing if complete operates on the group_df like other functions operate on group_dfs.

tbh, I can not see a use case for using complete on a grouped df. I would advocate just ignoring the fact that it's a group_df altogether and have the behavior be identical regardless of initial grouping.

@huftis
Copy link
Author

huftis commented Jan 25, 2018

@kendonB Completing based on just the grouping variable(s) (e.g. completing A based on a tibble grouped by A) might be a strange thing do, but completing based on the grouping variables and other variables (e.g. completing A and B based on a tibble grouped by A) may very well be both sensible and useful.

@huftis
Copy link
Author

huftis commented Jan 25, 2018

@kendonB, @hadley Hmm, I’m not sure I agree with myself anymore. I was sure I had an example where completing based on a grouping variable made sense. But I can’t find it anymore. And thinking about it, I no longer am convinced that it makes sense. So having an error message seems fine to me (I think an error is better than a warning, as the current behaviour of completing on a grouped variable is buggy).

@hadley
Copy link
Member

hadley commented Jan 25, 2018

complete() doesn't have a grouped_df method, so this is the result of expand.grouped_df():

#' @export
expand.grouped_df <- function(data, ...) {
  dots <- quos(...)
  dplyr::do(data, expand(., !!! dots))
}

which TBH I'm not sure about either - the use of do() is a warning sign. I think it should probably translate more like

df %>% group_by(a) %>% expand(b)
# same as
df %>% expand(nesting(a), b)

df %>% group_by(a, b) %>% expand(c)
# ->
df %>% expand(nesting(a, b), c)

df %>% group_by(a, b) %>% expand(b, c)
# ->
df %>% expand(nesting(a), b, c)

I'm not sure that if that would fix the issues with complete() too.

@hadley
Copy link
Member

hadley commented Jan 25, 2018

@huftis it's not clear to me that those commands should give the same results

To me:

d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(gr1, gr2) %>%
  group_by(gr1)

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(splitgroup, gr1, gr2)
  group_by(splitgroup)

(That's not what happens currently)

@gvwilson
Copy link

gvwilson commented Jan 4, 2019

I tripped over this one earlier today - took me a while to figure out that complete doesn't respect group_by:

library(tidyverse)

raw <- tribble(
  ~left, ~right, ~value,
  "a",   "x",    1,
  "a",   "x",    2,
  "a",   "y",    1,
  "a",   "y",    2,
  "b",   "x",    1,
  "b",   "x",    2
)

cat("The raw data table\n")
raw

cat("\nCompleting the raw data table (fills in 'b'-'y')\n")
raw %>%
  complete(left, right, fill = list(value = 0))

cat("\nSumming values for distinct combinations of 'left' and 'right'\n")
raw %>%
  group_by(left, right) %>%
  summarize(total = sum(value))

cat("\nTrying to complete summation does _not_ fill in 'b'-'y'\n")
raw %>%
  group_by(left, right) %>%
  summarize(total = sum(value)) %>%
  complete(left, right, fill = list(total = 0))
  1. What's the recommended workaround?
  2. Should I PR an addition to the docs for complete to warn people about this case?

@hadley
Copy link
Member

hadley commented Jan 4, 2019

  1. Add the grouping variables to complete() and then regroup? I think this got stuck last time because no one reassured me that my thoughts about the desired behaviour were correct.

  2. I'd prefer to just fix it, and the timeline for getting it into CRAN is likely to be the same either way.

@hadley hadley added bug an unexpected problem or unintended behavior pivoting ♻️ pivot rectangular data to different "shapes" labels Jan 4, 2019
@kendonB
Copy link

kendonB commented Jan 4, 2019

I still think that completing based on a grouping variable is an odd thing to do and should either throw an error or a warning. Responding to @hadley's year-old example:

He says:

d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(gr1, gr2) %>%
  group_by(gr1)

I think that complete should operate on a grouped df just like other functions do. Other functions can see the grouping variables but they always only have a single value within the group. The grouping variables are always then already "complete" within the subdataframes and there would be no reason to include them, which might warrant a warning. The above then should not be equivalent.

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)
# Should be equivalent to 
d %>% 
  complete(splitgroup, gr1, gr2)
  group_by(splitgroup)

Again, I would not expect these to be equivalent. In the former, the user is asking tidyr to complete the subdataframes defined by the grouping variable splitgroup. The set of values for gr1 and gr2 need not be same across different values of splitgroup so every value from the big dataframe wouldn't necessarily appear in every group. The latter example asks tidyr to complete the entire dataframe by all three variables, ensuring that all combinations of all three variables appear, then group by splitgroup.

@hadley
Copy link
Member

hadley commented Mar 8, 2019

@kendonB I agree with your analysis, although I think that we can just silently ignore grouping variables that are respecified in complete(). (I think detecting them is potentially hard because both complete() and group_by() have mutate() semantics).

@hadley
Copy link
Member

hadley commented Apr 29, 2019

Or should

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)

# Be equivalent to

d %>% 
  complete(nesting(splitgroup, crossing(gr1, gr2)) %>%

@hadley
Copy link
Member

hadley commented Apr 29, 2019

That code doesn't work, I suspect because I haven't full thought this through.

@thomas-bierhance-exxcellent
Copy link

I would have expected that it behaves like this

d %>% 
  group_by(splitgroup) %>% 
  complete(gr1, gr2)
# Be equivalent to
d %>%
  expand(nesting(splitgroup, gr1)) %>%
  full_join(
    d %>% expand(nesting(splitgroup, gr2)),
    by="splitgroup") %>%
  left_join(d)

This would basically behave more or less like applying complete on each group separately.

@hadley hadley added missing values 💀 and removed pivoting ♻️ pivot rectangular data to different "shapes" labels May 19, 2020
@DavisVaughan DavisVaughan added the grids #️⃣ expanding, nesting, crossing, ... label Nov 19, 2021
@DavisVaughan
Copy link
Member

DavisVaughan commented Dec 20, 2021

I am beginning to think that expand() should not have a grouped_df method at all, and should return a bare tibble rather than one that is reconstructed from data. complete() would still return a grouped-df through its usage of reconstruct_tibble().

  • I think expand() fundamentally creates a "new" data frame, like expand_grid(). It is just a small wrapper around that which also does factor expansion and makes it easier to specify where those columns come from (i.e. data or some external vector). So it should return a bare tibble.

  • I think the factor level expansion that occurs with expand() is at odds with the idea of expanding within the groups that come from dplyr. This is why the original example here results in something so confusing. Here is an example:

library(tidyverse)

df <- tibble(
  g1 = factor(c("A", "B", "B")),
  g2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30)
)

df
#> # A tibble: 3 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 B     2        20
#> 3 B     2        30

# Typically get a distinct combination of rows back
# (I think this should be an invariant of expand())
df %>% 
  expand(g1, g2)
#> # A tibble: 4 × 2
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2

# When we group by g1, we get duplicates in the expansion due to factor expansion
# Notice (A, 1) is duplicated! I think this is a big bug, because it goes against the invariant from above.
df %>% 
  group_by(g1) %>% 
  expand(g1, g2)
#> # A tibble: 8 × 2
#> # Groups:   g1 [2]
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2    
#> 5 A     1    
#> 6 A     2    
#> 7 B     1    
#> 8 B     2

# We get duplicates because it basically does this. The factor level expansion
# fills in the data we "lose" from splitting by the g1 groups
df %>%
  group_split(g1)
#> <list_of<
#>   tbl_df<
#>     g1: factor<a022a>
#>     g2: factor<6ab52>
#>     x : double
#>   >
#> >[2]>
#> [[1]]
#> # A tibble: 1 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 
#> [[2]]
#> # A tibble: 2 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 B     2        20
#> 2 B     2        30

df %>%
  group_split(g1) %>%
  lapply(function(df) expand(df, g1, g2))
#> [[1]]
#> # A tibble: 4 × 2
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2    
#> 
#> [[2]]
#> # A tibble: 4 × 2
#>   g1    g2   
#>   <fct> <fct>
#> 1 A     1    
#> 2 A     2    
#> 3 B     1    
#> 4 B     2

If expand() didn't have a grouped-df method, this theoretical code would return the exact same thing (basically, the group_by() does nothing):

df %>%
  group_by(g) %>%
  expand(g, x)

df %>%
  expand(g, x)

# -----

df %>%
  group_by(g) %>%
  expand(x)

df %>%
  expand(x)

With complete(), the actual data would still be the same, but the result in the group-by cases would be grouped.

If you do want to complete "within" groups, you can use the new-ish feature of summarize() that allows you to return multi row outputs. So, you could recreate the expected original result, which is still confusing but a little easier to reason about now.

library(tidyverse)

df <- tibble(
  g1 = factor(c("A", "B", "B")),
  g2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30)
)

df
#> # A tibble: 3 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 B     2        20
#> 3 B     2        30

# Confusing but explicit, expected, and more predictable
df %>%
  group_by(g1) %>%
  summarise(complete(cur_data_all(), g1, g2), .groups = "drop")
#> # A tibble: 9 × 3
#>   g1    g2        x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 A     2        NA
#> 3 B     1        NA
#> 4 B     2        NA
#> 5 A     1        NA
#> 6 A     2        NA
#> 7 B     1        NA
#> 8 B     2        20
#> 9 B     2        30

This idea makes more sense with this modified example from the test suite:

library(tidyverse)

df <- tibble(
  a = c(1L, 1L, 2L),
  b = c(1L, 2L, 1L),
  c = c(2L, 1L, 1L),
  d = c(3L, 4L, 5L)
)

df
#> # A tibble: 3 × 4
#>       a     b     c     d
#>   <int> <int> <int> <int>
#> 1     1     1     2     3
#> 2     1     2     1     4
#> 3     2     1     1     5

df %>% 
  group_by(a) %>% 
  summarise(expand(cur_data(), b, c), .groups = "drop")
#> # A tibble: 5 × 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     1     1     1
#> 2     1     1     2
#> 3     1     2     1
#> 4     1     2     2
#> 5     2     1     1

df %>% 
  group_by(a) %>% 
  summarise(complete(cur_data(), b, c), .groups = "drop")
#> # A tibble: 5 × 4
#>       a     b     c     d
#>   <int> <int> <int> <int>
#> 1     1     1     1    NA
#> 2     1     1     2     3
#> 3     1     2     1     4
#> 4     1     2     2    NA
#> 5     2     1     1     5

@hadley
Copy link
Member

hadley commented Dec 20, 2021

That reasoning sounds logical to me.

DavisVaughan added a commit to DavisVaughan/tidyr that referenced this issue Dec 21, 2021
This is the missing piece to solve the original issue in tidyverse#396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data.
DavisVaughan added a commit to DavisVaughan/tidyr that referenced this issue Dec 21, 2021
This is the missing piece to solve the original issue in tidyverse#396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data.
DavisVaughan added a commit that referenced this issue Dec 21, 2021
* Require dplyr >=1.0.0

* NEWS bullet

* Replace `do()` with `summarise()` in `expand()`

* Add `expand()` test for expanding on grouping variable

* Add grouped-df method for `complete()`

This is the missing piece to solve the original issue in #396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data.

* NEWS bullet

* Add an example of using `complete()` on a grouped-df

* Mention grouped-df behavior of `complete()` in the docs
@DavisVaughan
Copy link
Member

The reasoning in #396 (comment) is sort of right, but not completely. We ended up adding a grouped-df method for complete(), which is really all that was needed to fix the original issue. i.e. this now produces the correct results:

library(tidyverse)

d = tibble(
  gr1 = factor(c("A", "B", "B")),
  gr2 = factor(c(1, 2, 2)),
  x = c(10, 20, 30),
  splitgroup = gr1
)

d %>% 
  group_by(gr1) %>% 
  complete(gr1, gr2) %>% 
  select(-splitgroup)
#> # A tibble: 9 × 3
#> # Groups:   gr1 [2]
#>   gr1   gr2       x
#>   <fct> <fct> <dbl>
#> 1 A     1        10
#> 2 A     2        NA
#> 3 B     1        NA
#> 4 B     2        NA
#> 5 A     1        NA
#> 6 A     2        NA
#> 7 B     1        NA
#> 8 B     2        20
#> 9 B     2        30

I do think that expand() probably never needed a grouped-df method. Only complete() should have ever gotten one.

  • complete() is a high level verb that is a combination of expand() with full_join() and replace_na(). So adding a grouped-df method is not unreasonable here
  • expand() is a lower level verb. We typically don't special case grouped data frames in lower level verbs.
  • expand() should have the invariant that it always returns unique rows. But with the grouped-df method it has the potential to return duplicate rows.

Nevertheless, the grouped-df method for expand() has been around for quite awhile, and someone probably depends on it. So we decided to keep it and only fix complete().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior grids #️⃣ expanding, nesting, crossing, ... group 👨‍👨‍👦‍👦 missing values 💀
Projects
None yet
7 participants