-
Notifications
You must be signed in to change notification settings - Fork 412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
complete() on grouping variables gives wrong output #396
Comments
I was just coming here to post the same issue. It seems that complete ignores the grouping variables entirely. I can imagine some valid reasons for that, but if so then tidyr should output an error or at least a warning. This has caused me issues multiple times. |
There's a natural case here to output a warning/throw an error when trying to complete a grouped tibble that is being completed based on the same groups - the operation would do nothing if complete operates on the group_df like other functions operate on group_dfs. tbh, I can not see a use case for using |
@kendonB Completing based on just the grouping variable(s) (e.g. completing A based on a tibble grouped by A) might be a strange thing do, but completing based on the grouping variables and other variables (e.g. completing A and B based on a tibble grouped by A) may very well be both sensible and useful. |
@kendonB, @hadley Hmm, I’m not sure I agree with myself anymore. I was sure I had an example where completing based on a grouping variable made sense. But I can’t find it anymore. And thinking about it, I no longer am convinced that it makes sense. So having an error message seems fine to me (I think an error is better than a warning, as the current behaviour of completing on a grouped variable is buggy). |
#' @export
expand.grouped_df <- function(data, ...) {
dots <- quos(...)
dplyr::do(data, expand(., !!! dots))
} which TBH I'm not sure about either - the use of df %>% group_by(a) %>% expand(b)
# same as
df %>% expand(nesting(a), b)
df %>% group_by(a, b) %>% expand(c)
# ->
df %>% expand(nesting(a, b), c)
df %>% group_by(a, b) %>% expand(b, c)
# ->
df %>% expand(nesting(a), b, c) I'm not sure that if that would fix the issues with |
@huftis it's not clear to me that those commands should give the same results To me: d %>%
group_by(gr1) %>%
complete(gr1, gr2)
# Should be equivalent to
d %>%
complete(gr1, gr2) %>%
group_by(gr1)
d %>%
group_by(splitgroup) %>%
complete(gr1, gr2)
# Should be equivalent to
d %>%
complete(splitgroup, gr1, gr2)
group_by(splitgroup) (That's not what happens currently) |
I tripped over this one earlier today - took me a while to figure out that
|
|
I still think that completing based on a grouping variable is an odd thing to do and should either throw an error or a warning. Responding to @hadley's year-old example: He says: d %>%
group_by(gr1) %>%
complete(gr1, gr2)
# Should be equivalent to
d %>%
complete(gr1, gr2) %>%
group_by(gr1) I think that d %>%
group_by(splitgroup) %>%
complete(gr1, gr2)
# Should be equivalent to
d %>%
complete(splitgroup, gr1, gr2)
group_by(splitgroup) Again, I would not expect these to be equivalent. In the former, the user is asking |
@kendonB I agree with your analysis, although I think that we can just silently ignore grouping variables that are respecified in |
Or should d %>%
group_by(splitgroup) %>%
complete(gr1, gr2)
# Be equivalent to
d %>%
complete(nesting(splitgroup, crossing(gr1, gr2)) %>% |
That code doesn't work, I suspect because I haven't full thought this through. |
I would have expected that it behaves like this
This would basically behave more or less like applying complete on each group separately. |
I am beginning to think that
library(tidyverse)
df <- tibble(
g1 = factor(c("A", "B", "B")),
g2 = factor(c(1, 2, 2)),
x = c(10, 20, 30)
)
df
#> # A tibble: 3 × 3
#> g1 g2 x
#> <fct> <fct> <dbl>
#> 1 A 1 10
#> 2 B 2 20
#> 3 B 2 30
# Typically get a distinct combination of rows back
# (I think this should be an invariant of expand())
df %>%
expand(g1, g2)
#> # A tibble: 4 × 2
#> g1 g2
#> <fct> <fct>
#> 1 A 1
#> 2 A 2
#> 3 B 1
#> 4 B 2
# When we group by g1, we get duplicates in the expansion due to factor expansion
# Notice (A, 1) is duplicated! I think this is a big bug, because it goes against the invariant from above.
df %>%
group_by(g1) %>%
expand(g1, g2)
#> # A tibble: 8 × 2
#> # Groups: g1 [2]
#> g1 g2
#> <fct> <fct>
#> 1 A 1
#> 2 A 2
#> 3 B 1
#> 4 B 2
#> 5 A 1
#> 6 A 2
#> 7 B 1
#> 8 B 2
# We get duplicates because it basically does this. The factor level expansion
# fills in the data we "lose" from splitting by the g1 groups
df %>%
group_split(g1)
#> <list_of<
#> tbl_df<
#> g1: factor<a022a>
#> g2: factor<6ab52>
#> x : double
#> >
#> >[2]>
#> [[1]]
#> # A tibble: 1 × 3
#> g1 g2 x
#> <fct> <fct> <dbl>
#> 1 A 1 10
#>
#> [[2]]
#> # A tibble: 2 × 3
#> g1 g2 x
#> <fct> <fct> <dbl>
#> 1 B 2 20
#> 2 B 2 30
df %>%
group_split(g1) %>%
lapply(function(df) expand(df, g1, g2))
#> [[1]]
#> # A tibble: 4 × 2
#> g1 g2
#> <fct> <fct>
#> 1 A 1
#> 2 A 2
#> 3 B 1
#> 4 B 2
#>
#> [[2]]
#> # A tibble: 4 × 2
#> g1 g2
#> <fct> <fct>
#> 1 A 1
#> 2 A 2
#> 3 B 1
#> 4 B 2 If df %>%
group_by(g) %>%
expand(g, x)
df %>%
expand(g, x)
# -----
df %>%
group_by(g) %>%
expand(x)
df %>%
expand(x) With If you do want to complete "within" groups, you can use the new-ish feature of library(tidyverse)
df <- tibble(
g1 = factor(c("A", "B", "B")),
g2 = factor(c(1, 2, 2)),
x = c(10, 20, 30)
)
df
#> # A tibble: 3 × 3
#> g1 g2 x
#> <fct> <fct> <dbl>
#> 1 A 1 10
#> 2 B 2 20
#> 3 B 2 30
# Confusing but explicit, expected, and more predictable
df %>%
group_by(g1) %>%
summarise(complete(cur_data_all(), g1, g2), .groups = "drop")
#> # A tibble: 9 × 3
#> g1 g2 x
#> <fct> <fct> <dbl>
#> 1 A 1 10
#> 2 A 2 NA
#> 3 B 1 NA
#> 4 B 2 NA
#> 5 A 1 NA
#> 6 A 2 NA
#> 7 B 1 NA
#> 8 B 2 20
#> 9 B 2 30 This idea makes more sense with this modified example from the test suite: library(tidyverse)
df <- tibble(
a = c(1L, 1L, 2L),
b = c(1L, 2L, 1L),
c = c(2L, 1L, 1L),
d = c(3L, 4L, 5L)
)
df
#> # A tibble: 3 × 4
#> a b c d
#> <int> <int> <int> <int>
#> 1 1 1 2 3
#> 2 1 2 1 4
#> 3 2 1 1 5
df %>%
group_by(a) %>%
summarise(expand(cur_data(), b, c), .groups = "drop")
#> # A tibble: 5 × 3
#> a b c
#> <int> <int> <int>
#> 1 1 1 1
#> 2 1 1 2
#> 3 1 2 1
#> 4 1 2 2
#> 5 2 1 1
df %>%
group_by(a) %>%
summarise(complete(cur_data(), b, c), .groups = "drop")
#> # A tibble: 5 × 4
#> a b c d
#> <int> <int> <int> <int>
#> 1 1 1 1 NA
#> 2 1 1 2 3
#> 3 1 2 1 4
#> 4 1 2 2 NA
#> 5 2 1 1 5 |
That reasoning sounds logical to me. |
This is the missing piece to solve the original issue in tidyverse#396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data.
This is the missing piece to solve the original issue in tidyverse#396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data.
* Require dplyr >=1.0.0 * NEWS bullet * Replace `do()` with `summarise()` in `expand()` * Add `expand()` test for expanding on grouping variable * Add grouped-df method for `complete()` This is the missing piece to solve the original issue in #396. Previously, `complete()` called the grouped-df `expand()` method, and then joined that result back on the original data. This is wrong, as we need to expand on each group AND join back on each individual group of data. * NEWS bullet * Add an example of using `complete()` on a grouped-df * Mention grouped-df behavior of `complete()` in the docs
The reasoning in #396 (comment) is sort of right, but not completely. We ended up adding a grouped-df method for library(tidyverse)
d = tibble(
gr1 = factor(c("A", "B", "B")),
gr2 = factor(c(1, 2, 2)),
x = c(10, 20, 30),
splitgroup = gr1
)
d %>%
group_by(gr1) %>%
complete(gr1, gr2) %>%
select(-splitgroup)
#> # A tibble: 9 × 3
#> # Groups: gr1 [2]
#> gr1 gr2 x
#> <fct> <fct> <dbl>
#> 1 A 1 10
#> 2 A 2 NA
#> 3 B 1 NA
#> 4 B 2 NA
#> 5 A 1 NA
#> 6 A 2 NA
#> 7 B 1 NA
#> 8 B 2 20
#> 9 B 2 30 I do think that
Nevertheless, the grouped-df method for |
When running
complete()
on a grouped tibble and some of the variables completed on are also grouping variables, the resulting tibble is incorrect.Here’s a reprex. First, I’ll create a simple tibble with three factors (
gr1
,gr2
,splitgroup
) and one numeric variable (x
). The factorsplitgroup
is identical togr1
, so grouping on either variable should results in identical output. However, it doesn’t (I’ll removesplitgroup
from the output just so that it doesn’t effect the ordering of the columns). There’s not even the same number of rows in the output:The text was updated successfully, but these errors were encountered: