Empty grouping levels in the "groups" attribute #5830

nathaneastwood · 2021-03-29T14:32:34Z

I recently posted this question on StackOverflow but I didn't get very satisfactory answers. I believe that this topic has been discussed in several places already when it comes to the .drop argument (#4392, #341 for example) but maybe not about the "groups" attribute itself. I repeat the StackOverflow question below for brevity.

Taking an example from the dplyr tests:

df <- data.frame(
  e = 1,
  f = factor(c(1, 1, 2, 2), levels = 1:3),
  g = c(1, 1, 2, 2),
  x = c(1, 2, 1, 4)
) %>%
  group_by(e, f, g, .drop = FALSE)

I don't quite understand why or how the "groups" attribute is defined as such

attr(df, "groups")
# # A tibble: 3 x 4
#       e f         g       .rows
#   <dbl> <fct> <dbl> <list<int>>
# 1     1 1         1         [2]
# 2     1 2         2         [2]
# 3     1 3        NA         [0]

The third row doesn't make any sense to me, it's not a valid group within the original data. I'd have thought the result would be:

# # A tibble: 3 x 4
#       e f         g       .rows
#   <dbl> <fct> <dbl> <list<int>>
# 1     1 1         1         [2]
# 2     1 2         2         [2]
# 3    NA 3        NA         [0]

The consensus seems to be that dplyr is recycling the single valued column e based on dplyr's recycling rules.

So my first question is, is this the case? And could you please point me to the documentation where it says as much such that I can better educate myself about the rule?

Secondly, if this is true, I don't quite understand why this would be the case (see suggested alternative). It makes a huge assumption about the data in question which is that the column e cannot - and does not - take on any other value. It also assumes that the combination of factor level 3 for column f along with the value 1 from column e is a valid combination. To me, the result should either create all possible combinations of missing levels (i.e. including values which are both available and missing) from e, f and g or simply return only "known missing data", i.e. known factor levels (again, see suggested alternative).

The text was updated successfully, but these errors were encountered:

romainfrancois · 2021-03-30T08:30:59Z

Let's have a look.

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(
  e = 1,
  f = factor(c(1, 1, 2, 2), levels = 1:3),
  g = c(1, 1, 2, 2),
  x = c(1, 2, 1, 4)
)
df
#>   e f g x
#> 1 1 1 1 1
#> 2 1 1 1 2
#> 3 1 2 2 1
#> 4 1 2 2 4

When you group_by(e, f, g, .drop = FALSE) what happens currently is that you :

first split the data by e, e is not a factor so you get one group : e == 1
then by f. f is a factor with 3 levels and .drop is set to FALSE, so you get 3 groups,; no matter what

at this stage, you'd get this:

g <- group_by(df, e, f, .drop = FALSE)
group_split(g)
#> <list_of<
#>   tbl_df<
#>     e: double
#>     f: factor<d3bfc>
#>     g: double
#>     x: double
#>   >
#> >[3]>
#> [[1]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 1         1     1
#> 2     1 1         1     2
#> 
#> [[2]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 2         2     1
#> 2     1 2         2     4
#> 
#> [[3]]
#> # A tibble: 0 x 4
#> # … with 4 variables: e <dbl>, f <fct>, g <dbl>, x <dbl>
group_data(g)
#> # A tibble: 3 x 3
#>       e f           .rows
#>   <dbl> <fct> <list<int>>
#> 1     1 1             [2]
#> 2     1 2             [2]
#> 3     1 3             [0]

then you further split by g which is not a factor, so not concerned by .drop. Each of the 3 groups of the previous step is divided based on the observed values of g in that group

The first group:

#> [[1]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 1         1     1
#> 2     1 1         1     2

had only g == 1 so you get one group

The second group:

#> [[2]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 2         2     1
#> 2     1 2         2     4

only has g == 2 so you only get one group

The third group is empty.

#> [[3]]
#> # A tibble: 0 x 4
#> # … with 4 variables: e <dbl>, f <fct>, g <dbl>, x <dbl>

but still corresponds to e = 1, f = 3. The value of g is set to NA in the grouping data because there needs to be something.

The notion of "valid group in the data" is governed by the presence of factors and where they appear in the sequence of grouping vars. The group e = 1, f = 3 is valid. If we don't want it, we can either declare it .drop = TRUE or demote f to not be a factor.

hadley · 2021-03-31T13:36:42Z

We've worked through this at length in the past and this is the best we've been able to come up with. It's reasonable that while not perfect, works for most cases, and I don't think we want to reconsider our past decisions at this point.

nathaneastwood · 2021-03-31T14:15:34Z

@romainfrancois thanks for the explanation. It makes much more sense now and I understand the comment in the code about recursively splitting.

@hadley I'm sure it has been discussed - but where? This kind of insight into how the package works is really useful to understand.

hadley · 2021-03-31T16:41:45Z

@nathaneastwood probably some in issues and some in our private team chat. Unfortunately we don't have the time to clearly explain every development decision we make.

hadley closed this as completed Mar 31, 2021

nathaneastwood mentioned this issue Apr 18, 2021

Fix empty groups tests nathaneastwood/poorman#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Empty grouping levels in the "groups" attribute #5830

Empty grouping levels in the "groups" attribute #5830

nathaneastwood commented Mar 29, 2021 •

edited

Loading

romainfrancois commented Mar 30, 2021

hadley commented Mar 31, 2021

nathaneastwood commented Mar 31, 2021

hadley commented Mar 31, 2021

Empty grouping levels in the "groups" attribute #5830

Empty grouping levels in the "groups" attribute #5830

Comments

nathaneastwood commented Mar 29, 2021 • edited Loading

romainfrancois commented Mar 30, 2021

hadley commented Mar 31, 2021

nathaneastwood commented Mar 31, 2021

hadley commented Mar 31, 2021

nathaneastwood commented Mar 29, 2021 •

edited

Loading