Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty grouping levels in the "groups" attribute #5830

Closed
nathaneastwood opened this issue Mar 29, 2021 · 4 comments
Closed

Empty grouping levels in the "groups" attribute #5830

nathaneastwood opened this issue Mar 29, 2021 · 4 comments

Comments

@nathaneastwood
Copy link

nathaneastwood commented Mar 29, 2021

I recently posted this question on StackOverflow but I didn't get very satisfactory answers. I believe that this topic has been discussed in several places already when it comes to the .drop argument (#4392, #341 for example) but maybe not about the "groups" attribute itself. I repeat the StackOverflow question below for brevity.

Taking an example from the dplyr tests:

df <- data.frame(
  e = 1,
  f = factor(c(1, 1, 2, 2), levels = 1:3),
  g = c(1, 1, 2, 2),
  x = c(1, 2, 1, 4)
) %>%
  group_by(e, f, g, .drop = FALSE)

I don't quite understand why or how the "groups" attribute is defined as such

attr(df, "groups")
# # A tibble: 3 x 4
#       e f         g       .rows
#   <dbl> <fct> <dbl> <list<int>>
# 1     1 1         1         [2]
# 2     1 2         2         [2]
# 3     1 3        NA         [0]

The third row doesn't make any sense to me, it's not a valid group within the original data. I'd have thought the result would be:

# # A tibble: 3 x 4
#       e f         g       .rows
#   <dbl> <fct> <dbl> <list<int>>
# 1     1 1         1         [2]
# 2     1 2         2         [2]
# 3    NA 3        NA         [0]

The consensus seems to be that dplyr is recycling the single valued column e based on dplyr's recycling rules.

So my first question is, is this the case? And could you please point me to the documentation where it says as much such that I can better educate myself about the rule?

Secondly, if this is true, I don't quite understand why this would be the case (see suggested alternative). It makes a huge assumption about the data in question which is that the column e cannot - and does not - take on any other value. It also assumes that the combination of factor level 3 for column f along with the value 1 from column e is a valid combination. To me, the result should either create all possible combinations of missing levels (i.e. including values which are both available and missing) from e, f and g or simply return only "known missing data", i.e. known factor levels (again, see suggested alternative).

@romainfrancois
Copy link
Member

Let's have a look.

library(dplyr, warn.conflicts = FALSE)

df <- data.frame(
  e = 1,
  f = factor(c(1, 1, 2, 2), levels = 1:3),
  g = c(1, 1, 2, 2),
  x = c(1, 2, 1, 4)
)
df
#>   e f g x
#> 1 1 1 1 1
#> 2 1 1 1 2
#> 3 1 2 2 1
#> 4 1 2 2 4

When you group_by(e, f, g, .drop = FALSE) what happens currently is that you :

  • first split the data by e, e is not a factor so you get one group : e == 1
  • then by f. f is a factor with 3 levels and .drop is set to FALSE, so you get 3 groups,; no matter what

at this stage, you'd get this:

g <- group_by(df, e, f, .drop = FALSE)
group_split(g)
#> <list_of<
#>   tbl_df<
#>     e: double
#>     f: factor<d3bfc>
#>     g: double
#>     x: double
#>   >
#> >[3]>
#> [[1]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 1         1     1
#> 2     1 1         1     2
#> 
#> [[2]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 2         2     1
#> 2     1 2         2     4
#> 
#> [[3]]
#> # A tibble: 0 x 4
#> # … with 4 variables: e <dbl>, f <fct>, g <dbl>, x <dbl>
group_data(g)
#> # A tibble: 3 x 3
#>       e f           .rows
#>   <dbl> <fct> <list<int>>
#> 1     1 1             [2]
#> 2     1 2             [2]
#> 3     1 3             [0]
  • then you further split by g which is not a factor, so not concerned by .drop. Each of the 3 groups of the previous step is divided based on the observed values of g in that group

The first group:

#> [[1]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 1         1     1
#> 2     1 1         1     2

had only g == 1 so you get one group

The second group:

#> [[2]]
#> # A tibble: 2 x 4
#>       e f         g     x
#>   <dbl> <fct> <dbl> <dbl>
#> 1     1 2         2     1
#> 2     1 2         2     4

only has g == 2 so you only get one group

The third group is empty.

#> [[3]]
#> # A tibble: 0 x 4
#> # … with 4 variables: e <dbl>, f <fct>, g <dbl>, x <dbl>

but still corresponds to e = 1, f = 3. The value of g is set to NA in the grouping data because there needs to be something.

The notion of "valid group in the data" is governed by the presence of factors and where they appear in the sequence of grouping vars. The group e = 1, f = 3 is valid. If we don't want it, we can either declare it .drop = TRUE or demote f to not be a factor.

@hadley
Copy link
Member

hadley commented Mar 31, 2021

We've worked through this at length in the past and this is the best we've been able to come up with. It's reasonable that while not perfect, works for most cases, and I don't think we want to reconsider our past decisions at this point.

@hadley hadley closed this as completed Mar 31, 2021
@nathaneastwood
Copy link
Author

@romainfrancois thanks for the explanation. It makes much more sense now and I understand the comment in the code about recursively splitting.

@hadley I'm sure it has been discussed - but where? This kind of insight into how the package works is really useful to understand.

@hadley
Copy link
Member

hadley commented Mar 31, 2021

@nathaneastwood probably some in issues and some in our private team chat. Unfortunately we don't have the time to clearly explain every development decision we make.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants