Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: empty groups #3492

Merged
merged 78 commits into from
May 14, 2018
Merged

WIP: empty groups #3492

merged 78 commits into from
May 14, 2018

Conversation

romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Apr 10, 2018

@romainfrancois
Copy link
Member Author

For now the interface is, adding drop = FALSE to group_by. The default (TRUE) does what we did before. I guess we can reevaluate later if FALSE should be default.

This does:

  • collect the indices based on the data that is present (what we did before)
  • expand the labels so that it has all combos of factor levels * unique values of things that are not factors
  • fill the gaps with empty indices

@krlmlr
Copy link
Member

krlmlr commented Apr 10, 2018

Looks like we need @param drop in the roxygen2 docu: https://travis-ci.org/tidyverse/dplyr/jobs/364675197#L1521

build_index_cpp(data_);
// when there is no drop attribute, we assumed drop=TRUE
// the default for group_by.
SEXP drop = data_.attr("drop");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the "drop" attribute was used for something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

// cleanup after expand.grid
new_labels.attr("out.attrs") = R_NilValue;

IntegerVector new_labels_order = OrderVisitors(new_labels).apply();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to reorder?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of what expand.grid does

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do rev(expand.grid(!!!rev(dots))) ?

We should look into replacing expand.grid() with a home-grown visitor at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. This is just a first step to exchange about the feature, I'll polish the implementation later.

@krlmlr
Copy link
Member

krlmlr commented Apr 10, 2018

How do we handle NA in factors? What about logical?

@romainfrancois
Copy link
Member Author

For things that are not factors, it depends on what unique does.

For factors yeah i suppose i need to add something to handle na if any.

@romainfrancois romainfrancois added the wip work in progress label Apr 23, 2018
@romainfrancois romainfrancois changed the title zero length groups [wip] WIP: zero length groups Apr 23, 2018
@romainfrancois
Copy link
Member Author

Now with .drop and .expand so we get:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  y  = rep( 1:4, each = 2)
)

show_groups <- function(.data, ...){
  attr(group_by(.data, ...), "labels")
}
show_groups(df, f1, y, .drop = TRUE , .expand = FALSE)
#>   f1 y
#> 1  a 1
#> 2  a 2
#> 3  b 3
#> 4  b 4
show_groups(df, f1, y, .drop = TRUE , .expand = TRUE)
#> Warning: Column `f1` joining factors with different levels, coercing to
#> character vector
#>   f1 y
#> 1  a 1
#> 2  a 2
#> 3  a 3
#> 4  a 4
#> 5  b 1
#> 6  b 2
#> 7  b 3
#> 8  b 4
show_groups(df, f1, y, .drop = FALSE, .expand = TRUE)
#>    f1 y
#> 1   a 1
#> 2   a 2
#> 3   a 3
#> 4   a 4
#> 5   b 1
#> 6   b 2
#> 7   b 3
#> 8   b 4
#> 9   c 1
#> 10  c 2
#> 11  c 3
#> 12  c 4
show_groups(df, f1, y, .drop = FALSE, .expand = FALSE) # error
#> Error: if `drop` is FALSE, expand must be `TRUE`

The join warning is unexpected, not sure what happens here.

@romainfrancois romainfrancois changed the title WIP: zero length groups WIP: empty groups Apr 24, 2018
@romainfrancois
Copy link
Member Author

romainfrancois commented Apr 24, 2018

How about using the .empty argument to be one of:

  • none: no empty groups. same as before
  • some: expanding combinations but drop unused factor levels (probably the least useful)
  • all: expand combinations and keep all levels:
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  y  = rep( 1:4, each = 2)
)
show_groups <- function(.data){
  res <- summarise(.data, n = n()) %>%
    arrange(desc(n))

  cat( "non empty groups\n" )
  print(filter(res, n > 0))

  empties <- filter(res, n == 0)
  if(nrow(empties)){
    cat( "\n empty groups\n")
    print(empties)
  }

}
show_groups(group_by(df, f1, y, .empty = "none"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [2]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
show_groups(group_by(df, f1, y, .empty = "some"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [2]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
#> 
#>  empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [2]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         3     0
#> 2 a         4     0
#> 3 b         1     0
#> 4 b         2     0
show_groups(group_by(df, f1, y, .empty = "all"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [3]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
#> 
#>  empty groups
#> # A tibble: 8 x 3
#> # Groups:   f1 [3]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         3     0
#> 2 a         4     0
#> 3 b         1     0
#> 4 b         2     0
#> 5 c         1     0
#> 6 c         2     0
#> 7 c         3     0
#> 8 c         4     0

Created on 2018-04-24 by the reprex package (v0.2.0).

@romainfrancois
Copy link
Member Author

Current status, with automatic nesting, at least what I think it is. This is very in progress, so might 💣.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
  y  = rep( 1:4, each = 2)
)

group_by(df, f1, y) %>% tally()
#> # A tibble: 5 x 3
#> # Groups:   f1 [?]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
#> 5 c        NA     0
group_by(df, f2, y) %>% tally()
#> # A tibble: 5 x 3
#> # Groups:   f2 [?]
#>   f2        y     n
#>   <fct> <int> <int>
#> 1 d         1     2
#> 2 e         2     2
#> 3 f         3     2
#> 4 g         4     2
#> 5 h        NA     0
group_by(df, f1, f2, y) %>% tally()
#> # A tibble: 15 x 4
#> # Groups:   f1, f2 [?]
#>    f1    f2        y     n
#>    <fct> <fct> <int> <int>
#>  1 a     d         1     2
#>  2 a     e         2     2
#>  3 a     f        NA     0
#>  4 a     g        NA     0
#>  5 a     h        NA     0
#>  6 b     d        NA     0
#>  7 b     e        NA     0
#>  8 b     f         3     2
#>  9 b     g         4     2
#> 10 b     h        NA     0
#> 11 c     d        NA     0
#> 12 c     e        NA     0
#> 13 c     f        NA     0
#> 14 c     g        NA     0
#> 15 c     h        NA     0

Created on 2018-04-27 by the reprex package (v0.2.0).

@romainfrancois
Copy link
Member Author

skipping this test now bc I think the result should be different:

test_that("filter(FALSE) drops indices", {
  skip()
  out <- mtcars %>%
    group_by(cyl) %>%
    filter(FALSE) %>%
    attr("indices")
  expect_identical(out, list())
})

Right now, when regrouping after the filter, we get the sentinel NA:

> mtcars %>%
+     group_by(cyl) %>% tally()
# A tibble: 3 x 2
    cyl     n
  <dbl> <int>
1    4.    11
2    6.     7
3    8.    14
> 
> mtcars %>%
+     group_by(cyl) %>% filter(FALSE) %>% tally()
# A tibble: 1 x 2
    cyl     n
  <dbl> <int>
1    NA     0

I'd prefer that the grouping structure be preserved with all empty groups, but this is open for discussion because cyl is not a factor.

@romainfrancois
Copy link
Member Author

Also skipping this one.

test_that("FactorVisitor handles NA. #183", {
  skip("until we can group again by a factor that has NA")
  g <- group_by(MASS::survey, M.I)
  expect_equal(g$M.I, MASS::survey$M.I)
})

Do we want to support NA in factors used in group_by ?

@romainfrancois
Copy link
Member Author

I had to remove this from the examples of group_by_all:

group_by_all(mtcars, as.factor)

because it would under the new nesting design with expansion of all factor levels, it would generate a grouping structure with

> map_int(mtcars, compose(length, unique)) %>% prod()
[1] 61393464000

Wondering if we should issue a warning when the number of groups will be >> than the number of rows. We can get a lower bound of the #️⃣ of groups:

> mtcars %>% keep(is.factor) %>% map_int(compose(length, unique)) %>% prod
[1] 1
> mutate_at(mtcars, vars(mpg, cyl, disp), as.factor) %>% keep(is.factor) %>% map_int(compose(length, unique)) %>% prod
[1] 2025
> mutate_all(mtcars, as.factor) %>% keep(is.factor) %>% map_int(compose(length, unique)) %>% prod
[1] 61393464000

@romainfrancois
Copy link
Member Author

So we don't need the drop attribute in grouped data frames. Can we eliminate it from grouped_df, or perhaps leave it and warn that it is no longer used. it never really has been anyway. I'm not sure there are uses of the grouped_df function outside of dplyr.

@romainfrancois
Copy link
Member Author

filter keeps the grouping structure, which might create unwanted empty groups, e.g.

suppressPackageStartupMessages(library(dplyr))

d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     1     2
#> 3     2     3
#> 4     2     4
#> 5     3     5
#> 6     3     6
#> 7     4     7
#> 8     4     8

df <- d %>% 
  group_by(x) %>% 
  filter(y<5)

tally(df)
#> # A tibble: 4 x 2
#>       x     n
#>   <int> <int>
#> 1     1     2
#> 2     2     2
#> 3     3     0
#> 4     4     0

The groups with x == 3 or 4 are empty, but x is not a factor. We might instead expect to end up with something like this:

suppressPackageStartupMessages(library(dplyr))

d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     1     2
#> 3     2     3
#> 4     2     4
#> 5     3     5
#> 6     3     6
#> 7     4     7
#> 8     4     8

df <- d %>% 
  group_by(x) %>% 
  filter(y<5) %>% 
  group_by(x)

tally(df)
#> # A tibble: 2 x 2
#>       x     n
#>   <int> <int>
#> 1     1     2
#> 2     2     2

i.e. regenerate the grouping structure after the filtering (would be the easiest) or come up with some way to keep only the groups that this would create (more efficient, maybe more work).

However if x was a factor, we want to keep them all.

suppressPackageStartupMessages(library(dplyr))

d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     1     2
#> 3     2     3
#> 4     2     4
#> 5     3     5
#> 6     3     6
#> 7     4     7
#> 8     4     8

df <- d %>% 
  group_by(x = as.factor(x)) %>% 
  filter(y<5) %>% 
  group_by(x)


tally(df)
#> # A tibble: 4 x 2
#>   x         n
#>   <fct> <int>
#> 1 1         2
#> 2 2         2
#> 3 3         0
#> 4 4         0

@romainfrancois
Copy link
Member Author

I've changed the error message for implicit NA in factors so that it suggests using forcats to make them explicit.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
d <- data.frame(x = 1:3, f = factor(c("a", "b", NA)))
group_by(d,f)
#> Error in grouped_df_impl(data, unname(vars)): Column `f` contains implicit missing values. Consider `forcats::fct_explicit_na` to turn them explicit

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to remove slice_impl() ?

@hadley
Copy link
Member

hadley commented May 2, 2018

I think stopping is too aggressive as it is likely to break existing code; a warning would be better.

@romainfrancois
Copy link
Member Author

warning + making them explicit automatically then ?

What to do when we group by f1 and f2 and we see NA in f2 within only one level of f1?

Making f2 na explicit would make the na appear in all levels of f1.

@romainfrancois
Copy link
Member Author

@krlmlr i moved slice related code in filter.cpp because they share code

@hadley
Copy link
Member

hadley commented May 2, 2018

I think warn, and drop the implicit NAs?

@romainfrancois
Copy link
Member Author

🤔 if we drop the row, we end up with less rows after the group_by

Or drop as in the row does not belong to any group. So the sum of the group sizes would not be equal to the nrow.

The second option is easier to implement, but i’m not sure either way

@hadley
Copy link
Member

hadley commented May 2, 2018

Hmmm true. If we have to handle them anyway, maybe we should just do it without the warning?

@krlmlr
Copy link
Member

krlmlr commented May 2, 2018

Yeah, I was confused because the diff for filter.cpp was hidden.

Can we use a nested array (an array of arrays of ... of row IDs) to keep group indices? I think this might help solve the "empty groups after filter()" problem.

@romainfrancois
Copy link
Member Author

@krlmlr you mean in group_by. The algorithm in place works like that, i.e. recursively split the indices, so that information exists at some point. I'm not sure it's worth keeping it. This would complexity the metadata structure when I'd like to simplify it in #3489.

@romainfrancois
Copy link
Member Author

@hadley I think we should treat implicit NA differently from a true factor level, so have them create groups only where they appear. This would be inline with the old behavior:

library(dplyr)

d <- tibble(
  f1 = factor(c(1,1,2,2)), 
  f2 = factor(c(1,2,1,NA)), 
  x  = 1:4
)
group_by(d)
#> # A tibble: 4 x 3
#>   f1    f2        x
#>   <fct> <fct> <int>
#> 1 1     1         1
#> 2 1     2         2
#> 3 2     1         3
#> 4 2     <NA>      4

Would be quite cheap to do also, I just have to always have a vector with one more item and at the end look if I've seen NA or not.

Making them a true level means I have to search for NA before the algorithm.

Removing the rows means filtering all the columns, when currently group_by only affects the metadata.

A warning that can be silenced by an option might still be worth it though, to encourage explicit na in factors.

@romainfrancois romainfrancois force-pushed the feature-341-zero-length-groups branch from ebcfbcb to 5873a30 Compare May 3, 2018 07:45
@romainfrancois romainfrancois force-pushed the feature-341-zero-length-groups branch from 671fbf2 to 63e411d Compare May 14, 2018 07:48
@lock
Copy link

lock bot commented Feb 2, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Feb 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
wip work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants