WIP: empty groups #3492

romainfrancois · 2018-04-10T13:57:21Z

group_by preserves all factor levels, even with no data in the group (Preserve zero-length groups #341)

romainfrancois · 2018-04-10T14:21:36Z

For now the interface is, adding drop = FALSE to group_by. The default (TRUE) does what we did before. I guess we can reevaluate later if FALSE should be default.

This does:

collect the indices based on the data that is present (what we did before)
expand the labels so that it has all combos of factor levels * unique values of things that are not factors
fill the gaps with empty indices

krlmlr · 2018-04-10T20:20:19Z

Looks like we need @param drop in the roxygen2 docu: https://travis-ci.org/tidyverse/dplyr/jobs/364675197#L1521

krlmlr · 2018-04-10T20:24:41Z

inst/include/dplyr/GroupedDataFrame.h

-      build_index_cpp(data_);
+      // when there is no drop attribute, we assumed drop=TRUE
+      // the default for group_by.
+      SEXP drop = data_.attr("drop");


I remember the "drop" attribute was used for something else?

I don't think so.

krlmlr · 2018-04-10T20:26:19Z

src/group_indices.cpp

+  // cleanup after expand.grid
+  new_labels.attr("out.attrs") = R_NilValue;
+
+  IntegerVector new_labels_order = OrderVisitors(new_labels).apply();


Why do we need to reorder?

Because of what expand.grid does

Maybe we can do rev(expand.grid(!!!rev(dots))) ?

We should look into replacing expand.grid() with a home-grown visitor at some point.

sure. This is just a first step to exchange about the feature, I'll polish the implementation later.

krlmlr · 2018-04-10T20:30:55Z

How do we handle NA in factors? What about logical?

romainfrancois · 2018-04-10T20:49:41Z

For things that are not factors, it depends on what unique does.

For factors yeah i suppose i need to add something to handle na if any.

romainfrancois · 2018-04-23T14:56:01Z

Now with .drop and .expand so we get:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  y  = rep( 1:4, each = 2)
)

show_groups <- function(.data, ...){
  attr(group_by(.data, ...), "labels")
}
show_groups(df, f1, y, .drop = TRUE , .expand = FALSE)
#>   f1 y
#> 1  a 1
#> 2  a 2
#> 3  b 3
#> 4  b 4
show_groups(df, f1, y, .drop = TRUE , .expand = TRUE)
#> Warning: Column `f1` joining factors with different levels, coercing to
#> character vector
#>   f1 y
#> 1  a 1
#> 2  a 2
#> 3  a 3
#> 4  a 4
#> 5  b 1
#> 6  b 2
#> 7  b 3
#> 8  b 4
show_groups(df, f1, y, .drop = FALSE, .expand = TRUE)
#>    f1 y
#> 1   a 1
#> 2   a 2
#> 3   a 3
#> 4   a 4
#> 5   b 1
#> 6   b 2
#> 7   b 3
#> 8   b 4
#> 9   c 1
#> 10  c 2
#> 11  c 3
#> 12  c 4
show_groups(df, f1, y, .drop = FALSE, .expand = FALSE) # error
#> Error: if `drop` is FALSE, expand must be `TRUE`

The join warning is unexpected, not sure what happens here.

romainfrancois · 2018-04-24T12:56:25Z

How about using the .empty argument to be one of:

none: no empty groups. same as before
some: expanding combinations but drop unused factor levels (probably the least useful)
all: expand combinations and keep all levels:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  y  = rep( 1:4, each = 2)
)
show_groups <- function(.data){
  res <- summarise(.data, n = n()) %>%
    arrange(desc(n))

  cat( "non empty groups\n" )
  print(filter(res, n > 0))

  empties <- filter(res, n == 0)
  if(nrow(empties)){
    cat( "\n empty groups\n")
    print(empties)
  }

}
show_groups(group_by(df, f1, y, .empty = "none"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [2]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
show_groups(group_by(df, f1, y, .empty = "some"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [2]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
#> 
#>  empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [2]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         3     0
#> 2 a         4     0
#> 3 b         1     0
#> 4 b         2     0
show_groups(group_by(df, f1, y, .empty = "all"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups:   f1 [3]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
#> 
#>  empty groups
#> # A tibble: 8 x 3
#> # Groups:   f1 [3]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         3     0
#> 2 a         4     0
#> 3 b         1     0
#> 4 b         2     0
#> 5 c         1     0
#> 6 c         2     0
#> 7 c         3     0
#> 8 c         4     0

Created on 2018-04-24 by the reprex package (v0.2.0).

romainfrancois · 2018-04-27T17:27:58Z

Current status, with automatic nesting, at least what I think it is. This is very in progress, so might 💣.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
  y  = rep( 1:4, each = 2)
)

group_by(df, f1, y) %>% tally()
#> # A tibble: 5 x 3
#> # Groups:   f1 [?]
#>   f1        y     n
#>   <fct> <int> <int>
#> 1 a         1     2
#> 2 a         2     2
#> 3 b         3     2
#> 4 b         4     2
#> 5 c        NA     0
group_by(df, f2, y) %>% tally()
#> # A tibble: 5 x 3
#> # Groups:   f2 [?]
#>   f2        y     n
#>   <fct> <int> <int>
#> 1 d         1     2
#> 2 e         2     2
#> 3 f         3     2
#> 4 g         4     2
#> 5 h        NA     0
group_by(df, f1, f2, y) %>% tally()
#> # A tibble: 15 x 4
#> # Groups:   f1, f2 [?]
#>    f1    f2        y     n
#>    <fct> <fct> <int> <int>
#>  1 a     d         1     2
#>  2 a     e         2     2
#>  3 a     f        NA     0
#>  4 a     g        NA     0
#>  5 a     h        NA     0
#>  6 b     d        NA     0
#>  7 b     e        NA     0
#>  8 b     f         3     2
#>  9 b     g         4     2
#> 10 b     h        NA     0
#> 11 c     d        NA     0
#> 12 c     e        NA     0
#> 13 c     f        NA     0
#> 14 c     g        NA     0
#> 15 c     h        NA     0

Created on 2018-04-27 by the reprex package (v0.2.0).

romainfrancois · 2018-04-28T09:21:57Z

skipping this test now bc I think the result should be different:

test_that("filter(FALSE) drops indices", {
  skip()
  out <- mtcars %>%
    group_by(cyl) %>%
    filter(FALSE) %>%
    attr("indices")
  expect_identical(out, list())
})

Right now, when regrouping after the filter, we get the sentinel NA:

> mtcars %>%
+     group_by(cyl) %>% tally()
# A tibble: 3 x 2
    cyl     n
  <dbl> <int>
1    4.    11
2    6.     7
3    8.    14
> 
> mtcars %>%
+     group_by(cyl) %>% filter(FALSE) %>% tally()
# A tibble: 1 x 2
    cyl     n
  <dbl> <int>
1    NA     0

I'd prefer that the grouping structure be preserved with all empty groups, but this is open for discussion because cyl is not a factor.

romainfrancois · 2018-04-28T09:46:24Z

Also skipping this one.

test_that("FactorVisitor handles NA. #183", {
  skip("until we can group again by a factor that has NA")
  g <- group_by(MASS::survey, M.I)
  expect_equal(g$M.I, MASS::survey$M.I)
})

Do we want to support NA in factors used in group_by ?

romainfrancois · 2018-04-29T09:38:19Z

I had to remove this from the examples of group_by_all:

group_by_all(mtcars, as.factor)

because it would under the new nesting design with expansion of all factor levels, it would generate a grouping structure with

> map_int(mtcars, compose(length, unique)) %>% prod()
[1] 61393464000

Wondering if we should issue a warning when the number of groups will be >> than the number of rows. We can get a lower bound of the #️⃣ of groups:

> mtcars %>% keep(is.factor) %>% map_int(compose(length, unique)) %>% prod
[1] 1
> mutate_at(mtcars, vars(mpg, cyl, disp), as.factor) %>% keep(is.factor) %>% map_int(compose(length, unique)) %>% prod
[1] 2025
> mutate_all(mtcars, as.factor) %>% keep(is.factor) %>% map_int(compose(length, unique)) %>% prod
[1] 61393464000

romainfrancois · 2018-04-29T15:14:51Z

So we don't need the drop attribute in grouped data frames. Can we eliminate it from grouped_df, or perhaps leave it and warn that it is no longer used. it never really has been anyway. I'm not sure there are uses of the grouped_df function outside of dplyr.

romainfrancois · 2018-05-01T13:06:52Z

filter keeps the grouping structure, which might create unwanted empty groups, e.g.

suppressPackageStartupMessages(library(dplyr))

d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     1     2
#> 3     2     3
#> 4     2     4
#> 5     3     5
#> 6     3     6
#> 7     4     7
#> 8     4     8

df <- d %>% 
  group_by(x) %>% 
  filter(y<5)

tally(df)
#> # A tibble: 4 x 2
#>       x     n
#>   <int> <int>
#> 1     1     2
#> 2     2     2
#> 3     3     0
#> 4     4     0

The groups with x == 3 or 4 are empty, but x is not a factor. We might instead expect to end up with something like this:

suppressPackageStartupMessages(library(dplyr))

d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     1     2
#> 3     2     3
#> 4     2     4
#> 5     3     5
#> 6     3     6
#> 7     4     7
#> 8     4     8

df <- d %>% 
  group_by(x) %>% 
  filter(y<5) %>% 
  group_by(x)

tally(df)
#> # A tibble: 2 x 2
#>       x     n
#>   <int> <int>
#> 1     1     2
#> 2     2     2

i.e. regenerate the grouping structure after the filtering (would be the easiest) or come up with some way to keep only the groups that this would create (more efficient, maybe more work).

However if x was a factor, we want to keep them all.

suppressPackageStartupMessages(library(dplyr))

d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     1
#> 2     1     2
#> 3     2     3
#> 4     2     4
#> 5     3     5
#> 6     3     6
#> 7     4     7
#> 8     4     8

df <- d %>% 
  group_by(x = as.factor(x)) %>% 
  filter(y<5) %>% 
  group_by(x)


tally(df)
#> # A tibble: 4 x 2
#>   x         n
#>   <fct> <int>
#> 1 1         2
#> 2 2         2
#> 3 3         0
#> 4 4         0

romainfrancois · 2018-05-02T12:56:08Z

I've changed the error message for implicit NA in factors so that it suggests using forcats to make them explicit.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
d <- data.frame(x = 1:3, f = factor(c("a", "b", NA)))
group_by(d,f)
#> Error in grouped_df_impl(data, unname(vars)): Column `f` contains implicit missing values. Consider `forcats::fct_explicit_na` to turn them explicit

krlmlr

Did you mean to remove slice_impl() ?

hadley · 2018-05-02T19:21:35Z

I think stopping is too aggressive as it is likely to break existing code; a warning would be better.

romainfrancois · 2018-05-02T19:49:47Z

warning + making them explicit automatically then ?

What to do when we group by f1 and f2 and we see NA in f2 within only one level of f1?

Making f2 na explicit would make the na appear in all levels of f1.

romainfrancois · 2018-05-02T20:25:21Z

@krlmlr i moved slice related code in filter.cpp because they share code

hadley · 2018-05-02T20:27:34Z

I think warn, and drop the implicit NAs?

romainfrancois · 2018-05-02T21:20:45Z

🤔 if we drop the row, we end up with less rows after the group_by

Or drop as in the row does not belong to any group. So the sum of the group sizes would not be equal to the nrow.

The second option is easier to implement, but i’m not sure either way

hadley · 2018-05-02T21:42:19Z

Hmmm true. If we have to handle them anyway, maybe we should just do it without the warning?

krlmlr · 2018-05-02T21:45:42Z

Yeah, I was confused because the diff for filter.cpp was hidden.

Can we use a nested array (an array of arrays of ... of row IDs) to keep group indices? I think this might help solve the "empty groups after filter()" problem.

romainfrancois · 2018-05-03T07:21:17Z

@krlmlr you mean in group_by. The algorithm in place works like that, i.e. recursively split the indices, so that information exists at some point. I'm not sure it's worth keeping it. This would complexity the metadata structure when I'd like to simplify it in #3489.

romainfrancois · 2018-05-03T07:33:32Z

@hadley I think we should treat implicit NA differently from a true factor level, so have them create groups only where they appear. This would be inline with the old behavior:

library(dplyr)

d <- tibble(
  f1 = factor(c(1,1,2,2)), 
  f2 = factor(c(1,2,1,NA)), 
  x  = 1:4
)
group_by(d)
#> # A tibble: 4 x 3
#>   f1    f2        x
#>   <fct> <fct> <int>
#> 1 1     1         1
#> 2 1     2         2
#> 3 2     1         3
#> 4 2     <NA>      4

Would be quite cheap to do also, I just have to always have a vector with one more item and at the end look if I've seen NA or not.

Making them a true level means I have to search for NA before the algorithm.

Removing the rows means filtering all the columns, when currently group_by only affects the metadata.

A warning that can be silenced by an option might still be worth it though, to encourage explicit na in factors.

…e visited dataframe.

lock · 2019-02-02T14:55:45Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

romainfrancois mentioned this pull request Apr 10, 2018

Preserve zero-length groups #341

Closed

krlmlr reviewed Apr 10, 2018

View reviewed changes

romainfrancois added the wip work in progress label Apr 23, 2018

romainfrancois changed the title ~~zero length groups [wip]~~ WIP: zero length groups Apr 23, 2018

romainfrancois changed the title ~~WIP: zero length groups~~ WIP: empty groups Apr 24, 2018

romainfrancois force-pushed the feature-341-zero-length-groups branch from b186c8c to 2725bce Compare April 29, 2018 15:19

krlmlr reviewed May 2, 2018

View reviewed changes

romainfrancois force-pushed the feature-341-zero-length-groups branch from ebcfbcb to 5873a30 Compare May 3, 2018 07:45

romainfrancois and others added 20 commits May 14, 2018 09:47

😓 fixing a rebase hickup

be25513

less pointery code

c3460ec

group_by gives a warning about implicit NA in fcators

d5f7977

warning w/o call

f8ee892

unpropagate warn_na_factors

d945a40

starting to #2311 summarise [ci skip]

b93a819

this feels ✅

d36824d

check that result of a summarise is not NULL

30fb256

fix again lead and lag when used with summary values

9c04a58

➕ test for the ungrouped case of using lead and lag with summary values

7dd0680

retire old code that is no longer in use #2311 [ci skip]

4e5b4ec

oversight 😬

53471b3

revert last change about lead and lag, will revisit in #3526

12d45e5

additional DataFrameVisitors ctor using the n first variables of th…

c2e6370

…e visited dataframe.

summarise reconstructs grouping metadata instead of leaving it lazy

018cf82

handle check_length message b/c of #2311 mutate

5f829d2

train child slicers in VectorSlicer::train()

37b2b8b

minor changes in Gatherer to support #2311 for mutate

e74a07f

update test case

ace1648

more tests

63e411d

romainfrancois force-pushed the feature-341-zero-length-groups branch from 671fbf2 to 63e411d Compare May 14, 2018 07:48

up NEWS [ci skip]

708d425

romainfrancois merged commit 325e749 into master May 14, 2018

This was referenced May 29, 2018

Implement iterator-based subset() #3072

Closed

group_by is sorting and does not maintain original order #3279

Closed

krlmlr mentioned this pull request Jun 7, 2018

segfault/hang/error with filter() on Fedora 27/R 3.4.4/tidyverse 1.2.1.9000 #3641

Closed

romainfrancois deleted the feature-341-zero-length-groups branch August 6, 2018 14:24

msberends mentioned this pull request Dec 21, 2018

Throw warning when grouping columns include NAs? #4007

Closed

lock bot locked and limited conversation to collaborators Feb 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: empty groups #3492

WIP: empty groups #3492

romainfrancois commented Apr 10, 2018 •

edited

Loading

romainfrancois commented Apr 10, 2018

krlmlr commented Apr 10, 2018

krlmlr Apr 10, 2018

romainfrancois Apr 11, 2018

krlmlr Apr 10, 2018

romainfrancois Apr 10, 2018

krlmlr Apr 10, 2018

romainfrancois Apr 11, 2018

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

romainfrancois commented Apr 23, 2018

romainfrancois commented Apr 24, 2018 •

edited

Loading

romainfrancois commented Apr 27, 2018

romainfrancois commented Apr 28, 2018

romainfrancois commented Apr 28, 2018

romainfrancois commented Apr 29, 2018

romainfrancois commented Apr 29, 2018

romainfrancois commented May 1, 2018

romainfrancois commented May 2, 2018

krlmlr left a comment

hadley commented May 2, 2018

romainfrancois commented May 2, 2018

romainfrancois commented May 2, 2018

hadley commented May 2, 2018

romainfrancois commented May 2, 2018

hadley commented May 2, 2018

krlmlr commented May 2, 2018

romainfrancois commented May 3, 2018

romainfrancois commented May 3, 2018

lock bot commented Feb 2, 2019

WIP: empty groups #3492

WIP: empty groups #3492

Conversation

romainfrancois commented Apr 10, 2018 • edited Loading

romainfrancois commented Apr 10, 2018

krlmlr commented Apr 10, 2018

krlmlr Apr 10, 2018

Choose a reason for hiding this comment

romainfrancois Apr 11, 2018

Choose a reason for hiding this comment

krlmlr Apr 10, 2018

Choose a reason for hiding this comment

romainfrancois Apr 10, 2018

Choose a reason for hiding this comment

krlmlr Apr 10, 2018

Choose a reason for hiding this comment

romainfrancois Apr 11, 2018

Choose a reason for hiding this comment

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

romainfrancois commented Apr 23, 2018

romainfrancois commented Apr 24, 2018 • edited Loading

romainfrancois commented Apr 27, 2018

romainfrancois commented Apr 28, 2018

romainfrancois commented Apr 28, 2018

romainfrancois commented Apr 29, 2018

romainfrancois commented Apr 29, 2018

romainfrancois commented May 1, 2018

romainfrancois commented May 2, 2018

krlmlr left a comment

Choose a reason for hiding this comment

hadley commented May 2, 2018

romainfrancois commented May 2, 2018

romainfrancois commented May 2, 2018

hadley commented May 2, 2018

romainfrancois commented May 2, 2018

hadley commented May 2, 2018

krlmlr commented May 2, 2018

romainfrancois commented May 3, 2018

romainfrancois commented May 3, 2018

lock bot commented Feb 2, 2019

romainfrancois commented Apr 10, 2018 •

edited

Loading

romainfrancois commented Apr 24, 2018 •

edited

Loading