Preserve zero-length groups #341

hadley · 2014-03-20T12:57:03Z

http://stackoverflow.com/questions/22523131

Not sure what the interface to this should be - probably should default to drop = FALSE.

eipi10 · 2014-03-20T16:15:21Z

Thanks for opening up this issue Hadley.

statwonk · 2014-04-28T02:24:28Z

👍 ran into the same issue today, drop = FALSE would be a big help for me!

wsurles · 2014-07-09T03:18:59Z

Any idea on the time frame for putting a .drop = FALSE equivalent into dplyr? I need this for certain rCharts to render correctly.

In the mean time I DID get the answer in your link to work.
http://stackoverflow.com/questions/22523131

I grouped by two variables.

jennybc · 2014-07-23T17:52:31Z

+1 for option to not drop empty groups

hadley · 2014-08-01T14:06:56Z

May be some overlap with #486 and #413.

slackline · 2014-08-30T07:50:06Z

Not dropping empty groups would be very useful. Often needed when creating summary tables.

mcfrank · 2014-10-29T18:58:23Z

+1 - this is a deal-breaker for many analyses

bpbond · 2014-11-11T17:46:53Z

I agree with all the above--would be very useful.

hadley · 2014-11-20T19:38:22Z

@romainfrancois Currently build_index_cpp() doesn't respect the drop attribute:

t1 <- data_frame(
  x = runif(10),
  g1 = rep(1:2, each = 5),
  g2 = factor(g1, 1:3)
)
g1 <- grouped_df(t1, list(quote(g2)), drop = FALSE)
attr(g1, "group_size")
# should be c(5L, 5L, 0L)
attr(g1, "indices")
# shoud be list(0:4, 5:9, integer(0))

The drop attribute only applies when grouping by a factor, in which case we need to have one group per factor level, regardless of whether or not the level actually applies to the data.

This will also affect the single table verbs in the following ways:

select(): no effect
arrange(): no effect
summarise(): functions applied to zero row groups should be given 0-level integers. n() should return 0, mean(x) should return NaN
filter(): the set of groups should remain constant, even if some groups now have no rows
mutate(): don't need to evaluate expressions for empty groups

Eventually, drop = FALSE will be the default, and if it's a hassle to write both drop = FALSE and drop = TRUE branches, I'd happily drop support for drop = FALSE (since you can always re-level the factor yourself, or use a character vector instead).

Does that make sense? If it's a lot of work, we can push off to 0.4

@statwonk, @wsurles, @jennybc, @slackline, @mcfrank, @eipi10 If you'd like to help, the best thing to do would be to work on a set of test cases that exercises all the ways the different verbs might interact with zero-length groups.

romainfrancois · 2014-11-27T09:08:37Z

Ah. I think I just did not know what drop was supposed to do. That makes it clear. I don't think it's a lot of work.

bpbond · 2014-12-04T18:58:36Z

I have opened pull request #833 which tests whether the single table verbs above handle zero-length groups correctly. Most of the tests are commented out, because dplyr currently fails them, of course.

ebergelson · 2015-05-28T03:56:49Z

+1 , any status updates here? love summarise, need to keep empty levels!

wsurles · 2015-05-28T14:17:45Z

@ebergelson, Here is my current hack to get zero-length groups. I often need this so my bar charts will stack.

Here df has 3 columns: name, group, and metric

df2 <- expand.grid(name = unique(df$name), group = unique(df$group)) %>%
    left_join(df, by=c("name","group")) %>%
    mutate(metric = ifelse(is.na(metric),0,metric))

bpbond · 2015-05-28T14:28:43Z

I do something similar–check for missing groups, then if any generate all combinations and left_join.

Unfortunately, it doesn't seem like this issue is getting much love...perhaps because there is this straightforward workaround.

ebergelson · 2015-05-29T03:31:41Z

@wsurles, @bpbond thanks, yes i used a similar workaround to what you suggest! would love to see a built-in fix like .drop.

jalapic · 2015-06-18T14:28:39Z

Just to add and agree with everyone above - this is a super critical aspect of many analyses. Would love to see an implementation.

romainfrancois · 2015-07-14T13:26:02Z

Some more details needed here:

If I have this:

> df <- data_frame( x = c(1,1,1,2,2), f = factor( c(1,2,3,1,1) ) )
> df
Source: local data frame [5 x 2]

  x f
1 1 1
2 1 2
3 1 3
4 2 1
5 2 1

And I group by x then f, I'd end up with 6 (2x3) groups where the groups (2, 2) and (2,3) are empty. That's ok. I can manage to implement that I think.

now, what if I have this:

> df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )
> df
Source: local data frame [4 x 2]

  f x
1 1 1
2 1 2
3 2 1
4 2 4

and I want to group by f then x. What would the groups be ? @hadley

bpbond · 2015-07-14T13:51:47Z

Both stats::aggregate and plyr::ddply return 4 groups in this case (1,1; 1,2; 2,1; and 2,4), so I'd suggest that's the behavior to conform to.

huftis · 2015-07-14T18:48:14Z

Shouldn’t it agree with table() instead, i.e., return 9 groups?

> table(df$f, df$x)
  1 2 4
1 1 1 0
2 1 0 1
3 0 0 0

I would expect df %>% group_by(f, x) %>% tally to basically give the same result as with(df, as.data.frame(table(f, x))) and ddply(df, .(f, x), nrow, .drop=FALSE).

mcfrank · 2015-07-20T20:28:58Z

I thought our desired behavior was to preserve zero-length groups if they are factors (like .drop in plyr), so I would imagine we'd want @huftis's suggestion. I would suggest that the default be drop = TRUE though, so that the default behavior does not change, re @bpbond's suggestion.

krlmlr · 2017-08-22T20:06:33Z

Does tidyr::complete() work for factors?

GegznaV · 2018-03-05T09:06:01Z

All factors levels and combinations of factors levels must be preserved by default. This behavior can be controled by parameters such as drop, expand, etc. Thus the default behavior of dplyr::count() should be like this:

df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
df %>% dplyr::count(x, y)
#>  # A tibble: 4 x 3
#>       x        y       n
#>     <int>   <fct>    <int>
#> 1     1        1       1
#> 2     2        1       1
#> 3     1        2       0
#> 4     2        2       0

Zero length groups (combinations of groups) can be filtered later. But for exploratory analysis we must see the full picture.

Are there any status updates on the solution to this issue?
Are there any plans to completely solve this issue?

romainfrancois · 2018-03-05T09:08:28Z

2: yes definitely
1: There are some technical implementation difficulties about this issue, but I'll look into it in the next few weeks.

romainfrancois · 2018-04-09T15:07:30Z

We might get away with this by expanding the data after the fact, something like this:

library(tidyverse)

truly_group_by <- function(data, ...){
  dots <- quos(...)
  data <- group_by( data, !!!dots )

  labels <- attr( data, "labels" )
  labnames <- names(labels)
  labels <- mutate( labels, ..index.. =  attr(data, "indices") )

  expanded <- labels %>%
    tidyr::expand( !!!dots ) %>%
    left_join( labels, by = labnames ) %>%
    mutate( ..index.. = map(..index.., ~if(is.null(.x)) integer() else .x ) )

  indices <- pull( expanded, ..index..)
  group_sizes <- map_int( indices, length)
  labels <- select( expanded, -..index..)

  attr(data, "labels")  <- labels
  attr(data, "indices") <- indices
  attr(data, "group_sizes") <- group_sizes

  data
}

df  <- data_frame(
  x = 1:2,
  y = factor(c(1, 1), levels = 1:2)
)
tally( truly_group_by(df, x, y) )
#> # A tibble: 4 x 3
#> # Groups:   x [?]
#>       x y         n
#>   <int> <fct> <int>
#> 1     1 1         1
#> 2     1 2         0
#> 3     2 1         1
#> 4     2 2         0
tally( truly_group_by(df, y, x) )
#> # A tibble: 4 x 3
#> # Groups:   y [?]
#>   y         x     n
#>   <fct> <int> <int>
#> 1 1         1     1
#> 2 1         2     1
#> 3 2         1     0
#> 4 2         2     0

obviously down the line, this would be handled internally, sans using tidyr or purrr.

romainfrancois · 2018-04-09T15:11:04Z

This seems to take care of the original question on so:

> df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
> df$b = factor(df$b, levels=1:3)
> df %>%
+   group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 2 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
> df %>%
+   truly_group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 3 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
3 3           0 FALSE

romainfrancois · 2018-04-09T15:26:17Z

The key here being this

 tidyr::expand( !!!dots ) %>%

which means expanding all possibilities regardless of the variables being factors or not.

I'd say we either:

expand all when drop=FALSE, potentially having lots of 0 length groups
do what we do now if drop=TRUE

perhaps have a function to toggle dropness.

This is a relatively cheap operation I'd say because it only involves manipulating the metadata, so perhaps it is less risky to do this in R first ?

krlmlr · 2018-04-10T01:12:59Z

Did you mean crossing() instead of expand()?

Looking at the internals, do you agree that we "only" need to change build_index_cpp(), specifically the generation of the labels data frame, to make this happen?

Can we perhaps start with expanding only factors with drop = FALSE? I considered a "natural" syntax, but this may be too confusing in the end (and perhaps even not powerful enough):

group_by(data, crossing(col1, col2), col3)

Semantics: Using all combinations of col1 and col2, and there existing combinations with col3.

romainfrancois · 2018-04-10T08:48:52Z

Yes, I'd say this only affects build_index_cpp and the generation of the attributes labels, indices and group_sizes which I'd like to squash in a tidy structure as part of #3489

The "only expanding factors" part of this discussion is what took so long.

What would be the results of these:

library(dplyr)

d <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
  x  = 1:8,
  y  = rep( 1:4, each = 2)
)

f <- function(data, ...){
  group_by(data, !!!quos(...))  %>%
    tally()
}
f(d, f1, f2, x)
f(d, x, f1, f2)

f(d, f1, f2, x, y)
f(d, x, f1, f2, y)

krlmlr · 2018-04-10T11:57:55Z

I think f(d, f1, f2, x) should give the same results as f(d, x, f1, f2), if row order is ignored. Same for the other two.

Also interesting:

f(d, f2, x, f1, y)
d %>% sample_frac(0.3) %>% f(...)

I like the idea of implementing full expansion only for factors. For non-character data (including logicals), we could define/use a factor-like class that inherits the respective data type. Perhaps provided by forcats? This makes it more difficult to shoot yourself in the foot.

romainfrancois · 2018-04-10T14:17:07Z

implementation in progress in #3492

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )

( res1 <- tally(group_by(df,f,x, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   f [?]
#>   f         x     n
#>   <fct> <dbl> <int>
#> 1 1        1.     1
#> 2 1        2.     1
#> 3 1        4.     0
#> 4 2        1.     1
#> 5 2        2.     0
#> 6 2        4.     1
#> 7 3        1.     0
#> 8 3        2.     0
#> 9 3        4.     0
( res2 <- tally(group_by(df,x,f, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   x [?]
#>       x f         n
#>   <dbl> <fct> <int>
#> 1    1. 1         1
#> 2    1. 2         1
#> 3    1. 3         0
#> 4    2. 1         1
#> 5    2. 2         0
#> 6    2. 3         0
#> 7    4. 1         0
#> 8    4. 2         1
#> 9    4. 3         0

all.equal( res1, arrange(res2, f, x) )
#> [1] TRUE

all.equal( filter(res1, n>0), tally(group_by(df, f, x)) )
#> [1] TRUE
all.equal( filter(res2, n>0), tally(group_by(df, x, f)) )
#> [1] TRUE

Created on 2018-04-10 by the reprex package (v0.2.0).

kenahoo · 2018-05-15T03:19:26Z

As for whether complete() solves the issue - no, not really. Whatever summaries are being computed, their behaviors on empty vectors need to be preserved, not patched up after the fact. For example:

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
     group_by(x) %>%
     summarize(min=min(y), sum=sum(y), prod=prod(y))
# Should be:
#> x       min   sum  prod
#> 1         4     9    20
#> 2       Inf     0     1

sum and prod (and to a lesser extent, min) (and various other functions) have very well-defined semantics on empty vectors, and it's not great to have to come along afterwards with complete() and re-define those behaviors.

romainfrancois · 2018-05-15T04:34:08Z

@kenahoo I'm not sure I understand. This is what you get with the current dev version. So the only thing that you don't get is the warning from min()

library(dplyr)

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
  group_by(x) %>%
  summarize(min=min(y), sum=sum(y), prod=prod(y))
#> # A tibble: 2 x 4
#>   x       min   sum  prod
#>   <fct> <dbl> <int> <dbl>
#> 1 1         4     9    20
#> 2 2       Inf     0     1

min(integer())
#> Warning in min(integer()): no non-missing arguments to min; returning Inf
#> [1] Inf
sum(integer())
#> [1] 0
prod(integer())
#> [1] 1

Created on 2018-05-15 by the reprex package (v0.2.0).

kenahoo · 2018-05-15T04:45:47Z

@romainfrancois Oh cool, I didn't realize you were already so far along on this implementation. Looks great!

lock · 2018-11-11T05:24:07Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

hadley added the enhancement label Aug 1, 2014

hadley added this to the 0.3.1 milestone Aug 1, 2014

hadley self-assigned this Aug 1, 2014

hadley mentioned this issue Aug 26, 2014

drop argument in grouped_df() not work #530

Closed

jennybc mentioned this issue Nov 5, 2014

inner_join() reorders output #684

Closed

hadley modified the milestones: 0.4, 0.3.1 Nov 20, 2014

hadley assigned romainfrancois and unassigned hadley Nov 25, 2014

bpbond mentioned this issue Dec 4, 2014

Tests for correct handling of zero-length groups #833

Closed

krlmlr mentioned this issue Aug 23, 2017

summarise_all(max) does not preserve integer type #3036

Closed

ha0ye mentioned this issue Oct 13, 2017

Biomass function weecology/portalr#53

Closed

jennybc mentioned this issue Mar 12, 2018

"chop()" a tidy-style split() function tidyverse/tidyr#434

Closed

romainfrancois mentioned this issue Apr 9, 2018

tidy grouped data attributes #3489

Closed

romainfrancois mentioned this issue Apr 10, 2018

WIP: empty groups #3492

Merged

romainfrancois mentioned this issue Apr 23, 2018

FR: group_indices() should also return vector of group representatives #2121

Closed

romainfrancois added the wip work in progress label Apr 26, 2018

romainfrancois modified the milestones: future, 0.8.0 May 3, 2018

romainfrancois closed this as completed in 325e749 May 14, 2018

romainfrancois mentioned this issue May 14, 2018

.preserve argument in filter to control group preservation #3573

Merged

This was referenced May 30, 2018

group_by is sorting and does not maintain original order #3279

Closed

complete() on factors with NA level tidyverse/tidyr#454

Closed

nesting() should preserve absent levels tidyverse/tidyr#447

Closed

jennybc mentioned this issue Oct 25, 2018

dplyr 0.8.0 release candidate post tidyverse/tidyverse.org#220

Merged

lock bot locked and limited conversation to collaborators Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve zero-length groups #341

Preserve zero-length groups #341

hadley commented Mar 20, 2014

eipi10 commented Mar 20, 2014

statwonk commented Apr 28, 2014

wsurles commented Jul 9, 2014

jennybc commented Jul 23, 2014

hadley commented Aug 1, 2014

slackline commented Aug 30, 2014

mcfrank commented Oct 29, 2014

bpbond commented Nov 11, 2014

hadley commented Nov 20, 2014

romainfrancois commented Nov 27, 2014

bpbond commented Dec 4, 2014

ebergelson commented May 28, 2015

wsurles commented May 28, 2015

bpbond commented May 28, 2015

ebergelson commented May 29, 2015

jalapic commented Jun 18, 2015

romainfrancois commented Jul 14, 2015

bpbond commented Jul 14, 2015

huftis commented Jul 14, 2015

mcfrank commented Jul 20, 2015

krlmlr commented Aug 22, 2017

GegznaV commented Mar 5, 2018 •

edited

Loading

romainfrancois commented Mar 5, 2018 •

edited

Loading

romainfrancois commented Apr 9, 2018

romainfrancois commented Apr 9, 2018

romainfrancois commented Apr 9, 2018

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

kenahoo commented May 15, 2018

romainfrancois commented May 15, 2018 •

edited

Loading

kenahoo commented May 15, 2018

lock bot commented Nov 11, 2018

Preserve zero-length groups #341

Preserve zero-length groups #341

Comments

hadley commented Mar 20, 2014

eipi10 commented Mar 20, 2014

statwonk commented Apr 28, 2014

wsurles commented Jul 9, 2014

jennybc commented Jul 23, 2014

hadley commented Aug 1, 2014

slackline commented Aug 30, 2014

mcfrank commented Oct 29, 2014

bpbond commented Nov 11, 2014

hadley commented Nov 20, 2014

romainfrancois commented Nov 27, 2014

bpbond commented Dec 4, 2014

ebergelson commented May 28, 2015

wsurles commented May 28, 2015

bpbond commented May 28, 2015

ebergelson commented May 29, 2015

jalapic commented Jun 18, 2015

romainfrancois commented Jul 14, 2015

bpbond commented Jul 14, 2015

huftis commented Jul 14, 2015

mcfrank commented Jul 20, 2015

krlmlr commented Aug 22, 2017

GegznaV commented Mar 5, 2018 • edited Loading

romainfrancois commented Mar 5, 2018 • edited Loading

romainfrancois commented Apr 9, 2018

romainfrancois commented Apr 9, 2018

romainfrancois commented Apr 9, 2018

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

krlmlr commented Apr 10, 2018

romainfrancois commented Apr 10, 2018

kenahoo commented May 15, 2018

romainfrancois commented May 15, 2018 • edited Loading

kenahoo commented May 15, 2018

lock bot commented Nov 11, 2018

GegznaV commented Mar 5, 2018 •

edited

Loading

romainfrancois commented Mar 5, 2018 •

edited

Loading

romainfrancois commented May 15, 2018 •

edited

Loading