New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve zero-length groups #341

Closed
hadley opened this Issue Mar 20, 2014 · 44 comments

Comments

Projects
None yet
@hadley
Member

hadley commented Mar 20, 2014

http://stackoverflow.com/questions/22523131

Not sure what the interface to this should be - probably should default to drop = FALSE.

@eipi10

This comment has been minimized.

Contributor

eipi10 commented Mar 20, 2014

Thanks for opening up this issue Hadley.

@statwonk

This comment has been minimized.

statwonk commented Apr 28, 2014

👍 ran into the same issue today, drop = FALSE would be a big help for me!

@wsurles

This comment has been minimized.

wsurles commented Jul 9, 2014

Any idea on the time frame for putting a .drop = FALSE equivalent into dplyr? I need this for certain rCharts to render correctly.

In the mean time I DID get the answer in your link to work.
http://stackoverflow.com/questions/22523131

I grouped by two variables.

@jennybc

This comment has been minimized.

Member

jennybc commented Jul 23, 2014

+1 for option to not drop empty groups

@hadley

This comment has been minimized.

Member

hadley commented Aug 1, 2014

May be some overlap with #486 and #413.

@hadley hadley added the enhancement label Aug 1, 2014

@hadley hadley added this to the 0.3.1 milestone Aug 1, 2014

@hadley hadley self-assigned this Aug 1, 2014

@slackline

This comment has been minimized.

slackline commented Aug 30, 2014

Not dropping empty groups would be very useful. Often needed when creating summary tables.

@mcfrank

This comment has been minimized.

mcfrank commented Oct 29, 2014

+1 - this is a deal-breaker for many analyses

@bpbond

This comment has been minimized.

Contributor

bpbond commented Nov 11, 2014

I agree with all the above--would be very useful.

@hadley hadley modified the milestones: 0.4, 0.3.1 Nov 20, 2014

@hadley

This comment has been minimized.

Member

hadley commented Nov 20, 2014

@romainfrancois Currently build_index_cpp() doesn't respect the drop attribute:

t1 <- data_frame(
  x = runif(10),
  g1 = rep(1:2, each = 5),
  g2 = factor(g1, 1:3)
)
g1 <- grouped_df(t1, list(quote(g2)), drop = FALSE)
attr(g1, "group_size")
# should be c(5L, 5L, 0L)
attr(g1, "indices")
# shoud be list(0:4, 5:9, integer(0))

The drop attribute only applies when grouping by a factor, in which case we need to have one group per factor level, regardless of whether or not the level actually applies to the data.

This will also affect the single table verbs in the following ways:

  • select(): no effect
  • arrange(): no effect
  • summarise(): functions applied to zero row groups should be given 0-level integers. n() should return 0, mean(x) should return NaN
  • filter(): the set of groups should remain constant, even if some groups now have no rows
  • mutate(): don't need to evaluate expressions for empty groups

Eventually, drop = FALSE will be the default, and if it's a hassle to write both drop = FALSE and drop = TRUE branches, I'd happily drop support for drop = FALSE (since you can always re-level the factor yourself, or use a character vector instead).

Does that make sense? If it's a lot of work, we can push off to 0.4

@statwonk, @wsurles, @jennybc, @slackline, @mcfrank, @eipi10 If you'd like to help, the best thing to do would be to work on a set of test cases that exercises all the ways the different verbs might interact with zero-length groups.

@hadley hadley assigned romainfrancois and unassigned hadley Nov 25, 2014

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Nov 27, 2014

Ah. I think I just did not know what drop was supposed to do. That makes it clear. I don't think it's a lot of work.

@bpbond

This comment has been minimized.

Contributor

bpbond commented Dec 4, 2014

I have opened pull request #833 which tests whether the single table verbs above handle zero-length groups correctly. Most of the tests are commented out, because dplyr currently fails them, of course.

@ebergelson

This comment has been minimized.

ebergelson commented May 28, 2015

+1 , any status updates here? love summarise, need to keep empty levels!

@wsurles

This comment has been minimized.

wsurles commented May 28, 2015

@ebergelson, Here is my current hack to get zero-length groups. I often need this so my bar charts will stack.

Here df has 3 columns: name, group, and metric

df2 <- expand.grid(name = unique(df$name), group = unique(df$group)) %>%
    left_join(df, by=c("name","group")) %>%
    mutate(metric = ifelse(is.na(metric),0,metric))
@bpbond

This comment has been minimized.

Contributor

bpbond commented May 28, 2015

I do something similar–check for missing groups, then if any generate all combinations and left_join.

Unfortunately, it doesn't seem like this issue is getting much love...perhaps because there is this straightforward workaround.

@ebergelson

This comment has been minimized.

ebergelson commented May 29, 2015

@wsurles, @bpbond thanks, yes i used a similar workaround to what you suggest! would love to see a built-in fix like .drop.

@jalapic

This comment has been minimized.

jalapic commented Jun 18, 2015

Just to add and agree with everyone above - this is a super critical aspect of many analyses. Would love to see an implementation.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Jul 14, 2015

Some more details needed here:

If I have this:

> df <- data_frame( x = c(1,1,1,2,2), f = factor( c(1,2,3,1,1) ) )
> df
Source: local data frame [5 x 2]

  x f
1 1 1
2 1 2
3 1 3
4 2 1
5 2 1

And I group by x then f, I'd end up with 6 (2x3) groups where the groups (2, 2) and (2,3) are empty. That's ok. I can manage to implement that I think.

now, what if I have this:

> df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )
> df
Source: local data frame [4 x 2]

  f x
1 1 1
2 1 2
3 2 1
4 2 4

and I want to group by f then x. What would the groups be ? @hadley

@bpbond

This comment has been minimized.

Contributor

bpbond commented Jul 14, 2015

Both stats::aggregate and plyr::ddply return 4 groups in this case (1,1; 1,2; 2,1; and 2,4), so I'd suggest that's the behavior to conform to.

@huftis

This comment has been minimized.

huftis commented Jul 14, 2015

Shouldn’t it agree with table() instead, i.e., return 9 groups?

> table(df$f, df$x)
  1 2 4
1 1 1 0
2 1 0 1
3 0 0 0

I would expect df %>% group_by(f, x) %>% tally to basically give the same result as with(df, as.data.frame(table(f, x))) and ddply(df, .(f, x), nrow, .drop=FALSE).

@mcfrank

This comment has been minimized.

mcfrank commented Jul 20, 2015

I thought our desired behavior was to preserve zero-length groups if they are factors (like .drop in plyr), so I would imagine we'd want @huftis's suggestion. I would suggest that the default be drop = TRUE though, so that the default behavior does not change, re @bpbond's suggestion.

@krlmlr

This comment has been minimized.

Member

krlmlr commented Aug 22, 2017

Does tidyr::complete() work for factors?

@GegznaV

This comment has been minimized.

GegznaV commented Mar 5, 2018

All factors levels and combinations of factors levels must be preserved by default. This behavior can be controled by parameters such as drop, expand, etc. Thus the default behavior of dplyr::count() should be like this:

df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
df %>% dplyr::count(x, y)
#>  # A tibble: 4 x 3
#>       x        y       n
#>     <int>   <fct>    <int>
#> 1     1        1       1
#> 2     2        1       1
#> 3     1        2       0
#> 4     2        2       0

Zero length groups (combinations of groups) can be filtered later. But for exploratory analysis we must see the full picture.

  1. Are there any status updates on the solution to this issue?
  2. Are there any plans to completely solve this issue?
@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Mar 5, 2018

2: yes definitely
1: There are some technical implementation difficulties about this issue, but I'll look into it in the next few weeks.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Apr 9, 2018

We might get away with this by expanding the data after the fact, something like this:

library(tidyverse)

truly_group_by <- function(data, ...){
  dots <- quos(...)
  data <- group_by( data, !!!dots )

  labels <- attr( data, "labels" )
  labnames <- names(labels)
  labels <- mutate( labels, ..index.. =  attr(data, "indices") )

  expanded <- labels %>%
    tidyr::expand( !!!dots ) %>%
    left_join( labels, by = labnames ) %>%
    mutate( ..index.. = map(..index.., ~if(is.null(.x)) integer() else .x ) )

  indices <- pull( expanded, ..index..)
  group_sizes <- map_int( indices, length)
  labels <- select( expanded, -..index..)

  attr(data, "labels")  <- labels
  attr(data, "indices") <- indices
  attr(data, "group_sizes") <- group_sizes

  data
}

df  <- data_frame(
  x = 1:2,
  y = factor(c(1, 1), levels = 1:2)
)
tally( truly_group_by(df, x, y) )
#> # A tibble: 4 x 3
#> # Groups:   x [?]
#>       x y         n
#>   <int> <fct> <int>
#> 1     1 1         1
#> 2     1 2         0
#> 3     2 1         1
#> 4     2 2         0
tally( truly_group_by(df, y, x) )
#> # A tibble: 4 x 3
#> # Groups:   y [?]
#>   y         x     n
#>   <fct> <int> <int>
#> 1 1         1     1
#> 2 1         2     1
#> 3 2         1     0
#> 4 2         2     0

obviously down the line, this would be handled internally, sans using tidyr or purrr.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Apr 9, 2018

This seems to take care of the original question on so:

> df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
> df$b = factor(df$b, levels=1:3)
> df %>%
+   group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 2 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
> df %>%
+   truly_group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 3 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
3 3           0 FALSE
@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Apr 9, 2018

The key here being this

 tidyr::expand( !!!dots ) %>%

which means expanding all possibilities regardless of the variables being factors or not.

I'd say we either:

  • expand all when drop=FALSE, potentially having lots of 0 length groups
  • do what we do now if drop=TRUE

perhaps have a function to toggle dropness.

This is a relatively cheap operation I'd say because it only involves manipulating the metadata, so perhaps it is less risky to do this in R first ?

@krlmlr

This comment has been minimized.

Member

krlmlr commented Apr 10, 2018

Did you mean crossing() instead of expand()?

Looking at the internals, do you agree that we "only" need to change build_index_cpp(), specifically the generation of the labels data frame, to make this happen?

Can we perhaps start with expanding only factors with drop = FALSE? I considered a "natural" syntax, but this may be too confusing in the end (and perhaps even not powerful enough):

group_by(data, crossing(col1, col2), col3)

Semantics: Using all combinations of col1 and col2, and there existing combinations with col3.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Apr 10, 2018

Yes, I'd say this only affects build_index_cpp and the generation of the attributes labels, indices and group_sizes which I'd like to squash in a tidy structure as part of #3489

The "only expanding factors" part of this discussion is what took so long.

What would be the results of these:

library(dplyr)

d <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
  x  = 1:8,
  y  = rep( 1:4, each = 2)
)

f <- function(data, ...){
  group_by(data, !!!quos(...))  %>%
    tally()
}
f(d, f1, f2, x)
f(d, x, f1, f2)

f(d, f1, f2, x, y)
f(d, x, f1, f2, y)
@krlmlr

This comment has been minimized.

Member

krlmlr commented Apr 10, 2018

I think f(d, f1, f2, x) should give the same results as f(d, x, f1, f2), if row order is ignored. Same for the other two.

Also interesting:

f(d, f2, x, f1, y)
d %>% sample_frac(0.3) %>% f(...)

I like the idea of implementing full expansion only for factors. For non-character data (including logicals), we could define/use a factor-like class that inherits the respective data type. Perhaps provided by forcats? This makes it more difficult to shoot yourself in the foot.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Apr 10, 2018

implementation in progress in #3492

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )

( res1 <- tally(group_by(df,f,x, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   f [?]
#>   f         x     n
#>   <fct> <dbl> <int>
#> 1 1        1.     1
#> 2 1        2.     1
#> 3 1        4.     0
#> 4 2        1.     1
#> 5 2        2.     0
#> 6 2        4.     1
#> 7 3        1.     0
#> 8 3        2.     0
#> 9 3        4.     0
( res2 <- tally(group_by(df,x,f, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   x [?]
#>       x f         n
#>   <dbl> <fct> <int>
#> 1    1. 1         1
#> 2    1. 2         1
#> 3    1. 3         0
#> 4    2. 1         1
#> 5    2. 2         0
#> 6    2. 3         0
#> 7    4. 1         0
#> 8    4. 2         1
#> 9    4. 3         0

all.equal( res1, arrange(res2, f, x) )
#> [1] TRUE

all.equal( filter(res1, n>0), tally(group_by(df, f, x)) )
#> [1] TRUE
all.equal( filter(res2, n>0), tally(group_by(df, x, f)) )
#> [1] TRUE

Created on 2018-04-10 by the reprex package (v0.2.0).

@kenahoo

This comment has been minimized.

Contributor

kenahoo commented May 15, 2018

As for whether complete() solves the issue - no, not really. Whatever summaries are being computed, their behaviors on empty vectors need to be preserved, not patched up after the fact. For example:

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
     group_by(x) %>%
     summarize(min=min(y), sum=sum(y), prod=prod(y))
# Should be:
#> x       min   sum  prod
#> 1         4     9    20
#> 2       Inf     0     1

sum and prod (and to a lesser extent, min) (and various other functions) have very well-defined semantics on empty vectors, and it's not great to have to come along afterwards with complete() and re-define those behaviors.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented May 15, 2018

@kenahoo I'm not sure I understand. This is what you get with the current dev version. So the only thing that you don't get is the warning from min()

library(dplyr)

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
  group_by(x) %>%
  summarize(min=min(y), sum=sum(y), prod=prod(y))
#> # A tibble: 2 x 4
#>   x       min   sum  prod
#>   <fct> <dbl> <int> <dbl>
#> 1 1         4     9    20
#> 2 2       Inf     0     1

min(integer())
#> Warning in min(integer()): no non-missing arguments to min; returning Inf
#> [1] Inf
sum(integer())
#> [1] 0
prod(integer())
#> [1] 1

Created on 2018-05-15 by the reprex package (v0.2.0).

@kenahoo

This comment has been minimized.

Contributor

kenahoo commented May 15, 2018

@romainfrancois Oh cool, I didn't realize you were already so far along on this implementation. Looks great!

@lock

This comment has been minimized.

lock bot commented Nov 11, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 11, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.