Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve zero-length groups #341

Closed
hadley opened this issue Mar 20, 2014 · 44 comments
Closed

Preserve zero-length groups #341

hadley opened this issue Mar 20, 2014 · 44 comments
Assignees
Labels
feature a feature request or enhancement wip work in progress
Milestone

Comments

@hadley
Copy link
Member

hadley commented Mar 20, 2014

http://stackoverflow.com/questions/22523131

Not sure what the interface to this should be - probably should default to drop = FALSE.

@eipi10
Copy link
Contributor

eipi10 commented Mar 20, 2014

Thanks for opening up this issue Hadley.

@statwonk
Copy link

👍 ran into the same issue today, drop = FALSE would be a big help for me!

@wsurles
Copy link

wsurles commented Jul 9, 2014

Any idea on the time frame for putting a .drop = FALSE equivalent into dplyr? I need this for certain rCharts to render correctly.

In the mean time I DID get the answer in your link to work.
http://stackoverflow.com/questions/22523131

I grouped by two variables.

@jennybc
Copy link
Member

jennybc commented Jul 23, 2014

+1 for option to not drop empty groups

@hadley
Copy link
Member Author

hadley commented Aug 1, 2014

May be some overlap with #486 and #413.

@hadley hadley added this to the 0.3.1 milestone Aug 1, 2014
@hadley hadley self-assigned this Aug 1, 2014
@slackline
Copy link

Not dropping empty groups would be very useful. Often needed when creating summary tables.

@mcfrank
Copy link

mcfrank commented Oct 29, 2014

+1 - this is a deal-breaker for many analyses

@bpbond
Copy link
Contributor

bpbond commented Nov 11, 2014

I agree with all the above--would be very useful.

@hadley hadley modified the milestones: 0.4, 0.3.1 Nov 20, 2014
@hadley
Copy link
Member Author

hadley commented Nov 20, 2014

@romainfrancois Currently build_index_cpp() doesn't respect the drop attribute:

t1 <- data_frame(
  x = runif(10),
  g1 = rep(1:2, each = 5),
  g2 = factor(g1, 1:3)
)
g1 <- grouped_df(t1, list(quote(g2)), drop = FALSE)
attr(g1, "group_size")
# should be c(5L, 5L, 0L)
attr(g1, "indices")
# shoud be list(0:4, 5:9, integer(0))

The drop attribute only applies when grouping by a factor, in which case we need to have one group per factor level, regardless of whether or not the level actually applies to the data.

This will also affect the single table verbs in the following ways:

  • select(): no effect
  • arrange(): no effect
  • summarise(): functions applied to zero row groups should be given 0-level integers. n() should return 0, mean(x) should return NaN
  • filter(): the set of groups should remain constant, even if some groups now have no rows
  • mutate(): don't need to evaluate expressions for empty groups

Eventually, drop = FALSE will be the default, and if it's a hassle to write both drop = FALSE and drop = TRUE branches, I'd happily drop support for drop = FALSE (since you can always re-level the factor yourself, or use a character vector instead).

Does that make sense? If it's a lot of work, we can push off to 0.4

@statwonk, @wsurles, @jennybc, @slackline, @mcfrank, @eipi10 If you'd like to help, the best thing to do would be to work on a set of test cases that exercises all the ways the different verbs might interact with zero-length groups.

@hadley hadley assigned romainfrancois and unassigned hadley Nov 25, 2014
@romainfrancois
Copy link
Member

Ah. I think I just did not know what drop was supposed to do. That makes it clear. I don't think it's a lot of work.

@bpbond
Copy link
Contributor

bpbond commented Dec 4, 2014

I have opened pull request #833 which tests whether the single table verbs above handle zero-length groups correctly. Most of the tests are commented out, because dplyr currently fails them, of course.

@ebergelson
Copy link

+1 , any status updates here? love summarise, need to keep empty levels!

@wsurles
Copy link

wsurles commented May 28, 2015

@ebergelson, Here is my current hack to get zero-length groups. I often need this so my bar charts will stack.

Here df has 3 columns: name, group, and metric

df2 <- expand.grid(name = unique(df$name), group = unique(df$group)) %>%
    left_join(df, by=c("name","group")) %>%
    mutate(metric = ifelse(is.na(metric),0,metric))

@bpbond
Copy link
Contributor

bpbond commented May 28, 2015

I do something similar–check for missing groups, then if any generate all combinations and left_join.

Unfortunately, it doesn't seem like this issue is getting much love...perhaps because there is this straightforward workaround.

@ebergelson
Copy link

@wsurles, @bpbond thanks, yes i used a similar workaround to what you suggest! would love to see a built-in fix like .drop.

@jalapic
Copy link

jalapic commented Jun 18, 2015

Just to add and agree with everyone above - this is a super critical aspect of many analyses. Would love to see an implementation.

@romainfrancois
Copy link
Member

Some more details needed here:

If I have this:

> df <- data_frame( x = c(1,1,1,2,2), f = factor( c(1,2,3,1,1) ) )
> df
Source: local data frame [5 x 2]

  x f
1 1 1
2 1 2
3 1 3
4 2 1
5 2 1

And I group by x then f, I'd end up with 6 (2x3) groups where the groups (2, 2) and (2,3) are empty. That's ok. I can manage to implement that I think.

now, what if I have this:

> df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )
> df
Source: local data frame [4 x 2]

  f x
1 1 1
2 1 2
3 2 1
4 2 4

and I want to group by f then x. What would the groups be ? @hadley

@bpbond
Copy link
Contributor

bpbond commented Jul 14, 2015

Both stats::aggregate and plyr::ddply return 4 groups in this case (1,1; 1,2; 2,1; and 2,4), so I'd suggest that's the behavior to conform to.

@huftis
Copy link

huftis commented Jul 14, 2015

Shouldn’t it agree with table() instead, i.e., return 9 groups?

> table(df$f, df$x)
  1 2 4
1 1 1 0
2 1 0 1
3 0 0 0

I would expect df %>% group_by(f, x) %>% tally to basically give the same result as with(df, as.data.frame(table(f, x))) and ddply(df, .(f, x), nrow, .drop=FALSE).

@mcfrank
Copy link

mcfrank commented Jul 20, 2015

I thought our desired behavior was to preserve zero-length groups if they are factors (like .drop in plyr), so I would imagine we'd want @huftis's suggestion. I would suggest that the default be drop = TRUE though, so that the default behavior does not change, re @bpbond's suggestion.

@krlmlr
Copy link
Member

krlmlr commented Aug 22, 2017

Does tidyr::complete() work for factors?

@GegznaV
Copy link

GegznaV commented Mar 5, 2018

All factors levels and combinations of factors levels must be preserved by default. This behavior can be controled by parameters such as drop, expand, etc. Thus the default behavior of dplyr::count() should be like this:

df <- data.frame(x = 1:2, y = factor(c(1, 1), levels = 1:2))
df %>% dplyr::count(x, y)
#>  # A tibble: 4 x 3
#>       x        y       n
#>     <int>   <fct>    <int>
#> 1     1        1       1
#> 2     2        1       1
#> 3     1        2       0
#> 4     2        2       0

Zero length groups (combinations of groups) can be filtered later. But for exploratory analysis we must see the full picture.

  1. Are there any status updates on the solution to this issue?
  2. Are there any plans to completely solve this issue?

@romainfrancois
Copy link
Member

romainfrancois commented Mar 5, 2018

2: yes definitely
1: There are some technical implementation difficulties about this issue, but I'll look into it in the next few weeks.

@romainfrancois
Copy link
Member

We might get away with this by expanding the data after the fact, something like this:

library(tidyverse)

truly_group_by <- function(data, ...){
  dots <- quos(...)
  data <- group_by( data, !!!dots )

  labels <- attr( data, "labels" )
  labnames <- names(labels)
  labels <- mutate( labels, ..index.. =  attr(data, "indices") )

  expanded <- labels %>%
    tidyr::expand( !!!dots ) %>%
    left_join( labels, by = labnames ) %>%
    mutate( ..index.. = map(..index.., ~if(is.null(.x)) integer() else .x ) )

  indices <- pull( expanded, ..index..)
  group_sizes <- map_int( indices, length)
  labels <- select( expanded, -..index..)

  attr(data, "labels")  <- labels
  attr(data, "indices") <- indices
  attr(data, "group_sizes") <- group_sizes

  data
}

df  <- data_frame(
  x = 1:2,
  y = factor(c(1, 1), levels = 1:2)
)
tally( truly_group_by(df, x, y) )
#> # A tibble: 4 x 3
#> # Groups:   x [?]
#>       x y         n
#>   <int> <fct> <int>
#> 1     1 1         1
#> 2     1 2         0
#> 3     2 1         1
#> 4     2 2         0
tally( truly_group_by(df, y, x) )
#> # A tibble: 4 x 3
#> # Groups:   y [?]
#>   y         x     n
#>   <fct> <int> <int>
#> 1 1         1     1
#> 2 1         2     1
#> 3 2         1     0
#> 4 2         2     0

obviously down the line, this would be handled internally, sans using tidyr or purrr.

@romainfrancois
Copy link
Member

This seems to take care of the original question on so:

> df = data.frame(a=rep(1:3,4), b=rep(1:2,6))
> df$b = factor(df$b, levels=1:3)
> df %>%
+   group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 2 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
> df %>%
+   truly_group_by(b) %>%
+   summarise(count_a=length(a), .drop=FALSE)
# A tibble: 3 x 3
  b     count_a .drop
  <fct>   <int> <lgl>
1 1           6 FALSE
2 2           6 FALSE
3 3           0 FALSE

@romainfrancois
Copy link
Member

The key here being this

 tidyr::expand( !!!dots ) %>%

which means expanding all possibilities regardless of the variables being factors or not.

I'd say we either:

  • expand all when drop=FALSE, potentially having lots of 0 length groups
  • do what we do now if drop=TRUE

perhaps have a function to toggle dropness.

This is a relatively cheap operation I'd say because it only involves manipulating the metadata, so perhaps it is less risky to do this in R first ?

@krlmlr
Copy link
Member

krlmlr commented Apr 10, 2018

Did you mean crossing() instead of expand()?

Looking at the internals, do you agree that we "only" need to change build_index_cpp(), specifically the generation of the labels data frame, to make this happen?

Can we perhaps start with expanding only factors with drop = FALSE? I considered a "natural" syntax, but this may be too confusing in the end (and perhaps even not powerful enough):

group_by(data, crossing(col1, col2), col3)

Semantics: Using all combinations of col1 and col2, and there existing combinations with col3.

@romainfrancois
Copy link
Member

Yes, I'd say this only affects build_index_cpp and the generation of the attributes labels, indices and group_sizes which I'd like to squash in a tidy structure as part of #3489

The "only expanding factors" part of this discussion is what took so long.

What would be the results of these:

library(dplyr)

d <- data_frame(
  f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
  f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
  x  = 1:8,
  y  = rep( 1:4, each = 2)
)

f <- function(data, ...){
  group_by(data, !!!quos(...))  %>%
    tally()
}
f(d, f1, f2, x)
f(d, x, f1, f2)

f(d, f1, f2, x, y)
f(d, x, f1, f2, y)

@krlmlr
Copy link
Member

krlmlr commented Apr 10, 2018

I think f(d, f1, f2, x) should give the same results as f(d, x, f1, f2), if row order is ignored. Same for the other two.

Also interesting:

f(d, f2, x, f1, y)
d %>% sample_frac(0.3) %>% f(...)

I like the idea of implementing full expansion only for factors. For non-character data (including logicals), we could define/use a factor-like class that inherits the respective data type. Perhaps provided by forcats? This makes it more difficult to shoot yourself in the foot.

@romainfrancois
Copy link
Member

implementation in progress in #3492

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame( f = factor( c(1,1,2,2), levels = 1:3), x = c(1,2,1,4) )

( res1 <- tally(group_by(df,f,x, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   f [?]
#>   f         x     n
#>   <fct> <dbl> <int>
#> 1 1        1.     1
#> 2 1        2.     1
#> 3 1        4.     0
#> 4 2        1.     1
#> 5 2        2.     0
#> 6 2        4.     1
#> 7 3        1.     0
#> 8 3        2.     0
#> 9 3        4.     0
( res2 <- tally(group_by(df,x,f, drop = FALSE)) )
#> # A tibble: 9 x 3
#> # Groups:   x [?]
#>       x f         n
#>   <dbl> <fct> <int>
#> 1    1. 1         1
#> 2    1. 2         1
#> 3    1. 3         0
#> 4    2. 1         1
#> 5    2. 2         0
#> 6    2. 3         0
#> 7    4. 1         0
#> 8    4. 2         1
#> 9    4. 3         0

all.equal( res1, arrange(res2, f, x) )
#> [1] TRUE

all.equal( filter(res1, n>0), tally(group_by(df, f, x)) )
#> [1] TRUE
all.equal( filter(res2, n>0), tally(group_by(df, x, f)) )
#> [1] TRUE

Created on 2018-04-10 by the reprex package (v0.2.0).

@kenahoo
Copy link
Contributor

kenahoo commented May 15, 2018

As for whether complete() solves the issue - no, not really. Whatever summaries are being computed, their behaviors on empty vectors need to be preserved, not patched up after the fact. For example:

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
     group_by(x) %>%
     summarize(min=min(y), sum=sum(y), prod=prod(y))
# Should be:
#> x       min   sum  prod
#> 1         4     9    20
#> 2       Inf     0     1

sum and prod (and to a lesser extent, min) (and various other functions) have very well-defined semantics on empty vectors, and it's not great to have to come along afterwards with complete() and re-define those behaviors.

@romainfrancois
Copy link
Member

romainfrancois commented May 15, 2018

@kenahoo I'm not sure I understand. This is what you get with the current dev version. So the only thing that you don't get is the warning from min()

library(dplyr)

data.frame(x=factor(1, levels=1:2), y=4:5) %>%
  group_by(x) %>%
  summarize(min=min(y), sum=sum(y), prod=prod(y))
#> # A tibble: 2 x 4
#>   x       min   sum  prod
#>   <fct> <dbl> <int> <dbl>
#> 1 1         4     9    20
#> 2 2       Inf     0     1

min(integer())
#> Warning in min(integer()): no non-missing arguments to min; returning Inf
#> [1] Inf
sum(integer())
#> [1] 0
prod(integer())
#> [1] 1

Created on 2018-05-15 by the reprex package (v0.2.0).

@kenahoo
Copy link
Contributor

kenahoo commented May 15, 2018

@romainfrancois Oh cool, I didn't realize you were already so far along on this implementation. Looks great!

@lock
Copy link

lock bot commented Nov 11, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Nov 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement wip work in progress
Projects
None yet
Development

No branches or pull requests