-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: empty groups #3492
WIP: empty groups #3492
Conversation
romainfrancois
commented
Apr 10, 2018
•
edited
Loading
edited
- group_by preserves all factor levels, even with no data in the group (Preserve zero-length groups #341)
For now the interface is, adding This does:
|
Looks like we need |
build_index_cpp(data_); | ||
// when there is no drop attribute, we assumed drop=TRUE | ||
// the default for group_by. | ||
SEXP drop = data_.attr("drop"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember the "drop"
attribute was used for something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
src/group_indices.cpp
Outdated
// cleanup after expand.grid | ||
new_labels.attr("out.attrs") = R_NilValue; | ||
|
||
IntegerVector new_labels_order = OrderVisitors(new_labels).apply(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to reorder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of what expand.grid does
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can do rev(expand.grid(!!!rev(dots)))
?
We should look into replacing expand.grid()
with a home-grown visitor at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. This is just a first step to exchange about the feature, I'll polish the implementation later.
How do we handle |
For things that are not factors, it depends on what unique does. For factors yeah i suppose i need to add something to handle na if any. |
Now with library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data_frame(
f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
y = rep( 1:4, each = 2)
)
show_groups <- function(.data, ...){
attr(group_by(.data, ...), "labels")
}
show_groups(df, f1, y, .drop = TRUE , .expand = FALSE)
#> f1 y
#> 1 a 1
#> 2 a 2
#> 3 b 3
#> 4 b 4
show_groups(df, f1, y, .drop = TRUE , .expand = TRUE)
#> Warning: Column `f1` joining factors with different levels, coercing to
#> character vector
#> f1 y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 a 4
#> 5 b 1
#> 6 b 2
#> 7 b 3
#> 8 b 4
show_groups(df, f1, y, .drop = FALSE, .expand = TRUE)
#> f1 y
#> 1 a 1
#> 2 a 2
#> 3 a 3
#> 4 a 4
#> 5 b 1
#> 6 b 2
#> 7 b 3
#> 8 b 4
#> 9 c 1
#> 10 c 2
#> 11 c 3
#> 12 c 4
show_groups(df, f1, y, .drop = FALSE, .expand = FALSE) # error
#> Error: if `drop` is FALSE, expand must be `TRUE` The join warning is unexpected, not sure what happens here. |
How about using the
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data_frame(
f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
y = rep( 1:4, each = 2)
)
show_groups <- function(.data){
res <- summarise(.data, n = n()) %>%
arrange(desc(n))
cat( "non empty groups\n" )
print(filter(res, n > 0))
empties <- filter(res, n == 0)
if(nrow(empties)){
cat( "\n empty groups\n")
print(empties)
}
}
show_groups(group_by(df, f1, y, .empty = "none"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups: f1 [2]
#> f1 y n
#> <fct> <int> <int>
#> 1 a 1 2
#> 2 a 2 2
#> 3 b 3 2
#> 4 b 4 2
show_groups(group_by(df, f1, y, .empty = "some"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups: f1 [2]
#> f1 y n
#> <fct> <int> <int>
#> 1 a 1 2
#> 2 a 2 2
#> 3 b 3 2
#> 4 b 4 2
#>
#> empty groups
#> # A tibble: 4 x 3
#> # Groups: f1 [2]
#> f1 y n
#> <fct> <int> <int>
#> 1 a 3 0
#> 2 a 4 0
#> 3 b 1 0
#> 4 b 2 0
show_groups(group_by(df, f1, y, .empty = "all"))
#> non empty groups
#> # A tibble: 4 x 3
#> # Groups: f1 [3]
#> f1 y n
#> <fct> <int> <int>
#> 1 a 1 2
#> 2 a 2 2
#> 3 b 3 2
#> 4 b 4 2
#>
#> empty groups
#> # A tibble: 8 x 3
#> # Groups: f1 [3]
#> f1 y n
#> <fct> <int> <int>
#> 1 a 3 0
#> 2 a 4 0
#> 3 b 1 0
#> 4 b 2 0
#> 5 c 1 0
#> 6 c 2 0
#> 7 c 3 0
#> 8 c 4 0 Created on 2018-04-24 by the reprex package (v0.2.0). |
Current status, with automatic nesting, at least what I think it is. This is very in progress, so might 💣. library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data_frame(
f1 = factor( rep( c("a", "b"), each = 4 ), levels = c("a", "b", "c") ),
f2 = factor( rep( c("d", "e", "f", "g"), each = 2 ), levels = c("d", "e", "f", "g", "h") ),
y = rep( 1:4, each = 2)
)
group_by(df, f1, y) %>% tally()
#> # A tibble: 5 x 3
#> # Groups: f1 [?]
#> f1 y n
#> <fct> <int> <int>
#> 1 a 1 2
#> 2 a 2 2
#> 3 b 3 2
#> 4 b 4 2
#> 5 c NA 0
group_by(df, f2, y) %>% tally()
#> # A tibble: 5 x 3
#> # Groups: f2 [?]
#> f2 y n
#> <fct> <int> <int>
#> 1 d 1 2
#> 2 e 2 2
#> 3 f 3 2
#> 4 g 4 2
#> 5 h NA 0
group_by(df, f1, f2, y) %>% tally()
#> # A tibble: 15 x 4
#> # Groups: f1, f2 [?]
#> f1 f2 y n
#> <fct> <fct> <int> <int>
#> 1 a d 1 2
#> 2 a e 2 2
#> 3 a f NA 0
#> 4 a g NA 0
#> 5 a h NA 0
#> 6 b d NA 0
#> 7 b e NA 0
#> 8 b f 3 2
#> 9 b g 4 2
#> 10 b h NA 0
#> 11 c d NA 0
#> 12 c e NA 0
#> 13 c f NA 0
#> 14 c g NA 0
#> 15 c h NA 0 Created on 2018-04-27 by the reprex package (v0.2.0). |
skipping this test now bc I think the result should be different: test_that("filter(FALSE) drops indices", {
skip()
out <- mtcars %>%
group_by(cyl) %>%
filter(FALSE) %>%
attr("indices")
expect_identical(out, list())
}) Right now, when regrouping after the filter, we get the sentinel NA: > mtcars %>%
+ group_by(cyl) %>% tally()
# A tibble: 3 x 2
cyl n
<dbl> <int>
1 4. 11
2 6. 7
3 8. 14
>
> mtcars %>%
+ group_by(cyl) %>% filter(FALSE) %>% tally()
# A tibble: 1 x 2
cyl n
<dbl> <int>
1 NA 0 I'd prefer that the grouping structure be preserved with all empty groups, but this is open for discussion because |
Also skipping this one. test_that("FactorVisitor handles NA. #183", {
skip("until we can group again by a factor that has NA")
g <- group_by(MASS::survey, M.I)
expect_equal(g$M.I, MASS::survey$M.I)
}) Do we want to support |
I had to remove this from the examples of group_by_all(mtcars, as.factor) because it would under the new nesting design with expansion of all factor levels, it would generate a grouping structure with > map_int(mtcars, compose(length, unique)) %>% prod()
[1] 61393464000 Wondering if we should issue a warning when the number of groups will be >> than the number of rows. We can get a lower bound of the #️⃣ of groups:
|
So we don't need the |
b186c8c
to
2725bce
Compare
suppressPackageStartupMessages(library(dplyr))
d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 1 2
#> 3 2 3
#> 4 2 4
#> 5 3 5
#> 6 3 6
#> 7 4 7
#> 8 4 8
df <- d %>%
group_by(x) %>%
filter(y<5)
tally(df)
#> # A tibble: 4 x 2
#> x n
#> <int> <int>
#> 1 1 2
#> 2 2 2
#> 3 3 0
#> 4 4 0 The groups with x == 3 or 4 are empty, but suppressPackageStartupMessages(library(dplyr))
d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 1 2
#> 3 2 3
#> 4 2 4
#> 5 3 5
#> 6 3 6
#> 7 4 7
#> 8 4 8
df <- d %>%
group_by(x) %>%
filter(y<5) %>%
group_by(x)
tally(df)
#> # A tibble: 2 x 2
#> x n
#> <int> <int>
#> 1 1 2
#> 2 2 2 i.e. regenerate the grouping structure after the filtering (would be the easiest) or come up with some way to keep only the groups that this would create (more efficient, maybe more work). However if suppressPackageStartupMessages(library(dplyr))
d <- tibble(x = rep(1:4, each=2), y = 1:8)
d
#> # A tibble: 8 x 2
#> x y
#> <int> <int>
#> 1 1 1
#> 2 1 2
#> 3 2 3
#> 4 2 4
#> 5 3 5
#> 6 3 6
#> 7 4 7
#> 8 4 8
df <- d %>%
group_by(x = as.factor(x)) %>%
filter(y<5) %>%
group_by(x)
tally(df)
#> # A tibble: 4 x 2
#> x n
#> <fct> <int>
#> 1 1 2
#> 2 2 2
#> 3 3 0
#> 4 4 0 |
I've changed the error message for implicit NA in factors so that it suggests using forcats to make them explicit. library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
d <- data.frame(x = 1:3, f = factor(c("a", "b", NA)))
group_by(d,f)
#> Error in grouped_df_impl(data, unname(vars)): Column `f` contains implicit missing values. Consider `forcats::fct_explicit_na` to turn them explicit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to remove slice_impl()
?
I think stopping is too aggressive as it is likely to break existing code; a warning would be better. |
warning + making them explicit automatically then ? What to do when we group by f1 and f2 and we see NA in f2 within only one level of f1? Making f2 na explicit would make the na appear in all levels of f1. |
@krlmlr i moved slice related code in filter.cpp because they share code |
I think warn, and drop the implicit NAs? |
🤔 if we drop the row, we end up with less rows after the group_by Or drop as in the row does not belong to any group. So the sum of the group sizes would not be equal to the nrow. The second option is easier to implement, but i’m not sure either way |
Hmmm true. If we have to handle them anyway, maybe we should just do it without the warning? |
Yeah, I was confused because the diff for Can we use a nested array (an array of arrays of ... of row IDs) to keep group indices? I think this might help solve the "empty groups after |
@hadley I think we should treat implicit NA differently from a true factor level, so have them create groups only where they appear. This would be inline with the old behavior: library(dplyr)
d <- tibble(
f1 = factor(c(1,1,2,2)),
f2 = factor(c(1,2,1,NA)),
x = 1:4
)
group_by(d)
#> # A tibble: 4 x 3
#> f1 f2 x
#> <fct> <fct> <int>
#> 1 1 1 1
#> 2 1 2 2
#> 3 2 1 3
#> 4 2 <NA> 4 Would be quite cheap to do also, I just have to always have a vector with one more item and at the end look if I've seen NA or not. Making them a true level means I have to search for NA before the algorithm. Removing the rows means filtering all the columns, when currently group_by only affects the metadata. A warning that can be silenced by an option might still be worth it though, to encourage explicit na in factors. |
ebcfbcb
to
5873a30
Compare
…e visited dataframe.
671fbf2
to
63e411d
Compare
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |