-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax grouped data frame constraint concerning number of rows in groups vs data #3837
Comments
I think what you're describing is fundamentally different from the mental construct of I get the utility of multiple permutations, it just seems separate from |
I was also thinking of potential use cases of this in
Most other functions just don't make sense in this context, and a class that built on top of this, All that to say that I think I agree with you, but I also think this could live somewhere and be useful. |
Totally agreed with it living somewhere and being useful, it's just a different unit of observation— like the table below (from strapgod, which, is it really already on CRAN?!) is a different "set" from the original iris records. iris %>%
group_by(Species) %>%
bootstrapify(10) %>%
collect()
#> # A tibble: 1,500 x 7
#> # Groups: Species, .strap [30]
#> .strap .id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 id_1 48 4.6 3.2 1.4 0.2 setosa
#> 2 id_1 8 5 3.4 1.5 0.2 setosa
#> 3 id_1 39 4.4 3 1.3 0.2 setosa
#> 4 id_1 45 5.1 3.8 1.9 0.4 setosa
#> 5 id_1 31 4.8 3.1 1.6 0.2 setosa
#> 6 id_1 31 4.8 3.1 1.6 0.2 setosa
#> 7 id_1 50 5 3.3 1.4 0.2 setosa
#> 8 id_1 2 4.9 3 1.4 0.2 setosa
#> 9 id_1 30 4.7 3.2 1.6 0.2 setosa
#> 10 id_1 18 5.1 3.5 1.4 0.3 setosa What about rsample? |
Do we need to remove the check? Can't you make a different subclass? |
No no, the default
Yes, I think it would go here if that is what you are asking? Also,
I'm currently doing: I don't think I can do: When I call |
Other use cases would be lazy Maybe we could define a new
|
@krlmlr I don't think it's necessary to create a completely new infrastructure at this point. @DavisVaughan does the check occur anywhere other than in the constructor? |
No, it looks like it only occurs in the grouped df constructor with signature Here is the very first reference to why the check was included: Here is the test for that check: dplyr/tests/testthat/test-filter.r Line 172 in 53ed25b
|
What about doing the check if the classes are exactly: That check only happens in the for (int i = 0; i < ng; i++) rows_in_groups += Rf_length(idx[i]);
if (data_.nrows() != rows_in_groups) {
bad_arg(".data", "is a corrupt grouped_df, contains {rows} rows, and {group_rows} rows in groups",
_["rows"] = data_.nrows(), _["group_rows"] = rows_in_groups);
} It's kind of already expensive anyway, and is not enough at all to test if the tibble is indeed a correct grouped tibble in the sense of each row should belong to only one group, and all the rows are in a group, which would be even more expensive. |
If it's not accurate, then I'm in favour of removing it. We should probably expose a low-level |
Here is an example of how you can create a "valid" grouped tibble that should really be corrupt. Notice how row 51 is in 2 groups, but the sum of the lengths of each group equals the number of rows in the data frame so the check doesn't pick it up. library(dplyr)
iris_g <- as_tibble(iris) %>%
group_by(Species)
# setosa mean = 5.01
iris_g %>%
summarise(x = mean(Sepal.Length))
#> # A tibble: 3 x 2
#> Species x
#> <fct> <dbl>
#> 1 setosa 5.01
#> 2 versicolor 5.94
#> 3 virginica 6.59
g_data <- group_data(iris_g)
# row 51 overlaps with the versicolor rows
g_data$.rows[[1]] <- 2:51
g_data$.rows[[1]]
#> [1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#> [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#> [47] 48 49 50 51
g_data$.rows[[2]]
#> [1] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
#> [18] 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
#> [35] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
attr(iris_g, "groups") <- g_data
# setosa mean = 5.04
iris_g %>%
summarise(x = mean(Sepal.Length))
#> # A tibble: 3 x 2
#> Species x
#> <fct> <dbl>
#> 1 setosa 5.04
#> 2 versicolor 5.94
#> 3 virginica 6.59 Created on 2018-09-19 by the reprex |
Let's remove the test for now, adding a couple of lines to the documentation stating what we expect the user of this function to do. |
This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/ |
There are valid cases in which a single row can belong to multiple "virtual" groups. One such case is creating virtual groups to define bootstraps. This can result in large performance increases for operations such as
summarise()
ordo()
when compared to repeated subsetting. It also lends itself to elegant pipelines such as:Currently, a grouped data frame check prevents this from being useful. The code linked below checks to see if the number of rows in the data is equal to the sum of the lengths of the group indices.
dplyr/src/group_indices.cpp
Line 534 in 4803fb5
It would be great if a conversation could be had about either altering this check or removing it altogether.
This check has been removed in a branch I created:
dplyr/virtual-bootstrap-groups
An example package using that branch has also been created to demonstrate the usefulness of this idea applied to bootstraps:
strapgod
See also:
#14
The text was updated successfully, but these errors were encountered: