Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax grouped data frame constraint concerning number of rows in groups vs data #3837

Closed
DavisVaughan opened this issue Sep 18, 2018 · 13 comments
Milestone

Comments

@DavisVaughan
Copy link
Member

There are valid cases in which a single row can belong to multiple "virtual" groups. One such case is creating virtual groups to define bootstraps. This can result in large performance increases for operations such as summarise() or do() when compared to repeated subsetting. It also lends itself to elegant pipelines such as:

iris %>%
  group_by(Species) %>%
  bootstrapify(10) %>%
  summarise(per_strap_species_mean = mean(Petal.Width))

Currently, a grouped data frame check prevents this from being useful. The code linked below checks to see if the number of rows in the data is equal to the sum of the lengths of the group indices.

bad_arg(".data", "is a corrupt grouped_df, contains {rows} rows, and {group_rows} rows in groups",

It would be great if a conversation could be had about either altering this check or removing it altogether.

This check has been removed in a branch I created:
dplyr/virtual-bootstrap-groups

An example package using that branch has also been created to demonstrate the usefulness of this idea applied to bootstraps:
strapgod

See also:
#14

@batpigandme
Copy link
Contributor

I think what you're describing is fundamentally different from the mental construct of dplyr::group_by(). Grouping by a variable in the tidy data paradigm essentially replaces the need for split, in many cases. Thinking of it as a split, as opposed to several samples of, or groups of constituent observations, the mutual-exclusion of group membership is pretty foundational.

I get the utility of multiple permutations, it just seems separate from group_by().

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Sep 18, 2018

I was also thinking of potential use cases of this in dplyr, and while there are some that are really useful, there aren't that many.

  • summarise() for aggregating over bootstraps
  • do() for applying some data frame returning function/model to bootstraps (think tidy()) (the rsample+map combination does this too)

Most other functions just don't make sense in this context, and a class that built on top of this, bootstrapped_df or similar, would have to guard against their use. The main benefit is just the large improvement in speed / memory reduction + nice readability of these 2 cases.

All that to say that I think I agree with you, but I also think this could live somewhere and be useful.

@batpigandme
Copy link
Contributor

batpigandme commented Sep 18, 2018

Totally agreed with it living somewhere and being useful, it's just a different unit of observation— like the table below (from strapgod, which, is it really already on CRAN?!) is a different "set" from the original iris records.

iris %>%
  group_by(Species) %>%
  bootstrapify(10) %>%
  collect()
#> # A tibble: 1,500 x 7
#> # Groups:   Species, .strap [30]
#>    .strap   .id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>    <chr>  <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1 id_1      48          4.6         3.2          1.4         0.2 setosa 
#>  2 id_1       8          5           3.4          1.5         0.2 setosa 
#>  3 id_1      39          4.4         3            1.3         0.2 setosa 
#>  4 id_1      45          5.1         3.8          1.9         0.4 setosa 
#>  5 id_1      31          4.8         3.1          1.6         0.2 setosa 
#>  6 id_1      31          4.8         3.1          1.6         0.2 setosa 
#>  7 id_1      50          5           3.3          1.4         0.2 setosa 
#>  8 id_1       2          4.9         3            1.4         0.2 setosa 
#>  9 id_1      30          4.7         3.2          1.6         0.2 setosa 
#> 10 id_1      18          5.1         3.5          1.4         0.3 setosa 

What about rsample?

@hadley
Copy link
Member

hadley commented Sep 18, 2018

Do we need to remove the check? Can't you make a different subclass?

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Sep 18, 2018

is it really already on CRAN?!

No no, the default usethis::use_rmd() rmd has that "install from cran" line but I added # no you cannot into the code block

What about rsample?

Yes, I think it would go here if that is what you are asking? Also, rsample has some bootstrap infrastructure but it works differently.

Can't you make a different subclass?

I'm currently doing:
c("bootstrap_df", "grouped_df", "tbl_df", "data.frame")

I don't think I can do:
c("bootstrap_df", "tbl_df", "data.frame")

When I call summarise() I need to use the grouped_df version of it, but dispatch happens at the cpp level so unless my subclass inherits from grouped_df there is no way to use it. The check happens in the call to summarise()and in the print() method.

@krlmlr
Copy link
Member

krlmlr commented Sep 18, 2018

Other use cases would be lazy filter() and arrange(), and a mutate_when() which updates only rows that match a predicate.

Maybe we could define a new indirect_df class with a proper interface, where the existing grouped_df would be a user? Operations could be:

  • add_indirection()
  • set_indirections()
  • get_indirection(i)
  • get_indirections()
  • [
  • dplyr basic manipulation
  • nest()

@hadley
Copy link
Member

hadley commented Sep 19, 2018

@krlmlr I don't think it's necessary to create a completely new infrastructure at this point.

@DavisVaughan does the check occur anywhere other than in the constructor?

@DavisVaughan
Copy link
Member Author

No, it looks like it only occurs in the grouped df constructor with signature GroupedDataFrame::GroupedDataFrame(DataFrame x).

Here is the very first reference to why the check was included:
#606

Here is the test for that check:

test_that("GroupedDataFrame checks consistency of data (#606)", {

@romainfrancois
Copy link
Member

What about doing the check if the classes are exactly: c("grouped_df", "tbl_df", "data.frame").

That check only happens in the GroupedDataFrame(SEXP) constructor :

for (int i = 0; i < ng; i++) rows_in_groups += Rf_length(idx[i]);
  if (data_.nrows() != rows_in_groups) {
    bad_arg(".data", "is a corrupt grouped_df, contains {rows} rows, and {group_rows} rows in groups",
            _["rows"] = data_.nrows(), _["group_rows"] = rows_in_groups);
  }

It's kind of already expensive anyway, and is not enough at all to test if the tibble is indeed a correct grouped tibble in the sense of each row should belong to only one group, and all the rows are in a group, which would be even more expensive.

@hadley
Copy link
Member

hadley commented Sep 19, 2018

If it's not accurate, then I'm in favour of removing it.

We should probably expose a low-level new_grouped_df() constructer where we document our expectations, and say that if you violate them you'll need to provide alternative methods of mutate(), arrange() etc

@DavisVaughan
Copy link
Member Author

is not enough at all to test if the tibble is indeed a correct grouped tibble

Here is an example of how you can create a "valid" grouped tibble that should really be corrupt. Notice how row 51 is in 2 groups, but the sum of the lengths of each group equals the number of rows in the data frame so the check doesn't pick it up.

library(dplyr)

iris_g <- as_tibble(iris) %>%
  group_by(Species)

# setosa mean = 5.01
iris_g %>%
  summarise(x = mean(Sepal.Length))
#> # A tibble: 3 x 2
#>   Species        x
#>   <fct>      <dbl>
#> 1 setosa      5.01
#> 2 versicolor  5.94
#> 3 virginica   6.59

g_data <- group_data(iris_g)

# row 51 overlaps with the versicolor rows
g_data$.rows[[1]] <- 2:51
g_data$.rows[[1]]
#>  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#> [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#> [47] 48 49 50 51

g_data$.rows[[2]]
#>  [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67
#> [18]  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84
#> [35]  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

attr(iris_g, "groups") <- g_data

# setosa mean = 5.04
iris_g %>%
  summarise(x = mean(Sepal.Length))
#> # A tibble: 3 x 2
#>   Species        x
#>   <fct>      <dbl>
#> 1 setosa      5.04
#> 2 versicolor  5.94
#> 3 virginica   6.59

Created on 2018-09-19 by the reprex
package
(v0.2.0).

@hadley hadley added this to the 0.8.0 milestone Oct 1, 2018
@hadley
Copy link
Member

hadley commented Oct 1, 2018

Let's remove the test for now, adding a couple of lines to the documentation stating what we expect the user of this function to do.

romainfrancois added a commit that referenced this issue Oct 8, 2018
romainfrancois added a commit that referenced this issue Oct 8, 2018
romainfrancois added a commit that referenced this issue Oct 8, 2018
@lock
Copy link

lock bot commented Apr 8, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Apr 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants