Relax grouped data frame constraint concerning number of rows in groups vs data #3837

DavisVaughan · 2018-09-18T19:09:08Z

There are valid cases in which a single row can belong to multiple "virtual" groups. One such case is creating virtual groups to define bootstraps. This can result in large performance increases for operations such as summarise() or do() when compared to repeated subsetting. It also lends itself to elegant pipelines such as:

iris %>%
  group_by(Species) %>%
  bootstrapify(10) %>%
  summarise(per_strap_species_mean = mean(Petal.Width))

Currently, a grouped data frame check prevents this from being useful. The code linked below checks to see if the number of rows in the data is equal to the sum of the lengths of the group indices.

dplyr/src/group_indices.cpp

Line 534 in 4803fb5

    
           bad_arg(".data", "is a corrupt grouped_df, contains {rows} rows, and {group_rows} rows in groups",

It would be great if a conversation could be had about either altering this check or removing it altogether.

This check has been removed in a branch I created:
dplyr/virtual-bootstrap-groups

An example package using that branch has also been created to demonstrate the usefulness of this idea applied to bootstraps:
strapgod

See also:
#14

The text was updated successfully, but these errors were encountered:

batpigandme · 2018-09-18T19:33:16Z

I think what you're describing is fundamentally different from the mental construct of dplyr::group_by(). Grouping by a variable in the tidy data paradigm essentially replaces the need for split, in many cases. Thinking of it as a split, as opposed to several samples of, or groups of constituent observations, the mutual-exclusion of group membership is pretty foundational.

I get the utility of multiple permutations, it just seems separate from group_by().

DavisVaughan · 2018-09-18T19:42:26Z

I was also thinking of potential use cases of this in dplyr, and while there are some that are really useful, there aren't that many.

summarise() for aggregating over bootstraps
do() for applying some data frame returning function/model to bootstraps (think tidy()) (the rsample+map combination does this too)

Most other functions just don't make sense in this context, and a class that built on top of this, bootstrapped_df or similar, would have to guard against their use. The main benefit is just the large improvement in speed / memory reduction + nice readability of these 2 cases.

All that to say that I think I agree with you, but I also think this could live somewhere and be useful.

batpigandme · 2018-09-18T19:48:22Z

Totally agreed with it living somewhere and being useful, it's just a different unit of observation— like the table below (from strapgod, which, is it really already on CRAN?!) is a different "set" from the original iris records.

iris %>%
  group_by(Species) %>%
  bootstrapify(10) %>%
  collect()
#> # A tibble: 1,500 x 7
#> # Groups:   Species, .strap [30]
#>    .strap   .id Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>    <chr>  <int>        <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1 id_1      48          4.6         3.2          1.4         0.2 setosa 
#>  2 id_1       8          5           3.4          1.5         0.2 setosa 
#>  3 id_1      39          4.4         3            1.3         0.2 setosa 
#>  4 id_1      45          5.1         3.8          1.9         0.4 setosa 
#>  5 id_1      31          4.8         3.1          1.6         0.2 setosa 
#>  6 id_1      31          4.8         3.1          1.6         0.2 setosa 
#>  7 id_1      50          5           3.3          1.4         0.2 setosa 
#>  8 id_1       2          4.9         3            1.4         0.2 setosa 
#>  9 id_1      30          4.7         3.2          1.6         0.2 setosa 
#> 10 id_1      18          5.1         3.5          1.4         0.3 setosa

What about rsample?

hadley · 2018-09-18T19:48:44Z

Do we need to remove the check? Can't you make a different subclass?

DavisVaughan · 2018-09-18T20:07:40Z

is it really already on CRAN?!

No no, the default usethis::use_rmd() rmd has that "install from cran" line but I added # no you cannot into the code block

What about rsample?

Yes, I think it would go here if that is what you are asking? Also, rsample has some bootstrap infrastructure but it works differently.

Can't you make a different subclass?

I'm currently doing:
c("bootstrap_df", "grouped_df", "tbl_df", "data.frame")

I don't think I can do:
c("bootstrap_df", "tbl_df", "data.frame")

When I call summarise() I need to use the grouped_df version of it, but dispatch happens at the cpp level so unless my subclass inherits from grouped_df there is no way to use it. The check happens in the call to summarise()and in the print() method.

krlmlr · 2018-09-18T20:21:25Z

Other use cases would be lazy filter() and arrange(), and a mutate_when() which updates only rows that match a predicate.

Maybe we could define a new indirect_df class with a proper interface, where the existing grouped_df would be a user? Operations could be:

add_indirection()
set_indirections()
get_indirection(i)
get_indirections()
[
dplyr basic manipulation
nest()

hadley · 2018-09-19T07:24:26Z

@krlmlr I don't think it's necessary to create a completely new infrastructure at this point.

@DavisVaughan does the check occur anywhere other than in the constructor?

DavisVaughan · 2018-09-19T12:17:18Z

No, it looks like it only occurs in the grouped df constructor with signature GroupedDataFrame::GroupedDataFrame(DataFrame x).

Here is the very first reference to why the check was included:
#606

Here is the test for that check:

dplyr/tests/testthat/test-filter.r

Line 172 in 53ed25b

test_that("GroupedDataFrame checks consistency of data (#606)", {

romainfrancois · 2018-09-19T12:56:34Z

What about doing the check if the classes are exactly: c("grouped_df", "tbl_df", "data.frame").

That check only happens in the GroupedDataFrame(SEXP) constructor :

for (int i = 0; i < ng; i++) rows_in_groups += Rf_length(idx[i]);
  if (data_.nrows() != rows_in_groups) {
    bad_arg(".data", "is a corrupt grouped_df, contains {rows} rows, and {group_rows} rows in groups",
            _["rows"] = data_.nrows(), _["group_rows"] = rows_in_groups);
  }

It's kind of already expensive anyway, and is not enough at all to test if the tibble is indeed a correct grouped tibble in the sense of each row should belong to only one group, and all the rows are in a group, which would be even more expensive.

hadley · 2018-09-19T15:45:27Z

If it's not accurate, then I'm in favour of removing it.

We should probably expose a low-level new_grouped_df() constructer where we document our expectations, and say that if you violate them you'll need to provide alternative methods of mutate(), arrange() etc

DavisVaughan · 2018-09-19T15:55:08Z

is not enough at all to test if the tibble is indeed a correct grouped tibble

Here is an example of how you can create a "valid" grouped tibble that should really be corrupt. Notice how row 51 is in 2 groups, but the sum of the lengths of each group equals the number of rows in the data frame so the check doesn't pick it up.

library(dplyr)

iris_g <- as_tibble(iris) %>%
  group_by(Species)

# setosa mean = 5.01
iris_g %>%
  summarise(x = mean(Sepal.Length))
#> # A tibble: 3 x 2
#>   Species        x
#>   <fct>      <dbl>
#> 1 setosa      5.01
#> 2 versicolor  5.94
#> 3 virginica   6.59

g_data <- group_data(iris_g)

# row 51 overlaps with the versicolor rows
g_data$.rows[[1]] <- 2:51
g_data$.rows[[1]]
#>  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#> [24] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
#> [47] 48 49 50 51

g_data$.rows[[2]]
#>  [1]  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67
#> [18]  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84
#> [35]  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

attr(iris_g, "groups") <- g_data

# setosa mean = 5.04
iris_g %>%
  summarise(x = mean(Sepal.Length))
#> # A tibble: 3 x 2
#>   Species        x
#>   <fct>      <dbl>
#> 1 setosa      5.04
#> 2 versicolor  5.94
#> 3 virginica   6.59

Created on 2018-09-19 by the reprex
package (v0.2.0).

hadley · 2018-10-01T19:08:23Z

Let's remove the test for now, adding a couple of lines to the documentation stating what we expect the user of this function to do.

closes #3837

lock · 2019-04-08T07:44:31Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

hadley added this to the 0.8.0 milestone Oct 1, 2018

krlmlr added the documentation label Oct 4, 2018

romainfrancois added a commit that referenced this issue Oct 8, 2018

➕ new_grouped_df() function.

ee066da

closes #3837

romainfrancois added a commit that referenced this issue Oct 8, 2018

➕ new_grouped_df() function.

94ca765

closes #3837

romainfrancois added a commit that referenced this issue Oct 8, 2018

➕ new_grouped_df() function.

d0addc6

closes #3837

romainfrancois closed this as completed in 12202b0 Oct 10, 2018

lock bot locked and limited conversation to collaborators Apr 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax grouped data frame constraint concerning number of rows in groups vs data #3837

Relax grouped data frame constraint concerning number of rows in groups vs data #3837

DavisVaughan commented Sep 18, 2018

batpigandme commented Sep 18, 2018

DavisVaughan commented Sep 18, 2018 •

edited

Loading

batpigandme commented Sep 18, 2018 •

edited

Loading

hadley commented Sep 18, 2018

DavisVaughan commented Sep 18, 2018 •

edited

Loading

krlmlr commented Sep 18, 2018

hadley commented Sep 19, 2018

DavisVaughan commented Sep 19, 2018

romainfrancois commented Sep 19, 2018

hadley commented Sep 19, 2018

DavisVaughan commented Sep 19, 2018

hadley commented Oct 1, 2018

lock bot commented Apr 8, 2019

Relax grouped data frame constraint concerning number of rows in groups vs data #3837

Relax grouped data frame constraint concerning number of rows in groups vs data #3837

Comments

DavisVaughan commented Sep 18, 2018

batpigandme commented Sep 18, 2018

DavisVaughan commented Sep 18, 2018 • edited Loading

batpigandme commented Sep 18, 2018 • edited Loading

hadley commented Sep 18, 2018

DavisVaughan commented Sep 18, 2018 • edited Loading

krlmlr commented Sep 18, 2018

hadley commented Sep 19, 2018

DavisVaughan commented Sep 19, 2018

romainfrancois commented Sep 19, 2018

hadley commented Sep 19, 2018

DavisVaughan commented Sep 19, 2018

hadley commented Oct 1, 2018

lock bot commented Apr 8, 2019

DavisVaughan commented Sep 18, 2018 •

edited

Loading

batpigandme commented Sep 18, 2018 •

edited

Loading

DavisVaughan commented Sep 18, 2018 •

edited

Loading