Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbind giving errors with filter #606

jrvianna opened this issue Sep 18, 2014 · 12 comments

rbind giving errors with filter #606

jrvianna opened this issue Sep 18, 2014 · 12 comments
bug an unexpected problem or unintended behavior


Copy link

When joining two dataframes with rbind sometimes would result in an object that appears to be correct, but with incorrect structure, that gives errors when using filter function. More detailed and comments on:

df1 <- data.frame(
   group = factor(rep(c("C", "G"), 5)),
   value = 1:10)
df1 <- df1 %>% group_by(group) #df1 is now tbl
df2 <- data.frame(
   group = factor(rep("G", 10)),
   value = 11:20)
df3 <- rbind(df1, df2) #df2 is data.frame
df3 %>% filter(group == "C") #returns filtered rows in df1 and all rows of df2
Source: local data frame [15 x 2]
Groups: group

  group value
1      C     1
2      C     3
3      C     5
4      C     7
5      C     9
6      G    11
7      G    12
8      G    13
9      G    14
10     G    15
11     G    16
12     G    17
13     G    18
14     G    19
15     G    20
@romainfrancois romainfrancois self-assigned this Sep 21, 2014
@romainfrancois romainfrancois added the bug an unexpected problem or unintended behavior label Sep 21, 2014
@romainfrancois romainfrancois added this to the 0.3 milestone Sep 21, 2014
Copy link
Member keeps attributes from the first data frame for some reason:

> str(df3)
Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of  2 variables:
 $ group: Factor w/ 2 levels "C","G": 1 2 1 2 1 2 1 2 1 2 ...
 $ value: int  1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "vars")=List of 1
  ..$ : symbol group
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 2
  ..$ : int  0 2 4 6 8
  ..$ : int  1 3 5 7 9
 - attr(*, "group_sizes")= int  5 5
 - attr(*, "biggest_group_size")= int 5
 - attr(*, "labels")='data.frame':  2 obs. of  1 variable:
  ..$ group: Factor w/ 2 levels "C","G": 1 2
  ..- attr(*, "vars")=List of 1
  .. ..$ : symbol group

I also tried to add a rbind.tbl_df method using rbind_list, but I must be dumb or something bc this is what I get:

> rbind.tbl_df <- function(...) rbind_list(...)
> rbind(df1, df2)
    group     value
df1 factor,10 Integer,10
df2 factor,10 Integer,10


Copy link

Anyway, back to the original problem, one thing I could do is somehow check consistency between the attributes of a grouped data frame and the data frame itself, i.e.

> sum( attr(df3, "group_sizes") )
[1] 10
> nrow(df3)
[1] 20

Those are different, so it is not valid with respect to grouped_df, but it might get in the way of other forms of groupings because this would make the assumption that a grouped_df has one row in one and only one group and each row belong to a group. @hadley ?

Copy link

This also have undesired effects fo other verbs, e.g.:

> df3 %>% mutate( value2 = value + 1 )
Source: local data frame [20 x 3]
Groups: group

   group value         value2
1      C     1   2.000000e+00
2      G     2   3.000000e+00
3      C     3   4.000000e+00
4      G     4   5.000000e+00
5      C     5   6.000000e+00
6      G     6   7.000000e+00
7      C     7   8.000000e+00
8      G     8   9.000000e+00
9      C     9   1.000000e+01
10     G    10   1.100000e+01
11     G    11 -1.218652e-280
12     G    12 -5.964388e-181
13     G    13   1.397168e-78
14     G    14 -3.526465e+102
15     G    15   7.976847e-01
16     G    16   2.036815e-71
17     G    17  8.527626e-249
18     G    18 -3.508129e+277
19     G    19   8.292691e-88
20     G    20  6.934274e-310

Copy link

I think the problem is that I'm implicitely assuming what I described above: for a grouped_df all rows are in one and only one group.

So I could either:

  • assert that early, i.e. whenever I create a GroupedDataFrame
  • fix it. this would change the impl of mutate where we could no longer rely on shallow copy of the columns unless we were sure that the assumption holds.

The next problem is that asserting the assumption might be expensive. We can easily enough make the check about group_sizes for cheap, but that would not be enough.

So far, we've sort of worked under the assumption that we create the grouped_df object and therefore we know how to do that. The problem is that rbind creates an invalid grouped_df object.

Perhaps that is a documentation issue and we should not use rbind in the first place, we have rbind_list anyway.

Anyway, I'll hold off on this one until I get some guidance on what to do.

romainfrancois added a commit that referenced this issue Sep 22, 2014

  prevent some issues related to corrupt `grouped_df` objects as the one
  made by rbind (#606).
Copy link

I've implemented the test in GroupedDataFrame.

            if( !is_lazy ){
                // check consistency of the groups
                int rows_in_groups = sum(group_sizes) ;
                if( data_.nrows() != rows_in_groups ){
                    std::stringstream s ; 
                    s << "corrupt 'grouped_df', contains "
                      << data_.nrows()
                      << " rows, and "
                      << rows_in_groups
                      << " rows in groups" ;
                    stop(s.str()) ;

As said above, this does not guarantee complete coherence of the grouped_df but at least it filters off corrupt data as the one made by rbind in that case.

Copy link

hadley commented Sep 22, 2014

I'll see if I can figure out how to make rbind() behave properly in this scenario.

@hadley hadley assigned hadley and unassigned romainfrancois Sep 22, 2014
Copy link

Good luck with that; I'd be curious how this is done.

Copy link

hadley commented Sep 22, 2014

Hmmm, ok, I guess it's unfixable due to the crazy dispatch that rbind() uses.

@arunsrinivasan do you do anything to fix this for data.table?

@hadley hadley closed this as completed Sep 22, 2014
Copy link

@hadley FAQ 2.23 pretty much explains the issue and the current workaround data.table has...

Copy link

hadley commented Sep 23, 2014

@arunsrinivasan and R CMD check lets you get away with that?

Copy link

@hadley, that's what Matt, after exhausting all other options (as explained in the FAQ), has managed to do to get around this issue. It'd be great if someone comes up with a better fix...

Copy link

R CMD check does not have to know ;)

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
bug an unexpected problem or unintended behavior
None yet

No branches or pull requests

4 participants