New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide rbind solution that can add list element names as a variable in the output #22

Closed
jennybc opened this Issue Aug 22, 2014 · 13 comments

Comments

Projects
None yet
5 participants
@jennybc
Member

jennybc commented Aug 22, 2014

Problem: you have a list of data.frames and the element names convey information. You want to row bind them together and, in the new data.frame, you want a variable for the list element each observation originated in.

2014-08-22_rbind-and-store-as var

Demo: fragment subset of iris into separate data.frames, stored as list.
Note: Species info carried only via list names

my_list <- lapply(split(subset(iris, select = -Species),
                        iris$Species), "[", 1:2, )

Simple rbind-y calls cannot recover Species:

do.call("rbind", my_list) # rownames have never looked so good ...
##               Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa.1               5.1         3.5          1.4         0.2
## setosa.2               4.9         3.0          1.4         0.2
## versicolor.51          7.0         3.2          4.7         1.4
## versicolor.52          6.4         3.2          4.5         1.5
## virginica.101          6.3         3.3          6.0         2.5
## virginica.102          5.8         2.7          5.1         1.9
dplyr::rbind_all(my_list)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          7.0         3.2          4.7         1.4
## 4          6.4         3.2          4.5         1.5
## 5          6.3         3.3          6.0         2.5
## 6          5.8         2.7          5.1         1.9

Current workaround: prep with mapply() to restore Species, then rbind (thanks @kara_woo for this snippet)

my_list2 <-
  mapply(`[<-`, my_list, 'Species', value = names(my_list), SIMPLIFY = FALSE)
dplyr::rbind_all(my_list2)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1          5.1         3.5          1.4         0.2     setosa
## 2          4.9         3.0          1.4         0.2     setosa
## 3          7.0         3.2          4.7         1.4 versicolor
## 4          6.4         3.2          4.5         1.5 versicolor
## 5          6.3         3.3          6.0         2.5  virginica
## 6          5.8         2.7          5.1         1.9  virginica

@hadley hadley closed this in b44eeb6 Aug 22, 2014

@hadley

This comment has been minimized.

Member

hadley commented Aug 22, 2014

No documentation yet, but unnest() should do now what you want.

I'm still struggling if unnest() should work on both lists and columns of data frames that are lists. That might be confusing, but it's basically the same behaviour albeit with slightly different output.

@jennybc

This comment has been minimized.

Member

jennybc commented Aug 23, 2014

Thank -- yes unnest() does it!

As a happy user of the grammar that runs through your packages, the words this task evokes for me are rbind and/or gather. Maybe unnest will seem more natural when I've worked with it more.

Morally, this operation seems like gathering variables into key-value pairs, with really different mechanics. Instead of different levels of a factor represented as separate variables in a data.frame, we've got them represented as separate data.frames.

@hadley

This comment has been minimized.

Member

hadley commented Aug 23, 2014

I think it'll be more obvious why it's called unnest() once I sketch out the other pieces, and when you see what nest() does - they're fundamentally about lists of data frames and vectors, where spread() and gather() deal with columns.

@jennybc

This comment has been minimized.

Member

jennybc commented Aug 23, 2014

I have every faith that it will be the most natural thing when you're done. :)

@dwinsemius

This comment has been minimized.

dwinsemius commented Sep 20, 2014

This is what I would imagine to be the outcome of do.call( rbind.fill, df_list) but that's not what the code actually uses. It's an S3 method that seems to be first an rbind operation for the unmatched columns followed by (v)applying append column-wise matched on the basis of names. I wasn't really sure how the append_df would succeed, since it looked just like base::append with some attribute management (to handle Dates, datetimes, and factors presumably) but there is no append.data.frame. I was expecting some lapply(list, append_df), but it appears to be succeeding nonetheless, so probably it's just my confusion.

@hadley

This comment has been minimized.

Member

hadley commented May 18, 2015

FWIW, I've removed this experimental method because I'm now pretty sure it's a bad idea (and dplyr::bind_rows() should do the equivalent in the next version)

@hadley

This comment has been minimized.

Member

hadley commented May 18, 2015

Hmmm, but maybe unnest() needs to handle the case where a column of the data frame is a list of data frames:

data_frame(
  x = c(1, 2),
  y = c(3, 4),
  z = list(data_frame(a = 1), data_frame(a = 1:3))
)

(and those data frames could contain lists-columns themselves, but I think you'd need a second unnest to handle that)

I'm not certain whether or not this is useful (it might crop up as an alternative way of handling relational data), but it is interesting.

@hadley

This comment has been minimized.

Member

hadley commented May 18, 2015

Oh that's #58

@dlebauer

This comment has been minimized.

dlebauer commented Jan 6, 2017

So ... in case anyone else finds this in their search for adding a new column with bind_rows (or rbind_list or do.call(rbind, ...) and return a new column without inelegant convulsions... I'll finish out the example to demonstrate:

By

dplyr::bind_rows() should do the equivalent in the next version

What I gather is that (using dplyr 0.5.0) the argument .id can be used to specify the new column name, so following from the example #22 (comment)

bind_rows(my_list, .id = 'species')

returns

     species Sepal.Length Sepal.Width Petal.Length Petal.Width
1     setosa          5.1         3.5          1.4         0.2
2     setosa          4.9         3.0          1.4         0.2
3 versicolor          7.0         3.2          4.7         1.4
4 versicolor          6.4         3.2          4.5         1.5
5  virginica          6.3         3.3          6.0         2.5
6  virginica          5.8         2.7          5.1         1.9

P.S. thanks again for all of the great work you and your team are doing!

@d8aninja

This comment has been minimized.

d8aninja commented Sep 28, 2017

Is there a solution for the cases when rbind_rows will not accomplish this task if the constituent lists are of different length? In these cases I usually appeal to do.call(rbind, df) but there is obviously no .id argument to go along with that solution...

@jennybc

This comment has been minimized.

Member

jennybc commented Sep 28, 2017

@d8aninja Do you want to make a little example that shows exactly what you mean (I'm not entirely sure) and ask it over in the tidyverse section of community.rstudio.com? This question is a good fit.

@d8aninja

This comment has been minimized.

d8aninja commented Oct 3, 2017

@jennybc will do, thanks!

@dwinsemius

This comment has been minimized.

dwinsemius commented Oct 12, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment