Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better semantics for dataframeable objects #1676

Closed
lionel- opened this issue Mar 1, 2016 · 10 comments
Closed

Better semantics for dataframeable objects #1676

lionel- opened this issue Mar 1, 2016 · 10 comments
Labels
Milestone

Comments

@lionel-
Copy link
Member

@lionel- lionel- commented Mar 1, 2016

After thinking about tidyverse/purrr#179, I wonder if we could improve the semantics of dataframeable objects in bind_rows() by taking a vector viewpoint rather than a list viewpoint.

If we think about vectors, the current behaviour is not consistent because bind_rows() effectively binds columns together:

col_vectors <- list(
  a = c(1, 2),
  b = c(3, 4)
)

row_vectors <- list(
  c(a = 1, b = 3),
  c(a = 2, b = 4)
)
bind_rows(row_vectors)
#> Error: cannot convert object to a data frame

bind_rows(col_vectors)
#> Source: local data frame [2 x 2]
#>
#>       a     b
#>   (dbl) (dbl)
#> 1     1     3
#> 2     2     4

I think it'd be more intuitive to go the rbind() and cbind() way, and treat vectors differently depending on the direction of the binding. bind_rows() would require vectors with inner names and bind_cols() would require vectors with outer names.

Advantages:

  • Arguably more intuitive if we think about vectors rather than lists.
  • Better behaviour down the line, e.g. map_df()
  • This would be compatible with splicing the input as suggested in PR #992.

The latter would add a lot of flexibility. bind_cols() and bind_rows() would accept indistinctly data frames, named vectors, and lists of data frames and named vectors.

cc @jennybc

@jennybc
Copy link
Member

@jennybc jennybc commented Apr 19, 2016

Inspired by a recent real life thing that reminded me of this discussion.

library(dplyr)

x <- iris %>%
  select(-Species) %>% 
  lapply(summary)

I honestly didn't expect this to work. And I really didn't expect this result.

x %>%
  bind_rows()
#> Source: local data frame [6 x 4]
#> 
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width
#>          <dbl>       <dbl>        <dbl>       <dbl>
#> 1        4.300       2.000        1.000       0.100
#> 2        5.100       2.800        1.600       0.300
#> 3        5.800       3.000        4.350       1.300
#> 4        5.843       3.057        3.758       1.199
#> 5        6.400       3.300        5.100       1.800
#> 6        7.900       4.400        6.900       2.500

This is what I would expect row binding to produce.

x %>%
  do.call(rbind, .)
#>              Min. 1st Qu. Median  Mean 3rd Qu. Max.
#> Sepal.Length  4.3     5.1   5.80 5.843     6.4  7.9
#> Sepal.Width   2.0     2.8   3.00 3.057     3.3  4.4
#> Petal.Length  1.0     1.6   4.35 3.758     5.1  6.9
#> Petal.Width   0.1     0.3   1.30 1.199     1.8  2.5

Loading

@hadley
Copy link
Member

@hadley hadley commented Apr 19, 2016

So effectively you want to treat lists as transposed data frames - I think that makes sense.

Loading

@lionel-
Copy link
Member Author

@lionel- lionel- commented Apr 19, 2016

So effectively you want to treat lists as transposed data frames

But that should depend on the direction of the binding I think. Thinking of lists as collections of vectors rather than as wholistic objects (like when we bind data frames) is key.

Loading

@jennybc
Copy link
Member

@jennybc jennybc commented Apr 27, 2016

These two tweets sound like another example of something that should be dataframeable but is awkward:

Taken ages to figure out how to convert matrix of lists into single dataframe #rstats: x <- dplyr::rbind_all(apply(x, 1, as.data.frame))

NB my x here is rather odd format: each column a variable, each row a species, each cell a list, but all cells in same row have same length

Loading

@lionel-
Copy link
Member Author

@lionel- lionel- commented Apr 27, 2016

This should work out of the box once bind_rows() treats lists as collections of row vectors.

Loading

@jennybc
Copy link
Member

@jennybc jennybc commented May 18, 2016

Another toy example inspired by real life. Am I making something hard that's not?

library(dplyr)
x <- list(
  c(a = "a1", b = "b1"),
  c(a = "a2", b = "b2"),
  c(a = "a3", c = "c3")
)
dplyr::bind_rows(x)
#> Error: cannot convert object to a data frame
x %>% 
  purrr::map(as.list) %>% 
  ## update: it's simpler than I thought; this line is unnecessary
  #purrr::map(tibble::as_data_frame) %>%
  dplyr::bind_rows()
#> Source: local data frame [3 x 3]
#> 
#>       a     b     c
#>   <chr> <chr> <chr>
#> 1    a1    b1  <NA>
#> 2    a2    b2  <NA>
#> 3    a3  <NA>    c3

I want to use bind_rows() specifically because this is so nice: "When row-binding, columns are matched by name, and any values that don't match will be filled with NA."

Loading

@hadley
Copy link
Member

@hadley hadley commented May 26, 2016

I think this probably all makes sense, but I can never quite hold all the cases in my head, and when I fix one I end up breaking another. It would be really useful to develop a set of comprehensive test cases.

It may also be useful to pull all this coercion stuff into a separate package, so we can think more deeply about all the edge cases. The challenge would then be supplying both R and C apis.

Loading

@hadley
Copy link
Member

@hadley hadley commented Mar 5, 2017

I think this (from tidyverse/purrr#222) is another example of the same basic problem:

library(dplyr, warn.conflicts = FALSE)

x <- list(
  list(login = "mickey", public_repos = 27),
  list(login = "minnie", public_repos = 88)
)

x %>% bind_rows()
#> # A tibble: 2 × 2
#>    login public_repos
#>    <chr>        <dbl>
#> 1 mickey           27
#> 2 minnie           88
x %>% setNames(c("mickey", "minnie")) %>% bind_rows()
#> # A tibble: 2 × 2
#>      mickey    minnie
#>      <list>    <list>
#> 1 <chr [1]> <chr [1]>
#> 2 <dbl [1]> <dbl [1]>

Loading

@hadley hadley closed this Mar 5, 2017
@jennybc
Copy link
Member

@jennybc jennybc commented Mar 5, 2017

Did you mean to close this @hadley?

Loading

@hadley
Copy link
Member

@hadley hadley commented Mar 5, 2017

No 😞

Loading

@hadley hadley reopened this Mar 5, 2017
lionel- added a commit to lionel-/dplyr that referenced this issue Apr 6, 2017
@lionel- lionel- closed this in #2621 Apr 6, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants