Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for flatten(), simplify() and friends #900

Closed
hadley opened this issue Aug 28, 2022 · 5 comments · Fixed by #909
Closed

Plan for flatten(), simplify() and friends #900

hadley opened this issue Aug 28, 2022 · 5 comments · Fixed by #909
Labels
feature a feature request or enhancement flatten 🌎

Comments

@hadley
Copy link
Member

hadley commented Aug 28, 2022

I suggest we at least we mark as_vector(), simplify(), simplify_all(), flatten(), and flatten_*() as questioning and remove from the function index. The semantics of these functions are not super clear, and I'm not sure it's worth completely nailing them down because many of the problems they were originally designed to solve have moved into tidyr. Their use in their purrr docs reveals few compelling uses cases:

  • as_vector() is only used in its own examples.

  • simplify() (a version of as_vector() that returns x if it can simplify) isn't used at all.

  • simplify_all() is used with the output of transpose(). The most compelling use case is with safely() to get lists of error and results. But I suspect we could tackle this a different way, perhaps with a list_transpose() inspired by the work we've done in unnest_longer()/unnest_wider() or possibly powered by vec_simplify().

  • flatten() is only used in its own examples. I can see flatten() being useful in conjunction with map() but it might be easier to create the flat_map() functions directly.

Long term, if we do discover that flatten() and simplify() are useful enough to bother fully fleshing out their semantics, I'd suggest deprecating them and introducing new list_flatten() and list_simplify().

Additionally, it might be worth deprecating flatten_dfc() and flatten_dfr() as they have poor semantics (#648, #757). They are similar to both dplyr::bind_rows(x) and vctrs::vec_rbind(!!!x). If we think it's unappealing to recommend a vctrs function we could add list_rbind <- function(x) vctrs::vec_rbind(!!!x) (and similar for list_cbind()). I think we'd probably require individual elements to be data frames.

@lionel-
Copy link
Member

lionel- commented Aug 29, 2022

I think it'd be worth clarifying the notion of "flattening". In other languages and in rlang, flattening is a tool for nested structures that transforms list(atom1, list(atom2), list(list(atom3))) to list(atom1, atom2, list(atom3)). In purrr:

  • flatten() converts all elements to lists, including atomic elements, and concatenates. This flattens lists as a side effect but unfortunately can't really be used predictably to work with nested lists because atomic elements are not preserved:

    purrr::flatten(list(1:2, list(3L))) |> str()
    #> List of 3
    #>  $ : int 1
    #>  $ : int 2
    #>  $ : int 3
  • The typed variants only concatenate without flattening.

    purrr::flatten_int(list(1L, list(3L)))
    #> Error:
    #> ! Can't coerce element 2 from a list to a integer

@hadley
Copy link
Member Author

hadley commented Aug 29, 2022

list_flatten() might look something like this:

library(purrr)

list_flatten <- function(x) {
  stopifnot(vctrs::vec_is_list(x))
  
  is_nested <- map_lgl(x, vctrs::vec_is_list)
  x[!is_nested] <- map(x[!is_nested], list)
  unlist(x, recursive = FALSE)
}
str(list_flatten(list(1:2, list(3))))
#> List of 2
#>  $ : int [1:2] 1 2
#>  $ : num 3
str(list_flatten(list(list(1), list(2, 3), list(list(4)))))
#> List of 4
#>  $ : num 1
#>  $ : num 2
#>  $ : num 3
#>  $ :List of 1
#>   ..$ : num 4
str(list_flatten(list(data.frame(x = 1), data.frame(y = 2))))
#> List of 2
#>  $ :'data.frame':    1 obs. of  1 variable:
#>   ..$ x: num 1
#>  $ :'data.frame':    1 obs. of  1 variable:
#>   ..$ y: num 2

Created on 2022-08-29 by the reprex package (v2.0.1)

@hadley
Copy link
Member Author

hadley commented Aug 29, 2022

There are two possible directions we could take to replace simplify():

  • list_unify() (name up for discussion)
  • unlist2(), + unlist_int(), unlist_dbl(), ...

In either case we want to avoid both the name and the interface of the existing simplify() since we want it to make a simpler vector or die trying, and we'd deprecate the existing flatten() functions giving us fresh start on the interface.

list_unify()

Provide a single function that wraps vctrs::vec_unchop() with a name that fits into the existing purrr naming scheme:

list_unify <- function(x, ptype = NULL) {
  vctrs::vec_unchop(x, ptype = ptype)
}

Pros: only need to add one function.

Cons: lose performant atomic versions. Hard to find a good suffix that accurately conveys the purpose and ties to existing ideas.

unlist2_*()

There are two parts to this proposal:

  • Rename flatten_int() and friends to unlist_int() etc.
  • Add unlist2() which defined in the same way as list_unify()

Pros: unlist() is a very natural name to reach for.

Cons: unlist() operates recursively by default; either have to re-implement unlist_int(), accept that we follow map_*() coercion rules not vctrs coercion rules, or update set_vector_value().

@lionel-
Copy link
Member

lionel- commented Aug 30, 2022

I think list_concat() [1] / unlist2() are an answer to purrr::flatten_int() and co (which don't flatten but concatenate). They are unrelated to simplify() which has different properties than concatenation.

simplify() respects the shape of the vector and the identity of its elements. These invariants correspond to what is implemented in map_vec() but without failing when there is no common type.

[1] I prefer list_concat() to list_unify() because I think adding a new term for this operation would bring more confusion than using an existing one.

@hadley
Copy link
Member Author

hadley commented Sep 2, 2022

Concrete proposal to consider: #909

hadley added a commit that referenced this issue Sep 8, 2022
The key idea is to introduce a new family of "combining" functions: `list_c()`, `list_rbind()`, and `list_cbind()`, which replace `flatten_lgl()`, `flatten_int()`, `flatten_dbl()`, `flatten_chr()` (now `list_c()`), `flatten_dfc()` (`list_cbind()`), and `flatten_dfr()` (`list_rbind()`). The new functions are straightforward wrappers around vctrs functions, but somehow feel natural in purrr to me. 

This leaves `flatten()`, which had a rather idiosyncratic interface. It's now been replaced by `list_flatten()` which now always removes a single layer of list hierarchy (and nothing else). While working on this I realised that this was actually what `splice()` did, so overall this feels like a major improvement in naming consistency.

With those functions in place we can deprecate `map_dfr()` and `map_dfc()` which are actually "flat" map functions because they combine, rather than simplify, the results. They have never actually belonged with `map_int()` and friends because they don't satisfy the invariant `length(map(.x, .f)) == length(.x)`, and `.f` must return a length-1 result. This also strongly implies that `flat_map()` would just be `map_c()` and is thus not necessary.

* Fixes #376 by deprecating `map_dfc()`
* Fixes #405 by clearly ruling against `map_c()`
* Fixes #472 by deprecating `map_dfr()`
* Fixes #575 by introducing `list_c()`, `list_rbind()`, and `list_cbind()`
* Fixes #757 by deprecating `flatten_dfr()` and `flatten_dfc()`
* Fixes #758 by introducing `list_rbind()` and `list_cbind()`
* Part of #900
@hadley hadley closed this as completed in 1276623 Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement flatten 🌎
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants