-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FR: promote() to create new variable from a list column #341
Comments
Should it have an option to remove that component from the list? i.e.
Because species has been moved out of metadata? |
Probably. I think in my Game of Thrones character/book stuff I had to do exactly that. Seems good idea re: DRY principle. Then I guess you need |
More notes re: the conversation. Would you want to be able to promote multiple variables at once? In the limit, you are just transposing + simplifying + column binding, I suppose. |
And do you want to be able to specify types like in the map functions? |
In my original fantasy, no, |
Hmm it feels this should have been how If we used mutate semantics instead of select, we could be explicit by using the vector constructors from rlang: promote(df, list_col, Species = chr(Species)) But it wouldn't work if you want to be explicit about unnesting to another list-column. Unless we only try to simplify bare symbols? promote(df, listcol, other_listcol = Species) # Simplifies
promote(df, listcol, other_listcol = identity(Species)) # Doesn't simplify With promote_at(df, listcol, vars(everything()), funs(chr)) |
I also started to have an eerie feeling re: connections to |
I think Regarding "putting the new variable in front of the old instead of at the end where I can never see it", there is a tension with the idiom that the variable last created is placed in the last position. This allows |
I think |
Some more imaginary examples: library(tibble)
df <- tribble(
~x, ~y,
1, list(a = 1:3, b = list(X = 3, Y = 5), c = 5),
2, list(a = 4, b = list(X = 1, Y = 5), c = 7)
)
# Single value is unambiguous
# df %>% promote(y, "c")
tribble(
~x, ~y, ~c,
1, list(a = 1:3, b = list(X = 3, Y = 5)), 5,
2, list(a = 2, b = list(X = 1, Y = 5)), 7
)
# Named vector forms columns
# df %>% promote(y, "b")
tribble(
~x, ~y, ~X, ~Y,
1, list(a = 1:3, c = 5), 3, 5,
2, list(a = 2, c = 7), 1, 5
)
# Unnamed vector forms rows
# df %>% promote(a, "b")
tribble(
~x, ~y, ~a,
1, list(b = list(X = 3, Y = 5), c = 5), 1,
1, list(b = list(X = 3, Y = 5), c = 5), 2,
1, list(b = list(X = 3, Y = 5), c = 5), 3,
2, list(b = list(X = 1, Y = 5), c = 7), 1
) I think these are basically a wrapper around a mutate (which uses |
My familiarity with list columns comes largely from tibblized JSON data, as well. However, I really liked the approach taken by the In general, I think it would be more clear to use something akin to the It might make sense to pull this functionality into a separate package (I like @hadley's idea of Spread-like behavior. tree <- tibble::data_frame(
key = c(1,2)
, list_col=list(
list("a"=c(1,2)
, "b"=c(3,4))
, list("a"=c(5,6)
,"b"=c(7,8))
)
)
print(tree)
#> # A tibble: 2 x 2
#> key list_col
#> <dbl> <list>
#> 1 1 <list [2]>
#> 2 2 <list [2]>
# tree %>% spread_tree(list_col,levels=1)
# Parsed with column specification:
# cols(
# a = col_list(),
# b = col_list()
# )
output_level1 <- tibble::data_frame(
key=c(1,2)
, a=list(c(1,2),c(5,6))
, b=list(c(3,4),c(7,8))
)
print(output_level1)
#> # A tibble: 2 x 3
#> key a b
#> <dbl> <list> <list>
#> 1 1 <dbl [2]> <dbl [2]>
#> 2 2 <dbl [2]> <dbl [2]>
# output_level1 %>% spread_tree(levels=1) # hits all list columns?
# Parsed with column specification:
# cols(
# key = col_integer(),
# a_1 = col_integer(),
# a_2 = col_integer(),
# b_1 = col_integer(),
# b_2 = col_integer()
# )
output_level2 <- tibble::data_frame(
key=c(1,2)
, a_1=c(1,5)
, a_2=c(2,6)
, b_1=c(3,7)
, b_2=c(4,8)
)
print(output_level2)
#> # A tibble: 2 x 5
#> key a_1 a_2 b_1 b_2
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3 4
#> 2 2 5 6 7 8
Gather-like behavior: tree <- tibble::data_frame(key=c(1,2)
, list_col=list(
list("a","b","c")
,list("d","e","f")
)
)
#> # A tibble: 2 x 2
#> key list_col
#> <dbl> <list>
#> 1 1 <list [3]>
#> 2 2 <list [3]>
# tree %>% gather_tree(list_col)
# Parsed with column specification:
# cols(
# output = col_character()
# )
tibble::data_frame(
key=c(rep(1,3),rep(2,3))
, output= c("a","b","c","d","e","f")
)
#> # A tibble: 6 x 2
#> key output
#> <dbl> <chr>
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 2 d
#> 5 2 e
#> 6 2 f I am glossing over several tricky things here - what inferences to make about names when not provided, how to enable the user to control those inferences, etc. One last tidbit that I thought I would love to see this list-column implementation expanded to deal with XML, JSON, etc. in a generalized way. Curious to hear your thoughts! |
Relates to #418 , I believe |
Another potential application: https://sharla.party/posts/discog-purrr/ |
Latest thoughts:
All together, I think can means we can be more precise about the use of The main question is the interface. Should it take a column name, and the name of the components inside that column to hoist? df %>% hoist(metadata, "species")
df %>% hoist(metadata, "films")
df %>% hoist(metadata, c("films", "species", "color")) Or should it take a set of named pluck expressions?
The first form is less flexible, but forces a step-by-step approach to dealing with deeply nested columns that I think might be helpful (i.e. you don't need to discover the pluck expression up front). It's also nice that existing columns can be referred to without quotes, whereas the new columns require quotes. I have convinced myself that the simpler form is better, so please speak up now if you see an obvious downside! |
I think deep plucking might be useful when dealing with web metadata. Also the plucked objects are not existing columns but they are still existing objects, so the character vector syntax is less obvious. Maybe with the pluck syntax it is more obvious. For these two reasons, I think I prefer the second form. Also it seems more natural to me to define new columns with parameter syntax, as in |
@lionel- You can still use df %>% hoist(
species = c("metadata", "species"),
first_film = list("metadata", "film", 1L)
) Is equivalent to (and not much shorter than) df %>% mutate(
species = map_c(metadata, "species"),
first_film = map_c(metadata, list("film", 1))
) (assuming a |
I really wince at this, because it re-aggravates people's existing confusion about
I'll be back with some examples, if the conversation doesn't move past me too quickly. |
Here's my example for point 1. re: potential to aggravate existing confusion for people mastering "how to work with lists and list-cols". library(purrr)
library(repurrrsive) Let’s say you’re interested in multiple fields for each GoT character. map(got_chars[1:2], c("name", "culture", "born"))
#> [[1]]
#> NULL
#>
#> [[2]]
#> NULL No, silly, you’ve gap to map map(got_chars[1:2], `[`, c("name", "culture", "born"))
#> [[1]]
#> [[1]]$name
#> [1] "Theon Greyjoy"
#>
#> [[1]]$culture
#> [1] "Ironborn"
#>
#> [[1]]$born
#> [1] "In 278 AC or 279 AC, at Pyke"
#>
#>
#> [[2]]
#> [[2]]$name
#> [1] "Tyrion Lannister"
#>
#> [[2]]$culture
#> [1] ""
#>
#> [[2]]$born
#> [1] "In 273 AC, at Casterly Rock" But the proposed |
Here's my example for point 2 re: potential to aggravate existing confusion for people mastering "how to work with lists and list-cols". library(purrr)
library(repurrrsive)
# setup, nothing to see here
names(gh_repos) <- map_chr(gh_repos, list(1, "owner", "login")) Providing indexing info as “loose parts” does not error, but this is not correct. map(gh_repos, 4, "owner", "login")
#> $gaborcsardi
#> $gaborcsardi$id
#> [1] 34924886
#>
#> $gaborcsardi$name
#> [1] "baseimports"
#>
#> overwhelming amount of output follows ... What if we pack indexing info via map(gh_repos, c(4, "owner", "login"))
#> $gaborcsardi
#> NULL
#>
#> $jennybc
#> NULL
#>
#> $jtleek
#> NULL
#>
#> $juliasilge
#> NULL
#>
#> $leeper
#> NULL
#>
#> $masalmon
#> NULL What if we pack indexing info via map(gh_repos, list(4, "owner", "login"))
#> $gaborcsardi
#> [1] "gaborcsardi"
#>
#> $jennybc
#> [1] "jennybc"
#>
#> $jtleek
#> [1] "jtleek"
#>
#> $juliasilge
#> [1] "juliasilge"
#>
#> $leeper
#> [1] "leeper"
#>
#> $masalmon
#> [1] "masalmon" Created on 2019-04-24 by the reprex package (v0.2.1.9000) |
I realize this conversation is about I think it's important to view the |
How about this compromise between the two forms? We adhere closer to pluck syntax, but allow you to apply it to only a single-list col at a time (hence considerably reducing duplication): df %>% hoist(metadata,
species = "species",
first_film = list("films", 1L)
) (I've also decided it's easiest to leave the list column as is; attempting to combine removal with pluck semantics is too complicated) |
Putting on the radar here at @hadley's suggestion.
What about a function
promote()
that can create a simplified variable from info extracted from a list column?Example:
What friction would
promote()
remove? The auto-simplification and "putting the new variable in front of the old instead of at the end where I can never see it".Related to tidyverse/purrr#336. The new capability of
purrr::pluck()
also seems interesting in this context.In my real life, both issues are motivated by dealing with tibblized JSON from an API, where I have one row per item and I'm dragging around a list-column of metadata.
The text was updated successfully, but these errors were encountered: