Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack() / unpack() #523

Closed
romainfrancois opened this issue Dec 13, 2018 · 15 comments
Closed

pack() / unpack() #523

romainfrancois opened this issue Dec 13, 2018 · 15 comments
Labels
df-col 👜 feature a feature request or enhancement

Comments

@romainfrancois
Copy link
Member

  • pack() would tidy select columns, pack them in a data frame and then add that data frame as a new column
  • unpack() would "promote" columns from a data frame columns as columns of the outer df.

(very sketchy, just unloading pack from my head)

library(tidyverse)

pack <- function(.tbl, name = "data", ...) {
  all_vars <- tbl_vars(.tbl)
  selected <- tidyselect::vars_select(all_vars, ...)
  not_selected <- setdiff(all_vars, selected)
  
  .tbl %>% 
    select(not_selected) %>% 
    add_column(!!name := select(.tbl, selected))
}

df <- iris %>% 
  as_tibble() %>% 
  pack("Sepal", Length = Sepal.Length, Width = Sepal.Width) %>% 
  pack("Petal", Length = Petal.Length, Width = Petal.Width)
df$Petal
#> # A tibble: 150 x 2
#>    Length Width
#>     <dbl> <dbl>
#>  1    1.4   0.2
#>  2    1.4   0.2
#>  3    1.3   0.2
#>  4    1.5   0.2
#>  5    1.4   0.2
#>  6    1.7   0.4
#>  7    1.4   0.3
#>  8    1.5   0.2
#>  9    1.4   0.2
#> 10    1.5   0.1
#> # … with 140 more rows

Created on 2018-12-13 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

@romainfrancois

This comment has been minimized.

@hadley hadley transferred this issue from tidyverse/dplyr Dec 13, 2018
@romainfrancois
Copy link
Member Author

Random things:

  • would we want to pack into multiple data frame columns at once, something like this maybe: pack( x = select(<...>), y = select(<...>).
  • i guess we need to make grouping variables non packable.

@lionel-
Copy link
Member

lionel- commented Dec 13, 2018

pack( x = select(<...>), y = select(<...>)

I guess this would be pack(x = vars(), y = vars()) since select takes a data frame as input? Perhaps it could also nest the data frame in a list-column, following the join() proposal?

data %>% pack(
  x = vars(...),
  y = nesting(vars(...))
)

i guess we need to make grouping variables non packable.

How about copying them inside the nested df? It sounds useful to carry them around.

@romainfrancois
Copy link
Member Author

vars() sounds good, otherwise maybe pick(), pick_if(), pick_at(), ...

@romainfrancois
Copy link
Member Author

I’m not sure i get the nesting() thing. We would need to make a list column of n() row 1 tibbles ?

@lionel-
Copy link
Member

lionel- commented Dec 13, 2018

Yes, which is occasionally useful. Not that important, I just thought maybe the interfaces could be made consistent, in case we include join() in dplyr.

@hadley hadley added feature a feature request or enhancement df-col 👜 labels Jan 4, 2019
@hadley
Copy link
Member

hadley commented Feb 7, 2019

I propose we keep @romainfrancois simple syntax for pack(), and if you want to pack more variables in more complicated ways, you'd use mutate() or transmute(). Does that make sense?

@yutannihilation
Copy link
Member

yutannihilation commented Feb 8, 2019

I think pack() might be considered as namespaced-select(); select() also can select multiple columns for one name, but the difference is that then it flattens them, tweaking colnames to avoid name collisions. pack(), on the other hand, just keeps the namespace (+ renames them nicely). Considering this, even vars() might not be needed for the simplest interface.

library(dplyr, warn.conflicts = FALSE)

as_tibble(iris) %>% 
  select(Sepal = starts_with("Sepal"),
         Petal = starts_with("Petal"))
#> # A tibble: 150 x 4
#>    Sepal1 Sepal2 Petal1 Petal2
#>     <dbl>  <dbl>  <dbl>  <dbl>
#>  1    5.1    3.5    1.4    0.2
#>  2    4.9    3      1.4    0.2
#>  3    4.7    3.2    1.3    0.2
#>  4    4.6    3.1    1.5    0.2
#>  5    5      3.6    1.4    0.2
#>  6    5.4    3.9    1.7    0.4
#>  7    4.6    3.4    1.4    0.3
#>  8    5      3.4    1.5    0.2
#>  9    4.4    2.9    1.4    0.2
#> 10    4.9    3.1    1.5    0.1
#> # … with 140 more rows

Created on 2019-02-08 by the reprex package (v0.2.1.9000)

On the other hand, I expect pack() needs to accept more complicated specs to handle nested structures. For example:

# c.f. https://github.com/yutannihilation/tiedr#auto_bundle
d <- tibble::tribble(
  ~A_a_x, ~A_a_y, ~A_b_x, ~B_a_x, ~B_b_x, ~B_b_y,
       1,      2,      3,      4,      5,      6,
       2,      3,      4,      5,      6,      7
)

d %>%
  pack(
    A = list(
      a = list(x = A_a_x, y = A_a_y),
      b = list(x = A_b_x)
    ),
    B = list(
      a = list(x = B_a_x),
      b = list(x = B_b_x, y = B_b_y),
    )
  )

Of course, the users don't write this by hand. Probably some helper functions or higher interfaces are needed. For example, the code above should be written as something like:

# something like `seperate()` for colnames
spec <- guess_structure_from_colnames(d, sep = "_")
d %>%
  pack(!!!spec)

So, I agree the syntax of pack() should be kept as simple (and of low level), so that we can use it in more higher interfaces.

(Sorry, I didn't notice this issue at the time of writing this post... I should have mentioned this issue!)

@hadley
Copy link
Member

hadley commented Feb 8, 2019

(@yutannihilation FWIW I'm leaning towards "packed" columns rather than "bundled" columns because nest() and pack() are the same length)

@hadley
Copy link
Member

hadley commented Feb 8, 2019

Alternatively, if we had sels() and morph() then we can pack and unpack with dplyr:

packed <- iris %>%
  tibble() %>%
  morph(petal = sels(starts_with("Petal")), sepal = sels(starts_with("Sepal")))

unpacked <- packed %>%
  morph(petal, sepal)

And @yutannihilation's more complicated nested structure becomes:

d %>%
  morph(
    A = tibble(
      a = tibble(x = A_a_x, y = A_a_y),
      b = tibble(x = A_b_x)
    ),
    B = tibble(
      a = tibble(x = B_a_x),
      b = tibble(x = B_b_x, y = B_b_y),
    )
  )

This also has the nice property that guess_structure() could input a tibble and output a packed tibble.

@hadley
Copy link
Member

hadley commented Feb 8, 2019

But given how much stuff we already have schedule for 0.9.0, morph() is probably unlikely to happen until dplyr 1.0.0, so pack() will still be important in the interim.

@yutannihilation
Copy link
Member

yutannihilation commented Feb 9, 2019

I'm leaning towards "packed" columns rather than "bundled" columns

Thanks, I don't mind. "Bundle" was just needed to write one blog post. I didn't intend to advertise the term itself. I'll use "pack" next time 👍

Using morph() seems cleaner! One thing, I expect select semantics here (morph() has mutate semantics, right?), though I'm not sure how the differences matter actually. (edit: sorry, maybe I'm getting lost in between the select and mutate semantics...)

@lionel-
Copy link
Member

lionel- commented Feb 9, 2019

You'll also have something like a sels() functions which is aware of current variables and returns a tibble, so you can easily use selection semantics:

data %>% morph(df_col = sels(starts_with("d"))

I realised yesterday that unpacking with morph() is easy:

data %>% morph(packed = tibble(foo, bar)) %>% morph(packed)  

Because unnamed tibbles will be automatically spliced.

@jeroen
Copy link

jeroen commented Mar 1, 2019

Is the idea that unpack does the same as jsonlite::flatten()?

df <- tibble(x=1:3,y=tibble(a=1:3,b=1:3),z=4:6)
colnames(df)
colnames(jsonlite::flatten(df))

@hadley hadley closed this as completed in 90a13cb Apr 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
df-col 👜 feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

5 participants