Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack() / unpack() #523

Closed
romainfrancois opened this issue Dec 13, 2018 · 15 comments

Comments

@romainfrancois
Copy link
Member

commented Dec 13, 2018

  • pack() would tidy select columns, pack them in a data frame and then add that data frame as a new column
  • unpack() would "promote" columns from a data frame columns as columns of the outer df.

(very sketchy, just unloading pack from my head)

library(tidyverse)

pack <- function(.tbl, name = "data", ...) {
  all_vars <- tbl_vars(.tbl)
  selected <- tidyselect::vars_select(all_vars, ...)
  not_selected <- setdiff(all_vars, selected)
  
  .tbl %>% 
    select(not_selected) %>% 
    add_column(!!name := select(.tbl, selected))
}

df <- iris %>% 
  as_tibble() %>% 
  pack("Sepal", Length = Sepal.Length, Width = Sepal.Width) %>% 
  pack("Petal", Length = Petal.Length, Width = Petal.Width)
df$Petal
#> # A tibble: 150 x 2
#>    Length Width
#>     <dbl> <dbl>
#>  1    1.4   0.2
#>  2    1.4   0.2
#>  3    1.3   0.2
#>  4    1.5   0.2
#>  5    1.4   0.2
#>  6    1.7   0.4
#>  7    1.4   0.3
#>  8    1.5   0.2
#>  9    1.4   0.2
#> 10    1.5   0.1
#> # … with 140 more rows

Created on 2018-12-13 by the reprex package (v0.2.1.9000)

@hadley

This comment was marked as resolved.

Copy link
Member

commented Dec 13, 2018

This feels like maybe it should belong in tidyr? I like the names.

@romainfrancois

This comment was marked as outdated.

Copy link
Member Author

commented Dec 13, 2018

That makes sense. I still have no luck with "transfer this issue", would you please ?

image

@hadley hadley transferred this issue from tidyverse/dplyr Dec 13, 2018
@romainfrancois

This comment has been minimized.

Copy link
Member Author

commented Dec 13, 2018

Random things:

  • would we want to pack into multiple data frame columns at once, something like this maybe: pack( x = select(<...>), y = select(<...>).
  • i guess we need to make grouping variables non packable.
@lionel-

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

pack( x = select(<...>), y = select(<...>)

I guess this would be pack(x = vars(), y = vars()) since select takes a data frame as input? Perhaps it could also nest the data frame in a list-column, following the join() proposal?

data %>% pack(
  x = vars(...),
  y = nesting(vars(...))
)

i guess we need to make grouping variables non packable.

How about copying them inside the nested df? It sounds useful to carry them around.

@romainfrancois

This comment has been minimized.

Copy link
Member Author

commented Dec 13, 2018

vars() sounds good, otherwise maybe pick(), pick_if(), pick_at(), ...

@romainfrancois

This comment has been minimized.

Copy link
Member Author

commented Dec 13, 2018

I’m not sure i get the nesting() thing. We would need to make a list column of n() row 1 tibbles ?

@lionel-

This comment has been minimized.

Copy link
Member

commented Dec 13, 2018

Yes, which is occasionally useful. Not that important, I just thought maybe the interfaces could be made consistent, in case we include join() in dplyr.

@hadley

This comment has been minimized.

Copy link
Member

commented Feb 7, 2019

I propose we keep @romainfrancois simple syntax for pack(), and if you want to pack more variables in more complicated ways, you'd use mutate() or transmute(). Does that make sense?

@yutannihilation

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

I think pack() might be considered as namespaced-select(); select() also can select multiple columns for one name, but the difference is that then it flattens them, tweaking colnames to avoid name collisions. pack(), on the other hand, just keeps the namespace (+ renames them nicely). Considering this, even vars() might not be needed for the simplest interface.

library(dplyr, warn.conflicts = FALSE)

as_tibble(iris) %>% 
  select(Sepal = starts_with("Sepal"),
         Petal = starts_with("Petal"))
#> # A tibble: 150 x 4
#>    Sepal1 Sepal2 Petal1 Petal2
#>     <dbl>  <dbl>  <dbl>  <dbl>
#>  1    5.1    3.5    1.4    0.2
#>  2    4.9    3      1.4    0.2
#>  3    4.7    3.2    1.3    0.2
#>  4    4.6    3.1    1.5    0.2
#>  5    5      3.6    1.4    0.2
#>  6    5.4    3.9    1.7    0.4
#>  7    4.6    3.4    1.4    0.3
#>  8    5      3.4    1.5    0.2
#>  9    4.4    2.9    1.4    0.2
#> 10    4.9    3.1    1.5    0.1
#> # … with 140 more rows

Created on 2019-02-08 by the reprex package (v0.2.1.9000)

On the other hand, I expect pack() needs to accept more complicated specs to handle nested structures. For example:

# c.f. https://github.com/yutannihilation/tiedr#auto_bundle
d <- tibble::tribble(
  ~A_a_x, ~A_a_y, ~A_b_x, ~B_a_x, ~B_b_x, ~B_b_y,
       1,      2,      3,      4,      5,      6,
       2,      3,      4,      5,      6,      7
)

d %>%
  pack(
    A = list(
      a = list(x = A_a_x, y = A_a_y),
      b = list(x = A_b_x)
    ),
    B = list(
      a = list(x = B_a_x),
      b = list(x = B_b_x, y = B_b_y),
    )
  )

Of course, the users don't write this by hand. Probably some helper functions or higher interfaces are needed. For example, the code above should be written as something like:

# something like `seperate()` for colnames
spec <- guess_structure_from_colnames(d, sep = "_")
d %>%
  pack(!!!spec)

So, I agree the syntax of pack() should be kept as simple (and of low level), so that we can use it in more higher interfaces.

(Sorry, I didn't notice this issue at the time of writing this post... I should have mentioned this issue!)

@hadley

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

(@yutannihilation FWIW I'm leaning towards "packed" columns rather than "bundled" columns because nest() and pack() are the same length)

@hadley

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

Alternatively, if we had sels() and morph() then we can pack and unpack with dplyr:

packed <- iris %>%
  tibble() %>%
  morph(petal = sels(starts_with("Petal")), sepal = sels(starts_with("Sepal")))

unpacked <- packed %>%
  morph(petal, sepal)

And @yutannihilation's more complicated nested structure becomes:

d %>%
  morph(
    A = tibble(
      a = tibble(x = A_a_x, y = A_a_y),
      b = tibble(x = A_b_x)
    ),
    B = tibble(
      a = tibble(x = B_a_x),
      b = tibble(x = B_b_x, y = B_b_y),
    )
  )

This also has the nice property that guess_structure() could input a tibble and output a packed tibble.

@hadley

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

But given how much stuff we already have schedule for 0.9.0, morph() is probably unlikely to happen until dplyr 1.0.0, so pack() will still be important in the interim.

@yutannihilation

This comment has been minimized.

Copy link
Member

commented Feb 9, 2019

I'm leaning towards "packed" columns rather than "bundled" columns

Thanks, I don't mind. "Bundle" was just needed to write one blog post. I didn't intend to advertise the term itself. I'll use "pack" next time 👍

Using morph() seems cleaner! One thing, I expect select semantics here (morph() has mutate semantics, right?), though I'm not sure how the differences matter actually. (edit: sorry, maybe I'm getting lost in between the select and mutate semantics...)

@lionel-

This comment has been minimized.

Copy link
Member

commented Feb 9, 2019

You'll also have something like a sels() functions which is aware of current variables and returns a tibble, so you can easily use selection semantics:

data %>% morph(df_col = sels(starts_with("d"))

I realised yesterday that unpacking with morph() is easy:

data %>% morph(packed = tibble(foo, bar)) %>% morph(packed)  

Because unnamed tibbles will be automatically spliced.

@jeroen

This comment has been minimized.

Copy link

commented Mar 1, 2019

Is the idea that unpack does the same as jsonlite::flatten()?

df <- tibble(x=1:3,y=tibble(a=1:3,b=1:3),z=4:6)
colnames(df)
colnames(jsonlite::flatten(df))
@hadley hadley closed this in 90a13cb Apr 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.