New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement multi-spread #149

Open
hadley opened this Issue Dec 30, 2015 · 8 comments

Comments

Projects
None yet
4 participants
@hadley
Copy link
Member

hadley commented Dec 30, 2015

library(dplyr)
library(tidyr)

# From Jenny Bryan --------------------------------------------------------

input <- frame_data(
  ~hw,   ~name,  ~mark,   ~pr,
  "hw1", "anna",    95,  "ok",
  "hw1", "alan",    90, "meh",
  "hw1", "carl",    85,  "ok",
  "hw2", "alan",    70, "meh",
  "hw2", "carl",    80,  "ok"
)

# Want:
input %>%
  gather(key = element, value = score, mark, pr) %>%
  unite(thing, hw, element, remove = TRUE) %>%
  spread(thing, score, convert = TRUE)

# With multispread - still have to go through untidy/molten form,
# which loses variable names
input %>%
  gather(mark, pr, key = element, value = score) %>%
  spread(c(hw, element), score, convert = TRUE)

# http://stackoverflow.com/questions/27247078 -----------------------------

df <- frame_data(
  ~id, ~type,     ~transactions, ~amount,
  20,  "income",  20,            100,
  20,  "expense", 25,            95,
  30,  "income",  50,            300,
  30,  "expense", 45,            250
)

df %>%
  gather(var, val, transactions:amount) %>%
  unite(var2, type, var) %>%
  spread(var2, val)

# With multispread - still have to go through untidy/molten form
df %>%
  gather(var, val, transactions:amount) %>%
  spread(c(type, var), val)

# http://stackoverflow.com/questions/24929954 -----------------------------

df <- expand.grid(Year = 2000:2014, Product = c("A", "B"), Country = c("AI", "EI")) %>%
  tbl_df() %>%
  select(Product, Country, Year) %>%
  mutate(value = rnorm(nrow(.))) %>%
  filter((Product == "A" & Country == "AI") | (Product == "B" & Country == "EI"))

df %>%
  unite(Prod_Count, Product, Country) %>%
  spread(Prod_Count, value)

# If we had multi-spread:
df %>%
  spread(c(Product, Country), value)
@hadley

This comment has been minimized.

Copy link
Member Author

hadley commented Sep 11, 2017

Maybe spread should take a set of columns for the rows and a set for the columns. If multiple variables for the columns, join them together with a separate. Default rows to not-cols, and cols to not-rows, but also make it possible to reduce the number of variables, eliminating the need for an intermediate select()

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Sep 14, 2017

spread() already has a sep argument. Do you want to support vars() calls for the key and value arguments?

df %>%
  spread(vars(Product, Country), value)

# for multiple value columns
df %>%
  spread(vars(Product, Country), vars(value_1:value_x))
@hadley

This comment has been minimized.

Copy link
Member Author

hadley commented Sep 14, 2017

This will probably be a new verb - and yes, it will probably use vars().

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Jan 8, 2019

We can also a nested data frame as an intermediate step:

library(tidyverse)

data <- tribble(
  ~hw,   ~name,  ~mark,   ~pr,
  "hw1", "anna",    95,  "ok",
  "hw1", "alan",    90, "meh",
  "hw1", "carl",    85,  "ok",
  "hw2", "alan",    70, "meh",
  "hw2", "carl",    80,  "ok"
)

fill_empty <- function(x) {
  map(x, ~if (is.null(.)) tibble(.rows = 1) else .)
}

data %>% 
  nest(-hw, -name) %>% 
  spread(name, data) %>%
  mutate_at(vars(-hw), list(fill_empty)) %>%
  unnest(.sep = "_")
#> Warning: The `.name_repair` argument to `as_tibble()` takes precedence over
#> the deprecated `validate` argument.
#> # A tibble: 2 x 7
#>   hw    alan_mark alan_pr anna_mark anna_pr carl_mark carl_pr
#>   <chr>     <dbl> <chr>       <dbl> <chr>       <dbl> <chr>  
#> 1 hw1          90 meh            95 ok             85 ok     
#> 2 hw2          70 meh            NA <NA>           80 ok

Created on 2019-01-08 by the reprex package (v0.2.1.9000)

It would be much faster if spread() supported hierarchical columns, and if we had a tie() + untie() pair of verbs equivalent to nest() + unnest() 👍

library(tidyverse)

data <- tribble(
  ~hw,   ~name,  ~mark,   ~pr,
  "hw1", "anna",    95,  "ok",
  "hw1", "alan",    90, "meh",
  "hw1", "carl",    85,  "ok",
  "hw2", "alan",    70, "meh",
  "hw2", "carl",    80,  "ok"
)

# tie
tied <- data
tied$data <- data %>% select(-hw, -name)
tied <- tied[c("hw", "name", "data")]
tied
#> # A tibble: 5 x 3
#>   hw    name  data$mark $pr  
#>   <chr> <chr>     <dbl> <chr>
#> 1 hw1   anna         95 ok   
#> 2 hw1   alan         90 meh  
#> 3 hw1   carl         85 ok   
#> 4 hw2   alan         70 meh  
#> 5 hw2   carl         80 ok

# spread
spread <- tibble(
  hw = unique(tied$hw),
  anna = tied$data[tied$name == "anna", ][1:2, ],
  alan = tied$data[tied$name == "alan", ][1:2, ],
  carl = tied$data[tied$name == "carl", ][1:2, ]
)
spread
#> # A tibble: 2 x 4
#>   hw    anna$mark $pr   alan$mark $pr   carl$mark $pr  
#>   <chr>     <dbl> <chr>     <dbl> <chr>     <dbl> <chr>
#> 1 hw1          95 ok           90 meh          85 ok   
#> 2 hw2          NA <NA>         70 meh          80 ok

# untie: omitted, straightforward

Created on 2019-01-08 by the reprex package (v0.2.1.9000)

Happy to work on fleshing out the details.

@yutannihilation

This comment has been minimized.

Copy link
Member

yutannihilation commented Jan 19, 2019

A very preliminary version of implementation:
https://gist.github.com/yutannihilation/958d2f2eb8b2fcddf3391a32a1740d6d

@krlmlr

This comment has been minimized.

Copy link
Member

krlmlr commented Jan 20, 2019

Things I learned today about the problem:

  • @jeroen uses "nested" to describe both list columns ("nested tables") and data frame columns ("nested records"). I like the way records are consistent with what vctrs describes as "records"
  • The implementation of nest_column() and unnest_column() works for the cases I tested, but we need to think about:
    • Ordering of columns in the result
    • Naming: Maybe tie() / untie() or nest_record() and unnest_record()
  • We need a more robust implementation of spread(), perhaps using group_by() %>% group_data() and repeated vec_slice()
  • For the related problem of multi-gather, we need more helpers and perhaps a better gather()
  • A blog post describing the current limitations, our options and the challenges would be a good way to tackle the problem
@lionel-

This comment has been minimized.

Copy link
Member

lionel- commented Feb 7, 2019

@krlmlr I think the notion of nesting should refer to disaggregating operations. A data frame column might be nested within groups or rows (thus is really a list column of data frames), but not necessarily. See tidyverse/dplyr#3967 for some examples.

@hadley

This comment has been minimized.

Copy link
Member Author

hadley commented Feb 13, 2019

Note to self: multi-spread is about forming the key from multiple columns, which is also a natural fit to the packed data frame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment