Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_longer() should allow for varying the columns slower than the rows #1312

Closed
DavisVaughan opened this issue Feb 2, 2022 · 2 comments · Fixed by #1347
Closed

pivot_longer() should allow for varying the columns slower than the rows #1312

DavisVaughan opened this issue Feb 2, 2022 · 2 comments · Fixed by #1347
Labels
feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes"

Comments

@DavisVaughan
Copy link
Member

DavisVaughan commented Feb 2, 2022

I was somewhat surprised at the pivot_longer() results below. It seems to attempt to keep the row values close together (i.e. the original row 1 values became the new row 1 and 2 values), when really I wanted to keep the column values together (i.e. the original column 1 values became the new row 1 and row 2 values).

This seems very related to names_vary in pivot_wider(), but I don't quite think the name is exactly right here.

I think a good name might actually be cols_vary = "fastest" (i.e. it iterates through all the columns before moving on to the next row). This goes nicely with the cols argument.

library(tidyr)

df <- tibble(
  start = as.Date(c("2019-01-01", "2019-01-02")),
  end = as.Date(c("2019-01-03", "2019-01-04"))
)
df
#> # A tibble: 2 × 2
#>   start      end       
#>   <date>     <date>    
#> 1 2019-01-01 2019-01-03
#> 2 2019-01-02 2019-01-04

pivot_longer(df, c(start, end))
#> # A tibble: 4 × 2
#>   name  value     
#>   <chr> <date>    
#> 1 start 2019-01-01
#> 2 end   2019-01-03
#> 3 start 2019-01-02
#> 4 end   2019-01-04

# I sort of expected this here:
pivot_longer(df, c(start, end)) %>%
  dplyr::arrange(desc(name))
#> # A tibble: 4 × 2
#>   name  value     
#>   <chr> <date>    
#> 1 start 2019-01-01
#> 2 start 2019-01-02
#> 3 end   2019-01-03
#> 4 end   2019-01-04

# This is what we get from gather
gather(df, "name", "value", start, end)
#> # A tibble: 4 × 2
#>   name  value     
#>   <chr> <date>    
#> 1 start 2019-01-01
#> 2 start 2019-01-02
#> 3 end   2019-01-03
#> 4 end   2019-01-04


df <- tibble(
  id = c(1L, 1L, 2L, 2L),
  start = as.Date(c("2019-01-01")) + 0:3,
  end = as.Date(c("2019-01-03")) + 0:3
)
df
#> # A tibble: 4 × 3
#>      id start      end       
#>   <int> <date>     <date>    
#> 1     1 2019-01-01 2019-01-03
#> 2     1 2019-01-02 2019-01-04
#> 3     2 2019-01-03 2019-01-05
#> 4     2 2019-01-04 2019-01-06

# Not this:
pivot_longer(df, c(start, end))
#> # A tibble: 8 × 3
#>      id name  value     
#>   <int> <chr> <date>    
#> 1     1 start 2019-01-01
#> 2     1 end   2019-01-03
#> 3     1 start 2019-01-02
#> 4     1 end   2019-01-04
#> 5     2 start 2019-01-03
#> 6     2 end   2019-01-05
#> 7     2 start 2019-01-04
#> 8     2 end   2019-01-06

# I actually dont want this either because i think all of `id == 1` should
# be kept together
gather(df, "name", "value", start, end)
#> # A tibble: 8 × 3
#>      id name  value     
#>   <int> <chr> <date>    
#> 1     1 start 2019-01-01
#> 2     1 start 2019-01-02
#> 3     2 start 2019-01-03
#> 4     2 start 2019-01-04
#> 5     1 end   2019-01-03
#> 6     1 end   2019-01-04
#> 7     2 end   2019-01-05
#> 8     2 end   2019-01-06

# This is what I really wanted, and is what `cols_vary = "slowest"` would give
pivot_longer(df, c(start, end)) %>%
  dplyr::arrange(id, desc(name))
#> # A tibble: 8 × 3
#>      id name  value     
#>   <int> <chr> <date>    
#> 1     1 start 2019-01-01
#> 2     1 start 2019-01-02
#> 3     1 end   2019-01-03
#> 4     1 end   2019-01-04
#> 5     2 start 2019-01-03
#> 6     2 start 2019-01-04
#> 7     2 end   2019-01-05
#> 8     2 end   2019-01-06

Implementation wise, I think we need to not interleave here:

tidyr/R/pivot-long.R

Lines 259 to 265 in 48ba23d

out <- vec_c(!!!val_cols, .ptype = val_type)
# Interleave into correct order
# TODO somehow `t(matrix(x))` is _faster_ than `matrix(x, byrow = TRUE)`
# if this gets fixed in R this should use `byrow = TRUE` again
n_vals <- nrow(data) * length(val_cols)
idx <- t(matrix(seq_len(n_vals), ncol = length(val_cols)))
vals[[value]] <- vec_slice(out, as.integer(idx))

And then maybe use vec_rep_each() here instead of vec_rep() (that feels very similar to how names_vary works)

vec_rep(keys, vec_size(data)),

@DavisVaughan DavisVaughan added feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes" labels Feb 2, 2022
@DavisVaughan DavisVaughan changed the title pivot_longer() should allow for disabling interleaving pivot_longer() should allow for varying the columns slower than the rows Feb 2, 2022
@eutwt
Copy link

eutwt commented Feb 2, 2022

It might make sense to name the argument either by_row or rowwise. I'd guess R users may be familiar with what "by row" means because of the matrix argument (even if here the operation is in reverse), and dplyr users are familiar with what "rowwise" means. If you can use existing familiar concepts I think that's beneficial.

Edit: I didn't realize earlier that there is already names_vary in pivot_wider following the same pattern. Given that exists, I think your suggested name is best.

@hadley
Copy link
Member

hadley commented Feb 23, 2022

The anscombe example in vignette("pivotting") could use this argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement pivoting ♻️ pivot rectangular data to different "shapes"
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants