Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spread does not preserve order #453

Closed
coatless opened this issue May 19, 2018 · 6 comments

Comments

@coatless
Copy link

commented May 19, 2018

When going from a wide dataset to long and then back, there seems to be an implicit sort that occurs on non key/value variables. This prevents the ability to recover an initial data set by undoing the gather() operation with spread().

Consider the following example:

library("tidyr")
(init_data = tibble::tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
))
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                1.            4.            7.
#> 2 C                2.            5.            8.
#> 3 A                3.            6.            9.

long = init_data %>% 
  gather(key = "Week", value = "Cost", -Group)

long
#> # A tibble: 9 x 3
#>   Group Week         Cost
#>   <chr> <chr>       <dbl>
#> 1 B     Cost Week 1    1.
#> 2 C     Cost Week 1    2.
#> 3 A     Cost Week 1    3.
#> 4 B     Cost Week 2    4.
#> 5 C     Cost Week 2    5.
#> 6 A     Cost Week 2    6.
#> 7 B     Cost Week 3    7.
#> 8 C     Cost Week 3    8.
#> 9 A     Cost Week 3    9.

wide = long %>%
  spread(key = "Week", value = "Cost")

wide
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 A                3.            6.            9.
#> 2 B                1.            4.            7.
#> 3 C                2.            5.            8.

init_data
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                1.            4.            7.
#> 2 C                2.            5.            8.
#> 3 A                3.            6.            9.
@hadley

This comment has been minimized.

Copy link
Member

commented May 30, 2018

Supply factor_key = TRUE to gather()

@hadley hadley closed this May 30, 2018
@hadley

This comment has been minimized.

Copy link
Member

commented May 30, 2018

Oh maybe that's a different issue

@hadley hadley reopened this May 30, 2018
@mikmart

This comment has been minimized.

Copy link

commented Aug 27, 2018

I ran into this too, but with a further complication: if there is only a single identifier column, it seems the result is always sorted (as far as I tried, anyway), but if you have multiple identifier columns, the result switches to being unsorted at some number of rows. The more duplicate identifiers, the lower the number of rows needed for the result to be unsorted.

Very confusing.

library(dplyr) # 0.7.99.9000
library(tidyr) # 0.8.1.9000

new_df <- function(n) {
  set.seed(1)
  tibble(
    i = rep(seq_len(n / 2), each = 2),
    var = rep_len(c("a", "b"), n),
    val = runif(n)
  ) %>% sample_frac(1)
}

op <- options(tibble.print_min = 4)
df_10k <- new_df(10000)

# Single identifier: always sorted
df_10k %>%
  spread(var, val)
#> # A tibble: 5,000 x 3
#>       i     a     b
#>   <int> <dbl> <dbl>
#> 1     1 0.266 0.372
#> 2     2 0.573 0.908
#> 3     3 0.202 0.898
#> 4     4 0.945 0.661
#> # ... with 4,996 more rows

new_df(5000000) %>%
  spread(var, val)
#> # A tibble: 2,500,000 x 3
#>       i     a     b
#>   <int> <dbl> <dbl>
#> 1     1 0.266 0.372
#> 2     2 0.573 0.908
#> 3     3 0.202 0.898
#> 4     4 0.945 0.661
#> # ... with 2.5e+06 more rows

# Double identifier: switches from sorted to unsorted
df_10k %>% 
  mutate(j = i) %>% 
  spread(var, val)
#> # A tibble: 5,000 x 4
#>       i     j     a     b
#>   <int> <int> <dbl> <dbl>
#> 1     1     1 0.266 0.372
#> 2     2     2 0.573 0.908
#> 3     3     3 0.202 0.898
#> 4     4     4 0.945 0.661
#> # ... with 4,996 more rows

new_df(100000) %>% 
  mutate(j = i) %>% 
  spread(var, val)
#> # A tibble: 50,000 x 4
#>       i     j      a      b
#>   <int> <int>  <dbl>  <dbl>
#> 1 35026 35026 0.224  0.158 
#> 2 35088 35088 0.0712 0.439 
#> 3 15647 15647 0.330  0.0746
#> 4 27617 27617 0.138  0.0349
#> # ... with 5e+04 more rows

# Triple identifier: switches earlier
df_10k %>% 
  mutate(j = i, k = j) %>% 
  spread(var, val)
#> # A tibble: 5,000 x 5
#>       i     j     k       a     b
#>   <int> <int> <int>   <dbl> <dbl>
#> 1   324   324   324 0.0308  0.862
#> 2  3383  3383  3383 0.722   0.868
#> 3  3677  3677  3677 0.762   0.598
#> 4   557   557   557 0.00986 0.686
#> # ... with 4,996 more rows

new_df(1000) %>% 
  mutate(j = i, k = j) %>% 
  spread(var, val)
#> # A tibble: 500 x 5
#>       i     j     k     a     b
#>   <int> <int> <int> <dbl> <dbl>
#> 1     1     1     1 0.266 0.372
#> 2     2     2     2 0.573 0.908
#> 3     3     3     3 0.202 0.898
#> 4     4     4     4 0.945 0.661
#> # ... with 496 more rows

options(op)

Created on 2018-08-27 by the reprex package (v0.2.0).

@elbersb

This comment has been minimized.

Copy link

commented Sep 4, 2018

I think this might be related to the problem that spread does not preserve the column order. For me this is a somewhat annoying problem when creating summary tables.

See here:

d <- tibble::tibble(i = c("c", "a", "b"), j = 1:3)
tidyr::spread(d, i, j)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <int> <int> <int>
#> 1     2     3     1

Created on 2018-09-04 by the reprex
package
(v0.2.0).

@mikmart

This comment has been minimized.

Copy link

commented Sep 9, 2018

I found that the discrepancy in row order between data with many identifiers vs. few comes from the switch to using character ids (rather than integers) when the number of possible id combinations is > 2^31.

Relevant source is here:

tidyr/R/id.R

Lines 16 to 29 in cbdd14e

# Calculate dimensions
ndistinct <- map_dbl(ids, attr, "n")
n <- prod(ndistinct)
if (n > 2 ^ 31) {
# Too big for integers, have to use strings, which will be much slower :(
char_id <- do.call("paste", c(ids, sep = "\r"))
res <- match(char_id, unique(char_id))
} else {
combs <- c(1, cumprod(ndistinct[-p]))
mat <- do.call("cbind", ids)
res <- c((mat - 1L) %*% combs + 1L)
}

In the above if-branches, order(res) won't be the same, which results in the varying row orders.

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 3, 2019

This is handled by pivot() the successor to gather() and spread() (spread() and gather() need too many API changes, so I am introducing a new function which fixes a many issues rather than potentially breaking existing code)

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

df <- tibble(
  Group = c("B", "C", "A"),
  `Cost Week 1` = c(1, 2, 3),
  `Cost Week 2` = c(4, 5, 6),
  `Cost Week 3` = c(7, 8, 9)
)

spec <- df %>% 
  pivot_spec_long(-1, measure = "cost") %>% 
  mutate(
    week = readr::parse_number(variable),
    variable = NULL
  )
spec
#> # A tibble: 3 x 3
#>   col_name    measure  week
#>   <chr>       <chr>   <dbl>
#> 1 Cost Week 1 cost        1
#> 2 Cost Week 2 cost        2
#> 3 Cost Week 3 cost        3

df %>% pivot(spec)
#> # A tibble: 9 x 3
#>   Group  week  cost
#>   <chr> <dbl> <dbl>
#> 1 B         1     1
#> 2 B         2     2
#> 3 B         3     3
#> 4 C         1     4
#> 5 C         2     5
#> 6 C         3     6
#> 7 A         1     7
#> 8 A         2     8
#> 9 A         3     9
df %>% pivot(spec) %>% pivot(spec)
#> # A tibble: 3 x 4
#>   Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#>   <chr>         <dbl>         <dbl>         <dbl>
#> 1 B                 1             2             3
#> 2 C                 4             5             6
#> 3 A                 7             8             9

Created on 2019-03-03 by the reprex package (v0.2.1.9000)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.