-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spread does not preserve order #453
Comments
Supply |
Oh maybe that's a different issue |
I ran into this too, but with a further complication: if there is only a single identifier column, it seems the result is always sorted (as far as I tried, anyway), but if you have multiple identifier columns, the result switches to being unsorted at some number of rows. The more duplicate identifiers, the lower the number of rows needed for the result to be unsorted. Very confusing. library(dplyr) # 0.7.99.9000
library(tidyr) # 0.8.1.9000
new_df <- function(n) {
set.seed(1)
tibble(
i = rep(seq_len(n / 2), each = 2),
var = rep_len(c("a", "b"), n),
val = runif(n)
) %>% sample_frac(1)
}
op <- options(tibble.print_min = 4)
df_10k <- new_df(10000)
# Single identifier: always sorted
df_10k %>%
spread(var, val)
#> # A tibble: 5,000 x 3
#> i a b
#> <int> <dbl> <dbl>
#> 1 1 0.266 0.372
#> 2 2 0.573 0.908
#> 3 3 0.202 0.898
#> 4 4 0.945 0.661
#> # ... with 4,996 more rows
new_df(5000000) %>%
spread(var, val)
#> # A tibble: 2,500,000 x 3
#> i a b
#> <int> <dbl> <dbl>
#> 1 1 0.266 0.372
#> 2 2 0.573 0.908
#> 3 3 0.202 0.898
#> 4 4 0.945 0.661
#> # ... with 2.5e+06 more rows
# Double identifier: switches from sorted to unsorted
df_10k %>%
mutate(j = i) %>%
spread(var, val)
#> # A tibble: 5,000 x 4
#> i j a b
#> <int> <int> <dbl> <dbl>
#> 1 1 1 0.266 0.372
#> 2 2 2 0.573 0.908
#> 3 3 3 0.202 0.898
#> 4 4 4 0.945 0.661
#> # ... with 4,996 more rows
new_df(100000) %>%
mutate(j = i) %>%
spread(var, val)
#> # A tibble: 50,000 x 4
#> i j a b
#> <int> <int> <dbl> <dbl>
#> 1 35026 35026 0.224 0.158
#> 2 35088 35088 0.0712 0.439
#> 3 15647 15647 0.330 0.0746
#> 4 27617 27617 0.138 0.0349
#> # ... with 5e+04 more rows
# Triple identifier: switches earlier
df_10k %>%
mutate(j = i, k = j) %>%
spread(var, val)
#> # A tibble: 5,000 x 5
#> i j k a b
#> <int> <int> <int> <dbl> <dbl>
#> 1 324 324 324 0.0308 0.862
#> 2 3383 3383 3383 0.722 0.868
#> 3 3677 3677 3677 0.762 0.598
#> 4 557 557 557 0.00986 0.686
#> # ... with 4,996 more rows
new_df(1000) %>%
mutate(j = i, k = j) %>%
spread(var, val)
#> # A tibble: 500 x 5
#> i j k a b
#> <int> <int> <int> <dbl> <dbl>
#> 1 1 1 1 0.266 0.372
#> 2 2 2 2 0.573 0.908
#> 3 3 3 3 0.202 0.898
#> 4 4 4 4 0.945 0.661
#> # ... with 496 more rows
options(op) Created on 2018-08-27 by the reprex package (v0.2.0). |
I think this might be related to the problem that See here: d <- tibble::tibble(i = c("c", "a", "b"), j = 1:3)
tidyr::spread(d, i, j)
#> # A tibble: 1 x 3
#> a b c
#> <int> <int> <int>
#> 1 2 3 1 Created on 2018-09-04 by the reprex |
I found that the discrepancy in row order between data with many identifiers vs. few comes from the switch to using character ids (rather than integers) when the number of possible id combinations is > 2^31. Relevant source is here: Lines 16 to 29 in cbdd14e
In the above |
This is handled by library(dplyr, warn.conflicts = FALSE)
library(tidyr)
df <- tibble(
Group = c("B", "C", "A"),
`Cost Week 1` = c(1, 2, 3),
`Cost Week 2` = c(4, 5, 6),
`Cost Week 3` = c(7, 8, 9)
)
spec <- df %>%
pivot_spec_long(-1, measure = "cost") %>%
mutate(
week = readr::parse_number(variable),
variable = NULL
)
spec
#> # A tibble: 3 x 3
#> col_name measure week
#> <chr> <chr> <dbl>
#> 1 Cost Week 1 cost 1
#> 2 Cost Week 2 cost 2
#> 3 Cost Week 3 cost 3
df %>% pivot(spec)
#> # A tibble: 9 x 3
#> Group week cost
#> <chr> <dbl> <dbl>
#> 1 B 1 1
#> 2 B 2 2
#> 3 B 3 3
#> 4 C 1 4
#> 5 C 2 5
#> 6 C 3 6
#> 7 A 1 7
#> 8 A 2 8
#> 9 A 3 9
df %>% pivot(spec) %>% pivot(spec)
#> # A tibble: 3 x 4
#> Group `Cost Week 1` `Cost Week 2` `Cost Week 3`
#> <chr> <dbl> <dbl> <dbl>
#> 1 B 1 2 3
#> 2 C 4 5 6
#> 3 A 7 8 9 Created on 2019-03-03 by the reprex package (v0.2.1.9000) |
When going from a wide dataset to long and then back, there seems to be an implicit sort that occurs on non key/value variables. This prevents the ability to recover an initial data set by undoing the
gather()
operation withspread()
.Consider the following example:
The text was updated successfully, but these errors were encountered: