Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_wider() performance #790

Merged
merged 4 commits into from Nov 13, 2019
Merged

pivot_wider() performance #790

merged 4 commits into from Nov 13, 2019

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Oct 23, 2019

Feel free to close this PR and just use it as something to refer to later on.

Using the new vec_group_id() with vec_unique_loc() we can get some nice performance improvements over the current pivot_wider() implementation.

Using #775 as an example

library(tidyr)
library(dplyr, warn.conflicts = FALSE)
library(reshape2)

mydf <- expand_grid(
  case = sprintf("%03d", seq(1, 4000)),
  year = seq(1900, 2000),
  name = c("x", "y", "z")
) %>%
  mutate(value = rnorm(nrow(.)))

This is on current dev tidyr with current dev vctrs. It is already slightly faster than what is in the provided example. Maybe we made some vctrs improvements, not sure.

bench::mark(
  pivot = pivot_wider(mydf, names_from = "name", values_from = "value"),
  spread = spread(mydf, name, value),
  dcast = dcast(mydf, case + year ~ name),
  iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         815ms    881ms      1.14     115MB     2.45
#> 2 spread        430ms    504ms      2.01     431MB    11.1 
#> 3 dcast         407ms    485ms      2.07     457MB     8.82

Now with this PR

bench::mark(
  pivot = pivot_wider(mydf, names_from = "name", values_from = "value"),
  spread = spread(mydf, name, value),
  dcast = dcast(mydf, case + year ~ name),
  iterations = 50
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pivot         535ms    617ms      1.59     117MB     3.41
#> 2 spread        419ms    512ms      1.97     431MB     8.60
#> 3 dcast         417ms    458ms      2.11     457MB     7.42

@hadley
Copy link
Member

hadley commented Oct 23, 2019

Nice!

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2019

Another example:

Before

library(tidyverse)

set.seed(20191111)

N <- 100

data <-
  crossing(
    key1 = factor(letters),
    key2 = factor(letters),
    key3 = Sys.Date() - seq_len(N),
    nesting(name = factor(LETTERS), value = runif(26))
  )

system.time(
  pivot_wider(data, names_from = name)
)
#>    user  system elapsed 
#>   1.359   0.056   1.415

Created on 2019-11-11 by the reprex package (v0.3.0)

After

system.time(
  pivot_wider(data, names_from = name)
)
#>    user  system elapsed 
#>   0.818   0.032   0.850

Created on 2019-11-11 by the reprex package (v0.3.0)

@krlmlr
Copy link
Member

krlmlr commented Nov 11, 2019

Note that with the CRAN version of vctrs (and CRAN tidyr), the performance is weak. This is with N <- 10:

system.time(
  pivot_wider(data, names_from = name)
)
#>    user  system elapsed 
#>   9.868   0.004   9.873

Created on 2019-11-11 by the reprex package (v0.3.0)

Upgrading vctrs already helps.

@hadley hadley merged commit cb51247 into tidyverse:master Nov 13, 2019
0 of 4 checks passed
@hadley
Copy link
Member

hadley commented Nov 13, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants