Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of fill() after group_by() #520

Closed
albertotb opened this issue Dec 7, 2018 · 7 comments

Comments

@albertotb
Copy link

commented Dec 7, 2018

The fill() function after a group_by(), especially if the number of groups is large, is more than 10x slower than mutate() with na.locf(), from the zoo package, yet gives identical results. Maybe I'm missing something and there is another way to peform this same operation?

library(dplyr)
library(tidyr)
library(zoo)
library(tibble)

n <- 1e6
df <- tibble(a = sample(paste("id", 1:(n/4)), n, replace = T),
             b = sample(c("2012", "2013", "2014"), n, replace = T),
             c = sample(c("NA", "A", "B", "C"), n, replace = T))


t1 <- system.time(df1 <-
                    df %>% 
                      arrange(a, b) %>% 
                      group_by(a) %>% 
                      mutate(c = na.locf(c, na.rm=F))
                  )

print(t1)
#>    user  system elapsed 
#>   21.45    0.06   21.81

t2 <- system.time(df2 <-
                    df %>%
                      arrange(a, b) %>% 
                      group_by(a) %>%
                      fill(c)
                  )
print(t2)
#>    user  system elapsed 
#>  313.37    0.47  316.75
print(identical(df1, df2))
#> [1] TRUE

Created on 2018-12-07 by the reprex package (v0.2.1)

@albertotb

This comment has been minimized.

Copy link
Author

commented Dec 7, 2018

There is a typo in the reprex, it should be c = sample(c(NA, "A", "B", "C"), n, replace = T)) but the results are the same (I don't know how to edit the issue)

@albertotb albertotb changed the title Performance of fill function after group_by Performance of fill() after group_by() Dec 7, 2018
@coolbutuseless

This comment has been minimized.

Copy link
Contributor

commented Mar 5, 2019

I can confirm that this does seem rather slow!

suppressPackageStartupMessages({
  library(dplyr)
  library(tidyr)
  library(zoo)
  library(tibble)
})

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Create some dummy data
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
set.seed(2)
n <- 1e2
df <- tibble(
  a = sample(paste("id", 1:(n/4))     , n, replace = TRUE),
  b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
  c = sample(c(NA, "A", "B", "C")     , n, replace = TRUE)
) %>%
  arrange(a, b) %>%
  group_by(a)


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Simple functions to wrap 'zoo::na.locf()' and 'dplyr::fill'
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
f_zoo  <- function() { mutate(df, c = na.locf(c, na.rm=FALSE)) }
f_fill <- function() { fill(df, c)}


#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# benchmark
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
identical(f_zoo(), f_fill())
#> [1] TRUE
res <- bench::mark(f_zoo(), f_fill())

res
#> # A tibble: 2 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 f_zoo()     1.09ms  1.27ms  1.23ms  2.57ms     789.     90.4KB     7
#> 2 f_fill()   29.22ms 31.66ms 32.07ms 34.48ms      31.6   196.8KB     9
#> # … with 2 more variables: n_itr <int>, total_time <bch:tm>

plot(res)

Created on 2019-03-05 by the reprex package (v0.2.1)

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

It's likely slow because of the current implementation:

fill.grouped_df <- function(data, ..., .direction = c("down", "up", "downup", "updown")) {
  dplyr::do(data, fill(., ..., .direction = .direction))
}

Probably the easiest experiment to make this faster would be to switch from do() to mutate_at()

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

A quick experiment suggests that this is likely to considerably improve performance:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

set.seed(2)
n <- 1e4
df <- tibble(
  a = sample(paste("id", 1:(n/4))     , n, replace = TRUE),
  b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
  c = sample(c(NA, "A", "B", "C")     , n, replace = TRUE)
) %>% arrange(a, b) 
gf <- df %>% group_by(a)


bench::mark(
  ungrouped_fill = fill(df, c),
  grouped_fill = fill(gf, c),
  mutate_fillDown = mutate(gf, c = tidyr:::fillDown(c)),
  mutate_na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
  check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 x 4
#>   expression           min     mean   median
#>   <chr>           <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill  887.19µs    1.1ms 934.43µs
#> 2 grouped_fill       2.85s    2.85s    2.85s
#> 3 mutate_fillDown  36.79ms  42.71ms   42.8ms
#> 4 mutate_na.locf  115.99ms 120.38ms 120.29ms

Created on 2019-03-08 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

Ooh, and it massively simplifies the implementation.

New performance:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

set.seed(2)
n <- 1e4
df <- tibble(
  a = sample(paste("id", 1:(n/4))     , n, replace = TRUE),
  b = sample(c("2012", "2013", "2014"), n, replace = TRUE),
  c = sample(c(NA, "A", "B", "C")     , n, replace = TRUE)
) %>% arrange(a, b) 
gf <- df %>% group_by(a)

bench::mark(
  ungrouped_fill = fill(df, c),
  grouped_fill = fill(gf, c),
  na.locf = mutate(gf, c = zoo::na.locf(c, na.rm = FALSE)),
  check = FALSE
)[1:4]
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 3 x 4
#>   expression          min     mean   median
#>   <chr>          <bch:tm> <bch:tm> <bch:tm>
#> 1 ungrouped_fill   1.24ms    1.5ms   1.37ms
#> 2 grouped_fill    27.54ms   33.8ms  31.14ms
#> 3 na.locf        114.05ms  122.6ms 122.97ms
@hadley hadley closed this in 8fdea23 Mar 8, 2019
@albertotb

This comment has been minimized.

Copy link
Author

commented Mar 8, 2019

Great! Thanks for the hard work

@Dulani

This comment has been minimized.

Copy link

commented Mar 14, 2019

Nice timing on this fix! I had a bit o' tidyverse code for a dataframe with 1.1M rows that was running just fine until I added two grouped fills. Then... it didn't seem like it was EVER going to finish. So, I reduced it to 100k rows and still had to wait >2 mins for it to finish. After a quick Google search to find this recently resolved(!) issue and a quick devtools::install_github("tidyverse/tidyr") , I'm back up and running and all 1.1 M rows are now finishing in 15 seconds! Nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.