Slow performance with dplyr #378

jimhester · 2021-10-04T17:48:53Z

library(dplyr)
library(vroom)
download.file("https://www.ssa.gov/oact/babynames/names.zip", "names.zip")
unzip("names.zip", exdir = "names")
files <- dir("names", pattern = "*.txt", full.names = TRUE)
rawdat <- vroom(files, col_names=c("name", "sex", "count"), id = "path")
rawdat %>%
  mutate(year = substr(path, 10, 13)) %>%
  filter(year >= 1986) %>%
  group_by(name)

DavisVaughan · 2021-10-07T18:21:38Z

I think this may be a vroom performance bug related to how ALTREP views into the original data are materialized.

I can reproduce without dplyr. The slice materializes very slowly, but the original input materializes quickly.

library(vroom)
tmp <- tempdir()
tmp_file <- fs::path(tmp, "names.zip")
exdir <- fs::path(tmp, "names")
fs::dir_create(exdir)

download.file("https://www.ssa.gov/oact/babynames/names.zip", tmp_file)
unzip(tmp_file, exdir = exdir)
files <- dir(exdir, pattern = "*.txt", full.names = TRUE)

rawdat <- vroom(files, col_names=c("name", "sex", "count"), id = "path")

# make a big slice
sliced <- rawdat$name[1:1e6]

# this is an ALTREP view into the original data
.Internal(inspect(sliced))
#> @7fdfb18cb348 16 STRSXP g0c0 [REF(65535)] vroom_chr (len=1000000, materialized=F)

# takes forever to materialize
system.time({
  vroom:::force_materialization(sliced)
})
#>    user  system elapsed 
#>  70.576   0.208  70.994

# original data is still un-materialized
.Internal(inspect(rawdat$name))
#> @7fdff1d6da08 16 STRSXP g1c0 [MARK,REF(65535)] vroom_chr (len=2020863, materialized=F)

# but materializing it is quick
system.time({
  vroom:::force_materialization(rawdat$name)
})
#>    user  system elapsed 
#>   0.284   0.004   0.288

^{Created on 2021-10-07 by the reprex package (v2.0.1)}

DavisVaughan · 2021-10-07T18:29:36Z

Instruments suggests it has something to do with the subset_iterator when you request values with *b in this loop here:
https://github.com/r-lib/vroom/blob/07b9e7bbdad0e012e815f746372d3ab54ff099c0/src/vroom_chr.cc#L27

jimhester · 2021-10-07T20:06:59Z

Ok I have resolved this issue, should be much better now.

library(vroom)
tmp <- tempdir()
tmp_file <- fs::path(tmp, "names.zip")
exdir <- fs::path(tmp, "names")
fs::dir_create(exdir)

download.file("https://www.ssa.gov/oact/babynames/names.zip", tmp_file)
unzip(tmp_file, exdir = exdir)
files <- dir(exdir, pattern = "*.txt", full.names = TRUE)

rawdat <- vroom(files, col_names=c("name", "sex", "count"), id = "path")
#> Rows: 2020863 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (2): name, sex
#> dbl (1): count
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# make a big slice
sliced <- rawdat$name[1:1e6]

# takes forever to materialize
system.time({
  vroom:::force_materialization(sliced)
})
#>    user  system elapsed 
#>   0.117   0.002   0.120

# but materializing it is quick
system.time({
  vroom:::force_materialization(rawdat$name)
})
#>    user  system elapsed 
#>   0.271   0.005   0.277

^{Created on 2021-10-07 by the reprex package (v2.0.1)}

Previously we were creating a new iterator each time we returned a value, now we store the previous index in the iterator and update the full iterator with the difference when needed. The performance issue was occurring due to transient object creation of the temporary iterators and moving them to the right location. Fixes #378

jimhester added the performance 🚀 label Oct 4, 2021

jimhester closed this as completed in 96f7fe8 Oct 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance with dplyr #378

Slow performance with dplyr #378

jimhester commented Oct 4, 2021

DavisVaughan commented Oct 7, 2021

DavisVaughan commented Oct 7, 2021

jimhester commented Oct 7, 2021

Slow performance with dplyr #378

Slow performance with dplyr #378

Comments

jimhester commented Oct 4, 2021

DavisVaughan commented Oct 7, 2021

DavisVaughan commented Oct 7, 2021

jimhester commented Oct 7, 2021