Reading directly to data.frames seems slower with version 1.4.0 #309

jonkeane · 2021-02-11T16:15:39Z

Hello,

@nealrichardson and I have been working on some benchmark tooling to help us benchmark our work in the Arrow project. In the process of that we've seen some performance changes in vroom that we were surprised by: with version 1.4.0 we're seeing a pretty large (3-4x) jump in total time to read some large files into a data.frame (i.e. with altrep = FALSE). We've tried a few different csvs to make sure it's not something specific in these files (happy to send along links to the files/more timings from them..

We've also tried separating these csvs by data type (i.e. pulling all of the character columns out of one and saving that as a separate csv and reading it in) to see if those have any impact and it looks like the problem is more pronounced for numeric and numeric-like types (integers, floats, and datetimes). The character-only csv was read in at similar speeds across the two versions (again, happy to send our code/timings for this as well).

Using the taxi data from January 2013 from the vroom benchmarks:

> library("vroom")
> packageVersion("vroom")
[1] ‘1.4.0’
> system.time(taxis <- vroom("source_data/vroom_nyctaxi.csv", altrep = FALSE))
                                                                                                                                                                Rows: 14,776,615                   
Columns: 11
Delimiter: ","
chr  [4]: medallion, hack_license, vendor_id, payment_type
dbl  [6]: fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount
dttm [1]: pickup_datetime

Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
   user  system elapsed 
157.848   3.149  28.529

And with 1.3.2 (installed with install.packages("vroom", repos = "https://mran.microsoft.com/snapshot/2021-01-30")):

> library("vroom")
> packageVersion("vroom")
[1] ‘1.3.2’
> system.time(taxis <- vroom("source_data/vroom_nyctaxi.csv", altrep = FALSE))
                                                                                Rows: 14,776,615                   
Columns: 11
Delimiter: ","
chr  [4]: medallion, hack_license, vendor_id, payment_type
dbl  [6]: fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount
dttm [1]: pickup_datetime

Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
   user  system elapsed 
 27.860   1.405   9.333

The text was updated successfully, but these errors were encountered:

jimhester · 2021-02-11T18:20:52Z

Thanks for noticing this and opening the issue!

This was an inadvertent performance regression caused by 12f7152. We added a template function to handle parsing errors and some of the arguments were being passed by value rather than being passed by reference, so lots of unnecessary copies were being made.

This didn't happen to the character data because the template function was not used in that case.

After c9021fc the performance for this case again seems roughly comparable to 1.3.2.

jonkeane · 2021-02-12T15:44:27Z

Thanks for the incredibly quick fix and explanation!

jimhester closed this as completed in c9021fc Feb 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading directly to data.frames seems slower with version 1.4.0 #309

Reading directly to data.frames seems slower with version 1.4.0 #309

jonkeane commented Feb 11, 2021

jimhester commented Feb 11, 2021 •

edited

Loading

jonkeane commented Feb 12, 2021

Reading directly to data.frames seems slower with version 1.4.0 #309

Reading directly to data.frames seems slower with version 1.4.0 #309

Comments

jonkeane commented Feb 11, 2021

jimhester commented Feb 11, 2021 • edited Loading

jonkeane commented Feb 12, 2021

jimhester commented Feb 11, 2021 •

edited

Loading