Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading directly to data.frames seems slower with version 1.4.0 #309

Closed
jonkeane opened this issue Feb 11, 2021 · 2 comments
Closed

Reading directly to data.frames seems slower with version 1.4.0 #309

jonkeane opened this issue Feb 11, 2021 · 2 comments

Comments

@jonkeane
Copy link

Hello,

@nealrichardson and I have been working on some benchmark tooling to help us benchmark our work in the Arrow project. In the process of that we've seen some performance changes in vroom that we were surprised by: with version 1.4.0 we're seeing a pretty large (3-4x) jump in total time to read some large files into a data.frame (i.e. with altrep = FALSE). We've tried a few different csvs to make sure it's not something specific in these files (happy to send along links to the files/more timings from them..

We've also tried separating these csvs by data type (i.e. pulling all of the character columns out of one and saving that as a separate csv and reading it in) to see if those have any impact and it looks like the problem is more pronounced for numeric and numeric-like types (integers, floats, and datetimes). The character-only csv was read in at similar speeds across the two versions (again, happy to send our code/timings for this as well).

Using the taxi data from January 2013 from the vroom benchmarks:

> library("vroom")
> packageVersion("vroom")
[1] ‘1.4.0’
> system.time(taxis <- vroom("source_data/vroom_nyctaxi.csv", altrep = FALSE))
                                                                                                                                                                Rows: 14,776,615                   
Columns: 11
Delimiter: ","
chr  [4]: medallion, hack_license, vendor_id, payment_type
dbl  [6]: fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount
dttm [1]: pickup_datetime

Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
   user  system elapsed 
157.848   3.149  28.529 

And with 1.3.2 (installed with install.packages("vroom", repos = "https://mran.microsoft.com/snapshot/2021-01-30")):

> library("vroom")
> packageVersion("vroom")
[1] ‘1.3.2’
> system.time(taxis <- vroom("source_data/vroom_nyctaxi.csv", altrep = FALSE))
                                                                                Rows: 14,776,615                   
Columns: 11
Delimiter: ","
chr  [4]: medallion, hack_license, vendor_id, payment_type
dbl  [6]: fare_amount, surcharge, mta_tax, tip_amount, tolls_amount, total_amount
dttm [1]: pickup_datetime

Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
   user  system elapsed 
 27.860   1.405   9.333 
@jimhester
Copy link
Collaborator

jimhester commented Feb 11, 2021

Thanks for noticing this and opening the issue!

This was an inadvertent performance regression caused by 12f7152. We added a template function to handle parsing errors and some of the arguments were being passed by value rather than being passed by reference, so lots of unnecessary copies were being made.

This didn't happen to the character data because the template function was not used in that case.

After c9021fc the performance for this case again seems roughly comparable to 1.3.2.

@jonkeane
Copy link
Author

Thanks for the incredibly quick fix and explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants