You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@nealrichardson and I have been working on some benchmark tooling to help us benchmark our work in the Arrow project. In the process of that we've seen some performance changes in vroom that we were surprised by: with version 1.4.0 we're seeing a pretty large (3-4x) jump in total time to read some large files into a data.frame (i.e. with altrep = FALSE). We've tried a few different csvs to make sure it's not something specific in these files (happy to send along links to the files/more timings from them..
We've also tried separating these csvs by data type (i.e. pulling all of the character columns out of one and saving that as a separate csv and reading it in) to see if those have any impact and it looks like the problem is more pronounced for numeric and numeric-like types (integers, floats, and datetimes). The character-only csv was read in at similar speeds across the two versions (again, happy to send our code/timings for this as well).
This was an inadvertent performance regression caused by 12f7152. We added a template function to handle parsing errors and some of the arguments were being passed by value rather than being passed by reference, so lots of unnecessary copies were being made.
This didn't happen to the character data because the template function was not used in that case.
After c9021fc the performance for this case again seems roughly comparable to 1.3.2.
Hello,
@nealrichardson and I have been working on some benchmark tooling to help us benchmark our work in the Arrow project. In the process of that we've seen some performance changes in vroom that we were surprised by: with version 1.4.0 we're seeing a pretty large (3-4x) jump in total time to read some large files into a data.frame (i.e. with
altrep = FALSE
). We've tried a few different csvs to make sure it's not something specific in these files (happy to send along links to the files/more timings from them..We've also tried separating these csvs by data type (i.e. pulling all of the character columns out of one and saving that as a separate csv and reading it in) to see if those have any impact and it looks like the problem is more pronounced for numeric and numeric-like types (integers, floats, and datetimes). The character-only csv was read in at similar speeds across the two versions (again, happy to send our code/timings for this as well).
Using the taxi data from January 2013 from the vroom benchmarks:
And with 1.3.2 (installed with
install.packages("vroom", repos = "https://mran.microsoft.com/snapshot/2021-01-30")
):The text was updated successfully, but these errors were encountered: