New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV speed improvements and support for large CSV files #532

Merged
merged 5 commits into from Jul 14, 2018

Conversation

Projects
None yet
2 participants
@aaronjg
Contributor

aaronjg commented May 25, 2018

Summary:

Read in CSVs line by line so that CSV files with more than
INT_MAX entries can be supported properly, since base R's
scan function does not have long vector support.

This is also saves more performant since we can preallocate
the data.frame, and fill it in row by row, rather than
reading in the entire CSV, converting it into a matrix,
and then into a data.frame. We also read the comment lines
and data lines in one pass through the file rather than
separate passes.

Intended Effect:

Fix limit on read_stan_csv (#530) without requiring a patch to base R.

Side Effects:

Faster to read in CSV files even when they do not exceed the max size limit.

How to Verify:

Create large file via CMDStan and read in with read_stan_csv.
Reading 100 iterations from the following model:

parameters {
  vector[10000] theta;
} 
model {
  target += -0.5*dot_self(theta);
}

Old:

system.time(x <- read_stan_csv("output.csv"))
user system elapsed
3.333 0.056 5.191

New:

system.time(x <- read_stan_csv("output.csv"))
user system elapsed
2.148 0.049 3.409

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company):

Aaron Goodman

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

aaronjg added some commits May 23, 2018

CSV speed improvements and support for large CSV files
Read in CSVs line by line so that CSV files with more than
INT_MAX entries can be supported properly, since base R's
scan function does not have long vector support.

This is also saves moemory  since we can preallocate the
data.frame and then fill it in row by row, rather than
reading in the entire CSV, converting it into a matrix,
and then into a data.frame. We also read the comment lines
and data lines in one pass through the file rather than
separate passes.
Decrease memory usage in reading in a CSV
Use vapply on the sample data frame to compute the summary statistic
rather than apply. Apply coerces the data frame into a matrix,
which results in an extra copy of the entire data frame. By using
vapply, each column is treated one at a time.
@aaronjg

This comment has been minimized.

Contributor

aaronjg commented May 25, 2018

My patch was applied to base R (wch/r-source@86f7ea9) so the root cause is fixed. However, this PR is still a speed improvement, and reduces the number of times the sample data frame must be copied in memory so you are not required to have 2X+ the available RAM to load in the sample file.

@aaronjg

This comment has been minimized.

Contributor

aaronjg commented May 26, 2018

It looks like this is hitting a corner case in the R-3.5.0's new file buffering from wch/r-source@72d4008. I filed a bug report with R-devel, but this shouldn't be merged in until the next patch release of R.

aaronjg added some commits May 26, 2018

Change buffered file workaround to be on R>3.5.0
The workaround for the buffer file bug now applies when
using R > 3.5.0. This fixes the issue for people on R-devel
or R-patched. Once this bug is fixed, the check can be adjusted
to take into account the R version on which this is resolved.
@aaronjg

This comment has been minimized.

Contributor

aaronjg commented May 28, 2018

I added a work around for newer versions of R, so this new CSV reading also works on R >= 3.5.0.

Use binary file connection to avoid R-3.5.0 workaround
Performance is comparable on R-3.4.4 (1.8 seconds for
100 iter and 10000 param file), and slightly improved
on R-3.5.0 (1.4 seconds vs. 1.7 seconds).

@bgoodri bgoodri merged commit b105586 into stan-dev:for-2.18 Jul 14, 2018

0 of 2 checks passed

continuous-integration/jenkins/pr-merge This commit cannot be built
Details
continuous-integration/travis-ci/pr The Travis CI build could not complete due to an error
Details
@bgoodri

This comment has been minimized.

Contributor

bgoodri commented Aug 8, 2018

@aaronjg When I merged this, it was passing the tests, but a bunch of things got messed up subsequently. I tried to fix everything, but now it seems that the example and tests for read_stan_csv is failing on develop with the message "Error in row.buffer[buffer.pointer, ] <- scan(con, nlines = 1, sep = ",", : replacement has length zero". Can you take a look at it and let me know what I need to do to fix it? Sorry.

@aaronjg

This comment has been minimized.

Contributor

aaronjg commented Aug 9, 2018

Sure. Where are the test cases that are failing? I looked at "./rstan/example/testreadcsv.R" but that seems to be somewhat out of date.

@bgoodri

This comment has been minimized.

Contributor

bgoodri commented Aug 9, 2018

@aaronjg

This comment has been minimized.

Contributor

aaronjg commented Aug 9, 2018

There was an extra LF at the end of the example csv. I don't think Stan generates those CSVs with extra LF now, but I am handling it just in case.

Fixed with commit 1c3236e

@bgoodri

This comment has been minimized.

Contributor

bgoodri commented Aug 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment