vroom as a readr backend #282

jimhester · 2020-12-07T17:00:16Z

This issue is uses to track issues needed to use vroom as a readr backend.

Error reporting

We need a framework report shape (wrong number of columns) and parsing errors (wrong type of value) in similar format as readr. This is complicated because of multi-threading and lazy reading. A way to handle this is a class that locks updates with a mutex, and probably issue a warning when the first issue is found.

This is also complicated because often the code doesn't know the row, or the column, or the file it is reading. e.g. when indexing in parallel we don't know the row, because it depends on how many rows are in other threads. When reading values from altrep vectors we don't know the column, because the column type doesn't store what index it corresponds to.

Embedded newlines

Currently to handle embedded newlines we say you need to disable multi-threading. This is unfortunately not robust enough, and we should instead do something similar to Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald, Speculative Distributed CSV Data Parsing for Big Data Analytics, SIGMOD 2019.

A simple way to handle this would be to throw an error if you are using multiple threads and a newline is detected inside a quoted field, with the error message saying you need to re-parse with num_threads = 1. Or possibly we could even automatically re-parse ourselves when this occurs.

Alternatively we could employ the parallel 2 pass algorithm from the paper. It is slower than the speculative one, but simpler and easier to implement.

Quoting behavior

Currently vroom uses a fairly naive way of handling and escaping quotes. It allows a quote to occur anywhere in a field, rather than only right before or after a delimiter. This unfortunately means with the default quote option if there is a field with only a single unpaired quote the rest of the file is treated as quoted. We need to restrict this to only looking at quotes before or after a delimiter.

On-disk indexes

The current index is always stored in memory, which can be prohibitive for very large files. Experiments using a mmapped file instead showed faster index creation times with only minimal difference in value extraction time. This would also scale to arbitrarily big files (assuming you are only using part of the file). The major downside is you then also need to hold file handles for the indexes, which makes that problem worse. There is also a question of where the index should be stored, probably tempdir(), but what happens if there is not enough space there?

File handles

Vroom currently holds an open file handle for every index. If we use the on-disk index each thread would have a separate file handle. So we would have num_threads * num_files potential open file handles. Ideally we would hold at max num_threads handles open at a time. One downside to not keeping open file handles is this ensures the files remain until all data is read from them. If we don't keep an open file handle the file could be deleted out from under us.

NA handling

currently we only check for NA values for character and factor columns. Other column types do not explicitly check for the NA values. This wasn't a big deal until we gained better error reporting, but now these values will mistakenly show up in the problems dataset. This is somewhat trick to do, as we supply the NA values in the native encoding, the values are parsed as raw bytes, and we have to somehow detect the NA values without running re-encoding every value.

The text was updated successfully, but these errors were encountered:

Part of #282 Fixes #224

Because we don't want to include it as a problem in this case. Fixes part of #282

If a quoted newline is encountered we throw an exception which is caught and then restart the indexing with only a single thread so the newline is read properly. Part of #282

jimhester added the feature a feature request or enhancement label Dec 8, 2020

jimhester added a commit that referenced this issue Dec 8, 2020

More robust handling of quotes

77328b5

Part of #282 Fixes #224

jimhester added a commit that referenced this issue Dec 14, 2020

Explicits checks for NA matching the na= argument

3a8bba4

Because we don't want to include it as a problem in this case. Fixes part of #282

jimhester added a commit that referenced this issue Dec 18, 2020

Support quoted newlines automatically

a398be4

If a quoted newline is encountered we throw an exception which is caught and then restart the indexing with only a single thread so the newline is read properly. Part of #282

mjsteinbaugh mentioned this issue Jan 13, 2021

Current best practice for writing lines #291

Closed

jimhester added the readr 📖 Issues related to readr compatibility label Jan 26, 2021

jimhester closed this as completed Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vroom as a readr backend #282

vroom as a readr backend #282

jimhester commented Dec 7, 2020 •

edited

vroom as a readr backend #282

vroom as a readr backend #282

Comments

jimhester commented Dec 7, 2020 • edited

Error reporting

Embedded newlines

Quoting behavior

On-disk indexes

File handles

NA handling

jimhester commented Dec 7, 2020 •

edited