Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vroom as a readr backend #282

Closed
4 of 6 tasks
jimhester opened this issue Dec 7, 2020 · 0 comments
Closed
4 of 6 tasks

vroom as a readr backend #282

jimhester opened this issue Dec 7, 2020 · 0 comments
Labels
feature a feature request or enhancement readr 📖 Issues related to readr compatibility

Comments

@jimhester
Copy link
Collaborator

jimhester commented Dec 7, 2020

This issue is uses to track issues needed to use vroom as a readr backend.

Error reporting

We need a framework report shape (wrong number of columns) and parsing errors (wrong type of value) in similar format as readr. This is complicated because of multi-threading and lazy reading. A way to handle this is a class that locks updates with a mutex, and probably issue a warning when the first issue is found.

This is also complicated because often the code doesn't know the row, or the column, or the file it is reading. e.g. when indexing in parallel we don't know the row, because it depends on how many rows are in other threads. When reading values from altrep vectors we don't know the column, because the column type doesn't store what index it corresponds to.

Embedded newlines

Currently to handle embedded newlines we say you need to disable multi-threading. This is unfortunately not robust enough, and we should instead do something similar to Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald, Speculative Distributed CSV Data Parsing for Big Data Analytics, SIGMOD 2019.

A simple way to handle this would be to throw an error if you are using multiple threads and a newline is detected inside a quoted field, with the error message saying you need to re-parse with num_threads = 1. Or possibly we could even automatically re-parse ourselves when this occurs.

Alternatively we could employ the parallel 2 pass algorithm from the paper. It is slower than the speculative one, but simpler and easier to implement.

Quoting behavior

Currently vroom uses a fairly naive way of handling and escaping quotes. It allows a quote to occur anywhere in a field, rather than only right before or after a delimiter. This unfortunately means with the default quote option if there is a field with only a single unpaired quote the rest of the file is treated as quoted. We need to restrict this to only looking at quotes before or after a delimiter.

On-disk indexes

The current index is always stored in memory, which can be prohibitive for very large files. Experiments using a mmapped file instead showed faster index creation times with only minimal difference in value extraction time. This would also scale to arbitrarily big files (assuming you are only using part of the file). The major downside is you then also need to hold file handles for the indexes, which makes that problem worse. There is also a question of where the index should be stored, probably tempdir(), but what happens if there is not enough space there?

File handles

Vroom currently holds an open file handle for every index. If we use the on-disk index each thread would have a separate file handle. So we would have num_threads * num_files potential open file handles. Ideally we would hold at max num_threads handles open at a time. One downside to not keeping open file handles is this ensures the files remain until all data is read from them. If we don't keep an open file handle the file could be deleted out from under us.

NA handling

currently we only check for NA values for character and factor columns. Other column types do not explicitly check for the NA values. This wasn't a big deal until we gained better error reporting, but now these values will mistakenly show up in the problems dataset. This is somewhat trick to do, as we supply the NA values in the native encoding, the values are parsed as raw bytes, and we have to somehow detect the NA values without running re-encoding every value.

@jimhester jimhester added the feature a feature request or enhancement label Dec 8, 2020
jimhester added a commit that referenced this issue Dec 8, 2020
jimhester added a commit that referenced this issue Dec 14, 2020
Because we don't want to include it as a problem in this case.

Fixes part of #282
jimhester added a commit that referenced this issue Dec 18, 2020
If a quoted newline is encountered we throw an exception which is caught
and then restart the indexing with only a single thread so the newline
is read properly.

Part of #282
@jimhester jimhester added the readr 📖 Issues related to readr compatibility label Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement readr 📖 Issues related to readr compatibility
Projects
None yet
Development

No branches or pull requests

1 participant