You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We need a framework report shape (wrong number of columns) and parsing errors (wrong type of value) in similar format as readr. This is complicated because of multi-threading and lazy reading. A way to handle this is a class that locks updates with a mutex, and probably issue a warning when the first issue is found.
This is also complicated because often the code doesn't know the row, or the column, or the file it is reading. e.g. when indexing in parallel we don't know the row, because it depends on how many rows are in other threads. When reading values from altrep vectors we don't know the column, because the column type doesn't store what index it corresponds to.
Embedded newlines
Currently to handle embedded newlines we say you need to disable multi-threading. This is unfortunately not robust enough, and we should instead do something similar to Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald, Speculative Distributed CSV Data Parsing for Big Data Analytics, SIGMOD 2019.
A simple way to handle this would be to throw an error if you are using multiple threads and a newline is detected inside a quoted field, with the error message saying you need to re-parse with num_threads = 1. Or possibly we could even automatically re-parse ourselves when this occurs.
Alternatively we could employ the parallel 2 pass algorithm from the paper. It is slower than the speculative one, but simpler and easier to implement.
Quoting behavior
Currently vroom uses a fairly naive way of handling and escaping quotes. It allows a quote to occur anywhere in a field, rather than only right before or after a delimiter. This unfortunately means with the default quote option if there is a field with only a single unpaired quote the rest of the file is treated as quoted. We need to restrict this to only looking at quotes before or after a delimiter.
On-disk indexes
The current index is always stored in memory, which can be prohibitive for very large files. Experiments using a mmapped file instead showed faster index creation times with only minimal difference in value extraction time. This would also scale to arbitrarily big files (assuming you are only using part of the file). The major downside is you then also need to hold file handles for the indexes, which makes that problem worse. There is also a question of where the index should be stored, probably tempdir(), but what happens if there is not enough space there?
File handles
Vroom currently holds an open file handle for every index. If we use the on-disk index each thread would have a separate file handle. So we would have num_threads * num_files potential open file handles. Ideally we would hold at max num_threads handles open at a time. One downside to not keeping open file handles is this ensures the files remain until all data is read from them. If we don't keep an open file handle the file could be deleted out from under us.
NA handling
currently we only check for NA values for character and factor columns. Other column types do not explicitly check for the NA values. This wasn't a big deal until we gained better error reporting, but now these values will mistakenly show up in the problems dataset. This is somewhat trick to do, as we supply the NA values in the native encoding, the values are parsed as raw bytes, and we have to somehow detect the NA values without running re-encoding every value.
The text was updated successfully, but these errors were encountered:
If a quoted newline is encountered we throw an exception which is caught
and then restart the indexing with only a single thread so the newline
is read properly.
Part of #282
This issue is uses to track issues needed to use vroom as a readr backend.
Error reporting
We need a framework report shape (wrong number of columns) and parsing errors (wrong type of value) in similar format as readr. This is complicated because of multi-threading and lazy reading. A way to handle this is a class that locks updates with a mutex, and probably issue a warning when the first issue is found.
This is also complicated because often the code doesn't know the row, or the column, or the file it is reading. e.g. when indexing in parallel we don't know the row, because it depends on how many rows are in other threads. When reading values from altrep vectors we don't know the column, because the column type doesn't store what index it corresponds to.
Embedded newlines
Currently to handle embedded newlines we say you need to disable multi-threading. This is unfortunately not robust enough, and we should instead do something similar to Ge, Chang and Li, Yinan and Eilebrecht, Eric and Chandramouli, Badrish and Kossmann, Donald, Speculative Distributed CSV Data Parsing for Big Data Analytics, SIGMOD 2019.
A simple way to handle this would be to throw an error if you are using multiple threads and a newline is detected inside a quoted field, with the error message saying you need to re-parse with
num_threads = 1
. Or possibly we could even automatically re-parse ourselves when this occurs.Alternatively we could employ the parallel 2 pass algorithm from the paper. It is slower than the speculative one, but simpler and easier to implement.
Quoting behavior
Currently vroom uses a fairly naive way of handling and escaping quotes. It allows a quote to occur anywhere in a field, rather than only right before or after a delimiter. This unfortunately means with the default quote option if there is a field with only a single unpaired quote the rest of the file is treated as quoted. We need to restrict this to only looking at quotes before or after a delimiter.
On-disk indexes
The current index is always stored in memory, which can be prohibitive for very large files. Experiments using a mmapped file instead showed faster index creation times with only minimal difference in value extraction time. This would also scale to arbitrarily big files (assuming you are only using part of the file). The major downside is you then also need to hold file handles for the indexes, which makes that problem worse. There is also a question of where the index should be stored, probably
tempdir()
, but what happens if there is not enough space there?File handles
Vroom currently holds an open file handle for every index. If we use the on-disk index each thread would have a separate file handle. So we would have
num_threads * num_files
potential open file handles. Ideally we would hold at maxnum_threads
handles open at a time. One downside to not keeping open file handles is this ensures the files remain until all data is read from them. If we don't keep an open file handle the file could be deleted out from under us.NA handling
currently we only check for NA values for character and factor columns. Other column types do not explicitly check for the NA values. This wasn't a big deal until we gained better error reporting, but now these values will mistakenly show up in the problems dataset. This is somewhat trick to do, as we supply the NA values in the native encoding, the values are parsed as raw bytes, and we have to somehow detect the NA values without running re-encoding every value.
The text was updated successfully, but these errors were encountered: