New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement parse-dsv #246
Implement parse-dsv #246
Conversation
@rei2hu Whoa! This is awwwwwwwesssommmme! This has definitely been high on the wish list for some time! Will take a deeper look this evening! |
I've pushed the tests I've been using (I realize I'll have to rewrite their structure too eventually). Six of them are failing and I have some brief comments explaining why each one fails except for |
@rei2hu I suppose initial feedback would be whether the file reading can be separated from parsing DSV. My thinking here is whether we might be able to create a parsing engine which "incrementally" parses provided data. Borrowing a bit of inspiration from function createParser( options ) {
// Internal state configuration...
return incrparse;
function incrparse( byte ) {
// Parse byte and update internal state...
}
// Various helper functions for parsing provided data...
} My sense is that an incremental engine of this sort could then be used in a multiple contexts (e.g., streams, reading a file, etc), where wrapper code would read/pull data and then pass along to the incremental parsing engine. Just a thought. |
@rei2hu Considering an incremental parser, I suppose maybe better than a byte-by-byte parser would be an incremental parser which accepts arbitrarily sized chunks (per update). |
Alright, I think I got something working for that kind of implementation but instead of taking in raw bytes it takes in characters. This lets it pass the utf8 test without doing anything fancy. Also I put a quick if statement to handle the test where the delimiter is ; instead of , so it also passes that test now. |
|
||
// bytes should be an array | ||
// an array of characters (strings with length 1) | ||
function incrparse(chars) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense for chars
to be either a string
or Buffer
/Uint8Array
? I am thinking from the standpoint of reading data from disk or when wrapped in a stream, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, for something like reading data from disk, you need to buf.toString().split( '' )
in order to feed data to the accumulator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually do want the inner buffer variable to be an instance of Buffer instead of []. I've only been using a normal array to take advantage of shift/unshift/splice methods for an easier time implementing things.
It might bring up issues with UTF-8 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rei2hu Re: shift/unshift operations. Ah, right. Makes sense. My point was more that the values/data provided to incrparse
can be a string/Buffer/Uint8Array, similar to the fs
and stream
methods found in Node.js. The internal buffer
variable being a dynamic array is a reasonable choice, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say if I changed the dynamic array to an instance of Buffer, I would be able to use buffer.concat
which avoids having to transform the buffer to something else.
Although just transforming the provided buffer to something that can be added onto the end of a dynamic array is a lot simpler so I'll probably go with that for now.
and some other lint fixes
and small regex fixes
@rei2hu Is this ready for "formal" review or still a WIP? No pressure. Just want to make sure that you are not blocked and have not been waiting for PR feedback. |
I haven't started on the other files typically found in each package (benchmarks, docs, examples) so I think it's still WIP. For the committed files so far, I plan to clean up the comments in the main file and rework some things in tests (wording and how the last test suite for delim guessing calculates stuff). The other existing files are complete though, although that just leaves the README, package.json and the fixtures I believe. |
and make comments start with uppercase
Closing this in favor of https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/utils/dsv/base/parse. |
Resolves #53.
Checklist
develop
.develop
branch.Description
This pull request:
Related Issues
This pull request:
parse-dsv
#53Questions
What did I use that I shouldn't use for this library and what are the replacements? e.g. buffer/fs/nesting functions etc. I already know const/let and general style are problems.
Any implementation suggestions?
Other
The current implementation pretty strictly follows section 2 in RFC 4180 and works by reading a file byte by byte. I also assume that the delimiter will 1 byte long. I plan on changing it to handle streams which should be straightforward enough as it already works by reading a file byte by byte.
Some comments on differences between the current implementation and what the RFC describes:
The implementation allows for CR, LF, or CRLF for line breaks.
I'm ignoring headers for now. Right now the implementation returns a 2d array representing rows/columns.
@stdlib-js/reviewers