Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parse-dsv #246

Closed
wants to merge 70 commits into from
Closed

Implement parse-dsv #246

wants to merge 70 commits into from

Conversation

rei2hu
Copy link
Contributor

@rei2hu rei2hu commented Nov 5, 2018

Resolves #53.

Checklist

Please ensure the following tasks are completed before submitting this pull request.

  • Read, understood, and followed the contributing guidelines, including the relevant style guides.
  • Read and understand the Code of Conduct.
  • Read and understood the licensing terms.
  • Searched for existing issues and pull requests before submitting this pull request.
  • Filed an issue (or an issue already existed) prior to submitting this pull request.
  • Rebased onto latest develop.
  • Submitted against develop branch.

Description

What is the purpose of this pull request?

This pull request:

  • Implements the parse-dsv function

Related Issues

Does this pull request have any related issues?

This pull request:

Questions

Any questions for reviewers of this pull request?

What did I use that I shouldn't use for this library and what are the replacements? e.g. buffer/fs/nesting functions etc. I already know const/let and general style are problems.

Any implementation suggestions?

Other

Any other information relevant to this pull request? This may include screenshots, references, and/or implementation notes.

The current implementation pretty strictly follows section 2 in RFC 4180 and works by reading a file byte by byte. I also assume that the delimiter will 1 byte long. I plan on changing it to handle streams which should be straightforward enough as it already works by reading a file byte by byte.

Some comments on differences between the current implementation and what the RFC describes:

  1. Each record is located on a separate line, delimited by a line break (CRLF).

The implementation allows for CR, LF, or CRLF for line breaks.

  1. There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type).

I'm ignoring headers for now. Right now the implementation returns a 2d array representing rows/columns.


@stdlib-js/reviewers

@kgryte
Copy link
Member

kgryte commented Nov 5, 2018

@rei2hu Whoa! This is awwwwwwwesssommmme! This has definitely been high on the wish list for some time! Will take a deeper look this evening!

@rei2hu
Copy link
Contributor Author

rei2hu commented Nov 5, 2018

I've pushed the tests I've been using (I realize I'll have to rewrite their structure too eventually).

Six of them are failing and I have some brief comments explaining why each one fails except for regex.csv which seems to be far out of the bounds of the RFC.

@kgryte
Copy link
Member

kgryte commented Nov 5, 2018

@rei2hu I suppose initial feedback would be whether the file reading can be separated from parsing DSV. My thinking here is whether we might be able to create a parsing engine which "incrementally" parses provided data. Borrowing a bit of inspiration from @stdlib/stats/incr/*, the idea would be something like

function createParser( options ) {
    // Internal state configuration...

    return incrparse;

    function incrparse( byte ) {
        // Parse byte and update internal state...
    }

    // Various helper functions for parsing provided data...
}

My sense is that an incremental engine of this sort could then be used in a multiple contexts (e.g., streams, reading a file, etc), where wrapper code would read/pull data and then pass along to the incremental parsing engine.

Just a thought.

@kgryte
Copy link
Member

kgryte commented Nov 5, 2018

@rei2hu Considering an incremental parser, I suppose maybe better than a byte-by-byte parser would be an incremental parser which accepts arbitrarily sized chunks (per update).

@rei2hu
Copy link
Contributor Author

rei2hu commented Nov 5, 2018

Alright, I think I got something working for that kind of implementation but instead of taking in raw bytes it takes in characters. This lets it pass the utf8 test without doing anything fancy. Also I put a quick if statement to handle the test where the delimiter is ; instead of , so it also passes that test now.

@kgryte kgryte added Feature Issue or pull request for adding a new feature. Utilities Issue or pull request concerning general utilities. labels Nov 5, 2018

// bytes should be an array
// an array of characters (strings with length 1)
function incrparse(chars) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense for chars to be either a string or Buffer/Uint8Array? I am thinking from the standpoint of reading data from disk or when wrapped in a stream, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, for something like reading data from disk, you need to buf.toString().split( '' ) in order to feed data to the accumulator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do want the inner buffer variable to be an instance of Buffer instead of []. I've only been using a normal array to take advantage of shift/unshift/splice methods for an easier time implementing things.

It might bring up issues with UTF-8 though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rei2hu Re: shift/unshift operations. Ah, right. Makes sense. My point was more that the values/data provided to incrparse can be a string/Buffer/Uint8Array, similar to the fs and stream methods found in Node.js. The internal buffer variable being a dynamic array is a reasonable choice, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say if I changed the dynamic array to an instance of Buffer, I would be able to use buffer.concat which avoids having to transform the buffer to something else.

Although just transforming the provided buffer to something that can be added onto the end of a dynamic array is a lot simpler so I'll probably go with that for now.

@kgryte
Copy link
Member

kgryte commented Nov 23, 2018

@rei2hu Is this ready for "formal" review or still a WIP? No pressure. Just want to make sure that you are not blocked and have not been waiting for PR feedback.

@rei2hu
Copy link
Contributor Author

rei2hu commented Nov 24, 2018

I haven't started on the other files typically found in each package (benchmarks, docs, examples) so I think it's still WIP.

For the committed files so far, I plan to clean up the comments in the main file and rework some things in tests (wording and how the last test suite for delim guessing calculates stuff). The other existing files are complete though, although that just leaves the README, package.json and the fixtures I believe.

@rei2hu rei2hu changed the title [WIP] Implement parse-dsv Implement parse-dsv Nov 24, 2018
@kgryte
Copy link
Member

kgryte commented Sep 15, 2022

@kgryte kgryte closed this Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Issue or pull request for adding a new feature. Utilities Issue or pull request concerning general utilities.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement parse-dsv
2 participants