Implement parse-dsv #246

rei2hu · 2018-11-05T01:24:14Z

Resolves #53.

Checklist

Please ensure the following tasks are completed before submitting this pull request.

Read, understood, and followed the contributing guidelines, including the relevant style guides.
Read and understand the Code of Conduct.
Read and understood the licensing terms.
Searched for existing issues and pull requests before submitting this pull request.
Filed an issue (or an issue already existed) prior to submitting this pull request.
Rebased onto latest develop.
Submitted against develop branch.

Description

What is the purpose of this pull request?

This pull request:

Implements the parse-dsv function

Related Issues

Does this pull request have any related issues?

This pull request:

resolves Implement parse-dsv #53

Questions

Any questions for reviewers of this pull request?

What did I use that I shouldn't use for this library and what are the replacements? e.g. buffer/fs/nesting functions etc. I already know const/let and general style are problems.

Any implementation suggestions?

Other

Any other information relevant to this pull request? This may include screenshots, references, and/or implementation notes.

The current implementation pretty strictly follows section 2 in RFC 4180 and works by reading a file byte by byte. I also assume that the delimiter will 1 byte long. I plan on changing it to handle streams which should be straightforward enough as it already works by reading a file byte by byte.

Some comments on differences between the current implementation and what the RFC describes:

Each record is located on a separate line, delimited by a line break (CRLF).

The implementation allows for CR, LF, or CRLF for line breaks.

There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional "header" parameter of this MIME type).

I'm ignoring headers for now. Right now the implementation returns a 2d array representing rows/columns.

@stdlib-js/reviewers

kgryte · 2018-11-05T04:04:37Z

@rei2hu Whoa! This is awwwwwwwesssommmme! This has definitely been high on the wish list for some time! Will take a deeper look this evening!

rei2hu · 2018-11-05T04:30:30Z

I've pushed the tests I've been using (I realize I'll have to rewrite their structure too eventually).

Six of them are failing and I have some brief comments explaining why each one fails except for regex.csv which seems to be far out of the bounds of the RFC.

kgryte · 2018-11-05T04:30:57Z

@rei2hu I suppose initial feedback would be whether the file reading can be separated from parsing DSV. My thinking here is whether we might be able to create a parsing engine which "incrementally" parses provided data. Borrowing a bit of inspiration from @stdlib/stats/incr/*, the idea would be something like

function createParser( options ) {
    // Internal state configuration...

    return incrparse;

    function incrparse( byte ) {
        // Parse byte and update internal state...
    }

    // Various helper functions for parsing provided data...
}

My sense is that an incremental engine of this sort could then be used in a multiple contexts (e.g., streams, reading a file, etc), where wrapper code would read/pull data and then pass along to the incremental parsing engine.

Just a thought.

kgryte · 2018-11-05T05:33:06Z

@rei2hu Considering an incremental parser, I suppose maybe better than a byte-by-byte parser would be an incremental parser which accepts arbitrarily sized chunks (per update).

rei2hu · 2018-11-05T07:26:52Z

Alright, I think I got something working for that kind of implementation but instead of taking in raw bytes it takes in characters. This lets it pass the utf8 test without doing anything fancy. Also I put a quick if statement to handle the test where the delimiter is ; instead of , so it also passes that test now.

lib/node_modules/@stdlib/utils/parse-dsv/lib/parse_dsv.js

kgryte · 2018-11-05T22:22:19Z

lib/node_modules/@stdlib/utils/parse-dsv/lib/parse_dsv.js

+
+    // bytes should be an array
+    // an array of characters (strings with length 1)
+    function incrparse(chars) {


Would it make more sense for chars to be either a string or Buffer/Uint8Array? I am thinking from the standpoint of reading data from disk or when wrapped in a stream, etc.

Otherwise, for something like reading data from disk, you need to buf.toString().split( '' ) in order to feed data to the accumulator.

I actually do want the inner buffer variable to be an instance of Buffer instead of []. I've only been using a normal array to take advantage of shift/unshift/splice methods for an easier time implementing things.

It might bring up issues with UTF-8 though.

@rei2hu Re: shift/unshift operations. Ah, right. Makes sense. My point was more that the values/data provided to incrparse can be a string/Buffer/Uint8Array, similar to the fs and stream methods found in Node.js. The internal buffer variable being a dynamic array is a reasonable choice, I think.

I meant to say if I changed the dynamic array to an instance of Buffer, I would be able to use buffer.concat which avoids having to transform the buffer to something else.

Although just transforming the provided buffer to something that can be added onto the end of a dynamic array is a lot simpler so I'll probably go with that for now.

lib/node_modules/@stdlib/utils/parse-dsv/lib/index.js

lib/node_modules/@stdlib/utils/parse-dsv/lib/main.js

and some other lint fixes

lib/node_modules/@stdlib/utils/parse-dsv/README.md

lib/node_modules/@stdlib/utils/parse-dsv/lib/main.js

and small regex fixes

kgryte · 2018-11-23T23:22:22Z

@rei2hu Is this ready for "formal" review or still a WIP? No pressure. Just want to make sure that you are not blocked and have not been waiting for PR feedback.

rei2hu · 2018-11-24T00:03:46Z

I haven't started on the other files typically found in each package (benchmarks, docs, examples) so I think it's still WIP.

For the committed files so far, I plan to clean up the comments in the main file and rework some things in tests (wording and how the last test suite for delim guessing calculates stuff). The other existing files are complete though, although that just leaves the README, package.json and the fixtures I believe.

and make comments start with uppercase

kgryte · 2022-09-15T21:37:17Z

Closing this in favor of https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/utils/dsv/base/parse.

add rough working draft

60181ef

rmu added 2 commits November 4, 2018 22:07

modify gitattributes to not normalize csv test data line endings

ce46328

add and update for tests

fb851ce

change to incremental parser

2a455e2

kgryte reviewed Nov 5, 2018

View reviewed changes

lib/node_modules/@stdlib/utils/parse-dsv/lib/parse_dsv.js Outdated Show resolved Hide resolved

kgryte added Feature Issue or pull request for adding a new feature. Utilities Issue or pull request concerning general utilities. labels Nov 5, 2018

rmu added 6 commits November 5, 2018 09:40

remove textdata check for now and reclarify

2490ea1

consolidate tests

23e05d0

add multichar delimitter handling

45f86a2

another version of handling multichar delimiters

e79c7e9

fix 1, handle 2, mitigate 3

9d1d0a7

slightly cleaner

00294d7

kgryte reviewed Nov 5, 2018

View reviewed changes

rmu added 5 commits November 5, 2018 17:56

add weird test case

46e8a10

only allow buffers or null

8a1ddf6

add test for stream

b40e78c

update failing test case numbers

9d3b87d

update style for files in lib

8780871

kgryte reviewed Nov 6, 2018

View reviewed changes

lib/node_modules/@stdlib/utils/parse-dsv/lib/index.js Outdated Show resolved Hide resolved

kgryte reviewed Nov 6, 2018

View reviewed changes

lib/node_modules/@stdlib/utils/parse-dsv/lib/main.js Outdated Show resolved Hide resolved

kgryte reviewed Nov 6, 2018

View reviewed changes

lib/node_modules/@stdlib/utils/parse-dsv/lib/main.js Outdated Show resolved Hide resolved

kgryte reviewed Nov 6, 2018

View reviewed changes

lib/node_modules/@stdlib/utils/parse-dsv/lib/main.js Outdated Show resolved Hide resolved

rmu added 2 commits November 5, 2018 21:41

fix example in index.js

90690c6

more formatting fixes

bada739

rmu added 8 commits November 10, 2018 12:19

small change to tests

81e97dd

add package.json

98f7449

merge delimiter guessing into parser

6a5e88a

add delimiter guessing technique using quotes

0c79142

merge options like other packages do

e384728

and some other lint fixes

add docs for threshold option and fix typo

d019437

add docs for properties and other small README fixes

7cc88dc

remove unnecessary check

3eda29c