Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement parse-dsv #246

Closed
wants to merge 70 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
60181ef
add rough working draft
Nov 5, 2018
ce46328
modify gitattributes to not normalize csv test data line endings
Nov 5, 2018
fb851ce
add and update for tests
Nov 5, 2018
2a455e2
change to incremental parser
Nov 5, 2018
2490ea1
remove textdata check for now and reclarify
Nov 5, 2018
23e05d0
consolidate tests
Nov 5, 2018
45f86a2
add multichar delimitter handling
Nov 5, 2018
e79c7e9
another version of handling multichar delimiters
Nov 5, 2018
9d1d0a7
fix 1, handle 2, mitigate 3
Nov 5, 2018
00294d7
slightly cleaner
Nov 5, 2018
46e8a10
add weird test case
Nov 5, 2018
8a1ddf6
only allow buffers or null
Nov 6, 2018
b40e78c
add test for stream
Nov 6, 2018
9d3b87d
update failing test case numbers
Nov 6, 2018
8780871
update style for files in lib
Nov 6, 2018
90690c6
fix example in index.js
Nov 6, 2018
bada739
more formatting fixes
Nov 6, 2018
094299e
suggested changes
Nov 6, 2018
8e44690
add comments to explain each state
Nov 6, 2018
1f07469
add end of stream token
Nov 6, 2018
90ef36a
differentiate between quoted empty field and unquoted empty field
Nov 6, 2018
4c33b62
change to a class
Nov 6, 2018
2b209dc
allow picking quote character
Nov 6, 2018
271eea1
change tests so they pass
Nov 6, 2018
d7ff7a0
consolidate tests again
Nov 7, 2018
b092cba
add event emit test
Nov 7, 2018
9f2ffcf
add error tests
Nov 7, 2018
b109082
sacrifice a bit of efficiency
Nov 7, 2018
9e79be2
reordering cases
Nov 7, 2018
b3de30c
a lot of things
Nov 7, 2018
5f72e59
emit errors instead
Nov 7, 2018
524e82d
better way of quitting on error
Nov 7, 2018
6d18e58
fix: row only emitted after another token was recieved
Nov 7, 2018
29163ef
suggested changes
Nov 7, 2018
2cd8de3
naming changes
Nov 7, 2018
4f73244
add rough delimiter guesser and tests
Nov 7, 2018
f9dd715
integrate delimiter guesser with dsvparser
Nov 8, 2018
06dd5c2
add functionality to delimiter guesser
Nov 8, 2018
f8c2ee1
undo gitattributes change
Nov 8, 2018
66b7d1c
add an error state
Nov 8, 2018
6864575
refactoring and cleanup
Nov 8, 2018
bc844fd
choose symbols over alphanumeric in delimiter guesser
Nov 8, 2018
bd15a01
maybe a cleaner way to avoid emitting
Nov 9, 2018
5135640
test delimiter guesser on fixtures
Nov 9, 2018
15a9cba
update error message
Nov 10, 2018
a20b9cd
use better row count
Nov 10, 2018
b891f8a
try to generate dsv data automatically for tests
Nov 10, 2018
ad9317e
style fixes
Nov 10, 2018
83f8464
use regex to avoid quote problems
Nov 10, 2018
d04a671
initial draft for README
Nov 10, 2018
7b0215c
add implementation specifics to README
Nov 10, 2018
81e97dd
small change to tests
Nov 10, 2018
98f7449
add package.json
Nov 10, 2018
6a5e88a
merge delimiter guessing into parser
Nov 11, 2018
0c79142
add delimiter guessing technique using quotes
Nov 11, 2018
e384728
merge options like other packages do
Nov 11, 2018
d019437
add docs for threshold option and fix typo
Nov 11, 2018
7cc88dc
add docs for properties and other small README fixes
Nov 11, 2018
3eda29c
remove unnecessary check
Nov 13, 2018
483c600
suggested changes and other documentation updates
Nov 13, 2018
6824033
minify fixtures
Nov 13, 2018
f4b50cc
delete manual delim guessing tests
Nov 24, 2018
c01bc21
update last test to generate data with escaped quotes
Nov 24, 2018
b5e13cd
update comments
Nov 24, 2018
5000b5a
rename param in listener function for row event example
Nov 24, 2018
7ad493e
update examples in lib files
Nov 24, 2018
7b60ec1
create examples
Nov 24, 2018
32bd9ad
fix quotes in lib documentation
Nov 24, 2018
0448d08
add benchmarks
Nov 24, 2018
2d315b4
add docs
Nov 24, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
345 changes: 345 additions & 0 deletions lib/node_modules/@stdlib/utils/parse-dsv/README.md
@@ -0,0 +1,345 @@
<!--

@license Apache-2.0

Copyright (c) 2018 The Stdlib Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

-->

# Parse DSV

> DSV Parsing

<section class="intro">

Delimiter-separated values store data in 2d-arrays by separating each field in each record (row) by a specific delimiter character and by separating each row with a newline character. Due to a lack of official specifications for this format, the implementation conservatively extends the CSV specifications found in [RFC 4180][rfc4180] to fit the DSV use case. See the [notes](#notes) for implementation specifics.

</section>

<section class="usage">

## Usage

```javascript
var DsvParser = require( '@stdlib/utils/parse-dsv' );
```

#### DsvParser( \[options] )

```javascript
var parser = new DsvParser();
// returns <DsvParser>
```

The constructor accepts the following `options`:

- **delimiter**: `string` the string that will separate each entry in a record (row). If not specified, the parser will attempt to determine a delimiter from the input.
- **parseHeaders**: `boolean` whether or not to interpret the first record (row) as header information. Default: `false`.
- **quoteCharacter**: `string` the character to use for quotes. Must be length 1. Default: `"`.
- **watermark**: `number` the number of records (rows) to use to determine a delimiter if `options.delimiter` is undefined.
- **encoding**: `string` the encoding of the buffer that will be passed to the parser. Default: `utf8`.

To specify dsv parser options at instantiation, provide an `options` object.

```javascript
var opts = {
'delimiter': ',',
'parseHeaders': true,
'quoteCharacter': '\'',
'watermark': 10,
'encoding': 'utf8'
};

var parser = new DsvParser( opts );
// returns <DsvParser>
```

* * *

### Properties

<a name="property-delimiter"></a>

#### DsvParser.prototype.delimiter

The delimiter for the dsv data

```javascript
var parser = new DsvParser({
'delimiter': ','
});

var delimiter = parser.delimiter;
// returns ,
```

<a name="property-parseheaders"></a>

#### DsvParser.prototype.parseHeaders

Whether or not the dsv data's first row should be interpreted as a header

```javascript
var parser = new DsvParser({
'parseHeaders': true
});

var headers = parser.parseHeaders;
// returns true
```

<a name="property-quotecharacter"></a>

#### DsvParser.prototype.quoteCharacter

The quote character for the dsv data

```javascript
var parser = new DsvParser({
'quoteCharacter': '"'
});

var qc = parser.quoteCharacter;
// returns "
```

<a name="property-watermark"></a>

#### DsvParser.prototype.watermark

The number of rows to use when determining a delimiter when a delimiter is not provided

```javascript
var parser = new DsvParser({
'watermark': 10
});

var watermark = parser.watermark;
// returns 10
```

<a name="property-threshold"></a>

#### DsvParser.prototype.threshold

The percent of rows that must be satisfied by a potential delimiter when a delimiter is not provided

```javascript
var parser = new DsvParser({
'threshold': 0.9
});

var threshold = parser.threshold;
// returns 0.9
```

<a name="property-encoding"></a>

#### DsvParser.prototype.encoding

The encoding of the buffer that will be passed to [`DsvParser.prototype.push()`](#method-push)

```javascript
var parser = new DsvParser({
'encoding': 0.9
});

var encoding = parser.encoding;
// returns utf8
```

* * *

### Methods

<a name='method-push'></a>

#### DsvParser.prototype.push( data )

Pushes data to the parser. This accepts either a buffer to add data or `null` to close the parser.

```javascript
var string2buffer = require( '@stdlib/buffer/from-string' );
var parser = new DsvParser();
parser.push( string2buffer( 'a,b,c' ) );
```

* * *

### Events

<a name='event-delimiter'></a>

Event emitted when `options.delimiter` is undefined, the parser has received enough records (rows) greater than or equal to `options.watermark`, and has been able to determine a delimiter. The event passes three arguments. The first argument is a string which represents the determined delimiter. The second argument is a 2d-array which represents the records parsed so far. The third argument is a list of other potentially possible delimiters.

```javascript
var parser = new DsvParser();

function listener( delimiter, rows, otherGuesses ) {
console.log( 'Determined delimiter to be ' + delimiter + '.' );
console.log( 'Parsed ' + rows.length + ' rows so far.' );
console.log( 'Other potential delimiters are ' + otherGuesses.join( ' ' ) );
}

// Attach an event listener:
parser.on( 'delimiter', listener );
```

<a name='event-header'></a>

#### 'header'

Event emitted when `options.parseHeaders` is true and the parser has received the first record (row). The event passes two arguments. The first argument is an array of fields which represents the record and the second argument represents the zero based row number which will be 0.

```javascript
var parser = new DsvParser({
'parseHeaders': true
});

function listener( headers, rowNumber ) {
console.log( rowNumber, headers );
}

// Attach an event listener:
parser.on( 'header', listener );
```

<a name='event-row'></a>

#### 'row'

Event emitted when an entire record (row) is parsed. If `options.parseHeaders` is `true`, the [`header`](#event-header) event will be emitted for the first row instead. The event passes two arguments. The first argument is an array of fields which represents the record and the second argument represents the zero based row number.

```javascript
var parser = new DsvParser();

function listener( row, rowNumber ) {
console.log( rowNumber, row );
}

// Attach an event listener:
parser.on( 'row', listener );
```

<a name='event-end'></a>

#### 'end'

Event emitted when `null` is pushed to the parser via [`DsvParser.prototype.push`](#method-push). This event will only be emitted if the data passed so far is valid dsv otherwise the [`error`](#event-error) event will be emitted.

```javascript
var parser = new DsvParser();

function listener() {
console.log( 'The parser was closed successfully' );
}

// Attach an event listener:
parser.on( 'end', listener );
```

<a name='event-error'></a>

#### 'error'

Event emitted when an error is encountered. The event passes one argument which is an instance of `Error`.

```javascript
var parser = new DsvParser();

function errorHandler( error ) {
console.log( 'An error was encountered: ' + error );
}

parser.on( 'error', errorHandler );
```

</section>

## Examples

<section class="examples">

```javascript
var string2buffer = require( '@stdlib/buffer/from-string' );
var DsvParser = require( '@stdlib/utils/parse-dsv' );

var options = {
'delimiter': ','
};
var parser = new DsvParser( options );

function listener( row, rowNumber ) {
console.log( rowNumber, row );
}

parser.on( 'row', listener);

parser.push( string2buffer( '1,2,3' ) );
parser.push( string2buffer( '\n' ) );
parser.push( string2buffer('4, 5') );
parser.push( string2buffer( ', 6' ) );
parser.push( null );
```

</section>

<section class="notes">

## Notes

Differences between the definition of the CSV format mentioned in [RFC 4180][rfc4180] and the DSV implemenatation:
rei2hu marked this conversation as resolved.
Show resolved Hide resolved

>Each record is located on a separate line, delimited by a line break (CRLF).

For the implementation, records can be delimited by a carriage return (CR), line feed (LF), or line break (CRLF).

>The last record in the file may or may not have an ending line break.

No difference.

>There may be an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file.

No difference.

>Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.

For the implementation, the custom delimiter is used instead of a comma.

>Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

For the implementation, a quote character may be specified. This quote character still must follow this rule when appearing in the input.

>Fields containing line breaks (CRLF), double quotes, and commas, should be enclosed in double quotes.

For the implementation, line breaks are extended to carriage returns (CR), line feeds (LF), and line breaks (CRLF). Commas are replaced with the custom delimiter. Double quotes are replaced with the custom quote character.

>If double quotes are used to enclose fields, then a double quote appearing inside a field must be escaped by preceding it with another double quote.

For the implementation, a quote character may be specified. This quote character still must follow this rule when appearing in the input.

Other details:

- Delimiters are allowed to be longer than 1 byte.
- CR, LF, and CRLF are reserved characters. Using any of them as quote characters or in the delimiter will introduce undefined behaviour.
- Using the quote character in the delimiter or vice versa will introduce undefined behaviour.

</section>

<section class="links">

[rfc4180]: https://tools.ietf.org/html/rfc4180

</section>
50 changes: 50 additions & 0 deletions lib/node_modules/@stdlib/utils/parse-dsv/benchmark/benchmark.js
@@ -0,0 +1,50 @@
/**
* @license Apache-2.0
*
* Copyright (c) 2018 The Stdlib Authors.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

'use strict';

// MODULES //

var string2buffer = require( '@stdlib/buffer/from-string' );
var fromCodePoint = require( '@stdlib/string/from-code-point' );
var bench = require( '@stdlib/bench' );
var pkg = require( './../package.json' ).name;
var DsvParser = require( './../lib' );


// MAIN //

bench( pkg, function benchmark( b ) {
var parser;
var str;
var i;

parser = new DsvParser({
'delimiter': ','
});
parser.on( 'error', b.fail );

b.tic();
for ( i = 0; i < b.iterations; i++ ) {
str = 'a,b,c,'+fromCodePoint( 97 + (i%26) ) + ',d\n';
parser.push( string2buffer( str ) );
}
b.toc();
b.pass( 'benchmark finished' );
b.end();
});