Skip to content

Commit

Permalink
Merge work on 0.1.0-rc.1
Browse files Browse the repository at this point in the history
  • Loading branch information
wooorm committed Jul 3, 2014
2 parents a655cf5 + edfc53e commit ea7a32a
Show file tree
Hide file tree
Showing 7 changed files with 6,066 additions and 7,287 deletions.
38 changes: 38 additions & 0 deletions History.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,42 @@

0.1.0-rc.1 / 2014-07-03
==================

* Moved a wrongly places tokenizer call
* Modified a unit test testing for functionality that changed in 3169a8c
* Added a missing newline
* Removed unit tests for functionality removed in 3169a8c
* Completely rewrote the API
* The API-linter no longer checks for inconsistent return values
* Updated istanbul version to 0.2.14
* tokenizeSentence now depends accepts a `DELIMITER` option
* Removed support for breaking contractions into multiple “words” (e.g., gim|me to gimme)
* Removed functionality to split between alphabetic and number (e.g., 258|f to 258f, 5|S to 5S)
* tokenizeRoot and tokenizeParagraph no longer depend on global variables
* Added two unit tests for the changes in 04e8212
* Modified the benchmark to better reflect actual natural english language
* Removed two unused variables from the API
* Refactored code (better performance, comments, and readability)
* Merge branch 'master' into feature/alpha
* Removed functionality to browserify unit tests by default
* Added a factory method for the APIs tokenizers
* Added documentation for the changes in 04e8212 and 7dd0818
* API no longer depends on TextOM, instead returning AST objects; fixes #2
* API now exposes all tokenisation steps; fixes #4
* Update jscs dependency version
* Bring browser specs up to par with latest code 33204fd
* 0.0.24
* Removed redundant contractions
* Removed complexity-report from dependencies
* Added a unit test for white space only documents
* Refactored code to work faster and be more readable
* Added benchmarks
* Fixed newline
* Fixed newline
* Added History.md
* Fixed an ungrammatical sentence
* Updated dependencies

0.0.24 / 2014-06-29
==================

Expand Down
123 changes: 96 additions & 27 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@ See [Browser Support](#browser-support) for more information (a.k.a. don’t wor

---

**parse-english** is an English language parser in JavaScript. Build on top of [TextOM](https://github.com/wooorm/textom/). NodeJS, and the browser. Lots of tests (330+), including 630+ assertions. 100% coverage.
**parse-english** is an English language parser in JavaScript. NodeJS, and the browser. Lots of tests (330+), including 630+ assertions. 100% coverage.

Note: This project is **not** an object model for natural languages, or an extensible system for analysing and manipulating natural language, its an algorithm that transforms plain-text natural language into an object model. If you need the above-mentioned functionalities, use the following projects.
Note: This project is **not** an object model for natural languages, or an extensible system for analysing and manipulating natural language, its an algorithm that transforms plain-text natural language into an AST. If you need the above-mentioned functionalities, use the following projects.

* For a pluggable system for analysing and manipulating natural language, see [retext](https://github.com/wooorm/retext "Retext").
* For the object model used in parse-english, see [TextOM](https://github.com/wooorm/textom/ "TextOM");
* For an object model, see [TextOM](https://github.com/wooorm/textom "TextOM").

## Installation

Expand All @@ -28,41 +28,110 @@ $ component install wooorm/parse-english
## Usage

````js
var parse = require('parse-english')(),
rootNode;
var Parser = require('parse-english'),
parser = new Parser(),
root;

/* Simple sentence: */
parser.tokenizeRoot('A simple, english sentence.');
/*
* ˅ Object
* ˃ children: Array[1]
* type: "RootNode"
* ˃ __proto__: Object
*/

/* Unicode filled sentence: */
parser.tokenizeRoot('The \xC5 symbol invented by A. J. A\u030Angstro\u0308m (1814, Lo\u0308gdo\u0308, \u2013 1874) denotes the length 10\u207B\xB9\u2070 m.');
/*
* ˅ Object
* ˃ children: Array[1]
* type: "RootNode"
* ˃ __proto__: Object
*/
````

// Simple sentence:
rootNode = parse('A simple, english sentence.');
## API

// Unicode filled sentence:
rootNode = parse('The \xC5 symbol invented by A. J. A\u030Angstro\u0308m (1814, Lo\u0308gdo\u0308, \u2013 1874) denotes the length 10\u207B\xB9\u2070 m.');
### parseEnglish.tokenizeRoot(source?)

// A (plain-text) file:
rootNode = parse(require('fs').readFileSync('./document.txt', 'utf-8'));
````
```js
var Parser = require('parse-english');

new Parser().tokenizeRoot('A simple sentence.');
/*
* Object
* ├─ type: "RootNode"
* └─ children: Array[1]
* └─ 0: Object
* ├─ type: "ParagraphNode"
* └─ children: Array[1]
* └─ 0: Object
* ├─ type: "SentenceNode"
* └─ children: Array[6]
* | ...
*/
```

Note that the exported object is a function, which in turn returns brand-new parser and TextOM objects. There’s whole slew of issues that can arise from extending prototypes like (DOM) Node, NodeList, or Array—this feature however allows for multiple sandboxed environments (i.e., prototypes) without those disadvantages.
Tokenize a given document into paragraphs, sentences, words, white space, and punctionation.

## API
- `source` (`null`, `undefined`, or `String`): The english document to parse.

### parseEnglish.tokenizeParagraph(source?)

```js
var Parser = require('parse-english');

new Parser().tokenizeParagraph('A simple sentence.');
/*
* Object
* ├─ type: "ParagraphNode"
* └─ children: Array[1]
* └─ 0: Object
* ├─ type: "SentenceNode"
* └─ children: Array[6]
* | ...
*/
```

Tokenize a given paragraph into sentences, words, white space, and punctionation.

- `source` (`null`, `undefined`, or `String`): The english paragraph to parse.

### ParseEnglish(source?)
### parseEnglish.tokenizeSentence(source?)

```js
var parse = require('parse-english')(),
rootNode = parse('A simple sentence.');

rootNode; // RootNode
rootNode.head; // ParagraphNode
rootNode.head.head; // SentenceNode
rootNode.head.head.head; // WordNode
rootNode.head.head.head.toString(); // "A"
rootNode.head.head.tail; // PunctuationNode
rootNode.head.head.tail.toString(); // "."
var Parser = require('parse-english');

new Parser().tokenizeSentence('A simple sentence.');
/*
* Object
* ├─ type: "SentenceNode"
* └─ children: Array[6]
* ├─ 0: Object
* | ├─ type: "WordNode"
* | └─ value: "A"
* ├─ 1: Object
* | ├─ type: "WhiteSpaceNode"
* | └─ value: " "
* ├─ 2: Object
* | ├─ type: "WordNode"
* | └─ value: "simple"
* ├─ 3: Object
* | ├─ type: "WhiteSpaceNode"
* | └─ value: " "
* ├─ 4: Object
* | ├─ type: "WordNode"
* | └─ value: "sentence"
* └─ 5: Object
* ├─ type: "PunctuationNode"
* └─ value: "."
*/
```

Parses a given (english) string into an object model.
Tokenize a given sentence into words, white space, and punctionation.

- `source` (`null`, `undefined`, or `String`): The english source to parse.
- `source` (`null`, `undefined`, or `String`): The english sentence to parse.

## Browser Support
Pretty much every browser (available through browserstack) runs all parse-english unit tests.
Expand Down
116 changes: 66 additions & 50 deletions benchmark/index.js
Original file line number Diff line number Diff line change
@@ -1,78 +1,94 @@
'use strict';

var parseEnglish, source, tiny, small, medium, large;
var Parser, sentence, paragraph, section, article, book;

parseEnglish = require('..');

source = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam ' +
'ac ultricies diam, quis vehicula mauris. Vivamus accumsan eleifend ' +
'quam et varius. Etiam congue id magna eu fermentum. Aliquam mollis ' +
'adipiscing.\n\n';
Parser = require('..');

/* Test data */
tiny = source;
small = Array(11).join(source);
medium = Array(11).join(small);
large = Array(11).join(medium);

/* Source: http://www.gutenberg.org/files/10745/10745-h/10745-h.htm */

/* A sentence, 20 words. */
sentence = 'Where she had stood was clear, and she was gone since Sir Kay does not choose to assume my quarrel.';

/* A paragraph, 5 sentences, 100 words. */
paragraph = 'Thou art a churlish knight to so affront a lady ' +
'he could not sit upon his horse any longer. ' +
'For methinks something hath befallen my lord and that he ' +
'then, after a while, he cried out in great voice. ' +
'For that light in the sky lieth in the south ' +
'then Queen Helen fell down in a swoon, and lay. ' +
'Touch me not, for I am not mortal, but Fay ' +
'so the Lady of the Lake vanished away, everything behind. ' +
sentence;

/* A section, 10 paragraphs, 50 sentences, 1,000 words. */
section = paragraph + Array(10).join('\n\n' + paragraph);

/* An article, 100 paragraphs, 500 sentences, 10,000 words. */
article = section + Array(10).join('\n\n' + section);

/* A book, 1,000 paragraphs, 5,000 sentences, 100,000 words. */
book = article + Array(10).join('\n\n' + article);

/* Benchmarks */
suite('parse(source); // Reuse instance', function () {
var parse = parseEnglish();
suite('parser.tokenizeSentence(source);', function () {
var parser = new Parser();

bench('tiny (1 paragraph, 5 sentences, 30 words, 208 B)',
function (next) {
parse(tiny);
next();
}
);
set('mintime', 50);

bench('small (10 paragraphs, 50 sentences, 300 words, 2 kB)',
function (next) {
parse(small);
next();
}
);
bench('A sentence (20 words)', function (next) {
parser.tokenizeSentence(sentence);
next();
});
});

bench('medium (100 paragraphs, 500 sentences, 3000 words, 21 kB)',
function (next) {
parse(medium);
next();
}
);

bench('large (1000 paragraphs, 5000 sentences, 30000 words, 208 kB)',
function (next) {
parse(large);
next();
}
);
/* Benchmarks */
suite('parser.tokenizeParagraph(source);', function () {
var parser = new Parser();

set('mintime', 50);

bench('A sentence (20 words)', function (next) {
parser.tokenizeParagraph(sentence);
next();
});

bench('A paragraph (5 sentences, 100 words)', function (next) {
parser.tokenizeParagraph(paragraph);
next();
});
});

suite('parseEnglish()(source); // Create new instance', function () {
bench('tiny (1 paragraph, 5 sentences, 30 words, 208 B)',
function (next) {
parseEnglish()(tiny);
next();
}
);
/* Benchmarks */
suite('parser.tokenizeRoot(source);', function () {
var parser = Parser();

set('mintime', 100);

bench('A paragraph (5 sentences, 100 words)', function (next) {
parser.tokenizeRoot(paragraph);
next();
});

bench('small (10 paragraphs, 50 sentences, 300 words, 2 kB)',
bench('A section (10 paragraphs, 50 sentences, 1,000 words)',
function (next) {
parseEnglish()(small);
parser.tokenizeRoot(section);
next();
}
);

bench('medium (100 paragraphs, 500 sentences, 3000 words, 21 kB)',
bench('An article (100 paragraphs, 500 sentences, 10,000 words)',
function (next) {
parseEnglish()(medium);
parser.tokenizeRoot(article);
next();
}
);

bench('large (1000 paragraphs, 5000 sentences, 30000 words, 208 kB)',
bench('A (large) book (1,000 paragraphs, 5,000 sentences, 100,000 words)',
function (next) {
parseEnglish()(large);
parser.tokenizeRoot(book);
next();
}
);
Expand Down
10 changes: 3 additions & 7 deletions component.json
Original file line number Diff line number Diff line change
@@ -1,18 +1,14 @@
{
"name": "parse-english",
"repository": "wooorm/parse-english",
"description": "English (natural language) parser build on TextOM",
"version": "0.0.24",
"description": "English (natural language) parser",
"version": "0.1.0-rc.1",
"keywords": [
"english",
"natural",
"language",
"parser",
"textom"
"parser"
],
"dependencies": {
"wooorm/textom": "^0.0.20"
},
"scripts" : [
"index.js"
],
Expand Down
Loading

0 comments on commit ea7a32a

Please sign in to comment.