Latin-script (natural language) parser
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib Refactor code-style Apr 29, 2018
script Refactor code-style Apr 29, 2018
test Refactor code-style Apr 29, 2018
.editorconfig Refactor code-style Sep 18, 2016
.gitignore Add `yarn.lock` to `.gitignore` Apr 29, 2018
.npmrc Add `.npmrc` Apr 29, 2018
.prettierignore Refactor code-style Apr 29, 2018
.travis.yml Update Node in Travis Apr 29, 2018
LICENSE Refactor module Jun 21, 2016
index.js Refactor code-style Apr 29, 2018
package.json 4.1.1 Apr 29, 2018
readme.md Refactor code-style Apr 29, 2018

readme.md

parse-latin Build Status Coverage Status Chat

A Latin script language parser for retext producing NLCST nodes.

Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont les toilettes?”), parse-latin does a good job at tokenising it.

Note also that parse-latin does a decent job at tokenising Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა ხარ?”), Armenian (“Շատ հաճելի է”), and such.

Installation

npm:

npm install parse-latin

Usage

var inspect = require('unist-util-inspect')
var Latin = require('parse-latin')

var tree = new Latin().parse('A simple sentence.')

console.log(inspect(tree))

Which, when inspecting, yields:

RootNode[1] (1:1-1:19, 0-18)
└─ ParagraphNode[1] (1:1-1:19, 0-18)
   └─ SentenceNode[6] (1:1-1:19, 0-18)
      ├─ WordNode[1] (1:1-1:2, 0-1)
      │  └─ TextNode: "A" (1:1-1:2, 0-1)
      ├─ WhiteSpaceNode: " " (1:2-1:3, 1-2)
      ├─ WordNode[1] (1:3-1:9, 2-8)
      │  └─ TextNode: "simple" (1:3-1:9, 2-8)
      ├─ WhiteSpaceNode: " " (1:9-1:10, 8-9)
      ├─ WordNode[1] (1:10-1:18, 9-17)
      │  └─ TextNode: "sentence" (1:10-1:18, 9-17)
      └─ PunctuationNode: "." (1:18-1:19, 17-18)

API

ParseLatin(value)

Exposes the functionality needed to tokenise natural Latin-script languages into a syntax tree. If value is passed here, it’s not needed to give it to #parse().

ParseLatin#tokenize(value)

Tokenise value (string) into letters and numbers (words), white space, and everything else (punctuation). The returned nodes are a flat list without paragraphs or sentences.

Returns

Array.<NLCSTNode> — Nodes.

ParseLatin#parse(value)

Tokenise value (string) into an NLCST tree. The returned node is a RootNode with in it paragraphs and sentences.

Returns

NLCSTNode — Root node.

Algorithm

Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.

parse-latin splits text into white space, word, and punctuation tokens. parse-latin starts out with a pretty easy definition, one that most other tokenisers use:

  • A “word” is one or more letter or number characters
  • A “white space” is one or more white space characters
  • A “punctuation” is one or more of anything else

Then, it manipulates and merges those tokens into an NLCST syntax tree, adding sentences and paragraphs where needed.

  • Some punctuation marks are part of the word they occur in, e.g., non-profit, she’s, G.I., 11:00, N/A, &c, nineteenth- and...
  • Some full-stops do not mark a sentence end, e.g., 1., e.g., id.
  • Although full-stops, question marks, and exclamation marks (sometimes) end a sentence, that end might not occur directly after the mark, e.g., .), ."
  • And many more exceptions

License

MIT © Titus Wormer