Whether Old-English (“þā gewearþ þǣm hlāforde and þǣm hȳrigmannum wiþ
ānum penninge”), Icelandic (“Hvað er að frétta”), French (“Où sont
parse-latin does a good job at tokenising it.
Note also that
parse-latin does a decent job at tokenising
Latin-like scripts, Cyrillic (“Добро пожаловать!”), Georgian (“როგორა
ხარ?”), Armenian (“Շատ հաճելի է”), and such.
npm install parse-latin
var inspect = require('unist-util-inspect') var Latin = require('parse-latin') var tree = new Latin().parse('A simple sentence.') console.log(inspect(tree))
Which, when inspecting, yields:
RootNode (1:1-1:19, 0-18) └─ ParagraphNode (1:1-1:19, 0-18) └─ SentenceNode (1:1-1:19, 0-18) ├─ WordNode (1:1-1:2, 0-1) │ └─ TextNode: "A" (1:1-1:2, 0-1) ├─ WhiteSpaceNode: " " (1:2-1:3, 1-2) ├─ WordNode (1:3-1:9, 2-8) │ └─ TextNode: "simple" (1:3-1:9, 2-8) ├─ WhiteSpaceNode: " " (1:9-1:10, 8-9) ├─ WordNode (1:10-1:18, 9-17) │ └─ TextNode: "sentence" (1:10-1:18, 9-17) └─ PunctuationNode: "." (1:18-1:19, 17-18)
Exposes the functionality needed to tokenise natural Latin-script
languages into a syntax tree.
value is passed here, it’s not needed to give it to
string) into letters and numbers (words), white space, and
everything else (punctuation). The returned nodes are a flat list without
paragraphs or sentences.
Array.<NLCSTNode> — Nodes.
string) into an NLCST tree. The returned node is
RootNode with in it paragraphs and sentences.
NLCSTNode — Root node.
Note: The easiest way to see how parse-latin tokenizes and parses, is by using the online parser demo, which shows the syntax tree corresponding to the typed text.
parse-latin splits text into white space, word, and punctuation
parse-latin starts out with a pretty easy definition,
one that most other tokenisers use:
- A “word” is one or more letter or number characters
- A “white space” is one or more white space characters
- A “punctuation” is one or more of anything else
Then, it manipulates and merges those tokens into an NLCST syntax tree, adding sentences and paragraphs where needed.
- Some punctuation marks are part of the word they occur in, e.g.,
- Some full-stops do not mark a sentence end, e.g.,
- Although full-stops, question marks, and exclamation marks
(sometimes) end a sentence, that end might not occur directly
after the mark, e.g.,
- And many more exceptions