How it Works

Aman Kubanychbek edited this page Nov 27, 2018 · 16 revisions

Yeah I know..

    there's a lot of reasons why this shouldn't work.
compromise is a rule-based, 'brill-inspired' natural-language processing library, that prefers the smallest, least-fancy solutions to getting a text into a manageable form.

It is not the most accurate, or clever nlp library, but found its niche as an easy, small library that can run everywhere.

We're proud of making the codebase as approachable as we can. There is seldom any fancy computer-science, or weird linguistic acronyms to learn before getting a handle on how things work. (no offence, lisp! 😝)

it was built by finding common patterns in random linguistic data (that we could find on the internet) then listing-out the exceptions, and iterating on community-bugs.

ok i'll admit it,
... IT'S REGEX ALL THE WAY DOWN ...

It's written in ES6 javascript, and compiled to ES5 with Babel. It is squashed into one-file with browserify. :heart:

ok!

1) Internal word data

because file-size is huge a constraint on the web, we ship our lexicon in a very-engineered format.

First, try to use a suffix-pattern, or careful regex, instead of shipping a word. Most word-pattern data was learned using nlp-thumb on wordnet.

Second, when the tagger doesn't find anything useful about a term, it falls-back to a Noun. This means that if we're careful, most of the time we don't need to actually ship nouns. Most dictionaries are about 70% nouns.

And it comes to actually forming a list of known-words,

  • First, we compress all the word-data using efrt a trie-based compression library, optimised so that the data can 'pop' very quickly into a usable form.
  • Second, we ship only the infinitive form of verbs, the singular form of nouns, and the 'normal' form of adjectives - and then conjugate-out all possible forms of these words, be it conjugations, inflections, and random obscure forms of each term. Using this strategy, we can get mid-90% of english words into about 35kb. You can see our data in ./src/lexicon, and re-compress it with npm run pack.

2) Tokenization

when we get our text as input, we split it into sentences, and their terms - even if the input is only one word.

This is our data-model we use, universally:

Text [
  Terms [ # a sentence, or a match
    Term {} # a word, its punctuation, and whitespace
    Term {}
  ]
  Terms [
    Term {}
    Term {}
  ]
]

If you input a novel, the Terms class will each be a sentence, be default. If you run doc.match('one . three'), the Terms class will be of three terms each.

Text and Terms classes get subclassed by other forms, like 'Ngram', or Person, which include new functionality, (and occasionally overload some default methods). The advantage of this is that you can call any generic method, even if you're only looking at a subset of the document - and you can chain your logic endlessly.

Originally we allowed simpler structures, if the user knew what they wanted: nlp.adjective('cool'), etc. The problem was, in the real-word, all inputs are messy, and need to be re-interpreted anyways. It became clear that the solution was to parse and tag everything input in a generic way:

nlp('cookie').nouns().toPlural()
nlp('the cookie').nouns().toPlural()
nlp('double chocolate cookie').nouns().toPlural()
nlp('mayor of chicago').nouns().toPlural()

3) Normalization

the Term class stores a lot of information about each term, and how it's used. Some of this meta-data involves more computer-readable versions of the text - lowercased, removing punctuation, possessives, and hyphenation, unicode, and so on. This is non-destructive, as the .out('text') method will re-create the original text pixel-perfectly.

Because compromise makes no attempt at supporting other languages (yet!), it also normalizes some unicode-characters into ascii-variants (Québec -> Quebec, etc). This may be unfair, or non-semantic in some circumstances, but is usually ok.

4) Part-of-Speech-Tagging

After tokenizing its input, compromise begins to classify each term according to the role it plays in the sentence. This works, briefly, as follows:

  • Lexicon lookup for known-words and their most-frequent tag, like 'walk', or 'great britain'
  • Suffix/Regex patterns to identify unknown words, like 'All mimsy were the borogoves'
  • Sentence-level post-processing to identify mis-identified terms, like *'walk the walk' (#Verb the #Noun)
  • Eventual fallback to Nouns of unknown words, like 'lactobacillus' Sometimes one rule will overwrite another. It runs a few times, until settled. It's pretty quick. you can view the high-level source here.

Each time a Term is tagged, compromise juggles things around, to make sure our understanding remains consistent. When something is tagged as a #Person, it will also add the tags #Singular, and #Noun. If the term was formerly understood as a #Verb, it will remove any tags that aren't consistent with #Person. You can the tag rules in ./src/tags/tags or look here for a general overview.

5) Matching and Subsets

the resulting format, named Text is then ready to be queried and messed with. The Matching is a way to query for any particular part of the document, according to any patterns. A .match() will return just another Text object, pivoted according to the match (as discussed).

Because the original Terms object is broken by sentence, you cannot run a .match() across two sentences, without calling a .concat() method first.

Because some parts of a sentence can be delicate to retrieve, compromise ships with a bunch of predefined-subsets, like .people(), .nouns() or .quotations(). These declared-subsets are just wrappers around very careful .match() expressions. They handle tricky-cases like clause-separations, and neighbouring POS tags.

You can look-through the subset logic here.

Mutability

after each change, the Text class ships a reference to the original object, so that it may make changes to the original, like .match('#Adjectives').append('really'). To prevent this behaviour, call .clone()

6) Running, Testing, Debugging

🔥 now you're cooking with fire!

there we are. off you go!

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.