spencer kelly edited this page Oct 19, 2017 · 9 revisions

compromise is not the most accurate, or most clever nlp toolkit. It is though, pretty fun to use.



If the 80-20 rule applies for most things, the ''94-6 rule'' applies when working with language - by Zipfs law:

The top 10 words account for 25% of used language.

The top 100 words account for 50% of used language.

The top 1,000 words account for 80% of used language.^

The top 50,000 words account for 95% of used language.

On the Penn treebank, for example, this is possible:

  • just a 1 thousand word lexicon: 45% accuracy
  • ... then falling back to nouns: 70% accuracy
  • ... then some suffix regexes: 74% accuracy
  • ... then some sentence-level postprocessing: 81% accuracy

The process is to get some curated data, find the patterns, and list the exceptions. Bada bing, bada boom. In this way a satisfactory NLP library can be built with breathtaking lightness.

Namely, it can be run right on the user's computer instead of a server.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.