Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
compromise is not the most accurate, or most clever nlp toolkit. It is though, pretty fun to use.
If the 80-20 rule applies for most things, the ''94-6 rule'' applies when working with language - by Zipfs law:
The top 10 words account for 25% of used language.
The top 100 words account for 50% of used language.
The top 1,000 words account for 80% of used language.^
The top 50,000 words account for 95% of used language.
On the Penn treebank, for example, this is possible:
- just a 1 thousand word lexicon: 45% accuracy
- ... then falling back to nouns: 70% accuracy
- ... then some suffix regexes: 74% accuracy
- ... then some sentence-level postprocessing: 81% accuracy
The process is to get some curated data, find the patterns, and list the exceptions. Bada bing, bada boom. In this way a satisfactory NLP library can be built with breathtaking lightness.
Namely, it can be run right on the user's computer instead of a server.