mltk - Moz Language Tool Kit
Python C++ C Makefile
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
ext
mltk
test
.gitignore
.travis.yml
Makefile
README.md
bench.py
requirements.txt
setup.py

README.md

mltk

Build Status

mltk - Moz Language Tool Kit. Like nltk but faster.

The overall design goal for mltk is fast components that are "good enough", unlike most NLP libraries that trade speed for accuracy. "Good enough" in this context is typically a 0.5 - 1% decrease in accuracy or F score on standard benchmarks vs published state of the art results in order to gain an order of magnitude or more throughput.

Note: this is currently under development, expect undocumented API changes until things stabilize.

POS Tagger

There is a C++ implementation of the part of speech tagger in textblob-aptagger. It implements the NLTK POS tagger interface (tag and tag_sents methods) that assign POS tags to word and sentence tokenized text. Example usage, using NLTK to do the tokenization:

from nltk import word_tokenize, sent_tokenize
from mltk.aptagger import FastPerceptronTagger

tagger = FastPerceptronTagger()

doc = "This is a document as a utf-8 encoded string.  It has two sentences."
tokens = [word_tokenize(sent) for sent in sent_tokenize(doc)]
tags = tagger.tag_sents(tokens)

NP Chunker

Our noun phrase chunker generally follows the averaged perceptron approach in Collins (2002), using most of the features in Sha and Pereira (2003). It achieves a F score of 93.1% on the CoNLL-2000 test set (compared to published state of the art values of 94.39%).

We made a few modifications to these algorithms to improve the speed at the expense of a little accuracy:

  • We use feature hashing with a hash size 2^17 which we found to be a good tradeoff between accuracy and model size.
  • We use a greedy approach instead of beam or other search.
  • We replace the Brill tagger generated POS tags in the training/test data with predictions from the POS tagger in mltk.

Usage usage:

from nltk import word_tokenize, sent_tokenize
from mltk.aptagger import FastPerceptronTagger
from mltk.np_chunker import NPChunker

tagger = FastPerceptronTagger()
chunker = NPChunker()

doc = "This is a document as a utf-8 encoded string.  It has two sentences."
tokens = [word_tokenize(sent) for sent in sent_tokenize(doc)]
tags = tagger.tag_sents(tokens)
chunks = chunker.chunk_sents(tags)

Benchmarks

The script bench.py provides benchmarks for both the POS tagger and NP chunker. The times below were run on a mid 2014 MacBook Pro (2.5 GHz Intel Core i7).

Our implementation of the POS tagger achieves 98.8% accuracy on the small Penn Treebank sample in NLTK and can tag an average webpage in 1-2 milliseconds (6-700,000 tokens per second) after tokenization.

The NP chunker achieves 93.1% F score on the CoNLL-2000 test set and can chunk an average web page in about 3 milliseconds (4-500,000 tokens per second).