Permalink
Switch branches/tags
Nothing to show
Find file
Fetching contributors…
Cannot retrieve contributors at this time
22 lines (16 sloc) 597 Bytes
pytextpreprocess
================
written by Joseph Turian
released under a BSD license
Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)
REQUIREMENTS:
* My Python common library:
http://github.com/turian/common
and sub-requirements thereof.
* NLTK, for word tokenization
e.g.
apt-get install python-nltk
* Splitta if you want to sentence tokenize
The English stoplist is from:
http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
However, I added words at the top (above "a").