Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
README
english.stop
textpreprocess.py

README

pytextpreprocess
================

written by Joseph Turian
released under a BSD license

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

REQUIREMENTS:
    * My Python common library:
        http://github.com/turian/common
    and sub-requirements thereof.
    * NLTK, for word tokenization
        e.g.
            apt-get install python-nltk

    * Splitta if you want to sentence tokenize

The English stoplist is from:
    http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
However, I added words at the top (above "a").