Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 README
Octocat-spinner-32 english.stop
Octocat-spinner-32 textpreprocess.py
README
pytextpreprocess
================

written by Joseph Turian
released under a BSD license

Preprocess text for NLP (tokenizing, lowercasing, stemming, sentence splitting, etc.)

REQUIREMENTS:
    * My Python common library:
        http://github.com/turian/common
    and sub-requirements thereof.
    * NLTK, for word tokenization
        e.g.
            apt-get install python-nltk

    * Splitta if you want to sentence tokenize

The English stoplist is from:
    http://jmlr.csail.mit.edu/papers/volume5/lewis04a/a11-smart-stop-list/english.stop
However, I added words at the top (above "a").
Something went wrong with that request. Please try again.