Skip to content
This repository


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Worked examples from the NLTK Book

branch: master

Commit for Python portion of WEKA based sentence genre detection pipe…

…line. Text processing is done using Scikit-Learn libraries, and the X and y matrices are converted to ARFF files for processing by WEKA.
latest commit 5f11dd870c
Sujit Pal authored
Octocat-spinner-32 nbproject After completing chapter 5 and exercises.
Octocat-spinner-32 src Commit for Python portion of WEKA based sentence genre detection pipe…
Octocat-spinner-32 Commit for Python portion of WEKA based sentence genre detection pipe…



Worked examples from the NLTK Book


A Consumer Electronics Named Entity Recognizer - uses an NLTK Maximum Entropy Classifier and IOB tags to train and predict Consumer Electronics named entities in text.


A simple tool to detect word equivalences using Wordnet. Reads a TSV file of word pairs and returns the original (LHS) word if the words don't have the same meaning. Useful (at least in my case) for checking the results of a set of regular expressions to convert words from British to American spellings and for converting Greek/Latin plurals to their singular form (both based on patterns).


A Named Entity Recognizer for Genes - uses NLTK's HMM package to build an HMM tagger to recognize Gene names from within English text.


A trigram backoff language model trained on medical XML documents, and used to estimate the normalized log probability of an unknown sentence.


A proof of concept for calculating inter-document similarities for a collection of text documents for a cheating detection system. Contains implementation of the SCAM (Standard Copy Analysis Mechanism) in order to possible near-duplicate documents.


A proof of concept to identify significant word collocations as phrases from about an hours worth of messages from the Twitter 1% feed, calculated as a log-likelihood ratio of the probability that they are dependent vs that they are independent. Based on the approach described in "Building Search Applications: Lucene, LingPipe and GATE" by Manu Konchady, but extended to handle any size N-gram.


A trigram interpolated model trained on medical and legal sentences, and used to classify a sentence as one of the two genres.


Uses the same training data as medorleg, but uses Scikit-Learn's text API and LinearSVC implementations to build a classifier that predicts the genre of an unseen sentence.

Also contains an ARFF writer to convert the X and y matrices to ARFF format for consumption by WEKA. This was done so we could reuse Scikit-Learn's text processing pipeline to build a WEKA model, which could then be used directly from within a Java based data pipeline.

Something went wrong with that request. Please try again.