## Assignment 5.2 - Document Classifier
Team: Andy Carson, Walt Wells, and Nathan Cooper
    
In this assignment we were given the task of writing a document classification program. 
We chose the Brown Corpus available on the nltk module. The Brown corpus is a corpus of 500 texts (books) that have been assigned one of 15 categories. This makes the task potentialy more challenging given the high number of categories to choice from.

Our approach is to use preprocessing to eliminate non-useful data from the corpus such as punctuation 
and stop words (e.g., pronouns and direct articles) that are too common to be able to establish a distinction between documents. Once these words are removed we then create an ordered list of most commonly occurring words.  

Note that we originally used the Reuters Corpus that contained a similar number of words as the Brown Corpus, but contain many more documents (10,000 vs. 500), and more labels (90 vs 15) with the potential of multiple labels per document.  This program  took a computer with a quad-core processor with 2.5 GHz pre processor and 16 Gb of memory to run several hours with out returning a result. Specifically, preprocessing the corpus and training the classifier took several minutes, the program would stall when applying the classifier on the test set. Using the Brown Corpus with fewer, single-valued labels allowed the program to run and produce a result for evaluation. 

In [1]:
#Import the needed modules

import nltk

from nltk.corpus import brown
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import random
import operator

In [2]:
#We only need stopword in English for this Corpus
english_stops = stopwords.words('english')

In [3]:
#The Brown Corpus contains 500 texts categorized into 15 labels.

print(brown.categories())
print(brown.fileids()[:10])
print(brown.categories('ca05'))

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10']
['news']


In [4]:
#From ch 6.1 pg.227 in NLPwP

documents = [(list(brown.words(fileid)), category)
            for category in brown.categories()
            for fileid in brown.fileids(category)]

#seed was set to a specific value so we could vary word counts and see how it affected model accuracy.
random.seed(42)
random.shuffle(documents)
print(len(documents))
#indexing is due to documents being storred as a list of tuples.
print(documents[0][1])

500
government


In [5]:
"""
The preprocesssing is adapted from the Datacamp class in NLP. We added logic that iterated through the
list of individual articles and their label, which are stored as a list of tuples with article[0] being
the word tokens, and article[1] being the label.
  
We put all words to lower case, so upper case versions are not double counted.
We remove punctuation without removing letters and numbers then we put 
words to their stem words with word_netlemmatizer.
We then pass the cleaned wordlist back into a list with its label.
"""

wordnet_lemmatizer = WordNetLemmatizer()

def clean_corpus(corpus):
    cleaned_corpus = []
    for article in corpus:
        tup = ()
            
        #convert to lower - article[0] is the word list
        lower_tokens = [t.lower() for t in article[0]]
            
        #eliminate punctation only
        alpha_only = [t for t in lower_tokens if t.isalnum()]
            
        #remove stop_words
        no_stops  = [t for t in alpha_only if t not in english_stops]
            
        #Lemmatize all tokens
        lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
            
        cleaned_corpus.append((lemmatized, article[1]))#article[1] is the original label.
        
    return cleaned_corpus

cleaned_docs = clean_corpus(documents)
print(documents[100])
print(cleaned_docs[100])

(['Ambiguity', 'Nothing', 'in', 'English', 'has', 'been', 'ridiculed', 'as', 'much', 'as', 'the', 'ambiguous', 'use', 'of', 'words', ',', 'unless', 'it', 'be', 'the', 'ambiguous', 'use', 'of', 'sentences', '.', 'Ben', 'Franklin', 'said', ',', '``', 'Clearly', 'spoken', ',', 'Mr.', 'Fogg', '.', 'You', 'explain', 'English', 'by', 'Greek', "''", '.', 'Richard', 'Brinsley', 'Sheridan', 'said', ',', '``', 'I', 'think', 'the', 'interpreter', 'is', 'the', 'hardest', 'to', 'be', 'understood', 'of', 'the', 'two', "''", '.', 'And', 'a', 'witty', 'American', 'journalist', 'remarked', 'over', 'a', 'century', 'ago', 'what', 'is', 'even', 'more', 'true', 'today', ',', '``', 'Many', 'a', 'writer', 'seems', 'to', 'think', 'he', 'is', 'never', 'profound', 'except', 'when', 'he', "can't", 'understand', 'his', 'own', 'meaning', "''", '.', 'There', 'are', 'many', 'types', 'of', 'ambiguity', 'and', 'many', 'of', 'them', 'have', 'been', 'described', 'by', 'rhetoricians', 'under', 'such', 'names', 'as', 'amp

In [6]:
print(len(cleaned_docs))

500


In [7]:
"""
In this block we process the entire corpus.  
We put all words to lower case, so upper case versions are not double counted.
We remove punctuation without removing letters and numbers then we put 
words to their stem words with word_netlemmatizer.
"""

print(len(brown.words()))
#convert to lower
lower_tokens = [t.lower() for t in brown.words()]
            
#eliminate punctation and numbers
alpha_only = [t for t in lower_tokens if t.isalnum()]
            
#remove stop_words
no_stops  = [t for t in alpha_only if t not in english_stops]
            
#Lemmatize all tokens
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

print(len(lemmatized))

1161192
515882


In [8]:
"""
We made a couple of changes to the code in the NLPwP book.
First, the FreqDist() function returns a dictionary that is unordered. So, we create an ordered list of tuples so we can 
pull words based on order of frequency in te corpus. 
"""
print(lemmatized[:10])
counted_words = nltk.FreqDist(lemmatized) #Returns dict of words:counts.
sorted_counted_words = sorted(counted_words.items(), key=operator.itemgetter(1))#convert dict into a list of sorted tuples.
word_features = []
#This seperates the words from the word,count tuple.
for tup in sorted_counted_words:
    word_features.append(tup[0])
    
word_features = word_features[-2000:]#Slce by the most commonly occuring words.
print(word_features[:10])

['fulton', 'county', 'grand', 'jury', 'said', 'friday', 'investigation', 'recent', 'primary', 'election']
['reply', 'correct', 'weakness', 'interview', 'widely', 'developing', 'advice', 'gold', 'bound', 'foundation']


In [9]:
#From NLPwP, this returns a dict of words and a True/false rating if the word is present in a text.

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['containes(%s)' % word] = (word in document_words)
        
    return features

#print(document_features(cleaned_docs[100][0]))
#featuresets is a list of tuples with the document features and categories.
featuresets  = [(document_features(d),c) for (d,c) in cleaned_docs]
print(featuresets[100][0])
#With 500 articles, this is a 80-20 train-test split.
train_set, test_set  = featuresets[100:], featuresets[:100]



In [10]:
nb_classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(nb_classifier, test_set))

0.52


In [11]:
dt_classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(dt_classifier, test_set))

0.25


Results seemed to hinge on the number of words included in the word feature list. The program ran too slowly to include a loop of word limits. Including all words resulted in a Naive Bayes (NB) classifier accuracy of 0.33 and a Decision Tree (DT) accuracy of 0.18. Including the top 500 most common words resulted in a NB accuracy of 0.44 and DT accuracy of 0.2. The 2500 most common words resulted in an NB accuracy of 0.53 and DT accuracy of 0.23. The 5000 most common resulted in NB accuracy of 0.50 and DT accuracy of 0.22

We also ran the NB and DT with the 25,000 least common words. This resulted in a NB accuracy of 11% and DT to stall. 

Note that with 15 categories, a random guess would be correct approximately 6.7% of the time ($\frac{1}{15}$). So the ~50% accuracy rate we measured is about 7.5 times better than random chance. A human would probably not classify each text the same as the classifier(s) for this corpus, since they may have different opions as to whether a text is adventure or sci_fi as an exmaple. So a person may achieve accuracy in the upper 90% range. Furthermore, it would take a seasoned reader more than a year to read 500 long-form texts. A laptop computer processed them in about 5 minutes. 

## References

- Bird, Steven; Klein, Ewan; Loper, Edward, Natural Language Processing with Python (2009) O'Reilly
- Jarmul, Kathrine,  https://www.datacamp.com/courses/natural-language-processing-fundamentals-in-python