# Classify text

This example shows how to build a Naive Bayes classifier with the Python NLTK module. http://www.nltk.org/ 

In this example we build a language classifier with the top 1000 German and English words. With NLTK there is an easy way to build a classifier with your own features. The features are build as Python dictionaries.

To classify German or English language we use the following features:

|feature|description|
|---|---|
|word(foo)|True if the word 'foo' is found in the text|
|sufix1| the last letter of the word.|
|bigram(h,e)|True if word conatins the bigram 'he' |
|trigram(t,h,e)| True if the word contains the trigram 'the'|



In [20]:
import nltk
from nltk.util import ngrams

def get_features(word):
    word = word.lower()
    
    feature = {'word('+ word + ')': True}
    feature['sufix1'] =  word[-1:]
    
    
    for ngram in  ngrams(word, 2):
        feature['bigram' + str(ngram) + ''] = True
    
        
    for ngram in  ngrams(word, 3):
        feature['trigram' + str(ngram) + ''] = True
    
    
    return feature
        
def get_features_from_file(file):

    lines = [line.rstrip() for line in open(file)]
    
    features = []
    
    for word in lines:
        
        features.append(get_features(word))
    
    return features

def get_features_from_sentenece(sentence):
    features = {}
    for word in sentence.split(' '):
        features.update(get_features(word))
    return features
        
    

With the following command we build our feautres with the top 1000 German and English words,

In [21]:
featuresets_de = [(f, 'de') for f in get_features_from_file('data/top1000de.txt')]
featuresets_en = [(f, 'en') for f in get_features_from_file('data/top1000en.txt')]

We shuffle the words, because the most used words are at the beginning of the file.

In [22]:
import random

random.shuffle(featuresets_de)
random.shuffle(featuresets_en)

print('German features', len(featuresets_de))
print('English features', len(featuresets_de))

('German features', 1000)
('English features', 1000)


This is an example of a German feautre.

In [23]:
print(featuresets_de[0])

({"bigram('b', 'e')": True, "bigram('g', 'a')": True, "bigram('a', 'u')": True, "bigram('a', 'b')": True, "trigram('f', 'g', 'a')": True, 'sufix1': 'e', "bigram('f', 'g')": True, "trigram('g', 'a', 'b')": True, "trigram('u', 'f', 'g')": True, 'word(aufgabe)': True, "bigram('u', 'f')": True, "trigram('a', 'u', 'f')": True, "trigram('a', 'b', 'e')": True}, 'de')


This is an example of an English feature.

In [24]:
print(featuresets_en[0])

({"trigram('w', 'h', 'a')": True, 'sufix1': 't', "bigram('w', 'h')": True, "trigram('h', 'a', 't')": True, 'word(what)': True, "bigram('a', 't')": True, "bigram('h', 'a')": True}, 'en')


Now we split the feautres in a training and test set.

In [25]:
train_feats, test_feats = featuresets_de[200:] + featuresets_en[200:], featuresets_de[:200] + featuresets_en[:200]
classifier = nltk.NaiveBayesClassifier.train(train_feats)

With NLTK we can print the accuracy. To calculate the accuray you have to provide the test set.

In [26]:
print('accuracy', nltk.classify.accuracy(classifier, test_feats))

('accuracy', 0.8425)


With the trainned classifier we can now classify a text. The classifier returns if the text is writtine in the German (de) or English (en) language.

In [27]:
print(classifier.classify(get_features_from_sentenece('Mein Name ist Hugo')))
print(classifier.classify(get_features_from_sentenece('My name is Hugo')))

de
en


NLTK provides a method which displayes the most informative features.

In [28]:
classifier.show_most_informative_features(10)

Most Informative Features
        bigram('e', 'a') = True               en : de     =     26.3 : 1.0
        bigram('c', 'e') = True               en : de     =     26.3 : 1.0
  trigram('t', 'e', 'n') = True               de : en     =     23.7 : 1.0
        bigram('e', 'i') = True               de : en     =     22.3 : 1.0
  trigram('e', 'i', 'n') = True               de : en     =     17.7 : 1.0
        bigram('t', 'h') = True               en : de     =     17.0 : 1.0
        bigram('o', 'w') = True               en : de     =     14.3 : 1.0
  trigram('s', 't', 'e') = True               de : en     =     13.0 : 1.0
  trigram('s', 'c', 'h') = True               de : en     =     12.2 : 1.0
  trigram('c', 'h', 'e') = True               de : en     =     12.2 : 1.0


Evaluation is key. After you have trained a classifier you should get some basic metrics like precision and recall for every label. NLTK provides methods to calulate the most common metrics to evaluate a classifier. 

In [29]:
import collections
import nltk.metrics

refsets = collections.defaultdict(set)
testsets = collections.defaultdict(set)
 
for i, (feats, label) in enumerate(test_feats):
    refsets[label].add(i)
    observed = classifier.classify(feats)
    testsets[observed].add(i)
    
print 'DE precision:', nltk.metrics.precision(refsets['de'], testsets['de'])
print 'DE recall:', nltk.metrics.recall(refsets['de'], testsets['de'])
print 'DE F-measure:', nltk.metrics.f_measure(refsets['de'], testsets['de'])
print 'EN precision:', nltk.metrics.precision(refsets['en'], testsets['en'])
print 'EN recall:', nltk.metrics.recall(refsets['en'], testsets['en'])
print 'EN F-measure:', nltk.metrics.f_measure(refsets['en'], testsets['en'])

DE precision: 0.844221105528
DE recall: 0.84
DE F-measure: 0.842105263158
EN precision: 0.8407960199
EN recall: 0.845
EN F-measure: 0.84289276808
