# Text Classification (scikit-learn) with Naive Bayes

In this __Machine Learning Snippet__ we use scikit-learn (http://scikit-learn.org/) and ebooks from Project Gutenberg (https://www.gutenberg.org/) to create a text classifier, which can classify German, French, Dutch and English texts.

For our snippet we use the following ebooks:
- _'A Christmas Carol'_ by Charles Dickens (English), https://www.gutenberg.org/ebooks/46
- _'Der Weihnachtsabend'_ by Charles Dickens (German), https://www.gutenberg.org/ebooks/22465
- _'Cantique de Noël'_ by Charles Dickens (French), https://www.gutenberg.org/ebooks/16021
- _'Een Kerstlied in Proza'_ by Charles Dickens (Dutch), https://www.gutenberg.org/ebooks/28560


__Note:__
The eBooks are for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org




### Data Preparation

We prepare the English, French and German text. The goal is to cut off the header and footer from the ebooks. Then we do some text cleaning:
- Convert to lowercase
- Tokenize  the text by space
- Remove special chars (new lines, etc.)
- Remove numbers



First let's extract the text tokens (words) from the text. Convert text to lowercase and  remove the header and the footer from the text.

In [1]:
import re

txt_german = open('data/pg22465.txt', 'r').read()
txt_english = open('data/pg46.txt', 'r').read()
txt_french = open('data/pg16021.txt', 'r').read()
txt_dutch = open('data/pg28560.txt', 'r').read()



def get_markers(txt, pattern='\*\*\*'):
    iter = re.finditer(pattern, txt)
    indices = [m.start(0) for m in iter]
    return indices

def extract_text_tokens(txt):
    indices = get_markers(txt)
    header = indices[1]
    footer = indices[2]
    
    return txt[header: footer].lower().strip().split()


feat_german = extract_text_tokens(txt_german)
feat_english = extract_text_tokens(txt_english)
feat_french = extract_text_tokens(txt_french)
feat_dutch = extract_text_tokens(txt_dutch)



Next we create text tokens and remove the special characters and numbers.

In [2]:
import re

def remove_special_chars(x):
    
    chars = ['_', '(', ')', '*', '"', '[', ']', '?', '!', ',', '.', '»', '«', ':', ';']
    for c in chars:
        x = x.replace(c, '')
    
    # remove numbers
    x = re.sub('\d', '', x)
    
    return x

tokens_english = [remove_special_chars(x) for x in feat_english]
tokens_german = [remove_special_chars(x) for x in feat_german]
tokens_french = [remove_special_chars(x) for x in feat_french]
tokens_dutch = [remove_special_chars(x) for x in feat_dutch]


print('tokens (german)', len(tokens_german))
print('tokens (french)', len(tokens_french))
print('tokens (dutch)', len(tokens_dutch))
print('tokens (english)', len(tokens_english))


tokens (german) 27216
tokens (french) 32755
tokens (dutch) 31502
tokens (english) 28559


### Feature Extraction
Now we create text samples from 20 tokens (words). The toknes from the samples will be used to train the classifier.

In [3]:
def create_text_sample(x):
    max_tokens = 20
    data = []
    text = []
    for i, f in enumerate(x):
        text.append(f)
        if i % max_tokens == 0 and i != 0:
            data.append(' '.join(text))
            text = []
    return data
    

sample_german = create_text_sample(tokens_german)
sample_french = create_text_sample(tokens_french)
sample_dutch = create_text_sample(tokens_dutch)
sample_english = create_text_sample(tokens_english)

print('samples (german)', len(sample_german))
print('samples (french)', len(sample_french))
print('samples (dutch)', len(sample_dutch))
print('samples (english)', len(sample_english))


samples (german) 1360
samples (french) 1637
samples (dutch) 1575
samples (english) 1427


A text sample looks like this.

In [4]:
print('English sample:\n------------------')
print(sample_english[100])
print('------------------')


English sample:
------------------
very night we have no doubt his liberality is well represented by his surviving partner said the gentleman presenting his
------------------


### Modeling
As classifier we use the MultinomialNB classifier with the TfidfVectorizer. The TfidfVectorizer will use the the word analyzer and convert the text to lowercase. We need to do the following steps:
- Create the data structure for the classifier
- Split the data into test and training set
- Create the _Machine Learning Pipeline_

In [5]:
import argparse as ap

def create_data_structure(**kwargs):
    samples = {'data': [], 'target': [], 'target_names':[]}
    label = 0
    for name, value in kwargs.items():
        samples['target_names'].append(name)
        for i in value:
            samples['data'].append(i)
            samples['target'].append(label)
        label += 1
            
    
    return ap.Namespace(**samples)

data = create_data_structure(de = sample_german, en = sample_english, 
                             fr = sample_french, nl = sample_dutch)



print('target names: ', data.target_names)
print('number of observations: ', len(data.data))


target names:  ['en', 'nl', 'fr', 'de']
number of observations:  5999


Splitting the data into training (70%) and test set (30%)

In [18]:
from sklearn.model_selection import train_test_split
import numpy as np

x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.30, random_state=np.random.randint(low=0, high=10000))

print('train size (x, y): ', len(x_train),  len(y_train))
print('test size (x, y): ', len(x_test), len(y_test))


train size (x, y):  4199 4199
test size (x, y):  1800 1800


We create the following _Machine Learning Pipeline_ (model).

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import model_selection
from sklearn import metrics

pipeline = Pipeline([('vect', TfidfVectorizer(analyzer='word', min_df=1, lowercase=True)),
                      ('clf', MultinomialNB()),])


### Evaluation
In this step we want to evaluate the performance of our classifier. So we do the following evaluation:
- Evaluate the model with k-fold on the training set
- Evaluate the final model with the holdout set (test set)

Let's evaluate our model with k-fold against our training set. In this step we can tune the model and settings with the output from the model evaluation.

In [12]:
folds = 6
scores = model_selection.cross_val_score(pipeline, X=x_train, y=y_train, 
                                         cv=folds, scoring='accuracy')

print('Accuracy: %0.6f (+/- %0.4f)' % (scores.mean(), scores.std() * 2))




Accuracy: 0.999523 (+/- 0.0013)


In [13]:
predicted = model_selection.cross_val_predict(pipeline, X=x_train, y=y_train, cv=folds)
print(metrics.classification_report(y_train, predicted, 
                                    target_names=data.target_names, digits=4))

             precision    recall  f1-score   support

         en     0.9980    1.0000    0.9990       992
         nl     1.0000    1.0000    1.0000      1129
         fr     1.0000    0.9991    0.9996      1124
         de     1.0000    0.9990    0.9995       954

avg / total     0.9995    0.9995    0.9995      4199



Now we can evaluate our classifier with the holdout set (test set) against the final model.

In [10]:
text_clf = pipeline.fit(x_train, y_train)

predicted = text_clf.predict(x_test)


print(metrics.classification_report(y_test, predicted, target_names=data.target_names, digits=4))

             precision    recall  f1-score   support

         en     1.0000    1.0000    1.0000       435
         nl     1.0000    1.0000    1.0000       446
         fr     1.0000    1.0000    1.0000       513
         de     1.0000    1.0000    1.0000       406

avg / total     1.0000    1.0000    1.0000      1800



### New data
Let's try out the classifier with new data.

In [40]:
new_data = ['Hallo mein Name ist Hugo.', 
            'Hi my name is Hugo.', 
            'Bonjour mon nom est Hugo.',
            'Hallo mijn naam is Hugo.',
            'Eins, zwei und drei.',
            'One, two and three.',
            'Un, deux et trois.',
            'Een, twee en drie.'
           ]

predicted = text_clf.predict(new_data)
probs = text_clf.predict_proba(new_data)
for i, p in enumerate(predicted):
    print(new_data[i], ' --> ', data.target_names[p], ', prob:' , max(probs[i]))
    

Hallo mein Name ist Hugo.  -->  de , prob: 0.639591312356
Hi my name is Hugo.  -->  en , prob: 0.699118050854
Bonjour mon nom est Hugo.  -->  fr , prob: 0.827110536445
Hallo mijn naam is Hugo.  -->  nl , prob: 0.76476440211
Eins, zwei und drei.  -->  de , prob: 0.910709776629
One, two and three.  -->  en , prob: 0.964875844812
Un, deux et trois.  -->  fr , prob: 0.97930674257
Een, twee en drie.  -->  nl , prob: 0.956671833056


Let's see what are the most informative features

In [41]:
# show most informative features
def show_top10(classifier, vectorizer, categories):

    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))


show_top10(text_clf.named_steps['clf'], text_clf.named_steps['vect'], data.target_names)

en: was in that his he it of to and the
nl: ik te dat van zijn hij de het een en
fr: qu une que les un la il et le de
de: ein das ich zu es sie er die der und
