# Text Classification (scikit-learn) with Naive Bayes

In this __Machine Learning Snippet__ we use scikit-learn (http://scikit-learn.org/) and ebooks from Project Gutenberg (https://www.gutenberg.org/) to create a text classifier, which can classify German, French, Dutch and English documents.

We need one document per language and split the document into smaller chuncks to train the classifier.

For our snippet we use the following ebooks:
- _'A Christmas Carol'_ by Charles Dickens (English), https://www.gutenberg.org/ebooks/46
- _'Der Weihnachtsabend'_ by Charles Dickens (German), https://www.gutenberg.org/ebooks/22465
- _'Cantique de Noël'_ by Charles Dickens (French), https://www.gutenberg.org/ebooks/16021
- _'Een Kerstlied in Proza'_ by Charles Dickens (Dutch), https://www.gutenberg.org/ebooks/28560


__Note:__
The ebooks are for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org




### Gathering data
First let's extract the text without the header and footer from the ebooks and split the text by whitespace in tokens.

In [1]:
import re
import urllib.request

with urllib.request.urlopen('http://www.gutenberg.org/cache/epub/22465/pg22465.txt') as response:
   txt_german = response.read().decode('utf-8') 

with urllib.request.urlopen('https://www.gutenberg.org/files/46/46-0.txt') as response:
   txt_english = response.read().decode('utf-8') 

with urllib.request.urlopen('http://www.gutenberg.org/cache/epub/16021/pg16021.txt') as response:
   txt_french = response.read().decode('utf-8') 

with urllib.request.urlopen('http://www.gutenberg.org/cache/epub/28560/pg28560.txt') as response:
   txt_dutch = response.read().decode('utf-8') 



def get_markers(txt, begin_pattern, end_pattern):
    iter = re.finditer(begin_pattern, txt)
    index_headers = [m.start(0) for m in iter]
    
    iter = re.finditer(end_pattern, txt)
    index_footers = [m.start(0) for m in iter]    
    
    # return first match
    return index_headers[0] + len(begin_pattern.replace('\\','')), index_footers[0]

def extract_text_tokens(txt, 
                        begin_pattern='\*\*\* START OF THIS PROJECT GUTENBERG EBOOK', 
                        end_pattern='\*\*\* END OF THIS PROJECT GUTENBERG EBOOK'):
    header, footer = get_markers(txt, begin_pattern, end_pattern)
    return txt[header: footer].split()


tokens_german = extract_text_tokens(txt_german)
tokens_english = extract_text_tokens(txt_english)
tokens_french = extract_text_tokens(txt_french)
tokens_dutch = extract_text_tokens(txt_dutch)

print('tokens (german)', len(tokens_german))
print('tokens (english)', len(tokens_english))
print('tokens (french)', len(tokens_french))
print('tokens (dutch)', len(tokens_dutch))

tokens (german) 27218
tokens (english) 28562
tokens (french) 32758
tokens (dutch) 31506


## Data Preparation
Next we do some data cleaning. This means we remove special characters and numbers.

In [2]:
import re

def remove_special_chars(x):
    
    chars = ['_', '(', ')', '*', '"', '[', ']', '?', '!', ',', '.', '»', '«', ':', ';']
    for c in chars:
        x = x.replace(c, '')
    
    # remove numbers
    x = re.sub('\d', '', x)
    
    return x

def clean_data(featurs): 
    # strip, remove sepcial characters and numbers
    tokens = [remove_special_chars(x.strip()) for x in featurs]
    
    cleaned = []
    
    # only use words with length > 1
    for t in tokens:
        if len(t) > 1:
            cleaned.append(t)
            
    return cleaned

cleaned_tokens_english = clean_data(tokens_english)
cleaned_tokens_german = clean_data(tokens_german)
cleaned_tokens_french = clean_data(tokens_french)
cleaned_tokens_dutch = clean_data(tokens_dutch)


print('cleaned tokens (german)', len(cleaned_tokens_german))
print('cleaned tokens (french)', len(cleaned_tokens_french))
print('cleaned tokens (dutch)', len(cleaned_tokens_dutch))
print('cleaned tokens (english)', len(cleaned_tokens_english))


cleaned tokens (german) 27181
cleaned tokens (french) 31995
cleaned tokens (dutch) 31405
cleaned tokens (english) 27527


Now we create for every language 1300 text samples with 20 tokens (words). These samples will later be 
used to train and test our model.

In [3]:
from sklearn.utils import resample

max_tokens = 20
max_samples = 1300


def create_text_sample(x):
    
    data = []
    text = []
    for i, f in enumerate(x):
        text.append(f)
        if i % max_tokens == 0 and i != 0:
            data.append(' '.join(text))
            text = []
    return data
    

sample_german = resample(create_text_sample(cleaned_tokens_german), replace=False, n_samples=max_samples)
sample_french = resample(create_text_sample(cleaned_tokens_french), replace=False, n_samples=max_samples)
sample_dutch = resample(create_text_sample(cleaned_tokens_dutch), replace=False, n_samples=max_samples)
sample_english = resample(create_text_sample(cleaned_tokens_english), replace=False, n_samples=max_samples)

print('samples (german)', len(sample_german))
print('samples (french)', len(sample_french))
print('samples (dutch)', len(sample_dutch))
print('samples (english)', len(sample_english))


samples (german) 1300
samples (french) 1300
samples (dutch) 1300
samples (english) 1300


A text sample looks like this.

In [4]:
print('English sample:\n------------------')
print(sample_english[0])
print('------------------')


English sample:
------------------
ferocious condescension and threw him into dreadful state of mind by shaking hands with him He then conveyed him and
------------------


### Choosing a model
As classifier we use the MultinomialNB classifier with the TfidfVectorizer. 

First we create the data structure which we will use to train the model. 

```
 {
     samples: {
        text:[], 
        target: []
     }
    labels: [] 
}
 
```


In [5]:
class dotdict(dict):
    """dot.notation access to dictionary attributes"""
    __getattr__ = dict.get
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

def create_data_structure(**kwargs):
    data = dotdict({'labels':[]})
    data.samples = dotdict({'text': [], 'target': []})
    
    label = 0
    for name, value in kwargs.items():
        data.labels.append(name)
        for i in value:
            data.samples.text.append(i)
            data.samples.target.append(label)
        label += 1
            
    
    return data

data = create_data_structure(de = sample_german, en = sample_english, 
                             fr = sample_french, nl = sample_dutch)

print('labels: ', data.labels)
print('target (labels encoded): ', set(data.samples.target))
print('samples: ', len(data.samples.text))

labels:  ['de', 'en', 'fr', 'nl']
target (labels encoded):  {0, 1, 2, 3}
samples:  5200


## Training
It's importan that we shuffle and split the data into training (70%) and test set (30%)

In [6]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data.samples.text, data.samples.target, test_size=0.30)

print('train size (x, y): ', len(x_train),  len(y_train))
print('test size (x, y): ', len(x_test), len(y_test))


train size (x, y):  3640 3640
test size (x, y):  1560 1560


We connect all our parts (classifier, etc.) to our _Machine Learning Pipeline_. So it’s easier and faster to go trough all processing steps to build a model.

The TfidfVectorizer will use the the word analyzer, min document frequency of 10  and convert the text to lowercase. I know we already did a lowercase conversion in the previous step. We also provide some stop words which should be ignored in our model. 

The MultinomialNB classifier wil use the default alpha value 1.0.

Here you can play around with the settings. In the next section you see how to evaluate your model.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

stopwords = ['scrooge', 'scrooges', 'bob']

pipeline = Pipeline([('vect', TfidfVectorizer(analyzer='word', 
                            min_df=10, lowercase=True, stop_words=stopwords)),
                      ('clf', MultinomialNB(alpha=1.0))])


### Evaluation

In this step we want to evaluate the performance of our classifier. So we do the following evaluation:
- Evaluate the model with k-fold on the training set
- Evaluate the final model with the test set

Let's evaluate our model with k-fold cross validation. In this step we can tune the model and settings with the output from the model evaluation.

In [9]:
from sklearn.model_selection import KFold
from sklearn import model_selection


folds = 4
kf = KFold(n_splits=folds)


for scoring in ['f1_weighted', 'accuracy']:

    scores = model_selection.cross_val_score(pipeline, X=x_train, y=y_train, 
                                         cv=kf, scoring=scoring)
    print(scoring)
    print('scores: %s' % scores )
    print(scoring + ': %0.6f (+/- %0.4f)' % (scores.mean(), scores.std() * 2))
    print()




f1_weighted
scores: [1.        0.9989011 1.        0.9989011]
f1_weighted: 0.999451 (+/- 0.0011)

accuracy
scores: [1.        0.9989011 1.        0.9989011]
accuracy: 0.999451 (+/- 0.0011)



In [11]:
from sklearn import metrics

predicted = model_selection.cross_val_predict(pipeline, X=x_train, y=y_train, cv=folds)
print(metrics.classification_report(y_train, predicted, 
                                    target_names=data.target_names, digits=4))

              precision    recall  f1-score   support

           0     1.0000    0.9989    0.9995       910
           1     0.9978    1.0000    0.9989       900
           2     1.0000    1.0000    1.0000       906
           3     1.0000    0.9989    0.9995       924

    accuracy                         0.9995      3640
   macro avg     0.9994    0.9995    0.9994      3640
weighted avg     0.9995    0.9995    0.9995      3640



We build the final model with the fold, which had the best score. We will not use the whole training set because we might overfit the model.

In [12]:
import numpy as np

def select_best_kfold(x, y, kf, scores):

    splitts = list(kf.split(x))
    
    score_index = np.argmax(scores == max(scores))
    train_index = splitts[score_index][0]
    
    return np.array(x)[train_index], np.array(y)[train_index]

x_final, y_final = select_best_kfold(x_train, y_train, kf, scores)

Next we build the model and evaluate the result against our test set.

In [13]:
text_clf = pipeline.fit(x_final, y_final)

predicted = text_clf.predict(x_test)

print(metrics.classification_report(y_test, predicted, target_names=data.target_names, digits=4))

              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000       390
           1     1.0000    1.0000    1.0000       400
           2     1.0000    1.0000    1.0000       394
           3     1.0000    1.0000    1.0000       376

    accuracy                         1.0000      1560
   macro avg     1.0000    1.0000    1.0000      1560
weighted avg     1.0000    1.0000    1.0000      1560



## Examine the features of the model

Let's see what are the most informative features

In [14]:
# show most informative features
def show_top10(classifier, vectorizer, categories):

    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))


show_top10(text_clf.named_steps['clf'], text_clf.named_steps['vect'], data.labels)

de: ein ich das es zu sie er die der und
en: that was in his he it of to and the
fr: vous une que un les il et la le de
nl: op te dat zijn hij van de het een en


Let's see which and how many features our model has.

In [15]:
feature_names = np.asarray(text_clf.named_steps['vect'].get_feature_names())

print('number of features: %d' % len(feature_names))
print('first features: %s'% feature_names[0:10])
print('last features: %s' % feature_names[-10:])

number of features: 727
first features: ['aan' 'aber' 'about' 'after' 'again' 'ai' 'air' 'al' 'all' 'alle']
last features: ['zou' 'zu' 'zwei' 'écria' 'étaient' 'était' 'été' 'één' 'être' 'über']


### New data
Let's try out the classifier with new data.

In [16]:
new_data = ['Hallo mein Name ist Hugo.', 
            'Hi my name is Hugo.', 
            'Bonjour mon nom est Hugo.',
            'Hallo mijn naam is Hugo.',
            'Eins, zwei und drei.',
            'One, two and three.',
            'Un, deux et trois.',
            'Een, twee en drie.'
           ]

predicted = text_clf.predict(new_data)
probs = text_clf.predict_proba(new_data)
for i, p in enumerate(predicted):
    print(new_data[i], ' --> ', data.labels[p], ', prob:' , max(probs[i]))
    

Hallo mein Name ist Hugo.  -->  de , prob: 0.9088014030190799
Hi my name is Hugo.  -->  en , prob: 0.7959367420741617
Bonjour mon nom est Hugo.  -->  fr , prob: 0.9316852196347436
Hallo mijn naam is Hugo.  -->  nl , prob: 0.846662875920766
Eins, zwei und drei.  -->  de , prob: 0.9082485845051248
One, two and three.  -->  en , prob: 0.972644599266688
Un, deux et trois.  -->  fr , prob: 0.9832264124082083
Een, twee en drie.  -->  nl , prob: 0.9612448470425361
