# Text Binary Classification (scikit-learn) with Naive Bayes

In this __Machine Learing Snippet__ we use scikit-learn (http://scikit-learn.org/) and ebooks from Project Gutenberg (https://www.gutenberg.org/) to create text binary classifier, which can classify German and English text.

For our snippet we use the following ebooks:
- Alice's Adventures in Wonderland by Lewis Carroll (English), https://www.gutenberg.org/ebooks/28885
- Alice's Abenteuer im Wunderland by Lewis Carroll (German), https://www.gutenberg.org/ebooks/19778

__Note:__
The eBooks are for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

### Data Preparation

Prepare the English and German text. Try to cut off the header and footer of the ebook. We use fixed values, this is not precise but will do the job.
- cut off header / footer
- convert to lowercase
- tokenize (separated by space)
- remove special chars
- remove numbers



In [10]:
import re

txt_german = open('data/pg19778.txt', 'r').read()
txt_english = open('data/pg28885.txt', 'r').read()

feat_german = txt_german[5000: len(txt_german) - 20000].lower().strip().split()
feat_english = txt_english[5000: len(txt_english) - 20000].lower().strip().split()

def remove_special_chars(x):
    
    chars = ['_', '(', ')', '*', '"', '[', ']', '?', '!', ',', '.', '»', '«', ':', ';']
    for c in chars:
        x = x.replace(c, '')
    
    # remove numbers
    x = re.sub('\d', '', x)
    
    return x

feat_english = [remove_special_chars(x) for x in feat_english]
feat_german = [remove_special_chars(x) for x in feat_german]

print('tokens (german)', len(feat_german))
print('tokens (english)', len(feat_english))


tokens (german) 24934
tokens (english) 26678


### Feature Extraction
Create text samples with 200 tokens (words)

In [11]:
def create_text_sample(x):
    max_tokens = 30
    data = []
    text = []
    for i, f in enumerate(x):
        text.append(f)
        if i % max_tokens == 0 and i != 0:
            data.append(' '.join(text))
            text = []
    return data
    

sample_german = create_text_sample(feat_german)
sample_english = create_text_sample(feat_english)

print('samples (german)', len(sample_german))
print('samples (english)', len(sample_english))


samples (german) 831
samples (english) 889


We will use the text samples to train our binary classifier.

In [12]:
print('English sample:\n------------------')
print(sample_english[0])
print('------------------')


English sample:
------------------
 an unusually large saucepan flew close by it and very nearly carried it off  it grunted again so violently that she looked down into its face in some alarm
------------------


### Modeling



In [13]:
import argparse as ap

def create_sample(**kwargs):
    samples = {'data': [], 'target': [], 'target_names':[]}
    label = 0
    for name, value in kwargs.items():
        samples['target_names'].append(name)
        for i in value:
            samples['data'].append(i)
            samples['target'].append(label)
        label += 1
            
    
    return ap.Namespace(**samples)

data = create_sample(de = sample_german, en = sample_english)



print('target names: ', data.target_names)
print('number of observations: ', len(data.data))


target names:  ['en', 'de']
number of observations:  1720


In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import model_selection
from sklearn import metrics
import numpy as np

def shuffle(x):
    index = np.random.permutation(len(x.data))

    X = np.array(x.data)[index]
    y = np.array(x.target)[index]
    
    return X, y

X, y = shuffle(data)

folds = 4

pipeline = Pipeline([('vect', TfidfVectorizer(analyzer='word', min_df=1, lowercase=True)),
                      ('clf', MultinomialNB()),])



### Evaluation

In [15]:
scores = model_selection.cross_val_score(pipeline, X=X, y=y, cv=folds, scoring='f1_weighted')

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print(scores)


predicted = model_selection.cross_val_predict(pipeline, data.data, data.target, cv=folds)
print(metrics.classification_report(data.target, predicted, target_names=data.target_names))

Accuracy: 1.00 (+/- 0.00)
[ 1.  1.  1.  1.]
             precision    recall  f1-score   support

         en       1.00      1.00      1.00       889
         de       1.00      1.00      1.00       831

avg / total       1.00      1.00      1.00      1720



In [16]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=np.random.randint(low=0, high=10000))

text_clf = pipeline.fit(x_train, y_train)

predicted = text_clf.predict(x_test)


print(metrics.classification_report(y_test, predicted, target_names=data.target_names))

             precision    recall  f1-score   support

         en       1.00      1.00      1.00       188
         de       1.00      1.00      1.00       156

avg / total       1.00      1.00      1.00       344



In [17]:
new_data = [
    ('Die größte Macht hat das richtige Wort zur richtigen Zeit.', 0),
    ('Leadership is the art of getting someone else to do something you want done because he wants to do it.', 1)
            ]

predicted = text_clf.predict([x[0] for x in new_data])

In [18]:
predicted

array([1, 0])