# Text Binary Classification (scikit-learn) with Naive Bayes

In this __Machine Learing Snippet__ we use scikit-learn (http://scikit-learn.org/) and ebooks from Project Gutenberg (https://www.gutenberg.org/) to create text binary classifier, which can classify German and English text.

For our snippet we use the following ebooks:
- Alice's Adventures in Wonderland by Lewis Carroll (English), https://www.gutenberg.org/ebooks/28885
- Alice's Abenteuer im Wunderland by Lewis Carroll (German), https://www.gutenberg.org/ebooks/19778

__Note:__
The eBooks are for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

### Data Preparation

Prepare the English and German text. Try to cut off the header and footer of the ebook. We use fixed values, this is not precise but will do the job.
- cut off header / footer
- convert to lowercase
- tokenize (separated by space)
- remove special chars
- remove numbers



In [7]:
import re

txt_german = open('data/pg19778.txt', 'r').read()
txt_english = open('data/pg28885.txt', 'r').read()

feat_german = txt_german[5000: len(txt_german) - 20000].lower().strip().split()
feat_english = txt_english[5000: len(txt_english) - 20000].lower().strip().split()

def remove_special_chars(x):
    
    chars = ['_', '(', ')', '*', '"', '[', ']', '?', '!', ',', '.', '»', '«', ':', ';']
    for c in chars:
        x = x.replace(c, '')
    
    # remove numbers
    x = re.sub('\d', '', x)
    
    return x

feat_english = [remove_special_chars(x) for x in feat_english]
feat_german = [remove_special_chars(x) for x in feat_german]

print('tokens (german)', len(feat_german))
print('tokens (english)', len(feat_english))


tokens (german) 24934
tokens (english) 26678


### Feature Extraction
Create text samples with 200 tokens (words)

In [8]:
def create_text_sample(x):
    max_tokens = 20
    data = []
    text = []
    for i, f in enumerate(x):
        text.append(f)
        if i % max_tokens == 0 and i != 0:
            data.append(' '.join(text))
            text = []
    return data
    

sample_german = create_text_sample(feat_german)
sample_english = create_text_sample(feat_english)

print('samples (german)', len(sample_german))
print('samples (english)', len(sample_english))


samples (german) 1246
samples (english) 1333


We will use the text samples to train our binary classifier.

In [9]:
print('English sample:\n------------------')
print(sample_english[0])
print('------------------')


English sample:
------------------
 an unusually large saucepan flew close by it and very nearly carried it off  it grunted again so violently
------------------


### Modeling

- Create the data structure
- Split test and training set
- Create the Pipeline

In [10]:
import argparse as ap

def create_data_structure(**kwargs):
    samples = {'data': [], 'target': [], 'target_names':[]}
    label = 0
    for name, value in kwargs.items():
        samples['target_names'].append(name)
        for i in value:
            samples['data'].append(i)
            samples['target'].append(label)
        label += 1
            
    
    return ap.Namespace(**samples)

data = create_data_structure(de = sample_german, en = sample_english)



print('target names: ', data.target_names)
print('number of observations: ', len(data.data))


target names:  ['en', 'de']
number of observations:  2579


Splitting the data into training and test set

In [11]:
from sklearn.model_selection import train_test_split
import numpy as np

x_train, x_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.20, random_state=np.random.randint(low=0, high=10000))

print('train size (x, y): ', len(x_train),  len(y_train))
print('test size (x, y): ', len(x_test), len(y_test))



train size (x, y):  2063 2063
test size (x, y):  516 516


Create the Machine Learning Pipeline (model)

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn import model_selection
from sklearn import metrics

pipeline = Pipeline([('vect', TfidfVectorizer(analyzer='word', min_df=1, lowercase=True)),
                      ('clf', MultinomialNB()),])


### Evaluation
- Evaluate the model with k-fold with the training set
- Evaluate the model with the holdout set (test set)

In [13]:
folds = 5
scores = model_selection.cross_val_score(pipeline, X=x_train, y=y_train, cv=folds, scoring='accuracy')

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
print(scores)



Accuracy: 1.00 (+/- 0.00)
[ 1.  1.  1.  1.  1.]


In [14]:
predicted = model_selection.cross_val_predict(pipeline, X=x_train, y=y_train, cv=folds)
print(metrics.classification_report(y_train, predicted, target_names=data.target_names))

             precision    recall  f1-score   support

         en       1.00      1.00      1.00      1061
         de       1.00      1.00      1.00      1002

avg / total       1.00      1.00      1.00      2063



In [15]:
text_clf = pipeline.fit(x_train, y_train)

predicted = text_clf.predict(x_test)


print(metrics.classification_report(y_test, predicted, target_names=data.target_names))

             precision    recall  f1-score   support

         en       1.00      1.00      1.00       272
         de       1.00      1.00      1.00       244

avg / total       1.00      1.00      1.00       516

