# Text Binary Classification (scikit-learn) with Naive Bayes

In this __Machine Learing Snippet__ we use scikit-learn (http://scikit-learn.org/) and ebooks from Project Gutenberg (https://www.gutenberg.org/) to create text binary classifier, which can classify German and English text.

For our snippet we use the following ebooks:
- Alice's Adventures in Wonderland by Lewis Carroll (English), https://www.gutenberg.org/ebooks/28885
- Alice's Abenteuer im Wunderland by Lewis Carroll (German), https://www.gutenberg.org/ebooks/19778

__Note:__
The eBooks are for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org

### Data Preparation

Prepare the English and German text. Try to cut off the header and footer of the ebook. We use fixed values, this is not precise but will do the job.
- cut off header / footer
- convert to lowercase
- tokenize (separated by space)
- remove special chars
- remove numbers



In [67]:
import re

txt_german = open('data/pg19778.txt', 'r').read()
txt_english = open('data/pg28885.txt', 'r').read()

feat_german = txt_german[5000: len(txt_german) - 20000].lower().strip().split()
feat_english = txt_english[5000: len(txt_english) - 20000].lower().strip().split()

def remove_special_chars(x):
    
    chars = ['_', '(', ')', '*', '"', '[', ']', '?', '!', ',', '.', '»', '«', ':', ';']
    for c in chars:
        x = x.replace(c, '')
    
    # remove numbers
    x = re.sub('\d', '', x)
    
    return x

feat_english = [remove_special_chars(x) for x in feat_english]
feat_german = [remove_special_chars(x) for x in feat_german]

print('tokens (german)', len(feat_german))
print('tokens (english)', len(feat_english))


tokens (german) 24934
tokens (english) 26678


### Feature Extraction
Create text samples with 200 tokens (words)

In [68]:
def create_text_sample(x):
    max_tokens = 200
    data = []
    text = []
    for i, f in enumerate(x):
        text.append(f)
        if i % max_tokens == 0 and i != 0:
            data.append(' '.join(text))
            text = []
    return data
    

sample_german = create_text_sample(feat_german)
sample_english = create_text_sample(feat_english)

print('samples (german)', len(sample_german))
print('samples (english)', len(sample_english))


samples (german) 124
samples (english) 133


We will use the text samples to train our binary classifier.

In [69]:
print('English sample:\n------------------')
print(sample_english[0])
print('------------------')


English sample:
------------------
 an unusually large saucepan flew close by it and very nearly carried it off  it grunted again so violently that she looked down into its face in some alarm  a mad tea-party  the queen turned angrily away from him and said to the knave turn them over  the queen never left off quarrelling with the other players and shouting off with his head or off with her head  the mock turtle drew a long breath and said that's very curious  who stole the tarts  at this the whole pack rose up into the air and came flying down upon her  chapter i sidenote down the rabbit-hole alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do once or twice she had peeped into the book her sister was reading but it had no pictures or conversations in it and what is the use of a book thought alice without pictures or conversations so she was considering in her own mind as well as she could for the hot day made her feel very sleepy and 

### Modeling

### Evaluation