# Introduction
Just for fun and knowledge sharing, I made a NLP classifier to differentiate email authors among two of my colleagues. The emails are chosen by random and obvious clues such as email signatures are removed from my dataset. Surprisingly, the classifier worked quite well with minimal hyperparameters tweaking. 

The approach is to vectorize the email into bag-of-words using different vectorizers:
* CountVectorizer with individual counts for each word
* CountVectorizer with binary counts
* TF-IDF

Different algorithms also work better with different vectorizers. Random Forest, Multinomial Naive Bayes, and Bernoulli Naive Bayes were used and compared. 

Model performance was assessed simply with separate training and testing set. Hyperparameters tweaking with Cross Validation is not performed in this notebook (they're not the focus of this exercise).

In [355]:
import pandas as pd
import os
import codecs
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import classification_report,accuracy_score
from sklearn.pipeline import Pipeline

pd.options.display.max_columns = 500
pd.options.display.max_colwidth = 200

In [219]:
j_train_path = 'data/j/training'
j_test_path = 'data/j/testing'
r_train_path = 'data/r/training'
r_test_path = 'data/r/testing'

j_train_files = os.listdir(j_train_path)
j_test_files = os.listdir(j_test_path)
r_train_files = os.listdir(r_train_path)
r_test_files = os.listdir(r_test_path)

In [220]:
os.path.join(j_train_path,'j1.txt')

'data/j/training\\j1.txt'

In [221]:
def read_data(path,files):
    data=[]
    for i in files:
        f = codecs.open(os.path.join(path,i),'r',encoding='utf-8')
        data.append(f.read())
        f.close()
    return data

In [222]:
j_train_data = read_data(j_train_path,j_train_files)
j_test_data = read_data(j_test_path,j_test_files)
r_train_data = read_data(r_train_path,r_train_files)
r_test_data = read_data(r_test_path,r_test_files)

df_train = pd.DataFrame()
df_train['text'] = j_train_data+r_train_data
df_train['label'] = ['j'] * 10 + ['r'] * 10
df_test = pd.DataFrame()
df_test['text'] = j_test_data + r_test_data
df_test['label'] = ['j'] * 4 + ['r'] * 4

In [354]:
x_train = df_train['text']
x_test = df_test['text']
y_train = df_train['label']
y_test = df_test['label']

# vec = CountVectorizer(ngram_range=(1,3))
# x_train_dtm = vec.fit_transform(x_train)
# x_test_dtm = vec.transform(x_test)

vec = CountVectorizer(ngram_range=(1,3),binary=True)
x_train_dtm = vec.fit_transform(x_train)
x_test_dtm = vec.transform(x_test)

# vec = TfidfVectorizer(ngram_range=(1,3))
# x_train_dtm = vec.fit_transform(x_train)
# x_test_dtm = vec.transform(x_test)

In [358]:
# model = MultinomialNB()
# model = RandomForestClassifier()
model = BernoulliNB()
model.fit(x_train_dtm,y_train)

pred = model.predict(x_test_dtm)
print classification_report(y_test,pred)
print 'accuracy:',accuracy_score(y_test,pred)

             precision    recall  f1-score   support

          j       1.00      0.75      0.86         4
          r       0.80      1.00      0.89         4

avg / total       0.90      0.88      0.87         8

accuracy: 0.875


In [348]:
pipe = Pipeline([('vec',vec),('NB',model)])
pipe.fit(x_train,y_train)
pred = pipe.predict(x_test)

print classification_report(y_test,pred)
print 'accuracy:',accuracy_score(y_test,pred)

             precision    recall  f1-score   support

          j       0.80      1.00      0.89         4
          r       1.00      0.75      0.86         4

avg / total       0.90      0.88      0.87         8

accuracy: 0.875


In [352]:
blah = ['''
Hi all,

Dave asked Steve, Robert and myself to give a brief overview of the pull system to the SLT this morning.

Attached is the deck.

It went down well with great support from Dave and Steve. Positive feedback from Scott Collins, Zhorzh, Parish, Rebecca, Doug Blackburn.

Thanks to you all,
''']

print pipe.predict(blah)
print pipe.predict_proba(blah)

['j']
[[ 0.73178455  0.26821545]]
