# Introduction
Just for fun and knowledge sharing, I made a NLP classifier to differentiate email authors among two of my colleagues. The emails are chosen by random and obvious clues such as email signatures are removed from my dataset. Surprisingly, the classifier worked quite well with minimal hyperparameters tweaking. 

The approach is to vectorize the email into bag-of-words using different vectorizers:
* CountVectorizer with individual counts for each word
* CountVectorizer with binary counts
* TF-IDF

Different algorithms also work better with different vectorizers. Random Forest, Multinomial Naive Bayes, and Bernoulli Naive Bayes were used and compared. 

Model performance was assessed simply with separate training and testing set. Hyperparameters tweaking with Cross Validation is not performed in this notebook (they're not the focus of this exercise).

The author names are anonymized to **R** and **J**

In [1]:
import pandas as pd
import os
import codecs
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.metrics import classification_report,accuracy_score
from sklearn.pipeline import Pipeline

pd.options.display.max_columns = 500
pd.options.display.max_colwidth = 200

In [2]:
j_train_path = 'data/j/training'
j_test_path = 'data/j/testing'
r_train_path = 'data/r/training'
r_test_path = 'data/r/testing'

j_train_files = os.listdir(j_train_path)
j_test_files = os.listdir(j_test_path)
r_train_files = os.listdir(r_train_path)
r_test_files = os.listdir(r_test_path)

In [3]:
os.path.join(j_train_path,'j1.txt')

'data/j/training\\j1.txt'

In [4]:
def read_data(path,files):
    data=[]
    for i in files:
        f = codecs.open(os.path.join(path,i),'r',encoding='utf-8')
        data.append(f.read())
        f.close()
    return data

In [68]:
j_train_data = read_data(j_train_path,j_train_files)
j_test_data = read_data(j_test_path,j_test_files)
r_train_data = read_data(r_train_path,r_train_files)
r_test_data = read_data(r_test_path,r_test_files)

df_train = pd.DataFrame()
df_train['text'] = j_train_data+r_train_data
df_train['label'] = ['j'] * 10 + ['r'] * 10
df_test = pd.DataFrame()
df_test['text'] = j_test_data + r_test_data
df_test['label'] = ['j'] * 4 + ['r'] * 4

In [87]:
x_train = df_train['text']
x_test = df_test['text']
y_train = df_train['label']
y_test = df_test['label']

# vec = CountVectorizer(ngram_range=(1,3))
# x_train_dtm = vec.fit_transform(x_train)
# x_test_dtm = vec.transform(x_test)

# vec = CountVectorizer(ngram_range=(1,3),binary=True)
# x_train_dtm = vec.fit_transform(x_train)
# x_test_dtm = vec.transform(x_test)

vec = TfidfVectorizer(ngram_range=(1,3))
x_train_dtm = vec.fit_transform(x_train)
x_test_dtm = vec.transform(x_test)
x_test_dtm

<8x2494 sparse matrix of type '<type 'numpy.float64'>'
	with 253 stored elements in Compressed Sparse Row format>

In [88]:
model = MultinomialNB()
# model = RandomForestClassifier()
# model = BernoulliNB()
model.fit(x_train_dtm,y_train)

pred = model.predict(x_test_dtm)
print classification_report(y_test,pred)
print 'accuracy:',accuracy_score(y_test,pred)

             precision    recall  f1-score   support

          j       0.80      1.00      0.89         4
          r       1.00      0.75      0.86         4

avg / total       0.90      0.88      0.87         8

accuracy: 0.875


Results are not bad! I'm quite surprised at the accuracy since the training set is relatively small. Let's dig deeper to see what words each person uses more!

In [89]:
Words_df = pd.DataFrame()
Words_df['Words'] = vec.get_feature_names()
# Adding 1 because some frequencies are 0. And when I divide them to get the ratio, I don't want to divide by 0.
Words_df['j frequency'] = model.feature_count_[0,:] + 1
Words_df['r frequency'] = model.feature_count_[1,:] + 1

# j/r ratio: the higher this is, the more frequenty the word is used by author J than R
Words_df['j/r ratio'] = Words_df['j frequency'] / Words_df['r frequency']

In [78]:
# Words that J use a lot more than R
Words_df.sort_values(by='j/r ratio',ascending=False).head(30)

Unnamed: 0,Words,j frequency,r frequency,j/r ratio
291,data,1.512376,1.0,1.512376
1221,team,1.384292,1.0,1.384292
49,actuals,1.356092,1.0,1.356092
735,ll,1.345634,1.0,1.345634
29,a3s,1.335734,1.0,1.335734
304,dave,1.311846,1.0,1.311846
805,morning,1.311116,1.0,1.311116
693,interested,1.247243,1.0,1.247243
1018,robert,1.243797,1.0,1.243797
201,chasing,1.240419,1.0,1.240419


In [90]:
# Words that R use a lot more than J
Words_df.sort_values(by='j/r ratio',ascending=True).head(30)

Unnamed: 0,Words,j frequency,r frequency,j/r ratio
735,file,1.0,1.420557,0.703949
2170,to start,1.032511,1.415384,0.729492
1756,start,1.130045,1.532657,0.737311
1308,not,1.0,1.347106,0.742332
355,can you,1.0,1.342611,0.744818
1771,start to,1.0,1.330373,0.751669
1633,schedule,1.0,1.328618,0.752662
1563,remote,1.0,1.281789,0.78016
1946,the file,1.0,1.271522,0.786459
2308,was,1.0,1.268311,0.78845


Assembling the steps into a pipeline for convenience.

In [91]:
pipe = Pipeline([('vec',vec),('NB',model)])
pipe.fit(x_train,y_train)
pred = pipe.predict(x_test)

print classification_report(y_test,pred)
print 'accuracy:',accuracy_score(y_test,pred)

             precision    recall  f1-score   support

          j       0.80      1.00      0.89         4
          r       1.00      0.75      0.86         4

avg / total       0.90      0.88      0.87         8

accuracy: 0.875


The following is a space where I can paste in any new emails from the two authors and test out the performance of the model.

In [105]:
sample = ['''
Something’s wrong with my Sharepoint access again.
Here’s the updated xlsx.
Would you copy over the existing A3 in Sharepoint please?
''']

print pipe.predict(sample)
print pipe.predict_proba(sample)

['j']
[[ 0.5490296  0.4509704]]


It predicted accurately with 54.9% chance assigned to the correct author!