#  E-Mail classification problem
## We have a set of emails, half of which were written by one person and the other half by another person at the same company . Our objective is to classify the emails as written by one person or the other based only on the text of the email. We will start with Naive Bayes in this mini-project, and then expand in later projects to other algorithms.

# 1. Naive Bayes
One particular feature of Naive Bayes is that it’s a good algorithm for working with text classification. When dealing with text, it’s very common to treat each unique word as a feature, and since the typical person’s vocabulary is many thousands of words, this makes for a large number of features. The relative simplicity of the algorithm and the independent features assumption of Naive Bayes make it a strong performer for classifying texts. 

In [1]:
#Import packages
import cPickle
import numpy 
from time import time
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

In [2]:
def preprocess(words_file = "word_data.pkl", 
               authors_file="email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts (by default word_data.pkl)
        and the corresponding authors (by default email_authors.pkl) and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays, which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels

    """

    ### the words (features) and authors (labels), already largely preprocessed
    authors_file_handler = open(authors_file, "r")
    authors = cPickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "r")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = sklearn.model_selection.train_test_split(word_data, authors, test_size=0.1, random_state=42)


    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)



    ### feature selection, because text is super high dimensional and 
    ### can be really computationally chewy as a result
    selector = SelectPercentile(f_classif, percentile=10)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print "no. of Chris training emails:", sum(labels_train)
    print "no. of Sara training emails:", len(labels_train)-sum(labels_train)
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

In [3]:
""" Use a Naive Bayes Classifier to identify emails by their authors
    
    authors and labels:
    Sara has label 0
    Chris has label 1
"""
### features_train and features_test are the features for the training
### and testing datasets, respectively
### labels_train and labels_test are the corresponding item labels
features_train, features_test, labels_train, labels_test = preprocess()
print len(features_train)
print len(labels_train)

#Import the sklearn module for GaussianNB
from sklearn.naive_bayes import GaussianNB
#Create classifier
clf=GaussianNB()
#Start measuring training time
t0=time()
#Train the classifier
clf.fit(features_train, labels_train)
#Compute training time and print it
print "training time:", round(time()-t0,3), "s"

no. of Chris training emails: 7936
no. of Sara training emails: 7884
15820
15820
training time: 1.268 s


In [4]:
#Start measuring predicton time                             
t1=time()
#Use the trained classifier to predict labels for the test features
pred=clf.predict(features_test)
#Compute and print prediction time
print ("prediction time:", round(time()-t1,3), "s")

('prediction time:', 0.216, 's')


In [5]:
#Compute accuracy of prediction                               
from sklearn.metrics import accuracy_score
accuracy=accuracy_score(pred, labels_test)
print accuracy

0.973265073948


Classifying our e-mails using Naive Bayes supervised classification algorithm gave us immediately a really good accuracy. It was really easy to implement and it was easy to run. It is worth to mention that Naive Bayes doesn't do well on expressions or anything comprised more than 1 word since Naive Bayes treats each word independently, it doesn't detect the relation - e.g. the hidden meaning - between words. Both the training and the predcition took only moments, so Naive Bayes looks like a good approach tackling this problem. Let's see how the other classifiers perform on the same dataset. 

# 2. SVM

In [6]:
#Import sklearn module for SVM
from sklearn import svm
#Create classifier using first a linear kernel
clf=svm.SVC(kernel='linear')
#Start measuring training time
t0=time()
#Train classifier
clf.fit(features_train, labels_train)
#Compute training time and print it
print "training time:", round(time()-t0,3), "s"#Start measuring predicton time   

training time: 252.984 s


In [7]:
#Start measuring prediciton time
t1=time()
#Make predictions
pred_SVM_linear_kernel=clf.predict(features_test)
print "Prediction takes: ", round(time()-t1, 3), "s"

Prediction takes:  27.041 s


In [8]:
#Accuracy measurement
accuracy_SVM_linear_kernel=accuracy_score(pred_SVM_linear_kernel, labels_test)
print accuracy_SVM_linear_kernel

0.984072810011


Even though our SVM made a slightly better prediction on the dataset, it took awful much time to both train and predict. To speed training up I will use only 1 % of the training dataset. I expect some decrease in accuracy but on the other hand a sifnificant training time improvement. 

In [9]:
#Acquire 1 % of the training dataset 
features_train_reduced=features_train[:len(features_train)/100]
labels_train_reduced=labels_train[:len(labels_train)/100]
#The rest is the same as before
t0=time()
clf.fit(features_train_reduced, labels_train_reduced)
print "Reduced training time: ", round(time()-t0, 3), "s"
t1=time()
pred_SVM_linear_kernel=clf.predict(features_test)
print "Prediction takes: ", round(time()-t1, 3), "s"
accuracy_SVM_linear_kernel=accuracy_score(pred_SVM_linear_kernel, labels_test)
print accuracy_SVM_linear_kernel

Reduced training time:  0.157 s
Prediction takes:  1.59 s
0.884527872582


With the reduced dataset training and prediction took about the same time than using Naive Bayes but we are nowhere close to the accuracy of ~97 %. Let's see what we get using a different kernel. Rbf is the default kernel being used if nothing is specified. 

In [10]:
#Create classifier
clf_rbf=svm.SVC(kernel='rbf')
#Start measuring training time
t0=time()
#Train classifier
clf_rbf.fit(features_train_reduced, labels_train_reduced)
#Compute training time and print it
print "Training time:", round(time()-t0,3), "s"
#Start measuring prediciton time
t1=time()
#Make predictions
pred_SVM_rbf_kernel=clf_rbf.predict(features_test)
print "Prediction takes: ", round(time()-t1, 3), "s"
#Accuracy measurement
accuracy_SVM_rbf_kernel=accuracy_score(pred_SVM_rbf_kernel, labels_test)
print accuracy_SVM_rbf_kernel

Training time: 0.178 s
Prediction takes:  1.821 s
0.616040955631


Well, this is not a great accuracy, isn't it? Let's play around with the C (parameter of the error term). The decision boundary becomes more complex as C gets larger and larger (C=1, C=10, C=100, C=1000, C=10000)

In [11]:
#C=10
clf_rbf_C10=svm.SVC(C=10, kernel='rbf')
clf_rbf_C10.fit(features_train_reduced, labels_train_reduced)
pred_SVM_rbf_C10_kernel=clf_rbf_C10.predict(features_test)
accuracy_SVM_rbf_C10_kernel=accuracy_score(pred_SVM_rbf_C10_kernel, labels_test)
print "Accuracy with C=10:", accuracy_SVM_rbf_C10_kernel 
#C=100
clf_rbf_C100=svm.SVC(C=100, kernel='rbf')
clf_rbf_C100.fit(features_train_reduced, labels_train_reduced)
pred_SVM_rbf_C100_kernel=clf_rbf_C100.predict(features_test)
accuracy_SVM_rbf_C100_kernel=accuracy_score(pred_SVM_rbf_C100_kernel, labels_test)
print "Accuracy with C=100:", accuracy_SVM_rbf_C100_kernel 
#C=1000
clf_rbf_C1000=svm.SVC(C=1000, kernel='rbf')
clf_rbf_C1000.fit(features_train_reduced, labels_train_reduced)
pred_SVM_rbf_C1000_kernel=clf_rbf_C1000.predict(features_test)
accuracy_SVM_rbf_C1000_kernel=accuracy_score(pred_SVM_rbf_C1000_kernel, labels_test)
print "Accuracy with C=1000:", accuracy_SVM_rbf_C1000_kernel 
#C=10000
clf_rbf_C10000=svm.SVC(C=10000, kernel='rbf')
clf_rbf_C10000.fit(features_train_reduced, labels_train_reduced)
pred_SVM_rbf_C10000_kernel=clf_rbf_C10000.predict(features_test)
accuracy_SVM_rbf_C10000_kernel=accuracy_score(pred_SVM_rbf_C10000_kernel, labels_test)
print "Accuracy with C=10000:", accuracy_SVM_rbf_C10000_kernel 

Accuracy with C=10: 0.616040955631
Accuracy with C=100: 0.616040955631
Accuracy with C=1000: 0.821387940842
Accuracy with C=10000: 0.892491467577


Now let's get back to the full training dataset. It should significantly improve the prediction accuracy, now that C value is optimized, let's see how much does the full training dataset improves our predctions. 

In [12]:
#Create classifier
clf_rbf=svm.SVC(C=10000,kernel='rbf')
#Start measuring training time
t0=time()
#Train classifier
clf_rbf.fit(features_train, labels_train)
#Compute training time and print it
print "Training time:", round(time()-t0,3), "s"
#Start measuring prediciton time
t1=time()
#Make predictions
pred_SVM_rbf_kernel=clf_rbf.predict(features_test)
print "Prediction takes: ", round(time()-t1, 3), "s"
#Accuracy measurement
accuracy_SVM_rbf_kernel=accuracy_score(pred_SVM_rbf_kernel, labels_test)
print accuracy_SVM_rbf_kernel

Training time: 175.746 s
Prediction takes:  17.512 s
0.990898748578


We reached more than 99 % accuracy, however the training and prediction time increased significantly compared to using Naive Bayes classifier.

# 3. Decision trees

Let's tackle the problem using Decision Tree algorithm this time. As we have ~3800 features in our dataset, let's set the min_sample_split parameters to 40. Setting this parameter to an even smaller value would most likely result to overfitting our data.

In [13]:
from sklearn import tree
clf=tree.DecisionTreeClassifier(min_samples_split=40)
#Start measuring training time
t0=time()
clf=clf.fit(features_train, labels_train)
#Compute training time and print it
print "Training time:", round(time()-t0,3), "s"
#Start measuring prediciton time
t1=time()
#Make predictions
predict=clf.predict(features_test)
print "Prediction takes: ", round(time()-t1, 3), "s"
#Accuracy measurement
from sklearn.metrics import accuracy_score
acc=accuracy_score(predict, labels_test)
print acc

Training time: 57.775 s
Prediction takes:  0.028 s
0.978953356086


We explored the power of tuning the algorithm parameters in terms of training and prediciton time as well as accuracy. Another way to control the complexity of an algorithm is via the number of features that we use in training/testing. The more features the algorithm has available, the more potential there is for a complex fit. Let's reduce the number of features all the way down to 10 % of the original value. 

In [14]:
def preprocess(words_file = "word_data.pkl", 
               authors_file="email_authors.pkl"):
    ### the words (features) and authors (labels), already largely preprocessed
    authors_file_handler = open(authors_file, "r")
    authors = cPickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "r")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()

    ### test_size is the percentage of events assigned to the test set
    ### (remainder go into training)
    features_train, features_test, labels_train, labels_test = sklearn.model_selection.train_test_split(word_data, authors, test_size=0.1, random_state=42)


    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                                 stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)


    ### setting percentile parameter to 1 results having 379 features instead of 3785
    selector = SelectPercentile(f_classif, percentile=1)
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    ### info on the data
    print "no. of Chris training emails:", sum(labels_train)
    print "no. of Sara training emails:", len(labels_train)-sum(labels_train)
    
    return features_train_transformed, features_test_transformed, labels_train, labels_test

In [15]:
#Training and testing time, accuracy in case of 10 % of the original features using the same parameters as before

from sklearn import tree
features_train, features_test, labels_train, labels_test = preprocess()
clf=tree.DecisionTreeClassifier(min_samples_split=40)
#Start measuring training time
t0=time()
clf=clf.fit(features_train, labels_train)
#Compute training time and print it
print "Training time:", round(time()-t0,3), "s"
#Start measuring prediciton time
t1=time()
#Make predictions
predict=clf.predict(features_test)
print "Prediction takes: ", round(time()-t1, 3), "s"
#Accuracy measurement
from sklearn.metrics import accuracy_score
acc=accuracy_score(predict, labels_test)
print acc
print len(features_train[0])

no. of Chris training emails: 7936
no. of Sara training emails: 7884
Training time: 5.14 s
Prediction takes:  0.005 s
0.966439135381
379
