# Classification of emails using Shorttext library

This notebook will test the different classification methods offered by the Shorttext library.

---
__Organization:__
1. Put the csv file in the right format for shorttext model and split the data between train and test
2. Preprocess the text
3. Train a LDA model and make classification using topics found
4. Use a Word2Vec representation of words and make classification

__What is left to be done:__
1. Put the csv file in the right format for shorttext model and split the data between train and test<br>
-- Change "Fatou_relabeled" by the final dataframe

2. Preprocess the text<br>
-- Try stemming

3. Train a LDA model and make classification using topics found<br>
3.1. Train a LDA model and make classification using topics found<br>
---- Choose the right number of topics (try different values for k and keep the value that gives the highest cross-validation score (http://scikit-learn.org/stable/modules/cross_validation.html)<br>
3.3. Classify using Scikit-Learn Classifiers<br>
---- Try different SKLearn classifiers (GaussianNB, GradientBoostingClassifier, etc..)<br>
---- Optimize parameters for classifiers (for example for RandomForestClassifier, change number of trees)

4. Use a Word2Vec representation of words and make classification<br>
4.3. Classify using a Convolutional Neural Network
---- Optimize parameters for the CNN (number of epochs, size, etc...) => check Shorttext github<br>
---- Try a double CNN<br>

4.4. Classify using a C-LSTM Neural Network<br>
---- Optimize parameters for the C-LSTM (number of epochs, size, etc...) => check Shorttext github<br>

- <strong>What can be tested also:</strong> 
- Try metrics different than accuracy like f1-score, precision, etc.
- Make a comparison of all methods
- Add useful graphs

__Keep in mind:__

Unfortunately, it only works with Python 2. You can create a Python 2 environment using conda <br>see here => https://conda.io/docs/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands

In [1]:
#!pip install -U shorttext
#!pip install -U spacy
#!spacy download en
import pandas as pd
import operator
import re
from nltk.corpus import stopwords

import shorttext
from shorttext.utils import text_preprocessor
from shorttext.utils import load_word2vec_model

from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import StratifiedShuffleSplit

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

#import seaborn as sns
#%matplotlib inline

Using TensorFlow backend.


In [2]:
#Helper functions

def predict(classifier, mail):
    #function that takes a message and a shorttext classifer then predict the category associated
    probas = classifier.score(mail)
    category = max(probas.iteritems(), key=operator.itemgetter(1))[0]
    return(category)


def create_df_from_dict(dictionary, categories):
    #create a dataframe with columns "Label" and "Message" from the shorttext dictionary
    df = pd.DataFrame()
    for cat in categories :
        class_size = len(dictionary[cat])
        labels = pd.Series([cat]*class_size)
        messages = pd.Series(dictionary[cat])
        tmp = pd.concat([pd.DataFrame(labels),pd.Series(messages)],axis=1)
        tmp.columns = ["Label", "Message"]
        df = pd.concat([df,tmp],axis=0)
    return df

# 1. Put the csv file in the right format for shorttext model and split the data between train and test

The file has to obey these rules:

- there is a heading; and
- there are at least two columns: first the labels, and second the short text under the labels (everything being the second column will be neglected).

In [3]:
df = pd.read_csv("../data/recombined.csv")

In [4]:
#we add the catgegories names
categories = ["miscl.", "conflicts", "attendance", "assignments", "enrollment", "internal", "dsp", "regrades"]
df["Label"] = df.Category.apply(lambda cat : categories[cat-1])

In [5]:
#Concatenate the body and the subject
df["Message"] = df["Subject"] + " " + df["Body"]

In [29]:
df.fillna("", inplace = True)

In [30]:
#split the data between train and test
def stratified_train_test_split(X, y, test_size, seed):
    sss = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=seed)
    for train_index, test_index in sss.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    return X_train, X_test, y_train, y_test

In [31]:
test_size = 0.3
seed = 42
X_train, X_test, y_train, y_test = stratified_train_test_split(df["Message"], df["Label"], test_size, seed)
print("X_train.shape", X_train.shape)
print("X_test.shape", X_test.shape)

('X_train.shape', (1064,))
('X_test.shape', (456,))


In [32]:
print("Classes proportions in train set")
print(y_train.value_counts(normalize=True))
print("")
print("Classes proportions in test set")
print(y_test.value_counts(normalize=True))

Classes proportions in train set
miscl.         0.379699
assignments    0.212406
conflicts      0.132519
enrollment     0.106203
dsp            0.069549
attendance     0.047932
internal       0.033835
regrades       0.017857
Name: Label, dtype: float64

Classes proportions in test set
miscl.         0.379386
assignments    0.212719
conflicts      0.131579
enrollment     0.105263
dsp            0.070175
attendance     0.048246
internal       0.032895
regrades       0.019737
Name: Label, dtype: float64


In [33]:
#final training dataframe
train = pd.concat([y_train, X_train],axis=1)
train.columns = ["Label", "Message"]
train.to_csv("../data/train_set_in_shorttext_format.csv", index=False)
train.head()

Unnamed: 0,Label,Message
710,assignments,Re: HW2 forgot to attach screenshot of IPython...
1192,assignments,Re: Self-Grade hw 8 turned in at 12:02 am grad...
418,dsp,Re: DSP thanks for your email! please let me k...
647,assignments,"Re: iPython Submission hi jodie, unfortunate..."
668,assignments,"Minutes Late HW hello ms. li, yesterday,..."


In [34]:
#final test dataframe
test = pd.concat([y_test, X_test],axis=1)
test.columns = ["Label", "Message"]
test.to_csv("../data/test_set_in_shorttext_format.csv", index=False)
test.head()

Unnamed: 0,Label,Message
156,enrollment,Re: Graduating Senior: Enrollment from Waitlis...
950,miscl.,"Lab 109 GSI email? hi, i am an eecs 47d stude..."
685,attendance,"Re: Lab excuse hi sarah, we have buffer weeks..."
788,assignments,"Re: Uploading Homework problem hi youdong, i ..."
484,miscl.,"Re: EE Final Exam , thank you so much this wo..."


# 2. Preprocess the text

- remove punctuation
- lemmatize words
- put to lower cases
- remove stop words

In [35]:
#dictionary where key = "category" and value = list of emails in that category
trainclassdict = shorttext.data.retrieve_csvdata_as_dict('../data/train_set_in_shorttext_format.csv')
testclassdict = shorttext.data.retrieve_csvdata_as_dict('../data/test_set_in_shorttext_format.csv')

In [36]:
eng_stopwords = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

#preprocessing functions
step1fcn = lambda s: re.sub("[^a-zA-Z]", " ", s)
step2fcn = lambda s: ' '.join(map(lambda word: lemmatizer.lemmatize(word), s.split(' ')))
step3fcn = lambda s: s.lower()
step4fcn = lambda s: re.sub(' +',' '," ".join([word for word in s.split(" ") if not word in eng_stopwords]))

#pipeline
pipeline = [step1fcn, step2fcn, step3fcn, step4fcn]
preprocessor = text_preprocessor(pipeline)

In [37]:
text = "  Maryland blue had crab in, having Annapolis dogs!"
preprocessor(text)

u' maryland blue crab annapolis dog '

In [1]:
#Example of cleaning
cat = "conflicts"
#print("Before : {}".format(trainclassdict[cat][0]))
#print("")
#print("After: {}".format(preprocessor(trainclassdict[cat][0])))

In [39]:
#clean the train data
for cat in categories :
    class_size = len(trainclassdict[cat])
    for i in range(class_size):
        trainclassdict[cat][i] = preprocessor(trainclassdict[cat][i])

#clean the test data       
for cat in categories :
    class_size = len(testclassdict[cat])
    for i in range(class_size):
        testclassdict[cat][i] = preprocessor(testclassdict[cat][i])

In [40]:
#create dataframe for train and test
train = create_df_from_dict(trainclassdict, categories)
test = create_df_from_dict(testclassdict, categories)

# 3. Classify with LDA model 

- We train a LDA model with k number of topics (k can be determined by cross-validation)
- The LDA model converts every text to a vector
- The cos classifier compute the cosinus between the vector representing the text and the vector representing the label
- The sklearn classifer uses the coefficients of the vector as features

__Reference__: http://shorttext.readthedocs.io/en/latest/tutorial_topic.html

## 3.1. Train the LDA model

In [72]:
#https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
topicmodeler = shorttext.generators.LDAModeler()

In [73]:
num_topics = 7
topicmodeler.train(trainclassdict, num_topics)

In [74]:
example = 'exam conflict hi based school policy offering additional accommodation option involved club sport conflict exam time staff member time midterm exam may proctored staff member supervising let know would like take accommodation thanks'

In [75]:
#topic vector representation
topicmodeler.retrieve_topicvec(example)

array([ 0.05767095,  0.33613182,  0.93294021,  0.05779112,  0.057672  ,
        0.05764608,  0.05764608])

## 3.2. Classify using cosine similarity

### 3.2.1 Train the model

In [22]:
cos_classifier = shorttext.classifiers.TopicVectorCosineDistanceClassifier(topicmodeler)

In [23]:
#predictions
cos_classifier.score(example)

{'assignments': 0.88386267,
 'attendance': 0.44776478,
 'conflicts': 0.88386267,
 'dsp': 0.47317448,
 'enrollment': 0.88386267,
 'internal': 0.46643716,
 'miscl.': 0.74053299,
 'regrades': 0.46540907}

In [24]:
predict(cos_classifier, example)

'enrollment'

### 3.2.2. Accuracy on train and test set

In [25]:
train_preds = train.Message.apply(lambda x : predict(cos_classifier, x))
train_accuracy = sum(train_preds == train.Label)/float(len(train))
print("Accuracy:", train_accuracy)

('Accuracy:', 0.1830188679245283)


In [26]:
test_preds = test.Message.apply(lambda x : predict(cos_classifier, x))
test_accuracy = sum(test_preds == test.Label)/float(len(test))
print("Accuracy:", test_accuracy)

('Accuracy:', 0.16777041942604856)


## 3.3. Classify using Scikit-Learn Classifiers

In [55]:
from collections import defaultdict
test = defaultdict(int)
for k, v in trainclassdict.iteritems():
    if len(v) == 0:
        print('rip')
    print(k)
    for i in v:
#         print(i)
        test[i] += 1
    test[k] += 1

attendance
enrollment
dsp
assignments
internal
miscl.
conflicts
regrades


In [2]:
#test

In [77]:
sklearn_classifier = RandomForestClassifier()

In [78]:
classifier = shorttext.classifiers.TopicVectorSkLearnClassifier(topicmodeler, sklearn_classifier)

In [93]:
X = []
y = []
classlabels = trainclassdict.keys()
for classidx, classlabel in zip(range(len(classlabels)), classlabels):
    topicvecs = map(topicmodeler.retrieve_topicvec, trainclassdict[classlabel])
    if(np.any(np.isnan(topicvecs))):
        i, _ = np.where(np.isnan(topicvecs))
        print(i)
        print(trainclassdict[classlabel][i[0]])
        print(topicmodeler.retrieve_topicvec(trainclassdict[classlabel][i[0]]))
    X += topicvecs
    y += [classidx]*len(topicvecs)

[298 298 298 298 298 298 298]
 enroll ee thanks update thanks 
[ 0.37796434  0.37796434  0.37796434  0.37796525  0.37796434  0.37796434
  0.37796434]


In [85]:
ct = 0
for classidx, classlabel in zip(range(len(classlabels)), classlabels):
    v =  trainclassdict[classlabel]
    ct += len(v)
    if 798 < ct:
        print(v[798-(ct-len(v))])
        break

 enroll ee thanks update thanks 


In [140]:
for i in range(10):
    print(self.topicmodel[self.tfidf[bow] if self.toweigh else bow], self.tfidf[bow] if self.toweigh else bow)

([], [(88, 1.0)])
([(0, 0.14285715), (1, 0.14285715), (2, 0.14285715), (3, 0.14285715), (4, 0.14285715), (5, 0.14285715), (6, 0.14285715)], [(88, 1.0)])
([(0, 0.14285715), (1, 0.14285715), (2, 0.14285715), (3, 0.14285715), (4, 0.14285715), (5, 0.14285715), (6, 0.14285715)], [(88, 1.0)])
([(0, 0.14285713), (1, 0.14285721), (2, 0.14285713), (3, 0.14285715), (4, 0.14285713), (5, 0.14285713), (6, 0.14285713)], [(88, 1.0)])
([], [(88, 1.0)])
([(0, 0.14285715), (1, 0.14285715), (2, 0.14285715), (3, 0.14285715), (4, 0.14285715), (5, 0.14285715), (6, 0.14285715)], [(88, 1.0)])
([(0, 0.14285715), (1, 0.14285715), (2, 0.14285715), (3, 0.14285715), (4, 0.14285715), (5, 0.14285715), (6, 0.14285715)], [(88, 1.0)])
([(0, 0.14285715), (1, 0.14285715), (2, 0.14285715), (3, 0.14285715), (4, 0.14285715), (5, 0.14285715), (6, 0.14285715)], [(88, 1.0)])
([(0, 0.14285715), (1, 0.14285715), (2, 0.14285715), (3, 0.14285715), (4, 0.14285715), (5, 0.14285715), (6, 0.14285715)], [(88, 1.0)])
([(0, 0.14285715), 

In [110]:
for i in range(1000):
    x  = topicmodeler.retrieve_topicvec(' discussion attendance ')
    self = topicmodeler
    shorttext = ' discussion attendance '
    bow = self.retrieve_bow(shorttext)
    topicdist = self.topicmodel[self.tfidf[bow] if self.toweigh else bow]
#     topicdist = self.retrieve_corpus_topicdist(shorttext)
    topicvec = np.zeros(self.nb_topics)
    for topicid, frac in topicdist:
        topicvec[topicid] = frac
    if self.normalize:
        topicvec /= np.linalg.norm(topicvec)
    print(topicvec)
#     print(x)
    if np.any(np.isnan(topicvec)):
        print('wtf')
        break
        

[ 0.37796447  0.37796447  0.37796447  0.37796447  0.37796447  0.37796447
  0.37796447]
[ nan  nan  nan  nan  nan  nan  nan]
wtf


  if sys.path[0] == '':


In [91]:
import numpy as np
np.array(X)[np.isnan(X)]
np.where(np.isnan(X))

(array([704, 704, 704, 704, 704, 704, 704]), array([0, 1, 2, 3, 4, 5, 6]))

In [92]:
X[704]

array([ nan,  nan,  nan,  nan,  nan,  nan,  nan])

In [42]:
classifier = shorttext.classifiers.TopicVectorSkLearnClassifier(topicmodeler, sklearn_classifier)
classifier.train(trainclassdict)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

In [None]:
#predictions
classifier.score(example)

In [None]:
predict(classifier, example)

In [None]:
train_preds = train.Message.apply(lambda x : predict(classifier, x))
train_accuracy = sum(train_preds == train.Label)/float(len(train))
print("Accuracy:", train_accuracy)

In [None]:
train_confusion_matrix = pd.DataFrame(confusion_matrix(train.Label, train_preds, labels=categories), columns=categories, index=categories)
train_confusion_matrix

In [None]:
test_preds = test.Message.apply(lambda x : predict(classifier, x))
test_accuracy = sum(test_preds == test.Label)/float(len(test))
print("Accuracy:", test_accuracy)

In [None]:
test_confusion_matrix = pd.DataFrame(confusion_matrix(test.Label, test_preds, labels=categories), columns=categories, index=categories)
test_confusion_matrix

# 4. Classify with Word2Vec model 

- We load the previously trained Word2Vec model by Google 
- We try a classifer that represent a text as the sum of vectors of words
- The cos classifier compute the cosinus between the vector representing the text and the vector representing the label
- The sklearn classifer uses the coefficients of the vector as features

__Reference__: http://shorttext.readthedocs.io/en/latest/tutorial_sumvec.html

## 4.1. Load the Word2Vec model

In [141]:
wvmodel = load_word2vec_model('../data/GoogleNews-vectors-negative300.bin.gz')

KeyboardInterrupt: 

## 4.2 Classify using shorttext.classifiers.SumEmbeddedVecClassifier
This classifier :
- represents the text as a vector which is the sum of vectors representing words
- compute the cosinus between this vector and the vector of the labels

__Reference__: http://shorttext.readthedocs.io/en/latest/tutorial_sumvec.html

### 4.2.1 Train the model

In [None]:
#we should look for the file
classifier = shorttext.classifiers.SumEmbeddedVecClassifier(wvmodel)   
classifier.train(trainclassdict)

In [None]:
#predictions
classifier.score(example)

In [None]:
predict(classifier, example)

### 4.2.2. Accuracy on train and test set

In [None]:
train_preds = train.Message.apply(lambda x : predict(classifier, x))
train_accuracy = sum(train_preds == train.Label)/float(len(train))
print("Accuracy:", train_accuracy)

In [None]:
train_confusion_matrix = pd.DataFrame(confusion_matrix(train.Label, train_preds, labels=categories), columns=categories, index=categories)
train_confusion_matrix

In [None]:
test_preds = test.Message.apply(lambda x : predict(classifier, x))
test_accuracy = sum(test_preds == test.Label)/float(len(test))
print("Accuracy:", test_accuracy)

In [None]:
test_confusion_matrix = pd.DataFrame(confusion_matrix(test.Label, test_preds, labels=categories), columns=categories, index=categories)
test_confusion_matrix

## 4.3 Classify using a Convolutional Neural Network
This uses convolutional Neural Network classifier built with keras.

__Reference__: http://shorttext.readthedocs.io/en/latest/tutorial_nnlib.html

### 4.3.1 Train the model

In [None]:
#convnet classifier
kmodel = shorttext.classifiers.frameworks.CNNWordEmbed(len(trainclassdict.keys()), vecsize=300)
#initialize the classifier
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel)

In [None]:
#train the classifier
classifier.train(trainclassdict, kmodel)

In [None]:
classifier.score(example)

### 4.3.2. Accuracy on train and test set

In [None]:
train_preds = train.Message.apply(lambda x : predict(classifier, x))
train_accuracy = sum(train_preds == train.Label)/float(len(train))
print("Accuracy:", train_accuracy)

In [None]:
train_confusion_matrix = pd.DataFrame(confusion_matrix(train.Label, train_preds, labels=categories), columns=categories, index=categories)
train_confusion_matrix

In [None]:
test_preds = test.Message.apply(lambda x : predict(classifier, x))
test_accuracy = sum(test_preds == test.Label)/float(len(test))
print("Accuracy:", test_accuracy)

In [None]:
test_confusion_matrix = pd.DataFrame(confusion_matrix(test.Label, test_preds, labels=categories), columns=categories, index=categories)
test_confusion_matrix

## 4.4 Classify using a C-LSTM Neural Network
This uses a C-LSTM Neural Network classifier built with keras.

__Reference__: http://shorttext.readthedocs.io/en/latest/tutorial_nnlib.html

### 4.4.1 Train the model

In [None]:
#convnet classifier
kmodel = shorttext.classifiers.frameworks.CLSTMWordEmbed(len(trainclassdict.keys()), vecsize=300)
#initialize the classifier
classifier = shorttext.classifiers.VarNNEmbeddedVecClassifier(wvmodel)

In [None]:
#train the classifier
classifier.train(trainclassdict, kmodel)

In [None]:
classifier.score(example)

### 4.4.2. Accuracy on train and test set

In [None]:
train_preds = train.Message.apply(lambda x : predict(classifier, x))
train_accuracy = sum(train_preds == train.Label)/float(len(train))
print("Accuracy:", train_accuracy)

In [None]:
train_confusion_matrix = pd.DataFrame(confusion_matrix(train.Label, train_preds, labels=categories), columns=categories, index=categories)
train_confusion_matrix

In [None]:
test_preds = test.Message.apply(lambda x : predict(classifier, x))
test_accuracy = sum(test_preds == test.Label)/float(len(test))
print("Accuracy:", test_accuracy)

In [None]:
test_confusion_matrix = pd.DataFrame(confusion_matrix(test.Label, test_preds, labels=categories), columns=categories, index=categories)
test_confusion_matrix