# Classification of emails using Shorttext library

This notebook will test the different classification methods offered by the Shorttext library.

---
__Steps:__

__Keep in mind:__

Unfortunately, it only works with Python 2. You can create a Python 2 environment using conda <br>see here => https://conda.io/docs/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands

In [3]:
import shorttext

In [5]:
#!pip install -U shorttext
#!pip install -U spacy
#!spacy download en
import pandas as pd
import operator
import re
from nltk.corpus import stopwords
import shorttext
from shorttext.utils import text_preprocessor
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import StratifiedShuffleSplit

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score

In [6]:
#Helper functions

def predict(classifier, mail):
    #function that takes a message and a shorttext classifer then predict the category associated
    probas = classifier.score(mail)
    category = max(probas.iteritems(), key=operator.itemgetter(1))[0]
    return(category)


def create_df_from_dict(dictionary, categories):
    #create a dataframe with columns "Label" and "Message" from the shorttext dictionary
    df = pd.DataFrame()
    for cat in categories :
        class_size = len(dictionary[cat])
        labels = pd.Series([cat]*class_size)
        messages = pd.Series(dictionary[cat])
        tmp = pd.concat([pd.DataFrame(labels),pd.Series(messages)],axis=1)
        tmp.columns = ["Label", "Message"]
        df = pd.concat([df,tmp],axis=0)
    return df

# 1. Put the csv file in the right format for shorttext model and split the data between train and test

The file has to obey these rules:

- there is a heading; and
- there are at least two columns: first the labels, and second the short text under the labels (everything being the second column will be neglected).

In [45]:
df = pd.read_csv("../recombined.csv")

IOError: File ../recombined.csv does not exist

In [8]:
#we add the catgegories names
categories = ["miscl.", "conflicts", "attendance", "assignments", "enrollment", "internal", "disability"]
df["Label"] = df.Category.apply(lambda cat : categories[cat-1])

In [9]:
#Concatenate the body and the subject
df["Message"] = df["Subject"] + " " + df["Body"]

In [1]:
#df.head()

In [10]:
#split the data between train and test
def stratified_train_test_split(X, y, test_size, seed):
    sss = StratifiedShuffleSplit(n_splits=1, test_size=test_size, random_state=seed)
    for train_index, test_index in sss.split(X, y):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    return X_train, X_test, y_train, y_test

In [11]:
test_size = 0.3
seed = 42
X_train, X_test, y_train, y_test = stratified_train_test_split(df["Message"], df["Label"], test_size, seed)
print("X_train.shape", X_train.shape)
print("X_test.shape", X_test.shape)

('X_train.shape', (212,))
('X_test.shape', (92,))


In [12]:
print("Classes proportions in train set")
print(y_train.value_counts(normalize=True))
print("")
print("Classes proportions in test set")
print(y_test.value_counts(normalize=True))

Classes proportions in train set
assignments    0.452830
miscl.         0.382075
enrollment     0.047170
attendance     0.042453
conflicts      0.028302
disability     0.028302
internal       0.018868
Name: Label, dtype: float64

Classes proportions in test set
assignments    0.456522
miscl.         0.380435
enrollment     0.054348
attendance     0.043478
disability     0.021739
conflicts      0.021739
internal       0.021739
Name: Label, dtype: float64


In [13]:
#final training dataframe
train = pd.concat([y_train, X_train],axis=1)
train.columns = ["Label", "Message"]
train.to_csv("../data/train_set_in_shorttext_format.csv", index=False)
train.head()

Unnamed: 0,Label,Message
45,disability,"Hw3 DSP Accommodations hello, if you are rece..."
18,assignments,"HW1 Question hello , i just saw a piazza post..."
193,assignments,"HW 3 hi , yesterday i did my homework problem ..."
61,assignments,"Re: Minutes Late HW hey andy, we've all exper..."
92,miscl.,"Re: first self grade hi kavi, i believe i re..."


In [14]:
#final test dataframe
test = pd.concat([y_test, X_test],axis=1)
test.columns = ["Label", "Message"]
test.to_csv("../data/test_set_in_shorttext_format.csv", index=False)
test.head()

Unnamed: 0,Label,Message
177,assignments,Fwd: Uploading Homework problem uploading hom...
208,miscl.,"Re: Midterm 1 conflict , yes, i got your email..."
195,assignments,"Self-grade deadline hi, i misinterpreted the ..."
257,miscl.,"Re: Homework 1 hello, yes i did. theyre my ol..."
96,assignments,"Homework Submission dear staff, i submitted ..."


# 2. Preprocess the text

- remove punctuation
- lemmatize words
- put to lower cases
- remove stop words

In [15]:
#dictionary where key = "category" and value = list of emails in that category
trainclassdict = shorttext.data.retrieve_csvdata_as_dict('../data/train_set_in_shorttext_format.csv')
testclassdict = shorttext.data.retrieve_csvdata_as_dict('../data/test_set_in_shorttext_format.csv')

In [21]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/joycesylo/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [24]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/joycesylo/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [25]:
eng_stopwords = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

#preprocessing functions
step1fcn = lambda s: re.sub("[^a-zA-Z]", " ", s)
step2fcn = lambda s: ' '.join(map(lambda word: lemmatizer.lemmatize(word), s.split(' ')))
step3fcn = lambda s: s.lower()
step4fcn = lambda s: re.sub(' +',' '," ".join([word for word in s.split(" ") if not word in eng_stopwords]))

#pipeline
pipeline = [step1fcn, step2fcn, step3fcn, step4fcn]
preprocessor = text_preprocessor(pipeline)

In [26]:
text = "  Maryland blue had crab in, having Annapolis dogs!"
preprocessor(text)

u' maryland blue crab annapolis dog '

In [2]:
#Example of cleaning
cat = "conflicts"
#print("Before : {}".format(trainclassdict[cat][0]))
#print("")
#print("After: {}".format(preprocessor(trainclassdict[cat][0])))

In [28]:
#clean the train data
for cat in categories :
    class_size = len(trainclassdict[cat])
    for i in range(class_size):
        trainclassdict[cat][i] = preprocessor(trainclassdict[cat][i])

#clean the test data       
for cat in categories :
    class_size = len(testclassdict[cat])
    for i in range(class_size):
        testclassdict[cat][i] = preprocessor(testclassdict[cat][i])

In [29]:
#create dataframe for train and test
train = create_df_from_dict(trainclassdict, categories)
test = create_df_from_dict(testclassdict, categories)

# 4. Classify with LDA model 

- We train a LDA model with k number of topics (k can be determined by cross-validation)
- The LDA model converts every text to a vector
- The cos classifier compute the cosinus between the vector representing the text and the vector representing the label
- The sklearn classifer uses the coefficients of the vector as features

__Reference__: http://shorttext.readthedocs.io/en/latest/tutorial_topic.html

## 4.1. Train the LDA model

In [30]:
#https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
topicmodeler = shorttext.generators.LDAModeler()

In [32]:
num_topics = 7
topicmodeler.train(trainclassdict, num_topics)

In [33]:
example = 'exam conflict hi based school policy offering additional accommodation option involved club sport conflict exam time staff member time midterm exam may proctored staff member supervising let know would like take accommodation thanks'

In [34]:
#topic vector representation
topicmodeler.retrieve_topicvec(example)

array([ 0.03615069,  0.99607167,  0.03614631,  0.03616544,  0.03614813,
        0.03614631,  0.03614723])

## 4.2. Classify using cosine similarity

### 4.2.1 Train the model

In [35]:
cos_classifier = shorttext.classifiers.TopicVectorCosineDistanceClassifier(topicmodeler)

In [36]:
#predictions
cos_classifier.score(example)

{'assignments': 0.99820906,
 'attendance': 0.056295462,
 'conflicts': 0.99911392,
 'disability': 0.99955291,
 'enrollment': 0.053230517,
 'internal': 0.065536596,
 'miscl.': 0.036165349}

In [37]:
predict(cos_classifier, example)

'disability'

### 4.2.2. Accuracy on train and test set

In [38]:
train.Preds = train.Message.apply(lambda x : predict(cos_classifier, x))
accuracy = sum(train.Preds == train.Label)/float(len(train))
print("Accuracy:", accuracy)

('Accuracy:', 0.13942307692307693)


  """Entry point for launching an IPython kernel.


## 4.3. Classify using Scikit-Learn Classifiers

In [39]:
sklearn_classifier = GaussianNB()

In [40]:
classifier = shorttext.classifiers.TopicVectorSkLearnClassifier(topicmodeler, sklearn_classifier)
classifier.train(trainclassdict)

In [41]:
#predictions
classifier.score(example)

{'assignments': 0.0,
 'attendance': 0.0,
 'conflicts': 1.0,
 'disability': 0.0,
 'enrollment': 0.0,
 'internal': 0.0,
 'miscl.': 0.0}

In [42]:
predict(classifier, example)

'conflicts'

In [43]:
train.Preds = train.Message.apply(lambda x : predict(classifier, x))
accuracy = sum(train.Preds == train.Label)/float(len(train))
print("Accuracy:", accuracy)

('Accuracy:', 0.34615384615384615)
