## Data Pre-processing

The libraries I'll use will consist of:
- pandas for data structures/data analysis
- numpy for simple scientific computing
- sklearn for machine learning techniques
- nltk for natural language processing

In [144]:
import nltk
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix

First thing I did was to convert the pdf (since it was a one big file) into plaintext. 

I tried using PyPDF2 but that didn't work as I liked it to so I opted to use pdfminer instead to first convert it to plaintext.

Here's the package repo: ```https://github.com/pdfminer/pdfminer.six```

Then, I read in the file and tokenized it into sentences first. 

The sentences will then be futher pre-processed by checking whether each word in said sentence is alphanumeric and not a stopword.

I also make sure the sentence will at least have a noun and a verb using POS (part of speech) tags using a universal tagset.

Note: this process can be further improved by making sure a sentence also have a subject, adverb, adjective, but this will do so for now.

In [148]:
stopwords = set(nltk.corpus.stopwords.words("english"))
with open("All_MOUs.txt", "r") as f:
    text= f.read()
    sentences = nltk.sent_tokenize(text)
    cleaned_sentences = []
    for s in sentences:
        tokens = nltk.word_tokenize(s)
        words = [word.lower() for word in tokens if word.isalpha() and word not in stopwords]
        pos = nltk.pos_tag(words, tagset="universal")
        tags = set([tag[1] for tag in pos])
        if set(["NOUN", "VERB"]).issubset(tags):
            cleaned_sentences.append(" ".join(words))

Now, I simply create a dataframe with all my cleaned sentences.

In [150]:
text_df = pd.DataFrame(cleaned_sentences, columns=["text"])

In [155]:
text_df.head()

Unnamed: 0,text
0,memorandum of understanding no
1,for submission to the city council regarding the administrative unit this memorandum of understanding made entered day april amended day march by and between the city of los angeles and the engineers and architects association july june table of contents page general provisions unit membership list new employee information work access use city facilities bulletin boards actions employee relations board employment opportunities legislative agency shop
2,grievances association security recognition parties memorandum understanding
3,implementation memorandum understanding full understanding
4,term calendar successor memorandum understanding


Now that my sentences are ready... I will use a term frequency-inverse document frequency (tf-idf) vectorizer and then convert the sentences into a tf-idf matrix.

In [156]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_df.text)

In [157]:
X

<61549x8049 sparse matrix of type '<class 'numpy.float64'>'
	with 993135 stored elements in Compressed Sparse Row format>

I can now attempt to cluster my sentences (unsupervised learning) using kmeans with 5 clusters.

Note: I attempted multiple clusters but 5 seem to be a good n to create our clusters.

In [38]:
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)

In [41]:
from sklearn.externals import joblib
joblib.dump(kmeans, 'cluster_5.pkl')

['cluster_5.pkl']

I then assign cluster labels back to my text_df dataframe.

In [42]:
text_df["cluster"] = kmeans.labels_.tolist()

Here, I try to identify which cluster correspond most to "employee compensation".

Cluster 0 seems to relate to medical/leave/family.
Cluster 1 seems to be pay/compensation (which is what we want!)
Cluster 2 seems to be grievance related?
Cluster 3 seems to be sick/vacation/time-off related.
Cluster 4 seems to be legal/union/department related.

You could make the argument that Cluster 3 could also be "employee compensation". But, I wasn't really sure by the question asking.. "with regards to employee compensation".

I wanted to narrow my scope so I can actually define my target variable, hence I decided Cluster 1 will be my target variable.

In [205]:
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(5):
    print(f"Cluster {i} words: ")
    for ind in order_centroids[i, :10]:
        print(f"{terms[ind]}")
    print()

Cluster 0 words: 
leave
family
medical
employee
article
taken
health
child
care
shall

Cluster 1 words: 
hours
shall
time
employee
employees
compensation
day
receive
pay
holiday

Cluster 2 words: 
grievance
within
grievant
days
written
procedure
step
shall
the
business

Cluster 3 words: 
sick
leave
accrued
use
discretion
used
time
employee
subsection
vacation

Cluster 4 words: 
shall
the
employee
city
mou
article
management
employees
union
department



Now, I designate cluster 1 to be my target variable.

In [46]:
text_df["target"] = text_df.apply(lambda row: True if row.cluster == 1 else False, axis=1)

## Data Modelling & Evaluation

Now, we can try different models to implement binary classification.

Below are the two models I found that works really well.
I didn't want to overcomplicate things, opt for a lesser complexity/parsimonious models and as my professor used to say: "Keep it simple stupid."

The two binary classification models I found that works well are: Multinomial Naive Bayes, and Logistic Regression.

In [49]:
text_clf_NB = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB())])
text_clf_LR = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', LogisticRegression(solver='lbfgs'))])

Here, I test my model's accuracy using k-fold Cross-Validation with k = 10.

In [59]:
kf = KFold(n_splits=10, shuffle = True, random_state = 100)
for train_index, test_index in kf.split(text_df):
    X_train, X_test = text_df.text[train_index], text_df.text[test_index]
    y_train, y_test = text_df.target[train_index], text_df.target[test_index]
    text_clf_NB = text_clf_NB.fit(X_train, y_train)
    predicted = text_clf_NB.predict(X_test)
    print("{}\nAccuracy:{}".format(confusion_matrix(y_test, predicted), np.mean(predicted == y_test)))

[[4541  150]
 [ 118 1346]]
Accuracy:0.9564581640942323
[[4534  144]
 [ 105 1372]]
Accuracy:0.959545085296507
[[4507  174]
 [  93 1381]]
Accuracy:0.9566206336311942
[[4557  154]
 [ 115 1329]]
Accuracy:0.9562956945572705
[[4582  158]
 [ 101 1314]]
Accuracy:0.9579203899268887
[[4567  172]
 [ 123 1293]]
Accuracy:0.9520714865962632
[[4564  150]
 [ 101 1340]]
Accuracy:0.9592201462225832
[[4594  154]
 [ 100 1307]]
Accuracy:0.9587327376116979
[[4573  149]
 [  95 1338]]
Accuracy:0.960357432981316
[[4527  134]
 [ 104 1389]]
Accuracy:0.9613259668508287


Multinomial Naive Bayes's accuracy hovers around ~95% and does pretty well. What about Logistic Regression?

In [56]:
kf = KFold(n_splits=10, shuffle = True, random_state = 100)
for train_index, test_index in kf.split(text_df):
    X_train, X_test = text_df.text[train_index], text_df.text[test_index]
    y_train, y_test = text_df.target[train_index], text_df.target[test_index]
    text_clf_LR = text_clf_LR.fit(X_train, y_train)
    predicted = text_clf_LR.predict(X_test)
    print("{}\nAccuracy:{}".format(confusion_matrix(y_test, predicted), np.mean(predicted == y_test)))

[[4671   20]
 [  40 1424]]
Accuracy:0.9902518277822908
[[4662   16]
 [  36 1441]]
Accuracy:0.9915515840779854
[[4658   23]
 [  26 1448]]
Accuracy:0.9920389926888709
[[4679   32]
 [  30 1414]]
Accuracy:0.9899268887083672
[[4715   25]
 [  33 1382]]
Accuracy:0.9905767668562144
[[4726   13]
 [  32 1384]]
Accuracy:0.9926888708367181
[[4687   27]
 [  38 1403]]
Accuracy:0.9894394800974817
[[4728   20]
 [  29 1378]]
Accuracy:0.9920389926888709
[[4708   14]
 [  34 1399]]
Accuracy:0.9922014622258326
[[4649   12]
 [  29 1464]]
Accuracy:0.993337666558336


!! The logistic regression model has a 99% accuracy and does much better than our previous Multinomial Naive Bayes model.

In [60]:
final_model = text_clf_LR.fit(text_df.text, text_df.target)
joblib.dump(final_model, 'final_model.pkl')

['final_model.pkl']

## Model Usage

And now, we can use our logistic regression model to test on the .docx file.

I will use a python library called docx to read in the word document file and clean it.

This time however, I will keep note of the original sentences as well.

In [194]:
from docx import Document

doc = Document("MOU1_Compensation.docx")
paragraphs = [i.text.strip() for i in doc.paragraphs]
sentences = nltk.sent_tokenize(" ".join(paragraphs))
cleaned_sentences = []
original_sentences = []
for s in sentences:
    tokens = nltk.word_tokenize(s)
    words = [word.lower() for word in tokens if word.isalpha() and word not in stopwords]
    pos = nltk.pos_tag(words, tagset="universal")
    tags = set([tag[1] for tag in pos])
    if set(["NOUN", "VERB"]).issubset(tags):
        cleaned_sentences.append(" ".join(words))
        original_sentences.append(s)

In [195]:
test_df = pd.DataFrame({"original": original_sentences, "cleaned": cleaned_sentences})

test_df.head(5)

See how the logistic model predict, while also calculating its probabilities.

In [199]:
test_df["predicted"] = final_model.predict(test_df.cleaned)
test_df["probability"] = final_model.predict_proba(test_df.cleaned)[:, 1]

We can sort the dataframe by most likely to be "employee compensation" related based on our predicted probabilities.

In [200]:
sorted_df = test_df.sort_values("probability", ascending=False)

And here are the top 5 most likely sentences that are related to "employee compensation".

In [204]:
sorted_df.head(5)

Unnamed: 0,original,cleaned,predicted,probability
25,Overtime compensation shall be in time off at the rate of one and one-half (1½) hours for each hour of overtime worked or at the rate of one and one-half (1½) times the employee's regular rate of pay.,overtime compensation shall time rate one hours hour overtime worked rate one times employee regular rate pay,True,1.0
59,NOTE: \tAn employee shall not receive court on-call overtime compensation and hour-for-hour overtime compensation for the same time period.,note an employee shall receive court overtime compensation overtime compensation time period,True,1.0
56,An off-duty employee shall receive a minimum of two hours overtime compensation for any court day he/she is subpoenaed to be on call or required to appear.,an employee shall receive minimum two hours overtime compensation court day subpoenaed call required appear,True,0.999999
104,Any employee receiving On-Call/Standby compensation for the same day shall not be eligible to receive compensation under this Article for that day.,any employee receiving compensation day shall eligible receive compensation article day,True,0.999996
61,"In addition, he/she shall receive hour-for-hour overtime compensation for each additional hour of actual court attendance in excess of two hours.",in addition shall receive overtime compensation additional hour actual court attendance excess two hours,True,0.999994


Now, if one wishes so... we can output the predicted "employee compensation" related sentences like so:

In [211]:
out_doc = Document()
for p in test_df[test_df.predicted == True].original:
    out_doc.add_paragraph(p)
out_doc.save("output.docx")

That's it! I can probably make this into a command line script that takes any docx/text files as input using click or arg_parse (if that's what required). I wasn't really sure.

Please let me know if there is any improvements I can make to my approach or any valuable lessons!

Thanks!