# ML Model for Actionable item classification

## 1. Load data

In [2]:
import pandas as pd
from gensim import utils
import gensim.parsing.preprocessing as gsp

from gensim.models import Word2Vec
from gensim.test.utils import common_texts, get_tmpfile

from sklearn.base import BaseEstimator
from sklearn import utils as skl_utils
from tqdm import tqdm

import multiprocessing
import numpy as np
from sklearn.metrics import classification_report

In [3]:
#Load data
actions_df = pd.read_csv('actions.csv', names = ['action_sent'])

### 1.1 explore the dataset

In [4]:
pd.options.display.max_colwidth = 1500
actions_df

Unnamed: 0,action_sent
0,Activate all who work with Transmission or have any good ideas on the subject.
1,Add more to your score by stopping in and picking up hefty load of construction supplies to win.
2,Add O'neal Winfee and George Smith to the attendees list.
3,"Additionally, send me the payment schedule for Tenaska IV this month."
4,Adjust our purchase amount from each party based on the transport allocation.
...,...
1245,Write me note about what is going on and what issues you need my help to deal with when you send the rentroll.
1246,Write verification plans specifications and documentation today and send me.
1247,you have to expand on the maintenance tools.
1248,You have to resolve Enron's ongoing concerns at any cost.


- The tagged data available is only of one class i.e action class
- So I will use one class classification.
- One-class classification is a field of machine learning that provides techniques for outlier and anomaly detection.

## 2. Data preprocessing

### 2.1. data cleaning:
- convert to lower case
  1. remove html tags
  2. remove punctuation
  3. remove extra white spaces
  4. remove stop words
  5. remove numerics
  6. stemming
  7. remove very short words
  8. ignore non unicode characters

In [5]:
def sent_clean(sent):
    sent = sent.lower()
    sent = utils.to_unicode(sent)
    for rule in cleaner:
        sent = rule(sent)
    return sent

cleaner = [gsp.strip_tags, 
           gsp.strip_punctuation,
           gsp.strip_multiple_whitespaces,
           gsp.strip_numeric,
           gsp.remove_stopwords, 
           gsp.strip_short, 
           gsp.stem_text]

In [6]:
s1 = []
for ele in actions_df["action_sent"]:
    s1.append(sent_clean(ele))



In [7]:
actions_df['cleaned'] = s1

## 3. Featurization
- i want the features to capture some context hence using Word2Vec

In [9]:
# download the pretrained model from
#https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

In [13]:
from gensim.models import KeyedVectors
model1 = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)


In [15]:
w2v_words = list(model1.wv.vocab)

  """Entry point for launching an IPython kernel.


In [16]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []
for sent in tqdm(actions_df['cleaned']):
    sent_vec = np.zeros(300) # as word vectors are of 300 length
    cnt_words =0 
    for word in sent:
        if word in w2v_words:
            vec = model1.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)

  if __name__ == '__main__':
100%|██████████████████████████████████████████████████████████████████████████████| 1250/1250 [11:17<00:00,  1.84it/s]


## 4. Training Auto Encoder 
I am using Auto Encoder to learn efficient data codings in an unsupervised manner. The aim of using autoencoder is to learn a representation (encoding) for the set of action sentence data.

### 4.1 Layer structure of the auto encoder
- Layer1: 300 features INPUT
- Layer2: 600 features
- Layer3: 150 features
- Layer4: 600 features
- Layer5: 300 features OUTPUT

<br>
Autoencoders are trained with the same data as input & output both. So, Layer 5 output is nothing but a reconstructed version of the input with some loss

In [54]:
from sklearn.neural_network import MLPRegressor

auto_en = MLPRegressor(hidden_layer_sizes=(600,150,600))
auto_en.fit(sent_vectors, sent_vectors)
predicted_vec = auto_en.predict(sent_vectors)

In [55]:
auto_en.score(predicted_vec, sent_vectors)



0.5776947222477959

The Autoencoder is able to reconstruct only 57 % variance as per 'Regression accuracy'

## 5. one-class SVM

In [56]:
from sklearn.svm import OneClassSVM

In [58]:
svm_clf = OneClassSVM(gamma='scale', nu=0.01)

In [59]:
svm_clf.fit(sent_vectors)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='scale', kernel='rbf',
            max_iter=-1, nu=0.01, shrinking=True, tol=0.001, verbose=False)

### 5.1 test metrices

In [None]:
test_data = pd.read_csv('test.csv')
test_y = test['label']
test_x = test['sentence']

In [None]:
# detect outliers in the test set

svm_yhat = model.predict(test_x)

 To evaluate the performance of the model as a binary classifier, we must change the labels in the test dataset from 0 and 1 for the majority and minority classes respectively, to +1 and -1.

In [None]:
classification_report(test_y ,svm_yhat)