# ML Model for Actionable item classification

## 1. Approach
_ I will use BERT to extract features
- BERT is a method of pretraining language models.
- I will use BERT to extract high quality language features.
- I will use pre trained BERT model but will fine tune it on our data.
- Pre trained BERT model has a lot of information already encoded in its weights.
- I will lightly tune them to use the features for classification.

> <br>
- In the past, I have Pretrained BERT on Hindi language from scratch.
<br>
- But for english so much work has already been done.I can never match the amount and quality of data, and the resources used by big research labs to pre train BERT.
<br>
- I will use a pre BERT trained Model, tune it on my data and extract features by transfer learning.
- BERT is bidirectional it learns both left and right context.

> * BERT vs Word2Vec
<br>
- Word2Vec is a context-free model in the sense that it generates a single embedding representation for each word in the vocab.
- BERT is a Contextual model
- Contextuals models generate a representation of each word that is based on the other words in the sentence.

### 1.1. BERT Features
- https://github.com/google-research/bert
- BERT is trained using word piece embeddings.
- Word piece embedding means the tokenization is not only on white spaces it can also break a word.
- example: "what is your name?" can be tokenized as "wh","##at","is", "you", "#r", "name", "?"
- BERT-Large, Uncased model has around 30 thousand word pieces as vocabulary.
- word piece embedding helps BERT to give embeddings for even unseen words.

## 2. Importing required packages

In [None]:
#install ipywidgets
#!conda install -c conda-forge ipywidgets

In [1]:
import pandas as pd
from sklearn.metrics import classification_report

import tensorflow as tf

In [2]:
# Get the GPU device name.
device_name = tf.test.gpu_device_name()

if device_name != '/device:GPU:0':
    raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [3]:
import torch

# Check GPU available
if torch.cuda.is_available():    

    # set PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

    # If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
We will use the GPU: GeForce GTX 1060


>I will use transformers package by Hugging Face which will give us a pytorch interface for working with BERT.
>The library also includes task-specific classes for token classification, question answering, next sentence prediciton, etc. Using these pre-built classes simplifies the process of modifying BERT for our purposes.

## 3. Data Pre processing
- Apart from the other preprocessing
- BERT expects the inputs to be in certain format:
  - Add special tokens to the start and end of each sentence.
  - Pad & truncate all sentences to a single constant length(512)
  - Explicitly differentiate real tokens from padding tokens with the "attention mask".

In [6]:
data_df = pd.read_csv('created_data.csv')

In [8]:
data_df.head()

Unnamed: 0,sentence,label
0,There are future plans to make OWA available ...,0
1,"Finally, since a lot of the information revol...",0
2,"Among area utilities, Kansas Gas Service incre...",0
3,Let me know where the differences are.,1
4,The only Internet Email address format that wi...,0


> Bert will throw out of index error for sentences longer than 512 tokens

- I could break the longer sentences, or ignore them
- I choose to ignore them cause i found only 33 sentences which were longer than 512

In [9]:
# remove sentences longer than 512

bert_sent = []
labels = []

for i in range(len(data_df)):
    ele = data_df['sentence'][i]
    if len(ele) < 512:
        bert_sent.append(ele)
        labels.append(data_df['label'][i])
len(bert_sent)

3217

### 3.1. Tokenization
- To feed the data to BERT we have to tokenize the data and then map to index in the BERT tokenizer Vocab.

In [1]:
from transformers import BertTokenizer

# Load BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Loading BERT tokenizer...


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




>  The tokenize.encode function will do both tokenization and convert_tokens_to_ids, rather than calling tokenize and convert_tokens_to_ids separately.

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'GeForce GTX 1060'

In [13]:
data_df = pd.read_csv('created_data.csv')


In [39]:
# add special tokens for BERT to work properly

pre_sent = ["[CLS] " + sent + " [SEP]" for sent in bert_sent]
print(pre_sent[0])

[CLS] There are future plans to make OWA  available from your home or when traveling abroad. [SEP]


In [40]:
# Tokenize with BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(ele1) for ele1 in pre_sent]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

Tokenize the first sentence:
['[CLS]', 'there', 'are', 'future', 'plans', 'to', 'make', 'ow', '##a', 'available', 'from', 'your', 'home', 'or', 'when', 'traveling', 'abroad', '.', '[SEP]']


In [49]:
MAX_LEN = 128
# Pad our input tokens
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
# Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")


In [50]:
# Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
    seq_mask = [float(i>0) for i in seq]
    attention_masks.append(seq_mask)

In [52]:
# Use train_test_split to split our data into train and validation sets for training
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, labels, 
                                                            random_state=201, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,
                                             random_state=201, test_size=0.1)
                                             
# Convert all of our data into torch tensors, the required datatype for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

# Select a batch size for training. 
batch_size = 32

# Create an iterator of our data with torch DataLoader 
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

In [61]:
# Load BertForSequenceClassification, the pretrained BERT model with a single linear classification layer on top. 

model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=nb_labels)
model.cuda()

#BERT model summary
BertForSequenceClassification(
    (bert): BertModel(
        (embeddings): BertEmbeddings(
            (word_embeddings): Embedding(30522, 768, padding_idx=0)
            (position_embeddings): Embedding(512, 768)
            (token_type_embeddings): Embedding(2, 768)
            (LayerNorm): BertLayerNorm()
            (dropout): Dropout(p=0.1)
        )
        (encoder): BertEncoder(
            (layer): ModuleList(
                (0): BertLayer(
                    (attention): BertAttention(
                        (self): BertSelfAttention(
                            (query): Linear(in_features=768, out_features=768, bias=True)
                            (key): Linear(in_features=768, out_features=768, bias=True)
                            (value): Linear(in_features=768, out_features=768, bias=True)
                            (dropout): Dropout(p=0.1)
                        )
                        (output): BertSelfOutput(
                            (dense): Linear(in_features=768, out_features=768, bias=True)
                            (LayerNorm): BertLayerNorm()
                            (dropout): Dropout(p=0.1)
                        )
                    )
                    (intermediate): BertIntermediate(
                        (dense): Linear(in_features=768, out_features=3072, bias=True)
                    )
                    (output): BertOutput(
                        (dense): Linear(in_features=3072, out_features=768, bias=True)
                        (LayerNorm): BertLayerNorm()
                        (dropout): Dropout(p=0.1)
                    )
                )
                .
                .
                .
            )
        )
        (pooler): BertPooler(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (activation): Tanh()
        )
    )
    (dropout): Dropout(p=0.1)
    (classifier): Linear(in_features=768, out_features=2, bias=True)
)

SyntaxError: invalid syntax (<ipython-input-61-f9dbf0a39144>, line 8)

In [53]:

# BERT fine-tuning parameters
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]


optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5,
                     warmup=.1)

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs 
epochs = 4

# BERT training loop
for _ in trange(epochs, desc="Epoch"):
      # Set our model to training mode
    model.train()  
    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    # Train the data for one epoch
    for step, batch in enumerate(train_dataloader):
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        # Clear out the gradients (by default they accumulate)
        optimizer.zero_grad()
        # Forward pass
        loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
        train_loss_set.append(loss.item())    
        # Backward pass
        loss.backward()
        # Update parameters and take a step using the computed gradient
        optimizer.step()
        # Update tracking variables
        tr_loss += loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1
    print("Train loss: {}".format(tr_loss/nb_tr_steps))
    
    ## VALIDATION

    # Put model in evaluation mode
    model.eval()
    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    # Evaluate data for one epoch
    for batch in validation_dataloader:
         # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        # Telling the model not to compute or store gradients, saving memory and speeding up validation
        
        with torch.no_grad():
            # Forward pass, calculate logit predictions
            logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
            
        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)    
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1  
        
    print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

    
# plot training performance
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

NameError: name 'model' is not defined

NameError: name 'jkhjk' is not defined

In [3]:
#Load data
actions_df = pd.read_csv('actions.csv', names = ['action_sent'])

'nvcc' is not recognized as an internal or external command,
operable program or batch file.


### 1.1 explore the dataset

In [4]:
pd.options.display.max_colwidth = 1500
actions_df

Unnamed: 0,action_sent
0,Activate all who work with Transmission or have any good ideas on the subject.
1,Add more to your score by stopping in and picking up hefty load of construction supplies to win.
2,Add O'neal Winfee and George Smith to the attendees list.
3,"Additionally, send me the payment schedule for Tenaska IV this month."
4,Adjust our purchase amount from each party based on the transport allocation.
...,...
1245,Write me note about what is going on and what issues you need my help to deal with when you send the rentroll.
1246,Write verification plans specifications and documentation today and send me.
1247,you have to expand on the maintenance tools.
1248,You have to resolve Enron's ongoing concerns at any cost.


- The tagged data available is only of one class i.e action class
- So I will use one class classification.
- One-class classification is a field of machine learning that provides techniques for outlier and anomaly detection.

## 2. Data preprocessing

### 2.1. data cleaning:
- convert to lower case
  1. remove html tags
  2. remove punctuation
  3. remove extra white spaces
  4. remove stop words
  5. remove numerics
  6. stemming
  7. remove very short words
  8. ignore non unicode characters

In [5]:
def sent_clean(sent):
    sent = sent.lower()
    sent = utils.to_unicode(sent)
    for rule in cleaner:
        sent = rule(sent)
    return sent

cleaner = [gsp.strip_tags, 
           gsp.strip_punctuation,
           gsp.strip_multiple_whitespaces,
           gsp.strip_numeric,
           gsp.remove_stopwords, 
           gsp.strip_short, 
           gsp.stem_text]

In [6]:
s1 = []
for ele in actions_df["action_sent"]:
    s1.append(sent_clean(ele))



In [7]:
actions_df['cleaned'] = s1

## 3. Featurization
- i want the features to capture some context hence using Word2Vec

In [9]:
# download the pretrained model from
#https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz

In [13]:
from gensim.models import KeyedVectors
model1 = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)


In [15]:
w2v_words = list(model1.wv.vocab)

  """Entry point for launching an IPython kernel.


In [16]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []
for sent in tqdm(actions_df['cleaned']):
    sent_vec = np.zeros(300) # as word vectors are of 300 length
    cnt_words =0 
    for word in sent:
        if word in w2v_words:
            vec = model1.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)

  if __name__ == '__main__':
100%|██████████████████████████████████████████████████████████████████████████████| 1250/1250 [11:17<00:00,  1.84it/s]


## 4. Training Auto Encoder 
I am using Auto Encoder to learn efficient data codings in an unsupervised manner. The aim of using autoencoder is to learn a representation (encoding) for the set of action sentence data.

### 4.1 Layer structure of the auto encoder
- Layer1: 300 features INPUT
- Layer2: 600 features
- Layer3: 150 features
- Layer4: 600 features
- Layer5: 300 features OUTPUT

<br>
Autoencoders are trained with the same data as input & output both. So, Layer 5 output is nothing but a reconstructed version of the input with some loss

In [54]:
from sklearn.neural_network import MLPRegressor

auto_en = MLPRegressor(hidden_layer_sizes=(600,150,600))
auto_en.fit(sent_vectors, sent_vectors)
predicted_vec = auto_en.predict(sent_vectors)

In [55]:
auto_en.score(predicted_vec, sent_vectors)



0.5776947222477959

The Autoencoder is able to reconstruct only 57 % variance as per 'Regression accuracy'

## 5. one-class SVM

In [56]:
from sklearn.svm import OneClassSVM

In [58]:
svm_clf = OneClassSVM(gamma='scale', nu=0.01)

In [59]:
svm_clf.fit(sent_vectors)

OneClassSVM(cache_size=200, coef0=0.0, degree=3, gamma='scale', kernel='rbf',
            max_iter=-1, nu=0.01, shrinking=True, tol=0.001, verbose=False)

### 5.1 test metrices

In [None]:
test_data = pd.read_csv('test.csv')
test_y = test['label']
test_x = test['sentence']

In [None]:
# detect outliers in the test set

svm_yhat = model.predict(test_x)

 To evaluate the performance of the model as a binary classifier, we must change the labels in the test dataset from 0 and 1 for the majority and minority classes respectively, to +1 and -1.

In [None]:
classification_report(test_y ,svm_yhat)