# Model Training - Basic Model

In this Notebook, we will go through building a basic PyTorch Model for Training, and training it to get results on our dataset.

### Imports

In this project, we will be using PyTorch for Deep Learning. NLP Pre-Processing, however, will be done using Keras's modules, because I prefer the implementation provided in the library. Instead of installing Keras, the relavant modules are imported in as scripts from GitHub.

In [18]:
import pandas as pd;
import numpy as np;

import torch;
from torch import nn;
from torch.utils.data import Dataset, DataLoader;
import torch.nn.functional as F;
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score;

import math;
from numpy import save, load;
import keras_sequence_preprocessing as seq_preprocessing;
import keras_text_preprocessing as text_preprocessing;

import matplotlib.pyplot as plt;

import time;

from PyTorchTools import EarlyStopping;

In [24]:
quora_train_text = pd.read_csv('data/augmented_quora_text.txt');

In [25]:
quora_train_text = quora_train_text.dropna()

### Word Embeddings

We have 2 different types of Word Embeddings we will try in this application: Glove and FastText. To use the specific embedding, run that cell and not the other, as both are loaded in with the same formatting.

In [26]:
embed_size = 300;

In [27]:
# GLOVE Embeddings

embeddings_dict = {};
with open('../Embeddings/glove.6B/glove.6B.%dd.txt'%(embed_size), 'rb') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

In [4]:
# FASTTEXT Embeddings

embeddings_dict = {};
with open('../Embeddings/crawl-%dd-2M.vec'%(embed_size), 'rb') as f:
    for line in f:
        splits = line.split();
        word = splits[0];
        vec = np.asarray(splits[1:], dtype='float32')
        
        embeddings_dict[word.decode()] = vec;

We build a Word Index from the embeddings. To quickly do this, we will simply be iterating over the dataset and assigning an integer value to each word.

In [28]:
word_index = {};

token_num = 0;
for row in quora_train_text[['cleaned_text', 'target']].iterrows():
    text, label = row[1]
    
    tokens = [token for token in text.split(' ')];
    
    for token in tokens:
        if token not in word_index:
            word_index[token] = token_num;
            token_num = token_num + 1;

In [29]:
MAX_WORDS = 200000
MAX_LEN = 70

Next, we encode the individual sentences into sequences of integers from the word index. Than Pad them to fixed lengths using post-sequence-padding.

In [30]:
def encode_sentences(sentence, word_index=word_index, max_words=MAX_WORDS):
    output = [];
    for token in sentence.split(' '):
        if (token in word_index) and (word_index[token] < max_words):
            output.append(word_index[token]);
    return output;

In [31]:
encoded_sentences = [encode_sentences(sent) for sent in quora_train_text['cleaned_text']]

In [32]:
encoded_lengths = [len(x) for x in encoded_sentences]

In [33]:
padded_sequences = seq_preprocessing.pad_sequences(encoded_sentences, maxlen=MAX_LEN, padding='post', truncating='post');

To do training / testing, we will divide the dataset into proper Training and Validation. 85% of the dataset for training, and the remaining 15% fo validation.

In [34]:
val_split = int(0.85 * len(quora_train_text));

train_ds = padded_sequences[:val_split];
val_ds = padded_sequences[val_split:];

train_y = quora_train_text.iloc[:val_split]['target'].values;
val_y = quora_train_text.iloc[val_split:]['target'].values;

train_lens = encoded_lengths[:val_split];
val_lens = encoded_lengths[val_split:];

len(train_ds), len(val_ds)

(1176815, 207674)

We build an Embeddings Matrix. Each row in the matrix is a vector from Glove / Fasttext.

In [35]:
vocab_size = min(MAX_WORDS, len(word_index))+1;
embeddings_matrix = np.zeros((vocab_size, embed_size));

for word, posit in word_index.items():
    if posit >= vocab_size:
        break;
        
    vec = embeddings_dict.get(word);
    if vec is None:
        vec = np.random.sample(embed_size);
        embeddings_dict[word] = vec;
    
    embeddings_matrix[posit] = vec;

In [36]:
embeddings_tensor = torch.Tensor(embeddings_matrix)

Build a Data Loader to iterate over during the training process in a fixed batch size:

In [14]:
class QuoraDataset(Dataset):
    def __init__(self, encoded_sentences, labels, lengths):
        self.encoded_sentences = encoded_sentences;
        self.labels = labels;
        self.lengths = lengths;
        
    def __len__(self):
        return len(self.encoded_sentences);
    
    def __getitem__(self, index):
        x = self.encoded_sentences[index, :];
        x = torch.LongTensor(x);
        
        y = self.labels[index];
        y = torch.Tensor([y]);
        
        length = self.lengths[index];
        length = torch.Tensor([length]);
        
        return x, y, length;

In [15]:
train_dataset = QuoraDataset(train_ds, train_y, train_lens);
val_dataset = QuoraDataset(val_ds, val_y, val_lens);

In [16]:
batch_size = 512;

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True);
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=True);

## Creating a Model

The Torch Model will have the following architecture:

1. Embeddings Layer
2. 1st LSTM Layer
2. 1st Dense Fully Connected Layer
3. ReLU Activation
4. 2nd LSTM Layer
5. Global Max-Average Pooling Layer
6. 2nd Dense Fully Connected Layer

In [19]:
class Model(nn.Module):
    def __init__(self, embedding_matrix, hidden_unit = 64):
        super(Model, self).__init__();
        vocab_size = embeddings_tensor.shape[0];
        embedding_dim = embeddings_tensor.shape[1];
        
        self.embedding_layer = nn.Embedding(vocab_size, embedding_dim);
        self.embedding_layer.weight = nn.Parameter(embeddings_tensor);
        self.embedding_layer.weight.requires_grad = True;
        
        self.lstm_1 = nn.LSTM(embedding_dim, hidden_unit, bidirectional=True);
        
        self.fc_1 = nn.Linear(hidden_unit*2, hidden_unit*2);
        
        self.lstm_2 = nn.LSTM(hidden_unit*2, hidden_unit, bidirectional=True);
        
        self.fc_2 = nn.Linear(hidden_unit * 2 * 2, 1);
        
    def forward(self, x):
        out = self.embedding_layer(x);
        
        out, _ = self.lstm_1(out);
        
        out = self.fc_1(out);
        
        out = torch.relu(out);
        
        out, _ = self.lstm_2(out);
        
        out_avg, out_max = torch.mean(out, 1), torch.max(out, 1)[0];
        out = torch.cat((out_avg, out_max), 1);
        
        out = self.fc_2(out);
        return out;

In [38]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [39]:
model = Model(embeddings_tensor, 64);
model = model.to(device);
model

Model(
  (embedding_layer): Embedding(200001, 300)
  (lstm_1): LSTM(300, 64, bidirectional=True)
  (fc_1): Linear(in_features=128, out_features=128, bias=True)
  (lstm_2): LSTM(128, 64, bidirectional=True)
  (fc_2): Linear(in_features=256, out_features=1, bias=True)
)

We use a Binary-Cross-Entropy Loss Function, and an Adam Optimizer with a 0.003 Learning Rate.

In [21]:
criterion = nn.BCEWithLogitsLoss();
optimizer = torch.optim.Adam(lr=0.003, params = model.parameters());

## Model Training

Now we write the methods to iterate over the data to train and evaluate our model.

In [20]:
def train(nn_model, nn_optimizer, nn_criterion, data_loader, val_loader = None, num_epochs = 5, print_ratio = 0.1, verbose=True):
    
    print_every_step = int(print_ratio * len(train_loader));
    
    if verbose:
        print('Training with model: ');
        print(nn_model);
    
    for epoch in range(num_epochs):

        epoch_time = time.time();    

        f1_scores_train = []

        # Enable Training for the model
        nn_model.train()
        running_loss = 0;

        all_ys = torch.tensor(data=[]).to(device);
        all_preds = torch.tensor(data=[]).to(device);

        for ite, (x, y, l) in enumerate(data_loader):
            init_time = time.time();

            # Convert our tensors to GPU tensors
            x = x.cuda()
            y = y.cuda()

            # Clear gradients
            nn_optimizer.zero_grad()

            # Forward Propagation and compute predictions
            preds = nn_model.forward(x, l)

            # Compute loss against actual values
            loss = nn_criterion(preds, y)

            # Add predictions and actuals into larger list for scoring
            all_preds = torch.cat([all_preds, preds]);
            all_ys = torch.cat([all_ys, y]);

            # Back Propagation and Updating weights
            loss.backward()
            nn_optimizer.step()

            running_loss = running_loss + loss.item();

            if ite % print_every_step == print_every_step-1:
                
                # Compute Sigmoid Activation and Prediction Probabilities
                preds_sigmoid = torch.sigmoid(all_preds).cpu().detach().numpy();
                
                # Compute Predictions over the Sigmoid base line
                all_preds = (preds_sigmoid > 0.5).astype(int);

                # Compute Metrics
                all_ys = all_ys.detach().cpu().numpy();

                f_score = f1_score(all_ys, all_preds);
                precision = precision_score(all_ys, all_preds);
                recall = recall_score(all_ys, all_preds);
                accuracy = accuracy_score(all_ys, all_preds);

                print('\t[%d %5d %.2f sec] loss: %.3f acc: %.3f prec: %.3f rec: %.3f f1: %.3f'%(epoch+1, ite+1, time.time() - init_time, running_loss / 2000, accuracy, precision, recall, f_score))

                all_ys = torch.tensor(data=[]).to(device);
                all_preds = torch.tensor(data=[]).to(device);
        
        print('Epoch %d done in %.2f min'%(epoch+1, (time.time() - epoch_time)/60 ));

        if val_loader is not None:
            eval(nn_model, nn_criterion, val_loader);
        
        running_loss = 0.0;

In [19]:
def eval(nn_model, nn_criterion, data_loader):

    # Disable weight updates
    with torch.no_grad():

        # Enable Model Evaluation
        nn_model.eval()
        running_loss = 0;
        
        all_ys = torch.tensor(data=[]).to(device);
        all_preds = torch.tensor(data=[]).to(device);

        init_time = time.time();

        for ite, (x, y, l) in enumerate(data_loader):

            # Convert tensors to GPU tensors
            x = x.cuda()
            y = y.cuda()

            # Forward propagation to compute predictions
            preds = nn_model.forward(x, l)

            # Compute loss on these predictions
            loss = nn_criterion(preds, y)

            all_preds = torch.cat([all_preds, preds]);
            all_ys = torch.cat([all_ys, y]);

            running_loss = running_loss + loss.item();

        # Compute Sigmoid activation on the predictions, and derive predictions over the Sigmoid base line
        preds_sigmoid = torch.sigmoid(all_preds).cpu().detach().numpy();
        all_preds = (preds_sigmoid > 0.5).astype(int);

        # Compute metrics
        all_ys = all_ys.detach().cpu().numpy();
        f_score = f1_score(all_ys, all_preds);

        precision = precision_score(all_ys, all_preds);
        recall = recall_score(all_ys, all_preds);
        accuracy = accuracy_score(all_ys, all_preds);

        print('\tEVAL: [%5d %.2f sec] loss: %.3f acc: %.3f prec: %.3f rec: %.3f f1: %.3f'%(ite+1, time.time() - init_time, running_loss / 2000, accuracy, precision, recall, f_score))

Running Training on the Model

In [86]:
train(model, optimizer, criterion, train_loader)

Training with model: 
Model(
  (embedding_layer): Embedding(100001, 100)
  (lstm_1): LSTM(100, 64, bidirectional=True)
  (fc_1): Linear(in_features=128, out_features=128, bias=True)
  (lstm_2): LSTM(128, 64, bidirectional=True)
  (fc_2): Linear(in_features=256, out_features=1, bias=True)
)
	[1   356 0.13 sec] loss: 0.038 acc: 0.938 prec: 0.548 rec: 0.004 f1: 0.008
	[1   712 0.13 sec] loss: 0.069 acc: 0.940 prec: 0.615 rec: 0.104 f1: 0.178
	[1  1068 0.13 sec] loss: 0.098 acc: 0.942 prec: 0.629 rec: 0.180 f1: 0.279
	[1  1424 0.13 sec] loss: 0.125 acc: 0.943 prec: 0.615 rec: 0.207 f1: 0.310
	[1  1780 0.14 sec] loss: 0.152 acc: 0.944 prec: 0.617 rec: 0.231 f1: 0.336
	[1  2136 0.14 sec] loss: 0.179 acc: 0.945 prec: 0.640 rec: 0.255 f1: 0.365
	[1  2492 0.14 sec] loss: 0.205 acc: 0.946 prec: 0.650 rec: 0.277 f1: 0.389
	[1  2848 0.14 sec] loss: 0.231 acc: 0.946 prec: 0.635 rec: 0.272 f1: 0.381
	[1  3204 0.13 sec] loss: 0.256 acc: 0.947 prec: 0.652 rec: 0.288 f1: 0.399
	[1  3560 0.13 sec] loss:

In [9]:
eval(model, criterion, val_loader)

	EVAL: [  764 16.99 sec] loss: 0.046 acc: 0.953 prec: 0.617 rec: 0.480 f1: 0.540


### The best training F1 score is **0.637** over 5 epochs and the evaluation F1 score is **0.540**