# CSCE670 Spotlight: Introduction to NLP using Pytorch

### Sahan Suresh Alva (UIN: 130004855) 
PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing. It is primarily developed by Facebook's AI Research lab (FAIR).

At its core, PyTorch provides two main features:


1.   An n-dimensional Tensor, similar to numpy but can run on GPUs
2.   Automatic differentiation for building and training neural networks

## Why PyTorch?

There are a few reason you might prefer PyTorch to other deep learning libraries:


1.   Unlike other libraries like TensorFlow, where you have to first define an entire computational graph before you can run your model, PyTorch allows you to define your graph dynamically.
2.   PyTorch is also great for deep learning research and provides maximum flexibility and speed.



## Basics of PyTorch

When it comes to data objects, Pytorch is quite similar to Numpy. The following examples shows how similar they are:

In the NumPy library, we have multi-dimensional arrays whereas in PyTorch, we have tensors. So, let’s first understand what tensors are.

Tensors are multidimensional arrays. And PyTorch tensors are similar to NumPy’s n-dimensional arrays. We can use these tensors on a GPU as well (this is not the case with NumPy arrays). This is a major advantage of using tensors.

PyTorch supports multiple types of tensors, including:
*   FloatTensor: 32-bit float
*   DoubleTensor: 64-bit float
*   HalfTensor: 16-bit float
*   IntTensor: 32-bit int
*   LongTensor: 64-bit int









In [2]:
import numpy as np
import torch

a = np.array(1)
b = torch.tensor(1)

print(a, type(a))
print(b, type(b))

1 <class 'numpy.ndarray'>
tensor(1) <class 'torch.Tensor'>


### Basic Mathematical Operations

We will compare some baisc mathematical operation between NumPy and Pytorch. The operations are very similar in Pytorch. Features like broadcasting are also present in Pytorch.


In [22]:
np_a = np.array(10)
np_b = np.array(5)
print(np_a + np_b, np_a - np_b, np_a * np_b, np_a / np_b)

pt_a = np.array(10)
pt_b = np.array(5)
print(pt_a + pt_b, pt_a - pt_b, pt_a * pt_b, pt_a / pt_b)

15 5 50 2.0
15 5 50 2.0


### Matrix Operations

We will compare some matrix operations between NumPy and Pytorch. The operations are very similar in Pytorch.

In [7]:
np.random.seed(42)
np_a = np.random.randn(3,3)
np_b = np.random.randn(3,3)

print(np.add(np_a,np_b), '\n')
print(np.dot(np_a,np_b), '\n')
print(np.divide(np_a,np_b))

[[ 1.0392742  -0.60168199  0.18195878]
 [ 1.76499213 -2.14743362 -1.95905479]
 [ 1.01692529 -0.24539639 -0.15522705]] 

[[-0.12814468 -0.62164688  0.21069439]
 [ 0.90133115 -0.02065676 -0.3790019 ]
 [ 1.30648762 -1.7246546  -2.20677932]] 

[[ 0.9155008   0.29835784 -1.39069607]
 [ 6.29449313  0.12238321  0.13573803]
 [-2.80855031 -0.75771243 -1.49396459]]


In [6]:
torch.manual_seed(42)
pt_a = torch.randn(3,3)
pt_b = torch.randn(3,3)

print(torch.add(pt_a,pt_b), '\n')
print(torch.mm(pt_a,pt_b), '\n')
print(torch.div(pt_a,pt_b))

tensor([[ 0.6040,  0.6637,  1.0438],
        [ 1.3406, -2.8127, -1.1753],
        [ 3.1662,  0.6841,  1.2788]]) 

tensor([[ 0.4576,  0.2724,  0.3367],
        [-1.3636,  1.7743,  1.1446],
        [ 0.3243,  2.8696,  2.7954]]) 

tensor([[ 1.2594,  0.2408,  0.2897],
        [ 0.2075,  0.6645,  0.1884],
        [ 2.3051, -0.4826,  0.5649]])


#### Concatenating Tensors


In [9]:
a = torch.tensor([[1,2],[3,4]])
b = torch.tensor([[5,6],[7,8]])
c = torch.cat((a,b))
print(a, '\n')
print(b, '\n')
print(c)

tensor([[1, 2],
        [3, 4]]) 

tensor([[5, 6],
        [7, 8]]) 

tensor([[1, 2],
        [3, 4],
        [5, 6],
        [7, 8]])


#### Reshaping Tensors

In [13]:
print(c.shape, "\n")
d =c.reshape(1,8)
print(d, "\n")
print(d.shape, "\n")


torch.Size([4, 2]) 

tensor([[1, 2, 3, 4, 5, 6, 7, 8]]) 

torch.Size([1, 8]) 



### Important PyTorch Modules

#### Autograd Module

PyTorch uses a technique called automatic differentiation. It records all the operations that we are performing and replays it backward to compute gradients. This technique helps us to save time on each epoch as we are calculating the gradients on the forward pass itself.



In [15]:
a = torch.ones((2,2), requires_grad=True)
b = a + 10
c = b.mean()
print(b,c)

tensor([[11., 11.],
        [11., 11.]], grad_fn=<AddBackward0>) tensor(11., grad_fn=<MeanBackward0>)


We added 10 to all the elements of **a** and then taken the mean it. 

Now, the derivative of c w.r.t. a will be ¼ and hence the gradient matrix will be 0.25. Let’s verify this using PyTorch:

In [16]:
c.backward()
print(a.grad)


tensor([[0.2500, 0.2500],
        [0.2500, 0.2500]])


#### Optim Module

The Optim module in PyTorch has pre-written codes for most of the optimizers that are used while building a neural network. We just have to import them and then they can be used to build models. The below are the examples to get the ADAM and SGD optimizers. We cannot execute the code since we still havent built the model.


In [None]:
from torch import optim

#adam
# adam = optim.Adam(model.parameters(), lr=learning_rate)

# sgd
# SGD = optim.SGD(model.parameters(), lr=learning_rate)

#### nn Module
The autograd module in PyTorch helps us define computation graphs as we proceed in the model. But, just using the autograd module can be low-level when we are dealing with a complex neural network.

In those cases, we can make use of the nn module. This defines a set of functions, similar to the layers of a neural network, which takes the input from the previous state and produces an output.

## Building a Sentiment Classifier in Pytorch

In this section we will build a deep learning model to detect the sentiment using Pytorch and TorchText. We will use movie review from the IMDB dataset for this tutorial.

### Data

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as torchtext.datasets objects. It process the data using the Fields we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.

The parameters of a Field specify how the data should be processed. We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment.




In [None]:
import torch
from torchtext import data

SEED = 42
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)

In [2]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

aclImdb_v1.tar.gz:   0%|          | 131k/84.1M [00:00<01:14, 1.13MB/s]

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:01<00:00, 51.6MB/s]


In [3]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 25000
Number of testing examples: 25000


We split the train data to train and validation data using a 80/20 ratio. We use the same random.seed to do the split

In [4]:
import random

train_data, valid_data = train_data.split(random_state = random.seed(SEED))
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


We use of Glove word embeddings as the input to pur deep learning model. Instead of having our word embeddings initialized randomly, they are initialized with Glove pre-trained embedding vectors. We get these vectors simply by specifying which vectors we want and passing it as an argument to build_vocab. TorchText handles downloading the vectors and associating them with the correct words in our vocabulary.

In [5]:
MAX_VOCAB_SIZE = 25000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

.vector_cache/glove.6B.zip: 862MB [06:27, 2.23MB/s]                          
100%|█████████▉| 398252/400000 [00:15<00:00, 25449.20it/s]


Now we create the iterators, placing the tensors on the GPU if one is available.

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_within_batch = True,
    device = device)

### Model

We will be using Bidirectional LSTM architechture for our model. 

*Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architectureused in the field of deep learning. Unlike standard feedforward neural networks, LSTM has feedback connections. It can not only process single data points (such as images), but also entire sequences of data (such as speech or video)*

We are using the bi-directional version of the LSTM. We have first RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the last to the first (a backward RNN). At time step $t$, the forward RNN is processing word $x_t$, and the backward RNN is processing word $x_{T-t+1}$.



![title](Deep-Dive-into-Bidirectional-LSTM-i2tutorials.jpg)

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
        
        self.rnn = nn.LSTM(embedding_dim, 
                           hidden_dim, 
                           num_layers=n_layers, 
                           bidirectional=bidirectional, 
                           dropout=dropout)
        
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
        
        
        embedded = self.dropout(self.embedding(text))
        
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
        
        packed_output, (hidden, cell) = self.rnn(packed_embedded)
        
        output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
                    
        return self.fc(hidden)

We introduce regularization by using dropout = 0.5, which removes random hidden states for some perceptrons.

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            N_LAYERS, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

In [9]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')


The model has 4,810,857 trainable parameters


In [None]:
pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]

model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)



### Training 

As we had shown in the initial section, we use the optim module to create an Adam optimizer for our model. Adam adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates.

In [None]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()

model = model.to(device)
criterion = criterion.to(device)

We implement the function to calculate accuracy.

In [None]:
def binary_accuracy(preds, y):
   

    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() 
    acc = correct.sum() / len(correct)
    return acc

We define a function for training our model.

As we are now using dropout, we must remember to use model.train() to ensure the dropout is "turned on" while training.

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
        
        text, text_lengths = batch.text
        
        predictions = model(text, text_lengths).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


We define a function for testing our model.

As we are now using dropout, we must remember to use model.eval() to ensure the dropout is "turned off" while evaluating.

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text, text_lengths = batch.text
            
            predictions = model(text, text_lengths).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [None]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

At each epoch, if the validation loss is the best we have seen so far, we'll save the parameters of the model and then after training has finished we'll use that model on the test set.

In [17]:
N_EPOCHS = 15

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut2-model.pt')
    
    print(f'Iteration: {epoch+1:02} | Iteration Time: {epoch_mins}m {epoch_secs}s | Train Acc: {train_acc*100:.2f}% |  Val Acc: {valid_acc*100:.2f}%')
    

100%|█████████▉| 398252/400000 [00:29<00:00, 25449.20it/s]

Iteration: 01 | Iteration Time: 0m 42s | Train Acc: 60.39% |  Val Acc: 72.14%
Iteration: 02 | Iteration Time: 0m 41s | Train Acc: 73.51% |  Val Acc: 81.76%
Iteration: 03 | Iteration Time: 0m 41s | Train Acc: 80.30% |  Val Acc: 85.12%
Iteration: 04 | Iteration Time: 0m 41s | Train Acc: 84.56% |  Val Acc: 86.51%
Iteration: 05 | Iteration Time: 0m 41s | Train Acc: 86.39% |  Val Acc: 83.18%
Iteration: 06 | Iteration Time: 0m 41s | Train Acc: 88.58% |  Val Acc: 88.75%
Iteration: 07 | Iteration Time: 0m 41s | Train Acc: 90.65% |  Val Acc: 88.79%
Iteration: 08 | Iteration Time: 0m 41s | Train Acc: 91.08% |  Val Acc: 89.38%
Iteration: 09 | Iteration Time: 0m 41s | Train Acc: 92.29% |  Val Acc: 88.85%
Iteration: 10 | Iteration Time: 0m 41s | Train Acc: 93.03% |  Val Acc: 90.00%
Iteration: 11 | Iteration Time: 0m 41s | Train Acc: 93.89% |  Val Acc: 89.69%
Iteration: 12 | Iteration Time: 0m 41s | Train Acc: 94.62% |  Val Acc: 89.80%
Iteration: 13 | Iteration Time: 0m 41s | Train Acc: 95.21% |  Va

Using the best model that we have saved, we calculate the test accuracy. 

In [19]:
model.load_state_dict(torch.load('tut2-model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')


Test Loss: 0.302 | Test Acc: 88.22%


### References:

- https://pytorch.org/tutorials/beginner/pytorch_with_examples.html
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- https://github.com/bentrevett/pytorch-sentiment-analysis