
# Homework 2 - Part 2
### Natural Language Processing and Information Extraction,  2020 WS
----
Exercise largely based on input from [Adam Kovacs](adam.kovacs@tuwien.ac.at)


## Instructions

This exercise does not include tests, visible or not. However, should your code give errors that are not due to memoriy limitations on the JupyterHub server, we cannot spend time fixing them.

Your solution should replace placeholder lines such as:

`### YOUR CODE HERE
raise NotImplementedError()`

__IMPORTANT:__ Before submitting your solution, run your notebook with Kernel -> Restart & Run All and make sure it runs without exceptions. 

__Submission:__ You need to submit your exercise via the nbgrader interface. Note that you can submit multiple times before the deadline, but only your last submission will be graded.

Do not change the name of the ipynb file. Submition without validation is possible.

There is a total of __75 points__ to get for this exercise. Try to solve the exercise on your own. 

# Exercise Description

In this exercise you will make aquaintance with PyTorch's framework for textual data, [TorchText](https://pytorch.org/text/). PyTorch is a Python library optimized for working with tensors for Deep Learning. To refresh your knowledge about tensors, I recommend to read "[The Poor Man's Introduction to Tensors](https://web2.ph.utexas.edu/~jcfeng/notes/Tensors_Poor_Man.pdf)" by Justin C. Feng, a very concise introduction to tensors and operations with tensors.

The necessary packages are already available on the JupyterHub.

You are requred, in the following, to load the data into memory (Task 1), to design a simple Feed Forward Network (Task 2) and, then, modify it to an LSTM network (Task 3). The overall problem to solve is classification.

You are given a set of news texts (AG_NEWS) from different sources, and you have to build a classifier to make predictions on the types of articles:
*   World
*   Sports
* Business
* Sci/Tech

## Data loading, prerequisites

The [TorchText](https://pytorch.org/text/) framework offers you a set of built-in methods for handling text data sets. It also offers you some data sets for experimentation.

The `torchtext.data.Field` class is one of the main concepts of TorchText which defines how the text data is to be processed and how to transform it to a tensor (something close to an embedding).

The `TEXT` variable will store the processing pipeline selected for the data set in this exercise. We pass `tokenize = 'spacy'` to our tokenizer telling it to to use SpaCy methods. The default tokenizer would otherwise be splitting the text on spaces.

Similarly, `LABEL` is a wrapper that marks specific fields in the data to be labels.

You can read [here](https://torchtext.readthedocs.io/en/latest/data.html) to understand the role `Fields` play in `torchtext`.

If you need to install `torchtext`on your environment, please use version 4.0:
`!pip install torchtext == 0.4`


__Hint: it is a good idea to use Google Colab for this homework to be able to use GPUs. When you have completed the exercise, you can transfer the code to the JupyterHub.__ Take care that you do not rename notebooks.


Import the necessary libraries for this exercise. The random seed initialization is necessary for reproducibility reasons.

In [1]:
import torch
from torchtext import data
from torchtext.datasets import text_classification
import os

import random

SEED = 1234
# other imports

Initialize a `TEXT` and a `LABEL` variable with `torchtext` objects that will represent our dataset. We also set the random seed value defined above.

In [2]:
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.long)

## TASK 1. Loading the dataset, preparation for a NN training/prediction phase
### (total per task 1: 25 points)

One of the advantages of using `torchtext` is that it has methods for paralellizing data processing on GPUs, by sending batches of input data that can be processed at the same time on a GPU (on a CPU, the processing happens sequentially per batch).

Here you need to load the data set using the right iterators, to define the training, validation, and test portions of the data. 

The training data set (out of which you will set out a portion as a validation set) is in the file `dataset_train.csv`. The testing data set is in `dataset_test.csv` 

### TASK 1.1  Load the data 
_(15 points)_

Use data.TabularDataset to load in the data sets and split the train file into training and validation sets. Use the previously defined random seed when splitting the data set.

Hint: use data.TabularDatset.splits method with the right parameters.


In [3]:
def load_dataset(TEXT, LABEL):
    fields = [('label', LABEL), ('text', TEXT)]
    
    train, valid = data.TabularDataset('dataset_train.csv', 'csv', fields, skip_header=True).split(0.7)
    test = data.TabularDataset('dataset_test.csv', 'csv', fields, skip_header=True)
    
    return train, valid, test

In [4]:
train, valid, test = load_dataset(TEXT, LABEL)

In [5]:
print(f'Number of training examples: {len(train)}')
print(f'Number of validation examples: {len(valid)}')
print(f'Number of testing examples: {len(test)}')

Number of training examples: 84000
Number of validation examples: 36000
Number of testing examples: 7600


In [6]:
assert len(train) == 84000

Building a vocabulary for the textual data set we work on is an essential step. A vocabulary is a lookup table where every unique word has a corresponding index.

This is done so our machine learning model can operate on numbers instead of strings. The indexes are then used to construct embeddings for our words.

In [7]:
TEXT.build_vocab(train)  
LABEL.build_vocab(train)

In [8]:
print(len(TEXT.vocab))

101328


In [9]:
print(TEXT.vocab.freqs.most_common(20))

[(',', 177777), ('the', 124554), ('.', 93195), ('to', 82695), ('-', 78301), ('a', 68652), ('of', 68399), ('in', 64781), ('and', 47862), ('on', 38844), ('for', 33975), ('#', 32867), ('(', 28282), (' ', 26845), ('39;s', 21676), ('that', 19128), ('with', 17148), ('The', 16941), ('as', 16459), (')', 16270)]


When words in the test set are not found in the vocabulary of the training set, there is a special type of token that will be used for them: `< unk >`

For example, if the sentence was `"This film is great and I savoured it"` but if the word `savoured` is not in the vocabulary, it would become `"This film is great and I < unk > it"`.

We feed batches into our model, one at a time. Within a batch all sentences need to be of the same size. To do that, the shorter sentences need to be padded. With `torchtext` this is done automatically, when batches are created.

![unk-example](unk-example.png)


### TASK 1.2 Construct iterators on the training, validation and test data. 
_(10 points)_

Batching data allows us to process data sets that are much larger than the GPU's RAM. Use the `data.BucketIterator` to shuffle and bucket the input sequences in sequences of similar length. 

Hint: Set the correct parameter values for the `splits` method: batch_size, sort_key (based on length), sort_within_batch and device

In [10]:
def construct_iterators(train, dev, test):
  train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
      (train, dev, test), batch_size=BATCH_SIZE, sort_key=lambda x: len(x.text),
      sort_within_batch=True, device=device
  )
  return train_iterator, valid_iterator, test_iterator

In [11]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = construct_iterators(train, valid, test)

## TASK 2. Build a Feed forward Neural Network for text classification
### (total per task 2: 40 points)


In this task, you are to define a Feed-forward Neural Network for text classification. The network should consist of: 
* An embedding layer: the input batch is passed through the embedding layer to get word embeddings.
* Sentence/sequence representation: from the word embeddings produce one vector for the sentence. This could be done by taking the average of the word vectors.
* Hidden layer: feed the sentence vector through a linear layer to produce the output vector.
* Output layer: apply a softmax function on the output vector to produce probabilities for the labels.

You will need to use a loss function, and an optimization parameter.

Hints: Each batch, text, is a tensor of size `[sentence length, batch size]`. Look at the members of `torch.nn`.



_(10 points)_

In [12]:
import torch.nn as nn
from torch import autograd

# defined the architecture of the FNN
class FNNClassifier(nn.Module):
    def __init__(self, input_dim, embedding_dim, output_dim):
        # defines the layers of the network.
        super(FNNClassifier, self).__init__()

        self.embeddings = nn.Embedding(input_dim, embedding_dim)
        self.hidden = nn.Linear(embedding_dim, output_dim)

    def forward(self, text):
        # returns the probability distribution over the output layer (softmax)
        sentence_vectors = torch.mean(self.embeddings(text), 0)
        hidden = self.hidden(sentence_vectors)
        log_probs = F.log_softmax(hidden, dim=1)
        return log_probs


In [13]:
INPUT_DIM = len(TEXT.vocab)

# how many dimensions should the word embeddings have
EMBEDDING_DIM = 100
# how many classes should the classifier put the texts into.
OUTPUT_DIM = 4

# the training model we will use is now the FNN.
model = FNNClassifier(INPUT_DIM, EMBEDDING_DIM, OUTPUT_DIM)

Before training, we have to select the learning rate and the loss functions for our FNN.

In [14]:
import torch.optim as optim

# Define the learning rate optimizer, you can experiment with various optimizers: https://pytorch.org/docs/stable/optim.html
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

# Send the tensors to GPU if available
model = model.to(device)
criterion = criterion.to(device)

In [15]:
from sklearn.metrics import classification_report


def class_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    rounded_preds = preds.argmax(1)
    correct = (rounded_preds == y).float()  # convert into float for division
    acc = correct.sum() / len(correct)
    return acc

### TASK 2.1 Implement the train and the evaluate functions.

Now it's the time to `train` the network, then apply it to the test set by implementing an `evaluate` function.

The `train` function must:
- iterate throught the dataset with the given iterator, 
- get the output from the model
- calculate the loss and the accuracy
- Propagate backward the loss
- And calculate epoch loss

_(7 points)_

In [16]:
from sklearn.metrics import accuracy_score
import torch.nn.functional as F


def train(model, iterator, optimizer, criterion):
    model.train()
    
    epoch_loss = 0.0
    epoch_acc = 0.0

    for batch in iterator:
        optimizer.zero_grad()

        outputs = model(batch.text)
        loss = criterion(outputs, batch.label)
        loss.backward()
        optimizer.step()

        epoch_loss += loss
        epoch_acc += class_accuracy(outputs, batch.label)
    
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

The `evaluate`function must:
- set the model to the `.eval()` mode
- with torch.no_grad(), iterate on the iterator
- calculate the prediction and the loss on the validation dataset
- calculate epoch loss

_(7 points)_

In [17]:
def evaluate(model, iterator, criterion):
    model.eval()

    epoch_loss = 0.0
    epoch_acc = 0.0

    with torch.no_grad():
        for batch in iterator:
            outputs = model(batch.text)

            epoch_loss += criterion(outputs, batch.label)
            epoch_acc += class_accuracy(outputs, batch.label)
    
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [18]:
import time

def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We will now train the module using the `FNNClassifier` model and the `train` and `evaluate` functions you have just implemented. Experiment with different setups as well. Try to change the number of epochs, the dimensions and the optimizer as well. Summarize your experience in the next cell.

In [None]:
N_EPOCHS = 15

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')

    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'\nTest Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

### Write your findings

What are your findings while experimenting with different network settings, different loss functions, splits, etc.

_(6 points)_

WRITE YOUR ANSWER HERE!

The FNN model achieves a test set loss of 0.410 and 90.17% accuracy for the default parameters (15 epochs, Adam optimizer, NLLLoss). The validation set loss seemed to increase after 7 while the training set loss kept falling, which points to our model being overfitted.

If we lower the number of epochs to 7, the test set loss and accuracy are 0.272 and 91.53%, respectively.

Applying F.relu to the output of the linear layer did not seem to have much effect, while sigmoid worsened the results.

Changing the optimizer to SGD with lr=1e-3 really worsened the results (loss above 1, accuracy around 45% through 15 epochs). With lr=1e-1 the results were better (0.7 and 73.81% test loss and accuracy through 15 epochs).
RMSprop with its default parameters seemed promising, but the validation set loss kept increasing from the start. With lr=1e-3, a test set loss of 0.279 and accuracy of 90.53% are achieved at epoch 5, comparable to Adam.
Adamax with default parameters performed similar to Adam in the end, but it took 15 epochs to achieve what Adam had in 7.

If we use a 90/10 train/validation split instead of 70/30, with the other model parameters set to default, a low test set loss and accuracy (0.258, 91.91%) seem to be achieved earlier, around epoch 5.

## TASK 3. Change the Feed Forward network to LSTM. 
### (total per task 3: 20 points)

To use an LSTM instead of a FNN for the classification task, you will need to change your `model`. 

* Add a `hidden_dim` parameter. The LSTM layer will produce a sentence vector from the word vectors. This parameter will represent the output of the LSTM layer.
* Apply a linear layer to transform the feature vector from the LSTM output to class probabilities (It will still require Softmax)
* Experiment with different layer sizes.


_(15 points)_

In [19]:
class LSTMClassifier(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super(LSTMClassifier, self).__init__()

        self.embeddings = nn.Embedding(input_dim, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.linear = nn.Linear(hidden_dim, output_dim)

    def forward(self, text):
        embeddings = self.embeddings(text)
        lstm_out, _ = self.lstm(embeddings)
        linear_out = self.linear(lstm_out[-1])
        log_probs = F.log_softmax(linear_out, dim=1)
        return log_probs

In [20]:
INPUT_DIM = len(TEXT.vocab)

EMBEDDING_DIM = 100
HIDDEN_DIM = 100
OUTPUT_DIM = 4

model = LSTMClassifier(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

Reuse the training and the evaluation functions from before!

In [21]:
# Define the learning rate optimizer, you can experiment with various optimizers: https://pytorch.org/docs/stable/optim.html
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

# Send the tensors to GPU if available
model = model.to(device)
criterion = criterion.to(device)

In [None]:
N_EPOCHS = 15

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')

    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

test_loss, test_acc = evaluate(model, test_iterator, criterion)
print(f'\nTest Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

### Write your findings here

How do the results of the LSTM and FNN compare? Timings, accuracy, etc?

_(5 pointw)_

LSTMClassifier took longer to train than FNN. An FNN epoch on Google Colab using GPUs took 7-8s, while an LSTM epoch took 12s.

15 epochs also caused LSTM to overfit -- the lowest validation set loss seems to be achieved around epoch 5. Even then, however, this model actually performed worse than the simple FNNClassifier. The 5-epoch test set loss was 0.315 and the accuracy was 89.76%, compared to 0.272 and 91.53% for the default FNN model trained for 7 epochs.

If we set LSTM's num_layers parameter to 2 (i.e. if we stack 2 LSTMs together), the best performance seems to happen at 3 epochs -- test set loss of 0.309 and 89.72% accuracy.