<a href="https://colab.research.google.com/gist/adaamko/0161526d638e1877f7b649b3fff8f3de/deep-learning-practical-lesson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing and Information Extraction
## Deep learning - practical session

__Nov 12, 2021__

__√Åd√°m Kov√°cs__


During this lecture we are going to use a classification dataset from a shared task: SemEval 2019 - Task 6. 
The dataset is about Identifying and Categorizing Offensive Language in Social Media.
__Preparation:__
- You will need the Semeval dataset (we will have code to download it)
- You will need to install pytorch:
    - pip install torch 
- You will also need to have pandas, torchtext, numpy and scikit learn installed.

We are going to use an open source library for building optimized deep learning models that can be run on GPUs, the library is called [Pytorch](https://pytorch.org/docs/stable/index.html). It is one of the most widely used libraries for building neural networks/deep learning models.

In this lecture we are mostly using pure PyTorch models, but there are multiple libraries available to make it even easier to build neural networks. You are free to use them in your projects.
Just to name a few:
- TorchText: https://pytorch.org/text/stable/index.html
- AllenNLP: https://github.com/allenai/allennlp

__NOTE: It is advised to use Google Colab for this laboratory for free access to GPUs, and also for reproducibility.__

In [None]:
!pip install torch

In [1]:
# Import the needed libraries
import pandas as pd
import numpy as np

## Download the dataset and load it into a pandas DataFrame

In [5]:
import os

if not os.path.isdir("./data"):
    os.mkdir("./data")

import urllib.request

u = urllib.request.URLopener()
u.retrieve(
    "https://raw.githubusercontent.com/ZeyadZanaty/offenseval/master/datasets/training-v1/offenseval-training-v1.tsv",
    "data/offenseval.tsv",
)

('data/offenseval.tsv', <http.client.HTTPMessage at 0x7f1a58cfff50>)

## Read in the dataset into a Pandas DataFrame
Use `pd.read_csv` with the correct parameters to read in the dataset. If done correctly, `DataFrame` should have 5 columns, 
`id`, `tweet`, `subtask_a`, `subtask_b`, `subtask_c`.

In [6]:
import pandas as pd
import numpy as np

In [7]:
def read_dataset():
    train_data = pd.read_csv("./data/offenseval.tsv", sep="\t")
    return train_data

In [8]:
train_data_unprocessed = read_dataset()

train_data_unprocessed

Unnamed: 0,id,tweet,subtask_a,subtask_b,subtask_c
0,86426,@USER She should ask a few native Americans wh...,OFF,UNT,
1,90194,@USER @USER Go home you‚Äôre drunk!!! @USER #MAG...,OFF,TIN,IND
2,16820,Amazon is investigating Chinese employees who ...,NOT,,
3,62688,"@USER Someone should'veTaken"" this piece of sh...",OFF,UNT,
4,43605,@USER @USER Obama wanted liberals &amp; illega...,NOT,,
...,...,...,...,...,...
13235,95338,@USER Sometimes I get strong vibes from people...,OFF,TIN,IND
13236,67210,Benidorm ‚úÖ Creamfields ‚úÖ Maga ‚úÖ Not too sh...,NOT,,
13237,82921,@USER And why report this garbage. We don't g...,OFF,TIN,OTH
13238,27429,@USER Pussy,OFF,UNT,


## Convert `subtask_a` into a binary label
The task is to classify the given tweets into two category: _offensive(OFF)_ , _not offensive (NOT)_. For machine learning algorithms you will need integer labels instead of strings. Add a new column to the dataframe called `label`, and transform the `subtask_a` column into a binary integer label.

In [9]:
def transform(train_data):
    labels = {"NOT": 0, "OFF": 1}

    train_data["label"] = [labels[item] for item in train_data.subtask_a]
    train_data["tweet"] = train_data["tweet"].str.replace("@USER", "")

    return train_data

In [10]:
train_data = transform(train_data_unprocessed)

## Train a simple neural network on this dataset

In this notebook we are going to build different neural architectures on the task:
- A simple one layered feed forward neural network (FNN) with one-hot encoded vectors
- Adding more layers to the FNN, making it a deep neural network
- Instead of using one-hot encoded vectors we are going to add embedding vectors to the architecture, that takes the sequential nature of natural texts into account
- Then we will train LSTM networks
- At last, we will also build a Transformer architecture, that currently achieves SOTA results on a lot of tasks

First we will build one-hot-encoded vectors for each sentence, and then use a simple feed forward neural network to predict the correct labels.

In [11]:
# First we need to import pytorch and set a fixed random seed number for reproducibility
import torch

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Split the dataset into a train and a validation dataset
Use the random seed for splitting. You should split the dataset into 70% training data and 30% validation data

In [12]:
from sklearn.model_selection import train_test_split as split


def split_data(train_data, random_seed):
    tr_data, val_data = split(train_data, test_size=0.3, random_state=SEED)
    return tr_data, val_data

In [13]:
tr_data, val_data = split_data(train_data, SEED)

### Use CountVectorizer to prepare the features for the sentences
_CountVectorizer_ is a great tool from _sklearn_ that helps us with basic preprocessing steps. It has lots of parameters to play with, you can check the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). It will:
- Tokenize, lowercase the text
- Filter out stopwords
- Convert the text into one-hot encoded vectors
- Select the _n_-best features

We fit CountVectorizer using _3000_ features

We will also _lemmatize_ texts using the _nltk_ package and its lemmatizer. Check the [docs](https://www.nltk.org/_modules/nltk/stem/wordnet.html) for more.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

import nltk

nltk.download("punkt")
nltk.download("wordnet")

from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize


class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]


def prepare_vectorizer(tr_data):
    vectorizer = CountVectorizer(
        max_features=3000, tokenizer=LemmaTokenizer(), stop_words="english"
    )

    word_to_ix = vectorizer.fit(tr_data.tweet)

    return word_to_ix

[nltk_data] Downloading package punkt to /home/adaamko/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/adaamko/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
word_to_ix = prepare_vectorizer(tr_data)
# The vocab size is the length of the vocabulary, or the length of the feature vectors
VOCAB_SIZE = len(word_to_ix.vocabulary_)
assert VOCAB_SIZE == 3000

  'stop_words.' % sorted(inconsistent))


CountVectorizer can directly transform any sentence into a one-hot encoded vector based on the corpus it was built upon.

![onehot](https://miro.medium.com/max/886/1*_da_YknoUuryRheNS-SYWQ.png)

In [13]:
word_to_ix.transform(["Hello my name is adam"]).toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [16]:
# Initialize the correct device
# It is important that every array should be on the same device or the training won't work
# A device could be either the cpu or the gpu if it is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Prepare the DataLoader for batch processing

The __prepare_dataloader(..)__ function will take the training and the validation dataset and convert them to one-hot encoded vectors with the help of the initialized CountVectorizer.

We prepare two FloatTensors and LongTensors for the converted tweets and labels of the training and the validation data.

Then zip together the vectors with the labels as a list of tuples!

In [17]:
# Preparing the data loaders for the training and the validation sets
# PyTorch operates on it's own datatype which is very similar to numpy's arrays
# They are called Torch Tensors: https://pytorch.org/docs/stable/tensors.html
# They are optimized for training neural networks
def prepare_dataloader(tr_data, val_data, word_to_ix):
    # First we transform the tweets into one-hot encoded vectors
    # Then we create Torch Tensors from the list of the vectors
    # It is also inportant to send the Tensors to the correct device
    # All of the tensors should be on the same device when training
    tr_data_vecs = torch.FloatTensor(word_to_ix.transform(tr_data.tweet).toarray()).to(
        device
    )
    tr_labels = torch.LongTensor(tr_data.label.tolist()).to(device)

    val_data_vecs = torch.FloatTensor(
        word_to_ix.transform(val_data.tweet).toarray()
    ).to(device)
    val_labels = torch.LongTensor(val_data.label.tolist()).to(device)

    tr_data_loader = [(sample, label) for sample, label in zip(tr_data_vecs, tr_labels)]
    val_data_loader = [
        (sample, label) for sample, label in zip(val_data_vecs, val_labels)
    ]

    return tr_data_loader, val_data_loader

In [18]:
tr_data_loader, val_data_loader = prepare_dataloader(tr_data, val_data, word_to_ix)

- __We have the correct lists now, it is time to initialize the DataLoader objects!__
- __Create two DataLoader objects with the lists we have created__
- __Shuffle the training data but not the validation data!__

In [19]:
# We then define a BATCH_SIZE for our model
# Usually we don't feed the whole dataset into our model at once
# For this we have the BATCH_SIZE parameter
# Try to experiment with different sized batches and see if changing this will improve the performance or not!
BATCH_SIZE = 64

In [22]:
from torch.utils.data import DataLoader

# The DataLoader(https://pytorch.org/docs/stable/data.html) class helps us to prepare the training batches
# It has a lot of useful parameters, one of it is _shuffle_ which will randomize the training dataset in each epoch
# This can also improve the performance of our model
def create_dataloader_iterators(tr_data_loader, val_data_loader, BATCH_SIZE):
    train_iterator = DataLoader(
        tr_data_loader,
        batch_size=BATCH_SIZE,
        shuffle=True,
    )

    valid_iterator = DataLoader(
        val_data_loader,
        batch_size=BATCH_SIZE,
        shuffle=False,
    )

    return train_iterator, valid_iterator

In [23]:
train_iterator, valid_iterator = create_dataloader_iterators(
    tr_data_loader, val_data_loader, BATCH_SIZE
)
assert type(train_iterator) == torch.utils.data.dataloader.DataLoader

### Building the first PyTorch model
At first, the model will contain a single Linear layer that takes one-hot-encoded vectors and trainsforms it into the dimension of the __NUM_LABELS__(how many classes we are trying to predict). Then, run through the output on a softmax activation to produce probabilites of the classes!

In [34]:
from torch import nn


class BoWClassifier(nn.Module):  # inheriting from nn.Module!
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.
        # Torch defines nn.Linear(), which provides the affine map.
        # Note that we could add more Linear Layers here connected to each other
        # Then we would also need to have a HIDDEN_SIZE hyperparameter as an input to our model
        # Then, with activation functions between them (e.g. RELU) we could have a "Deep" model
        # This is just an example for a shallow network
        self.linear = nn.Linear(vocab_size, num_labels)

    def forward(self, bow_vec, sequence_lens):
        # Ignore sequence_lens for now!
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        # Softmax will provide a probability distribution among the classes
        # We can then use this for our loss function
        return F.log_softmax(self.linear(bow_vec), dim=1)

In [35]:
# The INPUT_DIM is the size of our input vectors
INPUT_DIM = VOCAB_SIZE
# We have only 2 classes
OUTPUT_DIM = 2

In [36]:
# Init the model
# At first it is untrained, the weights are assigned random
model = BoWClassifier(OUTPUT_DIM, INPUT_DIM)

In [37]:
# Set the optimizer and the loss function!
# https://pytorch.org/docs/stable/optim.html
import torch.optim as optim

# The optimizer will update the weights of our model based on the loss function
# This is essential for correct training
# The _lr_ parameter is the learning rate
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()

In [38]:
# Copy the model and the loss function to the correct device
model = model.to(device)
criterion = criterion.to(device)

In [39]:
assert model.linear.out_features == 2

### Training and evaluating PyTorch models
- __calculate_performance__: This should calculate the batch-wise precision, recall, and fscore of your model!
- __train__ - Train your model on the training data! This function should set the model to training mode, then use the given iterator to iterate through the training samples and make predictions using the provided model. You should then propagate back the error with the loss function and the optimizer. Finally return the average epoch loss and performance!
- __evaluate__ - Evaluate your model on the validation dataset. This function is essentially the same as the trainnig function, but you should set your model to eval mode and don't propagate back the errors to your weights!

In [40]:
from sklearn.metrics import precision_recall_fscore_support


def calculate_performance(preds, y):
    """
    Returns precision, recall, fscore per batch
    """
    # Get the predicted label from the probabilities
    rounded_preds = preds.argmax(1)

    # Calculate the correct predictions batch-wise and calculate precision, recall, and fscore
    # WARNING: Tensors here could be on the GPU, so make sure to copy everything to CPU
    precision, recall, fscore, support = precision_recall_fscore_support(
        rounded_preds.cpu(), y.cpu()
    )

    return precision[1], recall[1], fscore[1]

In [41]:
import torch.nn.functional as F


def train(model, iterator, optimizer, criterion):
    # We will calculate loss and accuracy epoch-wise based on average batch accuracy
    epoch_loss = 0
    epoch_prec = 0
    epoch_recall = 0
    epoch_fscore = 0

    # You always need to set your model to training mode
    # If you don't set your model to training mode the error won't propagate back to the weights
    model.train()

    # We calculate the error on batches so the iterator will return matrices with shape [BATCH_SIZE, VOCAB_SIZE]
    for batch in iterator:
        text_vecs = batch[0]
        labels = batch[1]
        sen_lens = []
        texts = []

        # This is for later!
        if len(batch) > 2:
            sen_lens = batch[2]
            texts = batch[3]

        # We reset the gradients from the last step, so the loss will be calculated correctly (and not added together)
        optimizer.zero_grad()

        # This runs the forward function on your model (you don't need to call it directly)
        predictions = model(text_vecs, sen_lens)

        # Calculate the loss and the accuracy on the predictions (the predictions are log probabilities, remember!)
        loss = criterion(predictions, labels)

        prec, recall, fscore = calculate_performance(predictions, labels)

        # Propagate the error back on the model (this means changing the initial weights in your model)
        # Calculate gradients on parameters that requries grad
        loss.backward()
        # Update the parameters
        optimizer.step()

        # We add batch-wise loss to the epoch-wise loss
        epoch_loss += loss.item()
        # We also do the same with the scores
        epoch_prec += prec.item()
        epoch_recall += recall.item()
        epoch_fscore += fscore.item()
    return (
        epoch_loss / len(iterator),
        epoch_prec / len(iterator),
        epoch_recall / len(iterator),
        epoch_fscore / len(iterator),
    )

In [42]:
# The evaluation is done on the validation dataset
def evaluate(model, iterator, criterion):

    epoch_loss = 0
    epoch_prec = 0
    epoch_recall = 0
    epoch_fscore = 0
    # On the validation dataset we don't want training so we need to set the model on evaluation mode
    model.eval()

    # Also tell Pytorch to not propagate any error backwards in the model or calculate gradients
    # This is needed when you only want to make predictions and use your model in inference mode!
    with torch.no_grad():

        # The remaining part is the same with the difference of not using the optimizer to backpropagation
        for batch in iterator:
            text_vecs = batch[0]
            labels = batch[1]
            sen_lens = []
            texts = []

            if len(batch) > 2:
                sen_lens = batch[2]
                texts = batch[3]

            predictions = model(text_vecs, sen_lens)
            loss = criterion(predictions, labels)

            prec, recall, fscore = calculate_performance(predictions, labels)

            epoch_loss += loss.item()
            epoch_prec += prec.item()
            epoch_recall += recall.item()
            epoch_fscore += fscore.item()

    # Return averaged loss on the whole epoch!
    return (
        epoch_loss / len(iterator),
        epoch_prec / len(iterator),
        epoch_recall / len(iterator),
        epoch_fscore / len(iterator),
    )

In [43]:
import time

# This is just for measuring training time!
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Training loop!
Below is the training loop of our model! Try to set an EPOCH number that will correctly train your model :) (it is not underfitted but neither overfitted!

In [44]:
def training_loop(epoch_number=15):
    # Set an EPOCH number!
    N_EPOCHS = epoch_number

    best_valid_loss = float("inf")

    # We loop forward on the epoch number
    for epoch in range(N_EPOCHS):

        start_time = time.time()

        # Train the model on the training set using the dataloader
        train_loss, train_prec, train_rec, train_fscore = train(
            model, train_iterator, optimizer, criterion
        )
        # And validate your model on the validation set
        valid_loss, valid_prec, valid_rec, valid_fscore = evaluate(
            model, valid_iterator, criterion
        )

        end_time = time.time()

        epoch_mins, epoch_secs = epoch_time(start_time, end_time)

        # If we find a better model, we save the weights so later we may want to reload it
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), "tut1-model.pt")

        print(f"Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s")
        print(
            f"\tTrain Loss: {train_loss:.3f} | Train Prec: {train_prec*100:.2f}% | Train Rec: {train_rec*100:.2f}% | Train Fscore: {train_fscore*100:.2f}%"
        )
        print(
            f"\t Val. Loss: {valid_loss:.3f} |  Val Prec: {valid_prec*100:.2f}% | Val Rec: {valid_rec*100:.2f}% | Val Fscore: {valid_fscore*100:.2f}%"
        )

In [45]:
training_loop()

  _warn_prf(average, modifier, msg_start, len(result))


Epoch: 01 | Epoch Time: 0m 0s
	Train Loss: 0.650 | Train Prec: 4.12% | Train Rec: 36.46% | Train Fscore: 6.40%
	 Val. Loss: 0.628 |  Val Prec: 3.95% | Val Rec: 57.54% | Val Fscore: 7.26%
Epoch: 02 | Epoch Time: 0m 0s
	Train Loss: 0.599 | Train Prec: 9.34% | Train Rec: 81.77% | Train Fscore: 16.34%
	 Val. Loss: 0.604 |  Val Prec: 11.87% | Val Rec: 82.45% | Val Fscore: 20.29%
Epoch: 03 | Epoch Time: 0m 0s
	Train Loss: 0.566 | Train Prec: 19.76% | Train Rec: 92.99% | Train Fscore: 31.78%
	 Val. Loss: 0.586 |  Val Prec: 16.35% | Val Rec: 82.04% | Val Fscore: 26.71%
Epoch: 04 | Epoch Time: 0m 0s
	Train Loss: 0.539 | Train Prec: 27.47% | Train Rec: 91.10% | Train Fscore: 41.35%
	 Val. Loss: 0.574 |  Val Prec: 21.29% | Val Rec: 77.86% | Val Fscore: 32.89%
Epoch: 05 | Epoch Time: 0m 0s
	Train Loss: 0.518 | Train Prec: 33.10% | Train Rec: 90.33% | Train Fscore: 47.47%
	 Val. Loss: 0.564 |  Val Prec: 25.48% | Val Rec: 78.34% | Val Fscore: 37.48%
Epoch: 06 | Epoch Time: 0m 0s
	Train Loss: 0.501 |


__NOTE: DON'T FORGET TO RERUN THE MODEL INITIALIZATION WHEN YOU ARE TRYING TO RUN THE MODEL MULTIPLE TIMES. IF YOU DON'T REINITIALIZE THE MODEL IT WILL CONTINUE THE TRAINING WHERE IT HAS STOPPED LAST TIME AND DOESN'T RUN FROM SRATCH!__

These lines:

```python
model = BoWClassifier(OUTPUT_DIM, INPUT_DIM)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.NLLLoss()
model = model.to(device)
criterion = criterion.to(device)
```

This will reinitialize the model!

In [46]:
def reinitialize(model):
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    criterion = nn.NLLLoss()
    model = model.to(device)
    criterion = criterion.to(device)

In [47]:
reinitialize(BoWClassifier(OUTPUT_DIM, INPUT_DIM))

## Add more linear layers to the model and experiment with other hyper-parameters

### More layers

Currently we only have a single linear layers in our model. We are now adding more linear layers to the model.
We also introduce a HIDDEN_SIZE parameter that will be the size of the intermediate representation between the linear layers. Also adding a RELU activation function between the linear layers.

See more:
- https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html
- https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_nn.html

In [48]:
from torch import nn


class BoWDeepClassifier(nn.Module):
    def __init__(self, num_labels, vocab_size, hidden_size):
        super(BoWDeepClassifier, self).__init__()
        # First linear layer
        self.linear1 = nn.Linear(vocab_size, hidden_size)
        # Non-linear activation function between them
        self.relu = torch.nn.ReLU()
        # Second layer
        self.linear2 = nn.Linear(hidden_size, num_labels)

    def forward(self, bow_vec, sequence_lens):
        # Run the input vector through every layer
        output = self.linear1(bow_vec)
        output = self.relu(output)
        output = self.linear2(output)

        # Get the probabilities
        return F.log_softmax(output, dim=1)

In [49]:
HIDDEN_SIZE = 200
learning_rate = 0.001
BATCH_SIZE = 64
N_EPOCHS = 15

In [50]:
model = BoWDeepClassifier(OUTPUT_DIM, INPUT_DIM, HIDDEN_SIZE)

optimizer = optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.NLLLoss()

model = model.to(device)
criterion = criterion.to(device)

In [51]:
training_loop()

Epoch: 01 | Epoch Time: 0m 0s
	Train Loss: 0.596 | Train Prec: 20.60% | Train Rec: 49.89% | Train Fscore: 25.92%
	 Val. Loss: 0.539 |  Val Prec: 36.23% | Val Rec: 74.23% | Val Fscore: 47.90%
Epoch: 02 | Epoch Time: 0m 0s
	Train Loss: 0.429 | Train Prec: 57.83% | Train Rec: 79.56% | Train Fscore: 66.12%
	 Val. Loss: 0.552 |  Val Prec: 48.22% | Val Rec: 69.31% | Val Fscore: 55.94%
Epoch: 03 | Epoch Time: 0m 0s
	Train Loss: 0.322 | Train Prec: 72.06% | Train Rec: 85.52% | Train Fscore: 77.73%
	 Val. Loss: 0.597 |  Val Prec: 52.68% | Val Rec: 64.65% | Val Fscore: 57.25%
Epoch: 04 | Epoch Time: 0m 0s
	Train Loss: 0.240 | Train Prec: 82.02% | Train Rec: 90.61% | Train Fscore: 85.80%
	 Val. Loss: 0.673 |  Val Prec: 51.16% | Val Rec: 63.78% | Val Fscore: 56.00%
Epoch: 05 | Epoch Time: 0m 0s
	Train Loss: 0.175 | Train Prec: 87.18% | Train Rec: 94.32% | Train Fscore: 90.37%
	 Val. Loss: 0.775 |  Val Prec: 52.20% | Val Rec: 61.79% | Val Fscore: 55.82%
Epoch: 06 | Epoch Time: 0m 0s
	Train Loss: 0.

## Implement automatic early-stopping in the training loop
Early stopping is a very easy method to avoid the overfitting of your model.

We could:
- Save the training and the validation loss of the last two epochs (if you are atleast in the third epoch)
- If the loss increased in the last two epoch on the training data but descreased or stagnated in the validation data, you should stop the training automatically!

In [38]:
# REINITIALIZE YOUR MODEL TO GET A CORRECT RUN!

## Handling class imbalance
Our data is imbalanced, the first class has twice the population of the second class.

One way of handling imbalanced data is to weight the loss function, so it penalizes errors on the smaller class.

Look at the documentation of the loss function: https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html

Set the weights based on the inverse population of the classes (so the less sample a class has, more the errors will be penalized!)

In [52]:
tr_data.groupby("label").size()

label
0    6179
1    3089
dtype: int64

In [53]:
weights = torch.Tensor([1, 2])
criterion = nn.NLLLoss(weight=weights)

## Adding an Embedding Layer to the network

- We only used one-hot-encoded vectors as our features until now
- Now we will introduce an [embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer into our network.
- We will feed the words into our network one-by-one, and the layer will learn a dense vector representation for each word

![embeddingbag](https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png)

_from pytorch.org_

In [54]:
# Get the analyzer to get the word-id mapping from CountVectorizer
an = word_to_ix.build_analyzer()

In [55]:
an("hello my name is adam")

['hello', 'adam']

In [56]:
max(word_to_ix.vocabulary_, key=word_to_ix.vocabulary_.get)

'ü§®'

In [57]:
len(word_to_ix.vocabulary_)

3000

In [58]:
def create_input(dataset, analyzer, vocabulary):
    dataset_as_indices = []

    # We go through each tweet in the dataset
    # We need to add two additional symbols to the vocabulary
    # We have 3000 features, ranged 0-2999
    # We add 3000 as an id for the "unknown" words not among the features
    # 3001 will be the symbol for padding, but about this later!
    for tweet in dataset:
        tokens = analyzer(tweet)
        token_ids = []

        for token in tokens:
            # if the token is in the vocab, we add the id
            if token in vocabulary:
                token_ids.append(vocabulary[token])
            # else we add the id of the unknown token
            else:
                token_ids.append(3000)

        # if we removed every token during preprocessing (stopword removal, lemmatization), we add the unknown token to the list so it won't be empty
        if not token_ids:
            token_ids.append(3000)
        dataset_as_indices.append(torch.LongTensor(token_ids).to(device))

    return dataset_as_indices

In [59]:
# We add the length of the tweets so sentences with similar lengths will be next to each other
# This can be important because of padding
tr_data["length"] = tr_data.tweet.str.len()
val_data["length"] = val_data.tweet.str.len()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [60]:
tr_data.tweet.str.len()

1224      59
9102      14
3655      28
8201      92
6141      62
        ... 
11468     63
7221     275
1318      52
8915      50
11055    167
Name: tweet, Length: 9268, dtype: int64

In [61]:
tr_data = tr_data.sort_values(by="length")
val_data = val_data.sort_values(by="length")

In [62]:
# We create the dataset as ids of tokens
dataset_as_ids = create_input(tr_data.tweet, an, word_to_ix.vocabulary_)

In [63]:
dataset_as_ids[0]

tensor([2366], device='cuda:0')

### Padding

- We didn't need to take care of input padding when using one-hot-encoded vectors
- Padding handles different sized inputs
- We can pad sequences from the left, or from the right

![padding](https://miro.medium.com/max/1218/1*zsIXWoN0_CE9PXzmY3tIjQ.png)

_image from https://towardsdatascience.com/nlp-preparing-text-for-deep-learning-model-using-tensorflow2-461428138657_

In [64]:
from torch.nn.utils.rnn import pad_sequence

# pad_sequence will take care of the padding
# we will need to provide a padding_value to it
padded = pad_sequence(dataset_as_ids, batch_first=True, padding_value=3001)

In [65]:
def prepare_dataloader_with_padding(tr_data, val_data, word_to_ix):
    # First create the id representations of the input vectors
    # Then pad the sequences so all of the input is the same size
    # We padded texts for the whole dataset, this could have been done batch-wise also!
    tr_data_vecs = pad_sequence(
        create_input(tr_data.tweet, an, word_to_ix.vocabulary_),
        batch_first=True,
        padding_value=3001,
    )
    tr_labels = torch.LongTensor(tr_data.label.tolist()).to(device)
    tr_lens = torch.LongTensor(
        [len(i) for i in create_input(tr_data.tweet, an, word_to_ix.vocabulary_)]
    )

    # We also add the texts to the batches
    # This is for the Transformer models, you wont need this in the next experiments
    tr_sents = tr_data.tweet.tolist()

    val_data_vecs = pad_sequence(
        create_input(val_data.tweet, an, word_to_ix.vocabulary_),
        batch_first=True,
        padding_value=3001,
    )
    val_labels = torch.LongTensor(val_data.label.tolist()).to(device)
    val_lens = torch.LongTensor(
        [len(i) for i in create_input(val_data.tweet, an, word_to_ix.vocabulary_)]
    )

    val_sents = val_data.tweet.tolist()

    tr_data_loader = [
        (sample, label, length, sent)
        for sample, label, length, sent in zip(
            tr_data_vecs, tr_labels, tr_lens, tr_sents
        )
    ]
    val_data_loader = [
        (sample, label, length, sent)
        for sample, label, length, sent in zip(
            val_data_vecs, val_labels, val_lens, val_sents
        )
    ]

    return tr_data_loader, val_data_loader

In [66]:
tr_data_loader, val_data_loader = prepare_dataloader_with_padding(
    tr_data, val_data, word_to_ix
)

In [67]:
def create_dataloader_iterators_with_padding(
    tr_data_loader, val_data_loader, BATCH_SIZE
):
    train_iterator = DataLoader(
        tr_data_loader,
        batch_size=BATCH_SIZE,
        shuffle=True,
    )

    valid_iterator = DataLoader(
        val_data_loader,
        batch_size=BATCH_SIZE,
        shuffle=False,
    )

    return train_iterator, valid_iterator

In [68]:
train_iterator, valid_iterator = create_dataloader_iterators_with_padding(
    tr_data_loader, val_data_loader, BATCH_SIZE
)

In [69]:
next(iter(train_iterator))

[tensor([[ 581,   21, 2786,  ..., 3001, 3001, 3001],
         [  11, 1457, 2293,  ..., 3001, 3001, 3001],
         [ 721,  208, 3000,  ..., 3001, 3001, 3001],
         ...,
         [   1, 1635, 3000,  ..., 3001, 3001, 3001],
         [2380, 3000,   21,  ..., 3001, 3001, 3001],
         [1209, 2980, 2741,  ..., 3001, 3001, 3001]], device='cuda:0'),
 tensor([0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
         1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1], device='cuda:0'),
 tensor([ 3,  6, 14, 13, 13, 24, 29, 17, 17, 14, 30,  9, 14,  2, 31,  2, 19,  3,
         25,  2, 10,  9,  8,  5,  3, 10,  7, 27,  8, 25, 22,  2, 30, 31, 14, 20,
         12,  2,  9, 25, 16, 18, 17, 12, 23,  7,  7, 20, 10,  4, 18, 11,  9,  8,
          7,  2, 32, 27,  2, 24, 16,  4,  8, 14]),
 ('      They coming. URL',
  "  That's just what Satan would say if he were threatened.",
  'And yet I see daily Ant

![embedding](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/b4efbefa47672174394a8b6a27d4e7bc193bc224/assets/sentiment8.png)

_image from bentrevett_

In [70]:
from torch import nn
import numpy as np


class BoWClassifierWithEmbedding(nn.Module):
    def __init__(self, num_labels, vocab_size, embedding_dim):
        super(BoWClassifierWithEmbedding, self).__init__()

        # We define the embedding layer here
        # It will convert a list of ids: [1, 50, 64, 2006]
        # Into a list of vectors, one for each word
        # The embedding layer will learn the vectors from the contexts
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=3001)
        # We could also load precomputed embeddings, e.g. GloVe, in some cases we don't want to train the embedding layer
        # In this case we enable the training
        self.embedding.weight.requires_grad = True

        self.linear = nn.Linear(embedding_dim, num_labels)

    def forward(self, text, sequence_lens):
        # First we create the embedded vectors
        embedded = self.embedding(text)
        # We need a pooling to convert a list of embedded words to a sentence vector
        # We could have chosen different pooling, e.g. min, max, average..
        # With LSTM we also do a pooling, just smarter
        pooled = F.max_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1)
        return F.log_softmax(self.linear(pooled), dim=1)

Output of the LSTM layer..

![lstm](https://i.stack.imgur.com/SjnTl.png)

_image from stackoverflow_

In [71]:
class LSTMClassifier(nn.Module):
    def __init__(self, num_labels, vocab_size, embedding_dim, hidden_dim):
        super(LSTMClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=3001)
        self.embedding.weight.requires_grad = True

        # Define the LSTM layer
        # Documentation: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
        self.lstm = nn.LSTM(
            embedding_dim,
            hidden_dim,
            batch_first=True,
            num_layers=1,
            bidirectional=False,
        )
        self.linear = nn.Linear(hidden_dim, num_labels)
        # Dropout to overcome overfitting
        self.dropout = nn.Dropout(0.25)

    def forward(self, text, sequence_lens):
        embedded = self.embedding(text)

        # To ensure LSTM doesn't learn gradients for the id of the padding symbol
        packed = nn.utils.rnn.pack_padded_sequence(
            embedded, sequence_lens, enforce_sorted=False, batch_first=True
        )
        packed_outputs, (h, c) = self.lstm(packed)
        # extract LSTM outputs (not used here)
        lstm_outputs, lens = nn.utils.rnn.pad_packed_sequence(
            packed_outputs, batch_first=True
        )

        # We use the last hidden vector from LSTM
        y = self.linear(h[-1])
        log_probs = F.log_softmax(y, dim=1)
        return log_probs

In [72]:
INPUT_DIM = VOCAB_SIZE + 2
OUTPUT_DIM = 2
EMBEDDING_DIM = 100
HIDDEN_DIM = 20
criterion = nn.NLLLoss()

# model = BoWClassifierWithEmbedding(OUTPUT_DIM, INPUT_DIM, EMBEDDING_DIM)
model = LSTMClassifier(OUTPUT_DIM, INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM)

In [73]:
model = model.to(device)
criterion = criterion.to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [74]:
training_loop(epoch_number=15)

Epoch: 01 | Epoch Time: 0m 1s
	Train Loss: 0.640 | Train Prec: 9.36% | Train Rec: 44.51% | Train Fscore: 13.57%
	 Val. Loss: 0.628 |  Val Prec: 5.59% | Val Rec: 38.91% | Val Fscore: 9.17%
Epoch: 02 | Epoch Time: 0m 1s
	Train Loss: 0.616 | Train Prec: 8.26% | Train Rec: 62.91% | Train Fscore: 14.09%
	 Val. Loss: 0.619 |  Val Prec: 13.26% | Val Rec: 57.58% | Val Fscore: 20.16%
Epoch: 03 | Epoch Time: 0m 1s
	Train Loss: 0.577 | Train Prec: 24.45% | Train Rec: 74.80% | Train Fscore: 35.52%
	 Val. Loss: 0.600 |  Val Prec: 27.24% | Val Rec: 60.82% | Val Fscore: 36.16%
Epoch: 04 | Epoch Time: 0m 1s
	Train Loss: 0.513 | Train Prec: 47.20% | Train Rec: 74.12% | Train Fscore: 56.60%
	 Val. Loss: 0.588 |  Val Prec: 38.15% | Val Rec: 59.75% | Val Fscore: 45.68%
Epoch: 05 | Epoch Time: 0m 1s
	Train Loss: 0.446 | Train Prec: 58.27% | Train Rec: 77.58% | Train Fscore: 65.68%
	 Val. Loss: 0.593 |  Val Prec: 48.63% | Val Rec: 57.97% | Val Fscore: 52.06%
Epoch: 06 | Epoch Time: 0m 1s
	Train Loss: 0.383 

## Transformers

To completely understand the transformers architecture look at this lecture held by Judit Acs (on the course of Introduction to Python and Natural Language Technologies in BME): 
- https://github.com/bmeaut/python_nlp_2021_spring/blob/main/lectures/09_Transformers_BERT.ipynb

Here I will only include and present the necessary details _from the lecture_ about transformers and BERT.

### Problems with recurrent neural networks:

Recall that we used recurrent neural cells, specifically LSTMs to encode a list of vectors into a sentence vector.

- Problem 1. No parallelism

        - LSTMs are recurrent, they rely on their left and right history, so the symbols need to be processed in order -> no parallelism.

- Problem 2. Long-range dependencies

        - Long-range dependencies are not infrequent in NLP.

        - "The people/person who called and wanted to rent your house when you go away next year are/is from California" -- Miller & Chomsky 1963

        - LSTMs have a problem capturing these because there are too many backpropagation steps between the symbols.

Introduced in [Attention Is All You Need](https://papers.nips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf) by Vaswani et al., 2017

Transformers solve Problem 1 by relying purely on attention instead of recurrence.

Not having recurrent connections means that sequence position no longer matters.

Recurrence is replaced by self attention.

- Transformers are available in the __transformers__ Python package: https://github.com/huggingface/transformers.
- There are thousands of pretrained transformers models in different languages and with different architectures. 
- With the huggingface package there is a unified interface to download and use all the models. Browse https://huggingface.co/models for more!
- There is also a great blog post to understand the architecture of transformers: https://jalammar.github.io/illustrated-transformer/

### BERT

[BERT](https://www.aclweb.org/anthology/N19-1423/): Pre-training of Deep Bidirectional Transformers for Language Understanding by Devlin et al. 2018, 17500 citations

[BERTology](https://huggingface.co/transformers/bertology.html) is the nickname for the growing amount of BERT-related research.

BERT trains a transformer model on two tasks:

- Masked language model:

    - 15% of the tokenswordpieces are selected at the beginning.
    - 80% of those are replaced with [MASK],
    - 10% are replaced with a random token,
    - 10% are kept intact.

- Next sentence prediction:
    - Are sentences A and B consecutive sentences?
    - Generate 50-50%.
    - Binary classification task.
    

### Training, Finetuning BERT

- BERT models are (masked-)language models that were usually trained on large corporas.

- e.g. BERT base model was trained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia.

#### Finetuning

- Get a trained BERT model.
- Add a small classification layer on top (typically a 2-layer MLP).
- Train BERT along with the classification layer on an annotated dataset.
- Much smaller than the data BERT was trained on
- Another option: freeze BERT and train the classification layer only.
    - Easier training regime.
    - Smaller memory footprint.
    - Worse performance.

<img src="https://production-media.paperswithcode.com/methods/new_BERT_Overall.jpg" alt="finetune" width="800px"/>

In [None]:
!pip install transformers

### WordPiece tokenizer
- BERT has its own tokenizer
- All inputs must be tokenized with BERT 
- You don't need to remove stopwords, lemmatize, preprocess the input for BERT

- It is a middle ground between word and character tokenization.

- Static vocabulary:
    - Special tokens: [CLS], [SEP], [MASK], [UNK]
    - It tokenizes everything, falling back to characters and [UNK] if necessary

In [75]:
from transformers import BertTokenizer

In [76]:
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [77]:
print(type(tokenizer))
print(len(tokenizer.get_vocab()))

tokenizer.tokenize("My shihtzu's name is Maszat.")

<class 'transformers.models.bert.tokenization_bert.BertTokenizer'>
30522


['my',
 'shi',
 '##ht',
 '##zu',
 "'",
 's',
 'name',
 'is',
 'mas',
 '##za',
 '##t',
 '.']

In [78]:
tokenizer("There are black cats and black dogs.", "Another sentence.")

{'input_ids': [101, 2045, 2024, 2304, 8870, 1998, 2304, 6077, 1012, 102, 2178, 6251, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Train a BertForSequenceClassification model on the dataset

In [82]:
from transformers import BertForSequenceClassification

__BertForSequenceClassification__ is a helper class to train transformer-based BERT models. It puts a classification layer on top of a pretrained model.

Read more in the documentation: https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification

In [83]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
_ = model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [84]:
# We only want to finetune the classification layer on top of BERT
for p in model.base_model.parameters():
    p.requires_grad = False

In [85]:
params = list(model.named_parameters())

print(f"The BERT model has {len(params)} different named parameters.")

print("==== Embedding Layer ====\n")

for p in params[0:5]:
    print(f"{p[0]} {str(tuple(p[1].size()))}")

print("\n==== First Transformer ====\n")

for p in params[5:21]:
    print(f"{p[0]} {str(tuple(p[1].size()))}")

print("\n==== Output Layer ====\n")

for p in params[-4:]:
    print(f"{p[0]} {str(tuple(p[1].size()))}")

The BERT model has 201 different named parameters.
==== Embedding Layer ====

bert.embeddings.word_embeddings.weight (30522, 768)
bert.embeddings.position_embeddings.weight (512, 768)
bert.embeddings.token_type_embeddings.weight (2, 768)
bert.embeddings.LayerNorm.weight (768,)
bert.embeddings.LayerNorm.bias (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight (768, 768)
bert.encoder.layer.0.attention.self.query.bias (768,)
bert.encoder.layer.0.attention.self.key.weight (768, 768)
bert.encoder.layer.0.attention.self.key.bias (768,)
bert.encoder.layer.0.attention.self.value.weight (768, 768)
bert.encoder.layer.0.attention.self.value.bias (768,)
bert.encoder.layer.0.attention.output.dense.weight (768, 768)
bert.encoder.layer.0.attention.output.dense.bias (768,)
bert.encoder.layer.0.attention.output.LayerNorm.weight (768,)
bert.encoder.layer.0.attention.output.LayerNorm.bias (768,)
bert.encoder.layer.0.intermediate.dense.weight (3072, 768)
bert.encoder.laye

In [86]:
N_EPOCHS = 5
optimizer = optim.Adam(model.parameters())

In [87]:
tr_data_loader, val_data_loader = prepare_dataloader_with_padding(
    tr_data, val_data, word_to_ix
)

train_iterator, valid_iterator = create_dataloader_iterators_with_padding(
    tr_data_loader, val_data_loader, BATCH_SIZE
)

In [88]:
for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_epoch_loss = 0
    train_epoch_prec = 0
    train_epoch_recall = 0
    train_epoch_fscore = 0
    model.train()

    # We use our own iterator but now use the raw texts instead of the ID tokens
    for train_batch in train_iterator:
        labels = train_batch[1]
        texts = train_batch[3]

        optimizer.zero_grad()

        # We use BERT's own tokenizer on raw texts
        # Check the documentation: https://huggingface.co/transformers/main_classes/tokenizer.html
        encoded = tokenizer(
            texts,
            truncation=True,
            max_length=128,
            padding=True,
            return_tensors="pt",
        )

        # BERT converts texts into IDs of its own vocabulary
        input_ids = encoded["input_ids"].to(device)
        # Mask to avoid performing attention on padding token indices.
        attention_mask = encoded["attention_mask"].to(device)

        # Run the model
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)

        loss = outputs[0]
        predictions = outputs[1]
        prec, recall, fscore = calculate_performance(predictions, labels)

        loss.backward()
        optimizer.step()

        train_epoch_loss += loss.item()
        train_epoch_prec += prec.item()
        train_epoch_recall += recall.item()
        train_epoch_fscore += fscore.item()

    train_loss = train_epoch_loss / len(train_iterator)
    train_prec = train_epoch_prec / len(train_iterator)
    train_rec = train_epoch_recall / len(train_iterator)
    train_fscore = train_epoch_fscore / len(train_iterator)

    # And validate your model on the validation set
    valid_epoch_loss = 0
    valid_epoch_prec = 0
    valid_epoch_recall = 0
    valid_epoch_fscore = 0
    model.eval()

    with torch.no_grad():
        for valid_batch in valid_iterator:
            labels = valid_batch[1]
            texts = valid_batch[3]

            encoded = tokenizer(
                texts,
                truncation=True,
                max_length=128,
                padding=True,
                return_tensors="pt",
            )
            input_ids = encoded["input_ids"].to(device)
            attention_mask = encoded["attention_mask"].to(device)

            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs[0]
            predictions = outputs[1]
            prec, recall, fscore = calculate_performance(predictions, labels)

            # We add batch-wise loss to the epoch-wise loss
            valid_epoch_loss += loss.item()
            valid_epoch_prec += prec.item()
            valid_epoch_recall += recall.item()
            valid_epoch_fscore += fscore.item()

    valid_loss = valid_epoch_loss / len(valid_iterator)
    valid_prec = valid_epoch_prec / len(valid_iterator)
    valid_rec = valid_epoch_recall / len(valid_iterator)
    valid_fscore = valid_epoch_fscore / len(valid_iterator)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f"Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s")
    print(
        f"\tTrain Loss: {train_loss:.3f} | Train Prec: {train_prec*100:.2f}% | Train Rec: {train_rec*100:.2f}% | Train Fscore: {train_fscore*100:.2f}%"
    )
    print(
        f"\t Val. Loss: {valid_loss:.3f} |  Val Prec: {valid_prec*100:.2f}% | Val Rec: {valid_rec*100:.2f}% | Val Fscore: {valid_fscore*100:.2f}%"
    )

  _warn_prf(average, modifier, msg_start, len(result))


Epoch: 01 | Epoch Time: 0m 21s
	Train Loss: 0.640 | Train Prec: 5.10% | Train Rec: 12.49% | Train Fscore: 5.08%
	 Val. Loss: 0.661 |  Val Prec: 0.00% | Val Rec: 0.00% | Val Fscore: 0.00%
Epoch: 02 | Epoch Time: 0m 21s
	Train Loss: 0.614 | Train Prec: 7.89% | Train Rec: 41.22% | Train Fscore: 12.05%
	 Val. Loss: 0.597 |  Val Prec: 2.25% | Val Rec: 21.98% | Val Fscore: 3.96%
Epoch: 03 | Epoch Time: 0m 22s
	Train Loss: 0.597 | Train Prec: 13.21% | Train Rec: 56.39% | Train Fscore: 19.26%
	 Val. Loss: 0.580 |  Val Prec: 16.67% | Val Rec: 70.27% | Val Fscore: 26.07%
Epoch: 04 | Epoch Time: 0m 22s
	Train Loss: 0.590 | Train Prec: 20.10% | Train Rec: 58.86% | Train Fscore: 26.62%
	 Val. Loss: 0.572 |  Val Prec: 10.38% | Val Rec: 66.87% | Val Fscore: 17.28%
Epoch: 05 | Epoch Time: 0m 22s
	Train Loss: 0.581 | Train Prec: 22.31% | Train Rec: 60.05% | Train Fscore: 30.59%
	 Val. Loss: 0.566 |  Val Prec: 13.55% | Val Rec: 69.60% | Val Fscore: 21.88%
