# Lab 03 - Transfer Learning w/ DistilBert for Relation Classification

In this lab, we'll be working with finetuning Transformer based language model for relation classification. The goals of this lab are the following:
- Preparing Data for BERT-based models
- Finetuning BERT-based models using HuggingFace Transformers

# Data: SemEval 2007 - Task 4: Classification of Semantic Relations Between Nominals

Recall from class that relation classification involve identifying the relationship between two or more entities. For example give the following text:
```
The <e1>artist</e1> made the <e2>picture</e2> when he was a fourth grade student in Iowa City.
```
What would the semantic relationship between entity 1 `artist` and entity 2 `picture`. Artist produce art and therefore the picture is a product created by the artist. We can label this relationship as `product-producer` where entity 2 is the product and entity 1 is the producer.  

The goal of relation classification then is to label the relation between a pair of entities. The [SemEval 2007 - Task 4](https://aclanthology.org/S07-1003/) (Girju et al. 2007) was one the first shared tasks explore relation identification. The associated dataset contains 7 relationships:
```
cause-effect
instrument-agency
product-producer
origin-entity
theme_tool
part-whole
content-container
```

In this lab, we'll frame relation-classification as multi-class sequence classification task. Provided a sentence containing two marked entities, the goal is to predict which relation categoy for the entities in the context of the sentence. 

Run the code below to load the data in memory. The `train`, `val`, and `test` environment variables are dataframes with our data.

In [None]:
!pip install transformers >> NULL
!wget https://raw.githubusercontent.com/dhairyadalal/relation-lab-dataset/main/test.csv
!wget https://raw.githubusercontent.com/dhairyadalal/relation-lab-dataset/main/train.csv
!wget https://raw.githubusercontent.com/dhairyadalal/relation-lab-dataset/main/val.csv

--2023-01-27 11:13:56--  https://raw.githubusercontent.com/dhairyadalal/relation-lab-dataset/main/test.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81140 (79K) [text/plain]
Saving to: ‘test.csv’


2023-01-27 11:13:56 (16.0 MB/s) - ‘test.csv’ saved [81140/81140]

--2023-01-27 11:13:56--  https://raw.githubusercontent.com/dhairyadalal/relation-lab-dataset/main/train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127162 (124K) [text/plain]
Saving to: ‘train.csv’


2023-01-27 11:13:57 (6.79 MB/s) - ‘train.csv’ saved [12716

In [None]:
import pandas as pd
train = pd.read_csv("train.csv")
val = pd.read_csv("val.csv")
test = pd.read_csv("test.csv")

train.head()

Unnamed: 0,text,label
0,"""Most of the <e1>steam</e1> comes from a volca...",product-producer
1,"""The <e1>cabin passengers</e1> composed the <e...",product-producer
2,"""Well, this <e1>footballer</e1> kicked the <e2...",instrument-agency
3,"""<e1>Germanium</e1> is found in <e2>germanite<...",part-whole
4,"""Craig expressed his <e1>frustration</e1> afte...",cause-effect


# Tranfer Learning with Pretrained Neural Language Models
Modern language models (BERT, ERNIE, GPT, T5, etc) have been found to be very effective across a wide range of NLP tasks. These models are usually deep neural networds which have been pretrained on large text corpora (i.e. Wikipedia, Common Crawl, BooksCorpus, etc) and are able to learn about the various aspects of language (syntax, grammar, semantics, etc) which can be tranferred acroos various domains and NLP tasks. The Transformer architecture (https://jalammar.github.io/illustrated-transformer/) tends to be the backbone for most modern language models. We sort the transformer models into two categories: autoregressive models and autencoding models. Autoregressive models (e.g. GPT, XLNet) are pretrained on the next word prediction. Given a sequence (the cat sat on the [BLANK]), the model attempts to predict the likely next word the sequence. In contrast autoencoding models (BERT, T5, RoBERTa, ERNIE) are trained to reconstruct corrupted sequences. So a given sentence like the cat sat on the mat would be corrupted by masking a random set of words, e.g. the [MASK] sat on the [MASK], where model must predict the masked tokens. Unlike autoregressive pretraining, the model uses the context of the full input to understand its masked constituent parts. 

There are two ways to use to these models given an arbitrary task. The weights of the model can be frozen and the last hidden layer output of the model can be used as a set of fixed features. While the method is very quick, it is limited in its efficacy. The other ways is finetuning. Finetuning is the process of updating the pretrained weights in order to adapt the model to a new task and domain (e.g. sentiment classification, relation classification, etc). Since the model already has internalized its own understanding of language, grammar, and semantics, finetuning usually only takes 1-5 epochs of additional gradient updates to condition the model to support the new task. 


The `HuggingFace Transformers` library hosts implementations and trained weights for nearly all the cutting edge Transformer models and has a unified and easy to use API for finetuning these models. For the final part of the lab we'll walk through how to prepare data for finetuning and train a model for our relation classification task. The `Transformers` library support Pytorch, Tensorflow, and Jax implementation of various models. For this course we'll be using `Pytorch` given is wide-spread adoption in academic research, pythonic paradigm, and ease of debugging with dynamic graph generation. 

We'll explore finetuning the DistilBERT (https://arxiv.org/abs/1910.01108) model. DistilBERT reduces the size of BERT model (fewer parameters and hidden layers) which allows for quicker finetuning while still retaining 90% of BERT's performance. For all other considerations (input encoding, training, etc) DistilBERT is identical to BERT. We recommend for this section to change your runtime type to GPU as it will dramatically speed up training time. You may find you'll need to rerun earlier cells to ensure the dataset is reloaded into memory. 

### Data Preparation

In order the prepare the data for BERT finetuning the following steps must be taken:
1. Numerically encode string labels and convert to tensor objects
2. Encode the text into wordpiece ids, pad inputs, and convert to tensor objects
3. Create a Dataset object containing the tensor inputs and labels.

### Label Encoding
The first thing we need to do is convert our string labels in a numerical representation. The easiest way to enumerate the labels and assign the label the enumerated value.

For example 
```
Label                  Encoding
---------------       ---------------
cause-effect               0
content-container          1
instrument-agency          2
origin-entity              3 
part-whole                 4
product-producer           5
theme_tool                 6
``` 

This can accomplished using the LabelEncoder from the sklearn library. 




In [None]:
from sklearn.preprocessing import LabelEncoder

# 1. Load Label Encoder
le = LabelEncoder()

# 2. Fit the label encoder to the label in our dataset
le.fit(train["label"])

# 3. Create a new column with encoded labels
train["encoded_label"] = le.transform(train["label"])
val["encoded_label"] = le.transform(val["label"])
test["encoded_label"] = le.transform(test["label"])

# Validate the mapping:
train.groupby(["label", "encoded_label"]).aggregate("count")

Unnamed: 0_level_0,Unnamed: 1_level_0,text
label,encoded_label,Unnamed: 2_level_1
cause-effect,0,126
content-container,1,126
instrument-agency,2,126
origin-entity,3,126
part-whole,4,126
product-producer,5,126
theme_tool,6,126


In [None]:
# The label encoder can also be used the tranform ids to labels
le.inverse_transform([1,4,6])

array(['content-container', 'part-whole', 'theme_tool'], dtype=object)

Once we've encoded the labels, let's go ahead and convert them to pytorch tensors

In [None]:
import torch

train_labels = torch.tensor(train["encoded_label"].tolist())
val_labels = torch.tensor(val["encoded_label"].tolist())
test_labels = torch.tensor(test["encoded_label"].tolist())

### Text Encoding 
DistilBERT expects wordpiece ids as input. Wordpiece is subword (https://huggingface.co/course/chapter6/6?fw=pt) tokenization algorithm learns a fixed set of token and partial token units which can be used to construct any word. The AutoTokenizer class from can be used to load the BERT wordpiece vocabulary and automatically enocde any text to a sequence of wordpiece ids. Let's explore this a bit further below.

The `AutoTokenizer.from_pretrained(model_alias)` method will automatically load the associated tokenizer and vocabulary associated with the transformer model.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Encoding a sentence
sent = "The quick brown fox jumped over the lazy dog."
print(f"Tokenizer output a dictionary: {tokenizer(sent)}")

# We can also decode ids to vocabulary
print(tokenizer.decode([101, 1996, 4248, 2829, 4419, 5598, 2058, 1996, 13971, 3899, 1012, 102]))

Tokenizer output a dictionary: {'input_ids': [101, 1996, 4248, 2829, 4419, 5598, 2058, 1996, 13971, 3899, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] the quick brown fox jumped over the lazy dog. [SEP]


The `Tokenizer` also handles some useful preprocessing steps for us like padding, truncation, and conversion to pytorch tensors. Most Bert models can handle up 512 tokens, however training a model that large of input tends to be computationally expensive. Since we operating with relatively short sentence, we can set the max_length to about 25 wordpiece tokens. Additionally if we pass all the input texts as once, `Tokenizer` will automatically pad to the specified length. Let try this out on the train dataset.


In [None]:
train_encodings = tokenizer(
    train["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

`Tokenizer` will output a dictionary with the input features for training. Let's take a closer look.

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [None]:
train_encodings

{'input_ids': tensor([[ 101, 1000, 2087,  ..., 1026, 1041,  102],
        [ 101, 1000, 1996,  ..., 1026, 1013,  102],
        [ 101, 1000, 2092,  ..., 3608, 1026,  102],
        ...,
        [ 101, 1000, 1996,  ..., 1026, 1013,  102],
        [ 101, 1000, 2348,  ..., 1013, 1041,  102],
        [ 101, 1000, 3402,  ..., 1013, 1041,  102]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])}

In [None]:
val_encodings = tokenizer(
    val["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

test_encodings = tokenizer(
    test["text"].tolist(),
    padding=True,           # pad all inputs to max length
    max_length=24,         # Bert max is 512, we choose 24 for computational efficiency
    return_tensors="pt",    # Return format pytorch tensor
    truncation=True
)

Finally we need to create a custom Pytorch Dataset class to store the the generated encodings for our train corpus. The code below creates a custom class and generates the datasets for the train, val, and test sets.

In [None]:
from torch.utils.data import Dataset
import torch
from sklearn.preprocessing import LabelEncoder

# Define Custom Class for DistilBert Inputs
class RelationDataset(Dataset):
    
    def __init__(self, encodings: dict):  
        self.encodings = encodings
        
    def __len__(self) -> int:
        return len(self.encodings["input_ids"])
    
    def __getitem__(self, idx: int) -> dict:
        e = {k: v[idx] for k,v in self.encodings.items()}
        return e 


# Update encodings with labels
train_encodings["labels"] = train_labels
val_encodings["labels"] = val_labels
test_encodings["labels"] = test_labels

# Generate Datasets
train_ds = RelationDataset(train_encodings)
val_ds = RelationDataset(val_encodings)
test_ds = RelationDataset(test_encodings)

In [None]:
train_ds[:2]

{'input_ids': tensor([[  101,  1000,  2087,  1997,  1996,  1026,  1041,  2487,  1028,  5492,
           1026,  1013,  1041,  2487,  1028,  3310,  2013,  1037, 12779,  1005,
           1055,  1026,  1041,   102],
         [  101,  1000,  1996,  1026,  1041,  2487,  1028,  6644,  5467,  1026,
           1013,  1041,  2487,  1028,  3605,  1996,  1026,  1041,  2475,  1028,
           7069,  1026,  1013,   102]]),
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'labels': tensor([5, 5])}

#Model Training
HuggingFace makes it simple to finetune transformer models for any task. First we load the pretrained model. `AutoModelForSequenceClassification` is a generic class combines an language model encoder with a classification head. Next create a `TrainingArgs` object which contains the training configuration details. Finally we create a `Trainer` object which will handle all the requisite training steps (i.e. learning rate scheduling, gradient backprop, etc.

Let's first load our model. 

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import Trainer
from transformers import TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=7)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'pre_classifi

### Finetuning Strategies
#### Partial Finetuning
There several strategies for finetuning. Most Transformer layers have multiple layers. Research shown that those layer often encode various linguistic and semantic information. Though it's difficult to know for certain whats the layers learn, research shows that the higher layers learn general liguistic features and the lower layers (closer to the classification head) pick up task specific information. You can experiment with how many layers you train. 

Full finetuning involves backpropagating the loss through all the layers in the model. This training strategy tends to be the most effective when adapting for a specific task, but the result models loses the ability to generalize to other tasks. Additionally full finetuning will take longer and require more compute. Partial finetuning involves strategically freezing layers (usually the top layers) and only finetuning the layer closest to the classification head. This tends to be faster and you don't risk losing what the model has learned during pretraining. Let's take a closer look at the model and it's layers. 

In [None]:
model

In [None]:
# Freeze embeddings
for name, param in model.distilbert.embeddings.named_parameters():
  param.requires_grad = False
  print(name, param.requires_grad)

# Freeze layers 1-4
freeze_layers = [1,2,3,4]
for name, param in model.distilbert.transformer.layer.named_parameters():
  if int(name[0]) in freeze_layers:
    param.requires_grad = False
  print(name, param.requires_grad)  

word_embeddings.weight False
position_embeddings.weight False
LayerNorm.weight False
LayerNorm.bias False
0.attention.q_lin.weight True
0.attention.q_lin.bias True
0.attention.k_lin.weight True
0.attention.k_lin.bias True
0.attention.v_lin.weight True
0.attention.v_lin.bias True
0.attention.out_lin.weight True
0.attention.out_lin.bias True
0.sa_layer_norm.weight True
0.sa_layer_norm.bias True
0.ffn.lin1.weight True
0.ffn.lin1.bias True
0.ffn.lin2.weight True
0.ffn.lin2.bias True
0.output_layer_norm.weight True
0.output_layer_norm.bias True
1.attention.q_lin.weight False
1.attention.q_lin.bias False
1.attention.k_lin.weight False
1.attention.k_lin.bias False
1.attention.v_lin.weight False
1.attention.v_lin.bias False
1.attention.out_lin.weight False
1.attention.out_lin.bias False
1.sa_layer_norm.weight False
1.sa_layer_norm.bias False
1.ffn.lin1.weight False
1.ffn.lin1.bias False
1.ffn.lin2.weight False
1.ffn.lin2.bias False
1.output_layer_norm.weight False
1.output_layer_norm.bias Fals

#### Training
We're finally ready to train the model. To train you need create training arguments object which contains the training details like num_epochs, learning rate scheduling, batch size, etc.

The `Trainer` object handles the training loop setup and training. 

In [None]:
from transformers import Trainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    lr_scheduler_type='cosine',
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32, 
    fp16=True,
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
)

trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
Using cuda_amp half precision backend
***** Running training *****
  Num examples = 882
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 140
  Number of trainable parameters = 14771719


Epoch,Training Loss,Validation Loss
1,No log,1.717285
2,No log,1.463429
3,No log,1.298611
4,No log,1.226717
5,No log,1.211797


***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-28
Configuration saved in ./results/checkpoint-28/config.json
Model weights saved in ./results/checkpoint-28/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-56
Configuration saved in ./results/checkpoint-56/config.json
Model weights saved in ./results/checkpoint-56/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-84
Configuration saved in ./results/checkpoint-84/config.json
Model weights saved in ./results/checkpoint-84/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-112
Configuration saved in ./results/checkpoint-112/config.json
Model weights saved in ./results/checkpoint-112/pytorch_model.bin
***** Running Evaluat

TrainOutput(global_step=140, training_loss=1.4396815708705357, metrics={'train_runtime': 10.4162, 'train_samples_per_second': 423.378, 'train_steps_per_second': 13.441, 'total_flos': 27385936794720.0, 'train_loss': 1.4396815708705357, 'epoch': 5.0})

### Getting prediction out of the model
You can use the predict call `trainer.predict(test_ds)` to run get predictions on the test set. The model will return the logit scores across the label space per prediction. The label with highest logit will be the primary prediction. Using `numpy.argmax` we can get the predicted encoded labels. We'll need to then convert those labels back to string for our classification report.  


In [None]:
import numpy as np
from sklearn.metrics import classification_report 

preds = trainer.predict(test_ds)
print(preds)


preds = le.inverse_transform(np.argmax(preds.predictions, axis=1))
print(classification_report(test["label"].tolist(), preds))

***** Running Prediction *****
  Num examples = 549
  Batch size = 32


PredictionOutput(predictions=array([[-0.2292 , -0.501  ,  0.4688 , ...,  0.2272 ,  0.505  , -1.41   ],
       [-0.567  , -1.333  , -0.9688 , ...,  0.02377,  0.827  ,  0.01137],
       [-1.143  , -0.651  , -0.5312 , ...,  0.3447 ,  0.6216 ,  0.7617 ],
       ...,
       [ 0.58   , -1.4    , -0.9326 , ..., -0.00762,  0.1239 ,  1.32   ],
       [ 2.67   , -1.234  , -0.8555 , ...,  0.2668 , -0.522  , -1.333  ],
       [ 0.84   , -1.688  , -0.8203 , ..., -0.3723 ,  0.2014 ,  0.859  ]],
      dtype=float16), label_ids=array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6

#### Full Finetuning
Let's how the model does if fully finetune it.

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=7)

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=5,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    lr_scheduler_type='cosine',
    per_device_train_batch_size = 32,
    per_device_eval_batch_size = 32, 
    fp16=True,
)

trainer = Trainer(
    model,
    training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
)

trainer.train()

preds = trainer.predict(test_ds)
print(preds)


preds = le.inverse_transform(np.argmax(preds.predictions, axis=1))
print(classification_report(test["label"].tolist(), preds))

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--distilbert-base-uncased/snapshots/1c4513b2eedbda136f57676a34eea67aba266e5c/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version

Epoch,Training Loss,Validation Loss
1,No log,1.569944
2,No log,1.054693
3,No log,0.857287
4,No log,0.806167
5,No log,0.793413


***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-28
Configuration saved in ./results/checkpoint-28/config.json
Model weights saved in ./results/checkpoint-28/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-56
Configuration saved in ./results/checkpoint-56/config.json
Model weights saved in ./results/checkpoint-56/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-84
Configuration saved in ./results/checkpoint-84/config.json
Model weights saved in ./results/checkpoint-84/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 98
  Batch size = 32
Saving model checkpoint to ./results/checkpoint-112
Configuration saved in ./results/checkpoint-112/config.json
Model weights saved in ./results/checkpoint-112/pytorch_model.bin
***** Running Evaluat

PredictionOutput(predictions=array([[-0.2405 , -1.0205 ,  0.8    , ..., -0.5635 ,  1.097  , -1.455  ],
       [-0.739  , -1.103  , -0.3555 , ..., -0.7295 ,  1.516  , -0.7373 ],
       [-2.037  ,  0.4438 , -1.092  , ...,  1.587  , -0.304  , -0.05453],
       ...,
       [ 2.068  , -1.578  , -1.279  , ..., -0.469  ,  0.3857 ,  1.251  ],
       [ 3.812  , -0.8027 , -0.703  , ..., -0.2651 , -0.274  , -0.841  ],
       [ 3.709  , -0.92   , -0.8037 , ..., -0.3564 ,  0.0277 , -0.5684 ]],
      dtype=float16), label_ids=array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
       6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6