# Improving our 🤗 base model

After training our first model with 🤗 we want to explore ways to improve our training and make the notebook better readable.

- In a first step we use DistilBERT Base model as it's small, fast and good for our 🌍.

## Setup

As always we first import the necessary libraries.

In [48]:
from datasets import Dataset,DatasetDict
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import seaborn as sns
from transformers import AutoModelForSequenceClassification,AutoTokenizer
from transformers import TrainingArguments, Trainer
import torch

In [3]:
path = Path("feedback-prize-english-language-learning")

Load our data

In [4]:
df_ss = pd.read_csv(path/"sample_submission.csv")
df_tst = pd.read_csv(path/"test.csv")
df_trn = pd.read_csv(path/"train.csv")

We want to use the small DistilBERT model as we care about our enviroment 😊

In [35]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")

## Preprocess

Setting our labels and hot encode them.

In [13]:
labels = df_trn.columns[2:]
df_trn_ohe = pd.get_dummies(df_trn, columns=labels)

Create the dataset

In [17]:
ds = Dataset.from_pandas(df_trn_ohe)

Split in training and validation

In [22]:
ds_d = ds.train_test_split(0.25, seed=42)

To identify dataset labels at a later point:

In [24]:
labels = [label for label in ds_d['train'].features.keys() if label not in ['text_id', 'full_text']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

Preprocess the data with the help of our tokenizer 

In [33]:
def preprocess_data(examples):
    # take a batch of texts
    text = examples["full_text"]
    # encode them
    encoding = tokenizer(text, padding="max_length", truncation=True)
    # add labels
    labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
    # create numpy array of shape (batch_size, num_labels)
    labels_matrix = np.zeros((len(text), len(labels)))
    # fill numpy array
    for idx, label in enumerate(labels):
        labels_matrix[:, idx] = labels_batch[label]

    encoding["labels"] = labels_matrix.tolist()
  
    return encoding

Encode!

In [37]:
enc_ds_d = ds_d.map(preprocess_data, batched=True, remove_columns=ds_d['train'].column_names)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Getting ready for training

Our metric, mcrmse

In [66]:
def mcrmse(y_true,y_pred): return np.mean(np.sqrt(np.mean((y_true-y_pred)**2,axis=0)))

See [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb):

In [67]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    mcrmse_acc = mcrmse(y_true=y_true, y_pred=y_pred)
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)    
    # return as dictionary
    metrics = {'mcrmse': mcrmse_acc,
               'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Setting up our model

In [68]:
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "id2label": {
    "0": "cohesion_1.0",
    "1": "cohesion_1.5",
    "2": "cohesion_2.0",
    "3": "cohesion_2.5",
    "4": "cohesion_3.0",
    "5": "cohesion_3.5",
    "6": "cohesion_4.0",
    "7": "cohesion_4.5",
    "8": "cohesion_5.0",
    "9": "syntax_1.0",
    "10": "syntax_1.5",
    "11": "syntax_2.0",
    "12": "syntax_2.5",
    "13": "syntax_3.0",
    "14": "syntax_3.5",
    "15": "syntax_4.0",
    "16": "syntax_4.5",
    "17": "syntax_5.0",
    "18": "vocabulary_1.0

Hyperparameters & arguments setting:

In [69]:
batch_size = 16
metric_for_best_model = "mcrmse"

In [70]:
args = TrainingArguments(
    f"outputs",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=8e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_for_best_model,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Set format to torch tensors:

In [71]:
enc_ds_d.set_format("torch")

Check for correct batch format and forward pass:

In [72]:
enc_ds_d['train'][0]['labels'].type()

'torch.FloatTensor'

In [73]:
enc_ds_d['train']['input_ids'][0]

tensor([  101,  2722,  1110,  1141,  1104,  1103,  1211,  2712,  1282,  1107,
         1142,  1362,   119,  2722,  1144,  7424,  1104,  4333,   117,  1152,
         1138,  1992,  5862,  3514,  1176,  5230,  6331,   117,  7120,   117,
         6244,  1186,  6331,   117,  1105,  1242,  1167,   119,  1220,  1138,
         1992,  4706,  1116,  1115,  2006,  1166,  1620,   117,  1288,  1234,
          117,  1152,  1256,  1138,  1199,  3505,  1956,  1176,  1152,  1138,
         1138,  1832,  5082,  7418,  1656,  1103,  4706,   117,  1141,  1104,
         1172,  1144,   170,  4255,  1656,   119,  2857,   146,  1156,  6657,
         1120,  2722,   146,  1156,  1567,  1106,  2222,  1147,  2094,  1272,
          146,  1767,  1147,  2094,  1108,  1363,  1177,  1358,   146, 14516,
         1116, 10340,  1174,  1122,  1146,  1105,  1103,  2094,  2736,  1177,
          194,  1818,  4527,  1105, 27629, 13913,   119,   146,  1156,  1145,
         1567,  1106,  4176,  1113,   170,  3499,  1105, 11270, 

In [74]:
#forward pass
outputs = model(input_ids=enc_ds_d['train']['input_ids'][0].unsqueeze(0), labels=enc_ds_d['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.6894, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.0697, -0.1150,  0.1347, -0.0884, -0.0258, -0.0756,  0.0942,  0.0280,
         -0.0434, -0.0496,  0.1531, -0.0448,  0.0982, -0.0008, -0.0036, -0.0964,
         -0.1115, -0.1378, -0.0158,  0.1075,  0.0232,  0.0487,  0.0406,  0.1637,
          0.0174,  0.0351, -0.0200,  0.0281, -0.0097, -0.0534,  0.0546, -0.0334,
         -0.1012,  0.0192, -0.1326,  0.0234, -0.0674,  0.0868,  0.0536, -0.0165,
          0.0937,  0.0105,  0.1397, -0.1875,  0.0375,  0.0325, -0.0408, -0.0414,
          0.0562, -0.0262, -0.0540, -0.0930,  0.0205, -0.0305]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Setting up our Trainer:

In [75]:
trainer = Trainer(
    model,
    args,
    train_dataset=enc_ds_d["train"],
    eval_dataset=enc_ds_d["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

**TRAIN!**

In [76]:
trainer.train()

***** Running training *****
  Num examples = 2933
  Num Epochs = 5
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 920


Epoch,Training Loss,Validation Loss,Mcrmse,F1,Roc Auc,Accuracy
1,No log,0.285174,0.276595,0.0,0.5,0.0
2,No log,0.277689,0.276595,0.0,0.5,0.0
3,0.296900,0.264235,0.277174,0.053526,0.512046,0.0
4,0.296900,0.263474,0.276842,0.041069,0.509224,0.0
5,0.296900,0.261615,0.27693,0.058494,0.513399,0.0


***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs/checkpoint-184
Configuration saved in outputs/checkpoint-184/config.json
Model weights saved in outputs/checkpoint-184/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-184/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-184/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs/checkpoint-368
Configuration saved in outputs/checkpoint-368/config.json
Model weights saved in outputs/checkpoint-368/pytorch_model.bin
tokenizer config file saved in outputs/checkpoint-368/tokenizer_config.json
Special tokens file saved in outputs/checkpoint-368/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 978
  Batch size = 16
Saving model checkpoint to outputs/checkpoint-552
Configuration saved in outputs/checkpoint-552/config.json
Model weights saved in outputs/che

TrainOutput(global_step=920, training_loss=0.2778122279954993, metrics={'train_runtime': 794.4716, 'train_samples_per_second': 18.459, 'train_steps_per_second': 1.158, 'total_flos': 1944435895879680.0, 'train_loss': 0.2778122279954993, 'epoch': 5.0})

In [77]:
trainer.save_model("trainer")

Saving model checkpoint to trainer
Configuration saved in trainer/config.json
Model weights saved in trainer/pytorch_model.bin
tokenizer config file saved in trainer/tokenizer_config.json
Special tokens file saved in trainer/special_tokens_map.json
