# 🤗 Inference

After training our first model with 🤗 we want to explore ways to improve our training and make the notebook better readable.

- In a first step we use DistilBERT Base model as it's small, fast and good for our 🌍.

## Setup

As always we first import the necessary libraries.

In [1]:
from datasets import Dataset,DatasetDict
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import seaborn as sns
from transformers import AutoModelForSequenceClassification,AutoTokenizer
from transformers import TrainingArguments, Trainer
import torch

In [2]:
path = Path("feedback-prize-english-language-learning")

Load our data

In [3]:
df_ss = pd.read_csv(path/"sample_submission.csv")
df_tst = pd.read_csv(path/"test.csv")
df_trn = pd.read_csv(path/"train.csv")

We want to use the small DistilBERT model as we care about our enviroment 😊

In [4]:
tokenizer = AutoTokenizer.from_pretrained("trainer")

## Preprocess

Setting our labels and hot encode them.

In [5]:
labels = df_trn.columns[2:]
df_trn_ohe = pd.get_dummies(df_trn, columns=labels)

Create the dataset

In [6]:
ds = Dataset.from_pandas(df_trn_ohe)

Split in training and validation

In [7]:
ds_d = ds.train_test_split(0.25, seed=42)

To identify dataset labels at a later point:

In [8]:
labels = [label for label in ds_d['train'].features.keys() if label not in ['text_id', 'full_text']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

Preprocess the data with the help of our tokenizer 

In [9]:
def preprocess_data(examples):
    # take a batch of texts
    text = examples["full_text"]
    # encode them
    encoding = tokenizer(text, padding="max_length", truncation=True)
    # add labels
    labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
    # create numpy array of shape (batch_size, num_labels)
    labels_matrix = np.zeros((len(text), len(labels)))
    # fill numpy array
    for idx, label in enumerate(labels):
        labels_matrix[:, idx] = labels_batch[label]

    encoding["labels"] = labels_matrix.tolist()
  
    return encoding

Encode!

In [10]:
enc_ds_d = ds_d.map(preprocess_data, batched=True, remove_columns=ds_d['train'].column_names)



  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

## Getting ready for training

Our metric, mcrmse

In [11]:
def mcrmse(y_true,y_pred): return np.mean(np.sqrt(np.mean((y_true-y_pred)**2,axis=0)))

See [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb):

In [12]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    mcrmse_acc = mcrmse(y_true=y_true, y_pred=y_pred)
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)    
    # return as dictionary
    metrics = {'mcrmse': mcrmse_acc,
               'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Setting up our model

In [13]:
model = AutoModelForSequenceClassification.from_pretrained("trainer", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Hyperparameters & TEST arguments setting:

In [14]:
batch_size = 16
metric_for_best_model = "mcrmse"

In [15]:
args = TrainingArguments(
    f"tst_out",
    do_train = False,
    do_predict = True,
    per_device_eval_batch_size = batch_size,
    dataloader_drop_last = False
)

Set format to torch tensors:

In [16]:
enc_ds_d.set_format("torch")

Check for correct batch format and forward pass:

In [17]:
enc_ds_d['train'][0]['labels'].type()

'torch.FloatTensor'

In [18]:
enc_ds_d['train']['input_ids'][0]

tensor([  101,  2722,  1110,  1141,  1104,  1103,  1211,  2712,  1282,  1107,
         1142,  1362,   119,  2722,  1144,  7424,  1104,  4333,   117,  1152,
         1138,  1992,  5862,  3514,  1176,  5230,  6331,   117,  7120,   117,
         6244,  1186,  6331,   117,  1105,  1242,  1167,   119,  1220,  1138,
         1992,  4706,  1116,  1115,  2006,  1166,  1620,   117,  1288,  1234,
          117,  1152,  1256,  1138,  1199,  3505,  1956,  1176,  1152,  1138,
         1138,  1832,  5082,  7418,  1656,  1103,  4706,   117,  1141,  1104,
         1172,  1144,   170,  4255,  1656,   119,  2857,   146,  1156,  6657,
         1120,  2722,   146,  1156,  1567,  1106,  2222,  1147,  2094,  1272,
          146,  1767,  1147,  2094,  1108,  1363,  1177,  1358,   146, 14516,
         1116, 10340,  1174,  1122,  1146,  1105,  1103,  2094,  2736,  1177,
          194,  1818,  4527,  1105, 27629, 13913,   119,   146,  1156,  1145,
         1567,  1106,  4176,  1113,   170,  3499,  1105, 11270, 

In [19]:
#forward pass
outputs = model(input_ids=enc_ds_d['train']['input_ids'][0].unsqueeze(0), labels=enc_ds_d['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.4215, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-4.2376, -4.1543, -2.9306, -2.1598, -1.3130, -0.7384, -0.6960, -1.9610,
         -3.2477, -3.9503, -3.9065, -2.9795, -2.5574, -1.0898, -0.4721, -1.1970,
         -1.9223, -3.2381, -4.6652, -4.3457, -3.6033, -2.8588, -1.1986, -0.7760,
         -0.5751, -2.0190, -2.8463, -4.6242, -4.5797, -2.7286, -2.3781, -1.3806,
         -0.7603, -0.7602, -1.5885, -3.4440, -4.4978, -4.0912, -2.8215, -2.0947,
         -1.2550, -0.6153, -1.2498, -1.5860, -3.2566, -4.2045, -4.2450, -3.0765,
         -2.4564, -0.9493, -0.8196, -0.9362, -1.8374, -3.1738]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Setting up our Trainer:

In [20]:
trainer = Trainer(
    model,
    args,
    train_dataset=enc_ds_d["train"],
    eval_dataset=enc_ds_d["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

## Predict

Set up eval dataset

In [29]:
def tok_func(x): return tokenizer(x["full_text"],padding="max_length", truncation=True)

In [30]:
ds_tst = Dataset.from_pandas(df_tst)

In [31]:
eval_ds = ds_tst.map(tok_func, batched=True, remove_columns=["text_id","full_text"])

  0%|          | 0/1 [00:00<?, ?ba/s]

In [32]:
preds = trainer.predict(eval_ds)

***** Running Prediction *****
  Num examples = 3
  Batch size = 16


In [33]:
preds_tt = torch.Tensor(preds.predictions)

In [34]:
# apply sigmoid
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(preds_tt)

Reshape.

We have NUM_TST x NUM_CATS x NUM_GRADES tensor 

In [35]:
probs_cat = torch.reshape(probs,(eval_ds.num_rows,6,-1))
probs_cat.shape

torch.Size([3, 6, 9])

Init 1 x 1 x NUM_GRADES tensor for calculating expectations along the grades!

In [36]:
cat_grade = torch.arange(1,5.5,0.5)[None,None]
cat_grade

tensor([[[1.0000, 1.5000, 2.0000, 2.5000, 3.0000, 3.5000, 4.0000, 4.5000,
          5.0000]]])

The physicists best friend 🌴

In [37]:
grades = torch.einsum('ijk,ijk->ij', probs_cat, cat_grade)
grades 

tensor([[2.9371, 2.8197, 3.0020, 2.8806, 2.8220, 2.9016],
        [2.8743, 2.7761, 3.0065, 2.8705, 2.7872, 2.8324],
        [3.1504, 2.9996, 3.1287, 2.9691, 2.9335, 3.0814]])

Let's convert this to a dataframe:

In [38]:
sub = pd.DataFrame(grades.numpy())
sub.columns = df_ss.columns[1:]
sub["text_id"] = ds_tst["text_id"]
sub

Unnamed: 0,cohesion,syntax,vocabulary,phraseology,grammar,conventions,text_id
0,2.937111,2.819667,3.002044,2.880621,2.822022,2.901631,0000C359D63E
1,2.874267,2.776135,3.006503,2.870503,2.78722,2.832387,000BAD50D026
2,3.150431,2.999614,3.128673,2.969058,2.933495,3.081372,00367BB2546B


We need to rearrange the columns to get the right format.

In [39]:
sub = sub[df_ss.columns]
sub 

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,2.937111,2.819667,3.002044,2.880621,2.822022,2.901631
1,000BAD50D026,2.874267,2.776135,3.006503,2.870503,2.78722,2.832387
2,00367BB2546B,3.150431,2.999614,3.128673,2.969058,2.933495,3.081372


In [40]:
sub.to_csv("submission.csv", index=False)