# Inference with 🤗 

In [this notebook](https://www.kaggle.com/code/simonveitner/a-first-model) I showed how to build a simple model in huggingface.

Other from that I just used `trainer.save_model("trainer")` to save my model in the folder `trainer`.

From there i just need to exchange the loading of model file in the notebook and setup the trainer for inference mode as described in [here](https://discuss.huggingface.co/t/using-trainer-at-inference-time/9378/7). Everything else is pretty much the same 😊

In [1]:
from datasets import Dataset,DatasetDict
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pathlib import Path
import seaborn as sns
from transformers import AutoModelForSequenceClassification,AutoTokenizer
import torch

# Set nice style
plt.style.use(['dark_background'])

## Setup

In [2]:
path = Path("feedback-prize-english-language-learning")

Get an overview of what is contained in our folder.

In [3]:
!ls {path}

sample_submission.csv  test.csv  train.csv


In [4]:
df_ss = pd.read_csv(path/"sample_submission.csv")
df_tst = pd.read_csv(path/"test.csv")
df_trn = pd.read_csv(path/"train.csv")

In [5]:
labels = df_trn.columns[2:]

## Preprocess

To use the transformers we need a dataset.

Let's initialize our tokenizer:

In [6]:
model_nm = 'trainer'

In [7]:
tokz = AutoTokenizer.from_pretrained(model_nm)

Let's see how exactly this works:

In [8]:
tokz.tokenize("I want to get tokenized!")

['▁I', '▁want', '▁to', '▁get', '▁token', 'ized', '!']

We can encode...

In [9]:
enc = tokz.encode(("I want to get tokenized!"))
enc

[1, 273, 409, 264, 350, 10704, 4666, 300, 2]

... and decode

In [10]:
tokz.decode(enc)

'[CLS] I want to get tokenized![SEP]'

Now we need to do 2 things:
* tokenize our data, that means transforming the text into some form the computer can process
* initialize our multi label for the model.

Let's first tackle the problem of generating categories:
* We One hot encode
* We use the resulting df to generate our dataset!

In [11]:
tempdf = pd.get_dummies(df_trn, columns=labels)

In [12]:
ds = Dataset.from_pandas(tempdf)

Split in training and validation!

In [13]:
ds_d = ds.train_test_split(0.25, seed=42)

Mapping between labels and integers

In [14]:
labels = [label for label in ds_d['train'].features.keys() if label not in ['text_id', 'full_text']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels[:5]

['cohesion_1.0',
 'cohesion_1.5',
 'cohesion_2.0',
 'cohesion_2.5',
 'cohesion_3.0']

In [15]:
def preprocess_data(examples):
    # take a batch of texts
    text = examples["full_text"]
    # encode them
    encoding = tokz(text, padding="max_length", truncation=True, max_length=128)
    # add labels
    labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
    # create numpy array of shape (batch_size, num_labels)
    labels_matrix = np.zeros((len(text), len(labels)))
    # fill numpy array
    for idx, label in enumerate(labels):
        labels_matrix[:, idx] = labels_batch[label]

    encoding["labels"] = labels_matrix.tolist()
  
    return encoding

In [16]:
enc_ds_d = ds_d.map(preprocess_data, batched=True, remove_columns=ds_d['train'].column_names)



  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [17]:
enc_ds_d

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2933
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 978
    })
})

Split in train and validation set

## Setting up the model & trainer

Setting up the metric...

In [18]:
def mcrmse(x,y): return np.mean(np.sqrt(np.mean((x-y)**2,axis=0)))

In [19]:
x1 = np.array([[1,1],[0,0]])
x2 = np.array([[1,0],[0,1]])

In [20]:
mcrmse(x1,x2)

0.5

In [21]:
def mcrmse_d(eval_pred): return {'mcrmse': mcrmse(*eval_pred)}

In [22]:
from transformers import TrainingArguments,Trainer

...Batch size and metric name...

In [23]:
batch_size = 8
metric_name = "mcrmse"

...the arguments...

In [24]:
args = TrainingArguments(
    f"tst_out",
    do_train = False,
    do_predict = True,
    per_device_eval_batch_size = batch_size,
    dataloader_drop_last = False
)

In [25]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    mcrmse_acc = mcrmse(y_true, y_pred)
    # return as dictionary
    metrics = {'mcrmse': mcrmse_acc}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

...the model...

In [26]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_nm, 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

In [27]:
enc_ds_d.set_format("torch")

Let's verify a batch as well as a forward pass:

In [28]:
enc_ds_d['train'][0]['labels'].type()

'torch.FloatTensor'

In [29]:
enc_ds_d['train']['input_ids'][0]

tensor([    1,  4136,   269,   311,   265,   262,   370,   874,   470,   267,
          291,   447,   260,  4136,   303,  1738,   265,  1451,   261,   306,
          286,   610,  6135,  4986,   334,  2858,  8202,   261,  7934,   261,
        13381,   795,  8202,   261,   263,   386,   310,   260,   450,   286,
          610, 34772,   272,   783,   360,   803,   261,   528,   355,   261,
          306,   402,   286,   347,  1085,   830,   334,   306,   286,   286,
         1090,  8926, 16224,  1013,   262,  8425,   261,   311,   265,   349,
          303,   266,  2553,  1013,   260,  1414,   273,   338,  3753,   288,
         4136,   273,   338,   472,   264,   687,   308,   645,   401,   273,
         1331,   308,   645,   284,   397,   324,  1964,   273, 40756,   834,
        31452,   278,   322,   263,   262,   645,  1127,   324, 12516,   263,
         7741,   260,   273,   338,   327,   472,   264,  2224,   277,   266,
         2750,   263,  5771,   390,   262,   707,   260,     2])

In [30]:
#forward pass
outputs = model(input_ids=enc_ds_d['train']['input_ids'][0].unsqueeze(0), labels=enc_ds_d['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.1572, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-7.6003, -6.2087, -1.6748, -0.1617, -0.7942, -2.9857, -5.6274, -7.9416,
         -8.3018, -7.3899, -5.6492, -1.3790,  0.0663, -1.1804, -4.2006, -6.4937,
         -8.1050, -8.4808, -8.3238, -6.9928, -3.2913, -0.7201,  0.4488, -3.5915,
         -5.5989, -7.6971, -8.4161, -6.8264, -7.3330, -1.5203, -0.0205, -0.8572,
         -3.8849, -5.9465, -8.1325, -8.1105, -7.3505, -6.5058, -0.8089, -0.1246,
         -1.5866, -4.0718, -5.6008, -7.9227, -8.0366, -6.8729, -6.0449, -1.1963,
          0.0729, -1.3299, -3.5036, -5.9796, -7.8256, -8.9868]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [31]:
trainer = Trainer(
    model,
    args,
    train_dataset=enc_ds_d["train"],
    eval_dataset=enc_ds_d["test"],
    tokenizer=tokz,
    compute_metrics=compute_metrics
)

## Predict

Set up dataset

In [32]:
def tok_func(x): return tokz(x["full_text"])

In [33]:
ds_tst = Dataset.from_pandas(df_tst)

In [34]:
eval_ds = ds_tst.map(tok_func, batched=True, remove_columns=["text_id","full_text"])

  0%|          | 0/1 [00:00<?, ?ba/s]

In [35]:
preds = trainer.predict(eval_ds)

***** Running Prediction *****
  Num examples = 3
  Batch size = 8


In [36]:
preds_tt = torch.Tensor(preds.predictions)

In [37]:
# apply sigmoid + threshold
sigmoid = torch.nn.Sigmoid()
probs = sigmoid(preds_tt)

In [38]:
probs.shape

torch.Size([3, 54])

In [39]:
probs_cat = torch.reshape(probs,(eval_ds.num_rows,6,-1))
probs_cat

tensor([[[4.3149e-04, 1.6425e-03, 1.4542e-01, 4.5501e-01, 3.2861e-01,
          5.1657e-02, 3.7629e-03, 3.3773e-04, 2.2183e-04],
         [5.0841e-04, 2.9085e-03, 1.8654e-01, 5.1705e-01, 2.5370e-01,
          1.5948e-02, 1.6301e-03, 2.7220e-04, 1.8518e-04],
         [2.0637e-04, 7.6860e-04, 3.1243e-02, 3.1676e-01, 6.1975e-01,
          2.9120e-02, 3.9431e-03, 4.3889e-04, 2.0195e-04],
         [9.2469e-04, 5.5310e-04, 1.6003e-01, 4.7628e-01, 3.1655e-01,
          2.2838e-02, 2.7699e-03, 2.7647e-04, 2.7412e-04],
         [5.4792e-04, 1.2285e-03, 2.7858e-01, 4.6222e-01, 1.8477e-01,
          1.9232e-02, 4.0661e-03, 3.4774e-04, 3.0245e-04],
         [8.6861e-04, 1.9523e-03, 2.1894e-01, 5.2350e-01, 2.2068e-01,
          3.2152e-02, 2.5723e-03, 3.9099e-04, 1.1256e-04]],

        [[9.6675e-04, 3.8596e-03, 2.3323e-01, 4.9773e-01, 2.5149e-01,
          3.2815e-02, 2.7687e-03, 3.8969e-04, 2.9928e-04],
         [1.2245e-03, 6.5233e-03, 3.1888e-01, 5.1605e-01, 1.5547e-01,
          1.0238e-02, 1.2

In [40]:
pred = torch.argmax(probs_cat, axis=2)
pred

tensor([[3, 3, 4, 3, 3, 3],
        [3, 3, 4, 3, 3, 3],
        [5, 5, 5, 5, 5, 5]])

If we get prediction for the i'th column that means we have have for the category given by the row index

$grad_{cat} = i * 0.5 + 1$

In [41]:
grades = pred*0.5+1
grades 

tensor([[2.5000, 2.5000, 3.0000, 2.5000, 2.5000, 2.5000],
        [2.5000, 2.5000, 3.0000, 2.5000, 2.5000, 2.5000],
        [3.5000, 3.5000, 3.5000, 3.5000, 3.5000, 3.5000]])

Let's convert this to a dataframe:

In [42]:
sub = pd.DataFrame(grades.numpy())
sub.columns = df_ss.columns[1:]
sub["text_id"] = ds_tst["text_id"]
sub

Unnamed: 0,cohesion,syntax,vocabulary,phraseology,grammar,conventions,text_id
0,2.5,2.5,3.0,2.5,2.5,2.5,0000C359D63E
1,2.5,2.5,3.0,2.5,2.5,2.5,000BAD50D026
2,3.5,3.5,3.5,3.5,3.5,3.5,00367BB2546B


In [43]:
df_ss

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,3.0,3.0,3.0,3.0,3.0,3.0
1,000BAD50D026,3.0,3.0,3.0,3.0,3.0,3.0
2,00367BB2546B,3.0,3.0,3.0,3.0,3.0,3.0


We need to rearrange the columns to get the right format.

In [44]:
sub = sub[df_ss.columns]
sub 

Unnamed: 0,text_id,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0000C359D63E,2.5,2.5,3.0,2.5,2.5,2.5
1,000BAD50D026,2.5,2.5,3.0,2.5,2.5,2.5
2,00367BB2546B,3.5,3.5,3.5,3.5,3.5,3.5
