# Fine-tuning a transformer models using Trainer API

[Trainer](https://huggingface.co/docs/transformers/v4.34.1/en/main_classes/trainer#trainer) class helps fine-tune any of the pre-trained models.


**NOTE:** Trainer Class is not compatible with CPU. This code will either run very slow or throw error in CPU. 

## Processing the dataset

### Loading the dataset

In [None]:
from datasets import load_dataset

In [None]:
raw_dataset = load_dataset('financial_phrasebank', 'sentences_allagree')
raw_dataset

In [None]:
raw_dataset = raw_dataset['train'].train_test_split(test_size=0.2, stratify_by_column="label")
raw_dataset

In [None]:
raw_dataset['train'].features

In [None]:
raw_dataset['train'][25]

### Preprocessing the dataset

In [None]:
from transformers import AutoTokenizer

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [None]:
def tokenize_dataset(example):
    return tokenizer(example['sentence'], truncation=True)

In [None]:
tokenized_dataset = raw_dataset.map(tokenize_dataset, batched=True)
tokenized_dataset

`Dataset.map()` gives flexibility to preprocess all the dataset at once efficiently without breaking the `DatasetDict` object. `The map()` method works by applying a function on each element of the dataset. Any other pre-processing can also be easily applied using the `map()`.

The above function `tokenize_dataset(example)` function takes a dictionary (like the items of our dataset) and returns a new dictionary with the keys `input_ids` and `attention_mask`

In [None]:
from transformers import DataCollatorWithPadding

If padding had been done earlier, then it would have padded all the sentences with the size of *longest sequence*, which is not a good practice. It is more efficient to pad while building a batch; it will pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths.

This process of padding during building the batch is called *dynamic padding*. This can be achieved using `collate function` while building `DataLoader`. The collate function is passed as a parameter in the DataLoader. One can write their own collate function or can use the pre-written ones.

In [None]:
dataCollator = DataCollatorWithPadding(tokenizer=tokenizer)

## Fine-tuning the model with Trainer api

### Defining the training arguements

In [None]:
from transformers import TrainingArguments

In [None]:
training_args = TrainingArguments(
                    output_dir='./../../model/fin_sentiment_distilbert/',
                    evaluation_strategy='epoch',
                    learning_rate=2e-5,
                    num_train_epochs=3,
                    weight_decay=0.01,                    
)

training_args

### Definining the model

In [None]:
from transformers import AutoModelForSequenceClassification

In [None]:
checkpoint = 'distilbert-base-uncased'

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)

### Defining the evaluation metric

In [None]:
import numpy as np
from datasets import load_metric
 
def compute_metrics(eval_pred):
   load_accuracy = load_metric("accuracy")
   load_f1 = load_metric("f1")
  
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
   f1 = load_f1.compute(predictions=predictions, references=labels, average='macro')["f1"]
   return {"accuracy": accuracy, "f1": f1}

### Defining the Trainer

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    data_collator=dataCollator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

### Training the model

In [None]:
trainer.train()

## Evaluating the model

In [None]:
trainer.evaluate()