# Classification Fine Tuning

Questo notebook ha l'obiettivo di effettuare il <b>"fine tuning"</b> di un modello pretrained al fine di specializzare il modello stesso rispetto alle capacità già acquisite in fase di pre-traing.

A tal fine è necessario utilizzare uno specifico dataset che sia specifico del dominio per il quale si vuole adattare il modello.
Attraverso il processo di fine tuning i weigths del modello sono modificati rispetto ai valori ottenuti con la fase di pre-trainig.

In [2]:
#!uv pip install datasets
#!uv pip install transformers[torch]

A titolo di esempio andremo a fare il fine tuning di un modello per specializzarlo nel compito di sentiment analisys (binary classification) su recensioni di ristoranti (in inglese).

A tal fine, come prima cosa, è necessario ottenere un dataset da utilizzare per il fine tuning. Utilizzeremo un dataset scaricato da HF (Yelp Polarity)

In [3]:
from datasets import load_dataset

dataset = load_dataset("yelp_polarity") 

# Inspects the dataset to understand the structure 
print(dataset) 

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 560000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 38000
    })
})


In [4]:
# Accesses the train split 
train_dataset = dataset['train']  #1

# Prints the first example
print(train_dataset[0])  #2

{'text': "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.", 'label': 0}


Dal momento che il dataset include recensioni relative a diversi argomenti, è necessario estrarre solo quelle che riguardano ristoranti.<br>
Inoltre è necessario ridurre il numero di esempi per facilitare l'attività di SFT (in assenza di hardware adeguato)

In [5]:
# Selects the train and test splits 
train_dataset = dataset["train"] 
test_dataset = dataset["test"]   

# Filters for restaurant-related reviews in the train and test datasets 
restaurant_train_reviews = train_dataset.filter( 
    lambda x: "restaurant" in x["text"].lower()
)

restaurant_test_reviews = test_dataset.filter( 
    lambda x: "restaurant" in x["text"].lower()
)

number_of_reviews = 5000

# Shuffles and gets 5,000 rows 
subset_train_reviews = restaurant_train_reviews.shuffle(
    seed = 42).select(range(number_of_reviews))
subset_test_reviews = restaurant_test_reviews.shuffle(
    seed = 42).select(range(number_of_reviews))

# Creates a DatasetDict to return both train and test datasets 
subset_dataset = {
    "train": subset_train_reviews,
    "test": subset_test_reviews
}

# Displays the structure to match the requested format 
from datasets import DatasetDict
yelp_restaurant_dataset = DatasetDict(subset_dataset)

# Prints the dataset structure 
print(yelp_restaurant_dataset) 

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})


Visualizzo il primo record del nuovo training set

In [6]:
yelp_restaurant_dataset['train'][0]

{'text': 'My girlfriend and I have been wanting to come here for awhile, we finally came & we had the worst experience ever. We asked our server for a few minutes to look over the menu & he never came back. 15 minutes later, someone finally came and took our order. We waited awhile and when they brought our food, they got the whole order wrong. My girlfriend ordered soup and it never came out. Worst service ever. Would not recommend this restaurant to anyone.',
 'label': 0}

### TOKENIZE DATASET

Dopo aver creato il dataset da utilizzare, procedo a scaricare il modello pre-trained (distilbert-base-uncased in questo caso) e il relativo tokenizer e procedo ad effettuare la tokenizzazione del dataset

In [6]:
from transformers import AutoTokenizer

# Loads a pretrained model and tokenizer (DistilBERT) for sentiment classification 
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

# Function to tokenize the dataset 
def tokenize_function(examples):  #2
    return tokenizer(examples["text"],
                     padding = "max_length",
                     truncation = True,
                     max_length = 512)

# Applies the tokenization function to the dataset 
tokenized_datasets = yelp_restaurant_dataset.map( 
                         tokenize_function,
                         batched=True)
tokenized_datasets

Map:   0%|          | 0/5000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 5000
    })
})

Scarico il modello pre-trained su cui effettuare SFT.
Utilizzando AutoModelForSequenceClassification viene aggiunta al modello una classification head adatta al task che è necessario eseguire il compito di sequence classification (assegnazione di n labels descrittive a una data sequenza di input)

In [7]:
from transformers import AutoModelForSequenceClassification
import torch

# Loads a pretrained model for sequence classification 
model = AutoModelForSequenceClassification.from_pretrained(  
            model_checkpoint, num_labels = 2)

# Determines the device 
if torch.backends.mps.is_available(): 
    device = torch.device("mps")
else:
    device = torch.device(
        "cuda" if torch.cuda.is_available() else "cpu")

# Moves the model to the selected device 
model.to(device) 

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


## TRAINING

Per il training si utilizzano 2 classi di HF:
- TrainigArguments : dove vengono settati tutti gli hyperparameter che guidano la fase di training
- Trainer : utilizza la classe precedente per fare il training

# ATTENZIONE!!!

NON LANCIARE SU CPU in quanto il kernel va in crash!

Da riprovare a lanciare da COLAB

In [None]:
from transformers import Trainer, TrainingArguments

# Sets up training arguments 
training_args = TrainingArguments( 
    output_dir = "./results",  # Directory in which to save results 
    eval_strategy = "epoch",  # Evaluates model after each epoch 
    save_strategy = "epoch",  # Saves the model after each epoch 
    learning_rate = 2e-5,  # Learning rate 
    per_device_train_batch_size = 16,  # Batch size for training 
    per_device_eval_batch_size = 16,  # Batch size for evaluation 
    num_train_epochs = 3,  # Number of training epochs 
    weight_decay = 0.01,  # Weight decay for regularization 
    logging_dir = "./logs",  # Directory for logs 
    logging_steps = 10,  # Logs every 10 steps 
    save_steps = 500,  # Saves the model every 500 steps 
    load_best_model_at_end = True,  # Loads the best model at the end of training 
)

# Sets up the Trainer 
trainer = Trainer( 
    model = model,
    args = training_args,
    train_dataset = tokenized_datasets["train"],
    eval_dataset = tokenized_datasets["test"],
)

# Fine-tunes the model 
trainer.train()  



Epoch,Training Loss,Validation Loss


Una volta conclusa la fase di fine tuning è possibile procedere con il salvataggio del nuovo modello e del relativo tokenizer in una specifica cartella. In questo modo sarà possibile successivamente recuperarlo al fine di utilizzarlo per effettuare inference

In [None]:
# Saves the fine-tuned model and tokenizer 
model.save_pretrained("./results/final_model")  
tokenizer.save_pretrained("./results/final_tokenizer")   

E' possibile effettuare la valutazione del model di cui è stato fatto SFT

In [None]:
# Evaluates the model 
eval_results = trainer.evaluate() 
print(f"Evaluation results: {eval_results}")

Una volta concluso il fine tuning e salvato in nuovo modello è possibile utilizzare il modello stesso per effettuare sentiment analysis su nuove recensioni di ristoranti

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Reloads the model and tokenizer 
new_model = AutoModelForSequenceClassification.from_pretrained( 
                "./results/final_model")
new_tokenizer = AutoTokenizer.from_pretrained(  
                "./results/final_tokenizer")

# Moves the inference to GPU 
new_model.to(device) 

sentence = '''
I had an amazing experience dining at this restaurant last night.
From the moment we walked in, the staff made us feel welcomed and
were incredibly attentive. Our server was friendly, knowledgeable,
and made great recommendations from the menu.

The food was absolutely delicious. I had the grilled salmon, and
it was cooked to perfection—tender, flavorful, and served with a
lovely citrus glaze that complemented it beautifully. The roasted 
vegetables on the side were fresh and perfectly seasoned. My
partner had the pasta, which was creamy and rich in flavor, with
just the right amount of spice.

The ambiance was warm and inviting, with cozy lighting and tasteful
decor. It was the perfect place to relax and enjoy a nice meal. The 
dessert, a decadent chocolate lava cake, was the perfect way to end
the meal.

Overall, this restaurant exceeded my expectations in every way.
Excellent food, exceptional service, and a wonderful atmosphere.
I'll definitely be back and highly recommend it to anyone looking
for a great dining experience.
'''

# Tokenizes the input sentence 
inputs = new_tokenizer(sentence, 
                       return_tensors = "pt",
                       padding = True,
                       truncation = True,
                       max_length = 512)

# Moves inputs to GPU/MPS 
inputs = {key: value.to(device) for key, value in inputs.items()} 

# Puts the model in evaluation mode 
new_model.eval() 

# Runs the model to get predictions 
with torch.no_grad():
    outputs = new_model(**inputs) 

# Gets the logits (raw scores) from the model output 
logits = outputs.logits 
# Converts logits to probabilities using Softmax 
probabilities = torch.nn.functional.softmax(logits, dim=-1) 
# Gets the predicted class (index of the maximum probability) 
predicted_class = torch.argmax(probabilities, dim=-1).item() 

# Outputs the predicted sentiment
if predicted_class == 1: 
    print(f"Sentiment: Positive (Confidence: \
          {probabilities[0][1].item():.2f})")
else:
    print(f"Sentiment: Negative (Confidence: \
           {probabilities[0][0].item():.2f})")