#License and Attribution

This notebook was developed by Emilio Serrano, Full Professor at the Department of Artificial Intelligence, Universidad Polit√©cnica de Madrid (UPM), for educational purposes in UPM courses. Personal website: https://emilioserrano.faculty.bio/

üìò License: Creative Commons Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)

You are free to: (1) Share ‚Äî copy and redistribute the material in any medium or format; (2) Adapt ‚Äî remix, transform, and build upon the material.

Under the following terms: (1) Attribution ‚Äî You must give appropriate credit, provide a link to the license, and indicate if changes were made; (2) NonCommercial ‚Äî You may not use the material for commercial purposes; (3) ShareAlike ‚Äî If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

üîó License details: https://creativecommons.org/licenses/by-nc-sa/4.0/

# Using and fine-tuning BERT's friends in Hugging Face

In this notebook, we will explore how to use BERT based models from Hugging Face for various NLP tasks. We will start by using the pre-trained model without fine-tuning, and then we will fine-tune the model for sentiment analysis.



## Check for GPU
Make sure you have your GPU available. You will need it.

In [3]:
%pip install torch

Collecting torch
  Downloading torch-2.9.0-cp310-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting filelock (from torch)
  Downloading filelock-3.20.0-py3-none-any.whl.metadata (2.1 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=0.8.5 (from torch)
  Downloading fsspec-2025.9.0-py3-none-any.whl.metadata (10 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading markupsafe-3.0.3-cp310-cp310-macosx_11_0_arm64.whl.metadata (2.7 kB)
Downloading torch-2.9.0-cp310-none-macosx_11_0_arm64.whl (74.5 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ

In [4]:
import torch

# Check for GPU availability
if torch.cuda.is_available():
    print(f"GPU detected: {torch.cuda.get_device_name(0)}")
else:
    print("No GPU detected.")
    print("If you are using Google Colab, please go to 'Runtime' > 'Change runtime type' and select 'GPU' as the hardware accelerator.")

  cpu = _conversion_method_template(device=torch.device("cpu"))


No GPU detected.
If you are using Google Colab, please go to 'Runtime' > 'Change runtime type' and select 'GPU' as the hardware accelerator.


## Installation of Libraries
Let us start installing required libraries for fine tuning a model. You may need to restart the environment (Runtime ‚Üí Restart Runtime)

* transformers: provides pretrained models like BERT, RoBERTa, etc., for NLP tasks
* datasets: easy access to thousands of NLP datasets with built-in processing tools
* evaluate: provides standard evaluation metrics (e.g., accuracy, F1) for model evaluation
* accelerate: utility for training models efficiently on CPU, GPU, or multi-GPU setups

In [None]:
!pip install -q transformers datasets accelerate evaluate


#Using pre-trained models without fine-tuning

In this section, we‚Äôll focus on how to use pretrained transformer models without fine-tuning ‚Äî leveraging their powerful language understanding capabilities right out of the box.

We‚Äôll explore the following tasks using models from the ü§ó Hugging Face Transformers library:

* üîç Using pretrained models for inference: Perform NLP tasks directly with models like BERT, RoBERTa, and DistilBERT.

* üß© Masked Language Modeling (MLM) with DistilBERT: Fill in missing words in a sentence using a distilled version of BERT.

* üß† Masked Language Modeling with RoBERTa: Run the same task with a more robust model that improves on BERT.

* üîó Next Sentence Prediction (NSP) with BERT: Use BERT to evaluate whether two sentences are likely to follow one another.

* üìö Natural Language Inference (NLI): Use a model like roberta-large-mnli to classify the relationship between sentence pairs (entailment, neutral, contradiction).

No training or fine-tuning is required ‚Äî we‚Äôre using the models as-is, just like prebuilt tools.


##Masked Language Modeling with DistilBERT
First, we will use **DistilBERT** to complete masked phrases.

In [None]:
from transformers import pipeline

# Load pre-trained model
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')

# Provide example inputs
examples = [
    "The capital of France is [MASK].",
    "The largest mammal is the [MASK].",
    "Deep learning is a subset of [MASK]."
]

# Get predictions
for example in examples:
    predictions = unmasker(example)
    print(f"Input: {example}")
    for pred in predictions:
        print(f"  {pred['sequence']} (score: {pred['score']:.4f})")
    print()

...it does work great. What about **Roberta**?

##Masked Language Modeling with RoBERTa




In [None]:
# Load pre-trained model
unmasker = pipeline('fill-mask', model='roberta-base')

# Provide example inputs
#DitlBert usese [MASK] but with Roberta is <mask>.
examples = [
    "The capital of France is <mask>.",
    "The largest mammal is the <mask>.",
    "Deep learning is a subset of <mask>."
]

# Get predictions
for example in examples:
    predictions = unmasker(example)
    print(f"Input: {example}")
    for pred in predictions:
        print(f"  {pred['sequence']} (score: {pred['score']:.4f})")
    print()


...a little better. And it is not even roberta-large.

DistilBERT learns a distilled (approximate) version of BERT, retaining 95% performance but **using only half the number of parameters** (66M vs 110M in Bert-base vs 340M in Bert-large).

RoBERTa. Introduced at Facebook, Robustly optimized BERT approach RoBERTa, is a retraining of BERT with improved training methodology, 1000% more data and compute power. To improve the training procedure, RoBERTa removes the Next Sentence Prediction (NSP) task from BERT‚Äôs pre-training and introduces dynamic masking so that the masked token changes during the training epochs. Larger batch-training sizes were also found to be more useful in the training procedure. **RoBERTa improves 2-20% BERT performance.**



 [See comparasion Bert, RoBERTa, DistilBERT & XLNet ](https://miro.medium.com/v2/resize:fit:1243/1*5PzGl1dNt_5jMH3_mm3PpA.png)



## Next Sentence Prediction with BERT
 As seen in the previous comparison, neither DistilBert nor RoBERTa natively support Next Sentence Prediction (NSP) because it was designed as a lighter version of BERT and does not include the NSP task in its pre-training objectives.

 Let use BERT.



In [None]:
import torch
from transformers import BertTokenizer, BertForNextSentencePrediction

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForNextSentencePrediction.from_pretrained("bert-base-uncased")

def get_nsp_probability(sentence1, sentence2):
    # Tokenize input sentences
    input_ids = tokenizer.encode(sentence1, sentence2, add_special_tokens=True, return_tensors="pt")

    # Get NSP prediction
    with torch.no_grad(): #This context manager is used to disable gradient computation during inference. It ensures that no gradients are calculated for the operations within the block. Since we‚Äôre not training the model here, we don‚Äôt need to track gradients.
        logits = model(input_ids).logits #model(input_ids) computes the forward pass of the BERT model on the input token IDs. .logits extracts the raw output scores (logits) from the model. These logits represent the model‚Äôs confidence scores for each possible class (in this case, ‚Äúnext sentence‚Äù or ‚Äúnot next sentence‚Äù)
        probabilities = torch.softmax(logits, dim=1) #torch.softmax(logits, dim=1) applies the softmax function along the second dimension (columns) of the logits tensor. This converts the raw scores into probabilities. The resulting probabilities tensor contains two values: the probability of the sentences being consecutive (‚Äúnext sentence‚Äù) and the probability of them not being consecutive (‚Äúnot next sentence‚Äù)
        is_next_probability = probabilities[0, 0].item() #probabilities[0, 0] extracts the probability of being consecutive (i.e., ‚Äúnext sentence‚Äù) from the tensor. .item() converts the scalar tensor value to a Python float.

    # Print results
    print(f"Sentence 1: {sentence1}")
    print(f"Sentence 2: {sentence2}")
    print(f"Probability of being consecutive: {is_next_probability:.4f}")
    print("-" * 60)
    return is_next_probability

# Example sentences
# Pair 1
sentence1 = "The Eiffel Tower is located in Paris."
sentence2 = "It is one of the most famous structures in the world."
get_nsp_probability(sentence1, sentence2)



# Pair 2
sentence1 = "The stock market closed higher after the latest earnings reports."
sentence2 = "Penguins can hold their breath for up to 20 minutes underwater."
get_nsp_probability(sentence1, sentence2)




While BERT includes a `Next Sentence Prediction (NSP)` objective during pretraining, its actual performance on this task can be misleading. In many practical cases, BERT tends to assign high probabilities even to sentence pairs that are clearly unrelated

Because of these limitations, many newer models have removed the NSP objective entirely. For example:

* RoBERTa, DistilBERT, and ALBERT exclude the NSP loss during pretraining because research has shown it does not significantly benefit downstream tasks like question answering or sentence classification.

* Instead, tasks like textual entailment (e.g., with models trained on the Multi-Genre Natural Language Inference (MNLI) benchmark like `roberta-large-mnli`) offer more robust and interpretable results when comparing sentence relationships.

##Natural Language Inference (NLI)

NLI is a core task in Natural Language Processing (NLP) that involves determining the logical relationship between two sentences:

* The premise: a given statement.

* The hypothesis: another statement to be evaluated in relation to the premise.

*The goal is to classify the relationship into one of three categories:

  * Entailment, the hypothesis logically follows from the premise.
  * Neutral,	the hypothesis is possibly true, but not guaranteed by the premise.
  * Contradiction, the hypothesis directly contradicts the premise.

`roberta-large-mnli` is a pretrained language model developed by Facebook AI (now Meta AI). It is based on the RoBERTa architecture and fine-tuned on the MNLI dataset. **MNLI** stands for Multi-Genre Natural Language Inference. It is a large benchmark dataset designed to test a model's ability to perform NLI across multiple genres (or domains) of text, such as: Fiction, Government documents, Telephone conversations, Travel guides, Slate magazine articles...

NLI models can be used to generate high-quality semantic representations of sentences. Because they are trained to reason about sentence meaning and relationships, their internal embeddings capture rich contextual information. NLI models can also be adapted for zero-shot text classification. By phrasing class labels as hypotheses, you can evaluate how likely a sentence "entails" a label.  

In [None]:
from transformers import pipeline

# Load the NLI model
classifier = pipeline("text-classification", model="roberta-large-mnli")

# Define sentence pairs to test
examples = [
    {
        "premise": "The Eiffel Tower is located in Paris.",
        "hypothesis": "It is one of the most famous structures in the world."
    },
    {
        "premise": "The stock market closed higher after the latest earnings reports.",
        "hypothesis": "Penguins can hold their breath for up to 20 minutes underwater."
    },
    {
        "premise": "A man is playing guitar on a stage.",
        "hypothesis": "A person is making music."
    },
    {
        "premise": "A cat is sleeping on the couch.",
        "hypothesis": "The dog is barking loudly."
    }
]

# Evaluate each sentence pair
for i, pair in enumerate(examples, 1):
    input_text = f"{pair['premise']} </s> {pair['hypothesis']}"
    result = classifier(input_text)[0]

    print(f"Example {i}:")
    print(f"Premise   : {pair['premise']}")
    print(f"Hypothesis: {pair['hypothesis']}")
    print(f"Prediction: {result['label']} (score: {result['score']:.4f})")
    print("-" * 60)

#Fine-tune a pretrained model with PyTorch Trainer

Here we will fine-tune a pretrained model with ü§ó Transformers **Trainer**. The code is based on the the HuggingFace [tutorial](https://huggingface.co/docs/transformers/training).


Transformers provides a [Trainer](https://huggingface.co/docs/transformers/v4.41.3/en/main_classes/trainer#transformers.Trainer) class optimized for training ü§ó Transformers models, making it easier to start training without manually writing your own training loop. The [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) API supports a wide range of training options.

The Trainer class provides an API for feature-complete training in PyTorch. Other options include using TensorFlow with Keras or native PyTorch for the fine-tuning.  Using high-level libraries has its disadvantages but tends to save time (especially in debugging).



##Load a dataset for text classification

We will use the [yelp_review_full](https://huggingface.co/datasets/Yelp/yelp_review_full) dataset, where reviews about businesses are labeled with 1 to 5 stars (labels 0 to 4).

In [None]:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")

# Print dataset example
print(dataset['train'][0])
print(dataset['train'][1])



##Preparing the dataset with AutoTokenizer

 In the Hugging Face ecosystem, a `tokenizer`'s primary tasks are to preprocess textual data into a format that can be fed into a model.

 The main tasks of a Tokenizer include:

1.   Tokenization: Splitting the input text into subwords or tokens. For example,  the word "playing" might be split into ["play", "##ing"].
Conversion to IDs: Mapping tokens to their corresponding numerical IDs as per the model's vocabulary.
2. Padding: Adding padding tokens to ensure that all input sequences in a batch are of the same length.
3. Truncation: Cutting off sequences that are longer than the maximum allowed length.
4. Adding Special Tokens: Inserting special tokens such as `[CLS]` (classification token) and `[SEP]` (separator token) which are required by specific models like BERT.
5. Creating Attention Masks: Generating masks that indicate which tokens should be attended to (1 for real tokens, 0 for padding tokens).

In the example, we will use the `bert-base-cased` language model (and, therefore, its tokenizer wich is recovered using the `AutoTokenizer` class).

ü§ó `Datasets` map method allows us to apply a preprocessing function over the entire dataset.  The [yelp_review_full](https://huggingface.co/datasets/Yelp/yelp_review_full).

‚ö†Ô∏è The dataset is large (700K rows), so it can take a while. After mapping the dataset to the tokenized version, we will create a smaller subset of the full dataset to fine-tune the model on (just 1K samples). This is done to reduce training time, especially when working in environments with limited resources like Google Colab, which has time and compute restrictions. However, if you're working locally with better hardware (e.g., a dedicated GPU) or using a more powerful cloud setup, you can experiment with the full dataset to potentially achieve better performance and generalization.  

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# Print dataset example after tokenizing. Data contain: label, text, input_ids, token_type_ids, and attention_mask
print(small_train_dataset[0])
print(small_train_dataset[1])

##Replacing the model head with AutoModelForSequenceClassification


Since our goal is to perform text classification on the [yelp_review_full dataset](https://huggingface.co/datasets/yelp_review_full#data-fields) ‚Äî which contains labeled reviews divided into multiple sentiment classes (five labels) ‚Äî we use the Hugging Face class `AutoModelForSequenceClassification`.

The pretrained head of the BERT model is discarded and replaced with a randomly initialized classification head. This new head is specifically designed to match the number of target classes in our task, which is why we must specify the number of classes using the `num_labels` parameter when loading the model with `AutoModelForSequenceClassification`.

By doing this, the model initializes a new classification layer with the appropriate output size for our problem. This new head is then trained from scratch during fine-tuning, while the pretrained base model transfers its learned knowledge to support the classification task effectively.
Alternatives in Hugging Face for fine-tuning for different tasks include:

* AutoModelForTokenClassification ‚Äî for token-level tasks like Named Entity Recognition (NER).

* AutoModelForQuestionAnswering ‚Äî for extractive QA tasks.

* AutoModelForMultipleChoice ‚Äî for multiple choice tasks.

* AutoModelForSeq2SeqLM ‚Äî for sequence-to-sequence tasks like translation or summarization.

Each of these classes replaces the pretrained head with a task-specific head initialized randomly and fine-tuned on your dataset.



‚ö†Ô∏è A warning is generated about some of the pretrained weights not being used and some weights being randomly initialized.  The weights of the new classification head (the ‚Äúhead‚Äù on top of the base model) are randomly initialized, since it needs to learn from scratch for the specific task at hand.






In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)

##Selecting hyperparameters with the TrainingArguments class

 Next, create a ü§ó ü§ó [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) object. This class holds all the hyperparameters and configuration options for training your model, such as learning rate, batch size, number of epochs, optimizer settings, and other flags that control different training behaviors.

You can start with the default training settings, but TrainingArguments lets you customize many parameters like learning_rate, num_train_epochs, per_device_train_batch_size, and more.

For this example, we will  specify the `output_dir` parameter, which defines the folder where the model checkpoints and predictions will be saved. (In Google Colab, you can view this directory in the "Files" tab.). We will also specify `report_to="none"` to disable all external logging. This is especially useful to prevent the trainer from prompting you for a Weights & Biases API key during training,

‚ö†Ô∏è If the code throws errors, try restarting the environment (Runtime ‚Üí Restart Runtime). If there are still errors, you can also check the discussion [TrainingArgument does not work on colab](https://discuss.huggingface.co/t/trainingargument-does-not-work-on-colab/43372/10)




In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="test_trainer",  # Directory to save model checkpoints and outputs
    report_to="none"            # Disable logging to external trackers like Weights & Biases
)

## Creating a compute_metric function for evaluation

An evaluation function has to be passed to the Trainer to compute and report metrics since Trainer does not automatically evaluate model performance during training. The ü§ó [Evaluate](https://huggingface.co/docs/evaluate/a_quick_tour) library provides a simple accuracy function you can load with the `evaluate.load` function.

A `compute_metrics` function is defined and passed as an argument to the `Trainer`. Inside this function, the compute method of the loaded `metric` object is called to calculate the metric (e.g., accuracy) based on the model‚Äôs predictions and references. The `evaluation_strategy` parameter in the TrainingArguments controls how often evaluation is performed and metrics are reported, for example, at the end of each epoch during fine-tuning.

In [None]:
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

# Load the accuracy metric from the ü§ó Evaluate library
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred): #will be passed to a Trainer object
    logits, labels = eval_pred  # Unpack predictions (logits) and true labels from the evaluation output
    # Convert logits to predicted class indices by selecting the index with the highest value for each example
    predictions = np.argmax(logits, axis=-1)
    # Compute and return the accuracy metric by comparing predictions with true labels
    return metric.compute(predictions=predictions, references=labels)

# Define training arguments including output directory and evaluation strategy
training_args = TrainingArguments(
    output_dir="test_trainer",  # Directory where model checkpoints and outputs will be saved
    eval_strategy="epoch"       # Perform evaluation at the end of each training epoch
)

##Creating a Trainer object and training

A [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) object can be created now  with the objects created in the previous sections:  model, training arguments, training, and test datasets, and evaluation function. Then train!

‚ö†Ô∏è You should not need an API key for wandb.ai because `report_to="none"` in the TrainingArguments should disable logging to Weights & Biases (wandb). If requested, load again the training arguments above.



In [None]:

trainer = Trainer(
    model=model,  # The model to be trained (e.g., BERT with a classification head)
    args=training_args,  # Training configuration and hyperparameters (TrainingArguments object)
    train_dataset=small_train_dataset,  # The dataset used for training the model
    eval_dataset=small_eval_dataset,  # The dataset used for evaluation during training
    compute_metrics=compute_metrics,  # Function to calculate metrics like accuracy or F1 during evaluation
)

trainer.train()  # Start the training process using the specified model, datasets, and arguments

I get an accuracy around 58% for the 5 classes. But the model is trained with a thousand samples in less than 7 minutes and improvement is seen as the epochs increase.



You can increase ephocs with the `num_train_epochs` parameter in  training_arg in the [TrainingArguments class](https://huggingface.co/docs/transformers/v4.42.0/en/main_classes/trainer#transformers.TrainingArguments). If your dataset is small, BERT might overfit quickly ‚Äî more than 4 epochs can cause overfitting or  catastrophic forgetting. **Catastrophic forgetting** refers to when a pre-trained model (like BERT) forgets its previously learned general knowledge during fine-tuning on a new task.



##Predicting with the fine-tuned model  

Tutorials  do not usually include information on how to use the model in the prediction phase. This code example passes new revisions to the fine-tuned BERT model for text classification. As in general machine learning, the input in the prediction phase must be processed in the same way as it was done in training: with a Tokenizer.


In [None]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch

# New reviews to classify
new_reviews = [
    "The food was amazing and the service was excellent.",
    "I did not enjoy the food at all.",
    "The ambiance was nice but the food was just okay.",
    "Terrible experience! Will not come back.",
    "Best restaurant ever! Highly recommend.",
    "Bad and expensive food, noisy and unpleasant atmosphere, slow and rude waiters. That's what you'll find in other restaurants, but here it's completely the opposite."
]

# Tokenize the new reviews using the same settings as in training
inputs = tokenizer(new_reviews, padding="max_length", truncation=True, return_tensors="pt")

# Check if CUDA is available and move the model to GPU if it is. The model and the input tensors need to be on the same device.
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# Ensure model is in evaluation mode
model.eval()

# Move the inputs to the same device as the model.  The model and the input tensors need to be on the same device.
inputs = {key: value.to(device) for key, value in inputs.items()}

# Predict
with torch.no_grad():
    # Disable gradient calculations ‚Äî this speeds up inference and reduces memory usage
    outputs = model(**inputs)
    # 'inputs' is a dictionary of PyTorch tensors, e.g., {'input_ids': tensor, 'attention_mask': tensor}
    # The model returns an object with 'logits', which is a tensor of shape (batch_size, num_classes)
    logits = outputs.logits
    # Use torch.argmax to get the predicted class index for each example in the batch
    # 'predictions' is a 1D tensor of shape (batch_size,), each element is an integer class index
    predictions = torch.argmax(logits, dim=-1)

# Map predictions to human-readable class labels
class_labels = ["1 star", "2 stars", "3 stars", "4 stars", "5 stars"]
# 'class_labels' is a Python list of strings representing each class

# Loop over the original reviews and their predicted classes
for review, pred in zip(new_reviews, predictions):
    # 'review' is a string, 'pred' is a tensor scalar, automatically converted to int for indexing
    print(f"{review}\n‚Üí {class_labels[pred]}\n")




The predictions are pretty good, but the last one is tricky: *Bad and expensive food, noisy and unpleasant atmosphere, slow and rude waiters. That's what you'll find in other restaurants, but here it's completely the opposite.*

#Repeating with RoBERTa
Instead of google-bert/bert-base-cased, let us try `FacebookAI/roberta-base` changing just 2 lines of code: the `AutoTokenizer` and `AutoModelForSequenceClassification` with the new head.



In [None]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from datasets import load_dataset


###############   PREPARING DATA ##############################
dataset = load_dataset("yelp_review_full")

# Print dataset example
print(dataset['train'][0])
print(dataset['train'][1])

#tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-large-cased")
tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

# Print dataset example after tokenizing. Data contain: label, text, input_ids, token_type_ids, and attention_mask
print(small_train_dataset[0])
print(small_train_dataset[1])

###############   TRAINING  ##############################

from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

#model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-large-cased", num_labels=5)
model = AutoModelForSequenceClassification.from_pretrained("FacebookAI/roberta-base", num_labels=5)

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1) # calculates the index of the maximum value along the last axis (which corresponds to the predicted class) for each prediction. This converts the raw logits into discrete class predictions.
    return metric.compute(predictions=predictions, references=labels) #Calculates the evaluation metrics based on the predicted classes (predictions) and the true classes (labels).

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch", report_to="none")
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

###############   PREDICTING  ##############################

# Import the correct tokenizer for RoBERTa
from transformers import AutoTokenizer # Already imported at the top, but good to be explicit
import torch

# New reviews to classify
new_reviews = [
    "The food was amazing and the service was excellent.",
    "I did not enjoy the food at all.",
    "The ambiance was nice but the food was just okay.",
    "Terrible experience! Will not come back.",
    "Best restaurant ever! Highly recommend.",
    "Bad and expensive food, noisy and unpleasant atmosphere, slow and rude waiters. That's what you'll find in other restaurants, but here it's completely the opposite."
]

# Use the RoBERTa tokenizer loaded earlier
# tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base") # Not needed, already loaded

# Tokenize the new reviews using the same settings as in training
inputs = tokenizer(new_reviews, padding="max_length", truncation=True, return_tensors="pt")

# Check if CUDA is available and move the model to GPU if it is. The model and the input tensors need to be on the same device.
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

# Ensure model is in evaluation mode
model.eval()

# Move the inputs to the same device as the model.  The model and the input tensors need to be on the same device.
inputs = {key: value.to(device) for key, value in inputs.items()}

# Predict
with torch.no_grad():
    # Disable gradient calculations ‚Äî this speeds up inference and reduces memory usage
    outputs = model(**inputs)
    # 'inputs' is a dictionary of PyTorch tensors, e.g., {'input_ids': tensor, 'attention_mask': tensor}
    # The model returns an object with 'logits', which is a tensor of shape (batch_size, num_classes)
    logits = outputs.logits
    # Use torch.argmax to get the predicted class index for each example in the batch
    # 'predictions' is a 1D tensor of shape (batch_size,), each element is an integer class index
    predictions = torch.argmax(logits, dim=-1)

# Map predictions to human-readable class labels
class_labels = ["1 star", "2 stars", "3 stars", "4 stars", "5 stars"]
# 'class_labels' is a Python list of strings representing each class

# Loop over the original reviews and their predicted classes
for review, pred in zip(new_reviews, predictions):
    # 'review' is a string, 'pred' is a tensor scalar, automatically converted to int for indexing
    print(f"{review}\n‚Üí {class_labels[pred]}\n")

I get around 62% accuracy in around 7 minutes with RoBERTa... 4 points better than BERT (58%). RoBERTa, as BERT, does not seem to get the last and tricky review:  *Bad and expensive food, noisy and unpleasant atmosphere, slow and rude waiters. That's what you'll find in other restaurants, but here it's completely the opposite.*


# Conclusions and Next Steps

This notebook explores the use of the BERT language model as well as some of its adaptations such as DitilBERT and RoBERTa.

First of all, these models have been used for those tasks for which they were pretrained: masked language modeling and next setence prediction.

Afterwards, BERT and RoBERTa have been fine-tuned for a multi-class text classification problem (or Sequence Classification).

Thanks to the **PyTorch Trainer**, this fine-tuning requires very little source code. But at the same time, it offers custom tuning of the models **with your own data** unlike using the **pipeline** function to download and use already trained models. Furthermore, the evaluation of different language models is greatly simplified, with just two lines of code we go from Bert to RoBERTa.


**Next Steps**

*   You can try fine-tuning more efficient models like [DistilBert](https://huggingface.co/distilbert) or [ALBERT](https://huggingface.co/albert).
*   You can try fine-tuning more powerful models like [XLNet](https://huggingface.co/xlnet) or the large versions of [BERT](https://huggingface.co/google-bert) and [RoBERTA](https://huggingface.co/FacebookAI).
*   You can  lower the level of abstraction even further.  This [tutorial](https://huggingface.co/docs/transformers/training) describes how to use native PyTorch or TensorFlow with Keras in the same example.
*   Check also the ü§ó [Task guides](https://huggingface.co/docs/transformers/tasks/sequence_classification) with, among others, examples of Text Classification, Token Classification, Question Answering, Translation, and Summarization.







