<a href="https://colab.research.google.com/github/ten-jampa/LLM_grind/blob/main/transformers_doc/en/pytorch/finetuning_a_yelp_score_classifer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Transformers installation
! pip install transformers evaluate accelerate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git



# Fine-tuning

Fine-tuning adapts a pretrained model to a specific task with a smaller specialized dataset. This approach requires far less data and compute compared to training a model from scratch, which makes it a more accessible option for many users.

Transformers provides the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) API, which offers a comprehensive set of training features, for fine-tuning any of the models on the [Hub](https://hf.co/models).

> [!TIP]
> Learn how to fine-tune models for other tasks in our Task Recipes section in Resources!

This guide will show you how to fine-tune a model with [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) to classify Yelp reviews.

Log in to your Hugging Face account with your user token to ensure you can access gated models and share your models on the Hub.

In [5]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Start by loading the [Yelp Reviews](https://hf.co/datasets/yelp_review_full) dataset and [preprocess](https://huggingface.co/docs/transformers/main/en/./fast_tokenizers#preprocess) (tokenize, pad, and truncate) it for training. Use [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) to preprocess the entire dataset in one step.

In [3]:
!pip install -U datasets



In [6]:
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("Yelp/yelp_review_full") #loading the dataset
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased") # loading the Autotokenizer, which

def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = dataset.map(tokenize, batched=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

> [!TIP]
> Fine-tune on a smaller subset of the full dataset to reduce the time it takes. The results won't be as good compared to fine-tuning on the full dataset, but it is useful to make sure everything works as expected first before committing to training on the full dataset.
> ```py
> small_train = dataset["train"].shuffle(seed=42).select(range(1000))
> small_eval = dataset["test"].shuffle(seed=42).select(range(1000))
> ```

In [9]:
small_train = dataset['train'].shuffle(seed=42).select(range(1000))
small_eval = dataset['test'].shuffle(seed = 42).select(range(1000))

import pandas as pd
sample_df = pd.DataFrame(small_train[:10])
sample_df

Unnamed: 0,label,text,input_ids,token_type_ids,attention_mask
0,4,I stalk this truck. I've been to industrial p...,"[101, 146, 27438, 1142, 4202, 119, 146, 112, 1...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,2,"who really knows if this is good pho or not, i...","[101, 1150, 1541, 3520, 1191, 1142, 1110, 1363...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,4,I LOVE Bloom Salon... all of their stylist are...,"[101, 146, 149, 2346, 17145, 19169, 17420, 119...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,0,"We were excited to eat here, it is difficult t...","[101, 1284, 1127, 7215, 1106, 3940, 1303, 117,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,2,"So this is a place, with food. That much canno...","[101, 1573, 1142, 1110, 170, 1282, 117, 1114, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
5,2,Review for the Lounge/Club:\nEvery time I go t...,"[101, 4960, 1111, 1103, 26135, 120, 1998, 131,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
6,3,I've been going here a lot(pretty much a regul...,"[101, 146, 112, 1396, 1151, 1280, 1303, 170, 1...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
7,0,I went to Sole on the weekend and found that i...,"[101, 146, 1355, 1106, 17135, 1162, 1113, 1103...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
8,0,Went in on a sunday afternoon. Place was dead....,"[101, 23158, 1204, 1107, 1113, 170, 3336, 6194...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
9,4,I have lived here in Phoenix for 9 months now ...,"[101, 146, 1138, 2077, 1303, 1107, 6343, 1111,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


## Trainer

In [7]:
#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/nvBXf7s7vTI?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')



[Trainer](https://huggingface.co/docs/transformers/main/en/./trainer) is an optimized training loop for Transformers models, making it easy to start training right away without manually writing your own training code. Pick and choose from a wide range of training features in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) such as gradient accumulation, mixed precision, and options for reporting and logging training metrics.

Load a model and provide the number of expected labels (you can find this information on the Yelp Review [dataset card](https://huggingface.co/datasets/yelp_review_full#data-fields)).

In [8]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5) # the nums_label 5 being for 5 stars
WARNING_MSG = "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']"
"You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference."

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


'You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.'

> [!TIP]
> The message above is a reminder that the models pretrained head is discarded and replaced with a randomly initialized classification head. The randomly initialized head needs to be fine-tuned on your specific task to output meanginful predictions.

With the model loaded, set up your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). Hyperparameters are variables that control the training process - such as the learning rate, batch size, number of epochs - which in turn impacts model performance. Selecting the correct hyperparameters is important and you should experiment with them to find the best configuration for your task.

For this guide, you can use the default hyperparameters which provide a good baseline to begin with. The only settings to configure in this guide are where to save the checkpoint, how to evaluate model performance during training, and pushing the model to the Hub.

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) requires a function to compute and report your metric. For a classification task, you'll use [evaluate.load](https://huggingface.co/docs/evaluate/main/en/package_reference/loading_methods#evaluate.load) to load the [accuracy](https://hf.co/spaces/evaluate-metric/accuracy) function from the [Evaluate](https://hf.co/docs/evaluate/index) library. Gather the predictions and labels in [compute](https://huggingface.co/docs/evaluate/main/en/package_reference/main_classes#evaluate.EvaluationModule.compute) to calculate the accuracy.

In [11]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # convert the logits to their predicted class
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Downloading builder script: 0.00B [00:00, ?B/s]

Set up [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) with where to save the model and when to compute accuracy during training. The example below sets it to `"epoch"`, which reports the accuracy at the end of each epoch. Add `push_to_hub=True` to upload the model to the Hub after training.

In [12]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="yelp_review_classifier",
    eval_strategy="epoch", #outputs the new metric score after every epoch
    push_to_hub=True,
)

Create a [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) instance and pass it the model, training arguments, training and test datasets, and evaluation function. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to start training.

In [15]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train,
    eval_dataset=small_eval,
    compute_metrics=compute_metrics,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.027822,0.55
2,No log,0.982634,0.598
3,No log,1.062458,0.599


TrainOutput(global_step=375, training_loss=0.817215087890625, metrics={'train_runtime': 397.805, 'train_samples_per_second': 7.541, 'train_steps_per_second': 0.943, 'total_flos': 789354427392000.0, 'train_loss': 0.817215087890625, 'epoch': 3.0})

Finally, use [push_to_hub()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.push_to_hub) to upload your model and tokenizer to the Hub.

In [21]:
# prompt: how to use the trainer object (which presumably is the trained model) to do inference on new data entries that I can select from dataset['eval']?

# Select some data points from the eval dataset for inference
eval_samples = dataset['test'].select(range(10)) # Select the first 10 samples

# The trainer.predict() method expects a Dataset object.
# Make sure the selected data has the required input features (like 'input_ids', 'attention_mask', etc.)
# which are created by the tokenizer during the .map() operation.
predictions = trainer.predict(eval_samples) # just the logits, we still have to convert them

print('raw Predictions', predictions)
# The 'predictions' object is a Predictions object from the transformers library.
# It contains the predicted logits and the original labels.
# The logits are the raw output values from the model's final layer.
# To get the predicted class, you typically apply argmax over the logits.
predicted_labels = np.argmax(predictions.predictions, axis=-1)
# You can access the original labels from the predictions object as well
original_labels = predictions.label_ids

# Now you can print or analyze the predictions and compare them to the original labels
print("Predicted Labels:", predicted_labels)
print("Original Labels:", original_labels)


# You can also get the probability distribution over the classes if needed
# probabilities = softmax(predictions.predictions, axis=-1)
# print("Probabilities:", probabilities) # You would need to import softmax if you use this


raw Predictions PredictionOutput(predictions=array([[ 4.592553  ,  0.7491206 , -1.6116874 , -2.1059    , -0.3231202 ],
       [ 4.461411  ,  0.6129195 , -1.6715999 , -1.7410411 ,  0.17647786],
       [ 4.5862737 ,  0.98839045, -1.6251113 , -2.1731071 , -0.35450155],
       [ 4.659207  ,  1.0205717 , -1.6037289 , -2.3557065 , -0.5622573 ],
       [ 3.883485  ,  1.9786222 , -1.2305516 , -2.5604484 , -1.9810019 ],
       [-1.7948964 ,  1.1163661 ,  1.9512619 , -0.82260233, -3.1031837 ],
       [-2.354108  , -0.06236765,  2.1837215 ,  0.2111557 , -2.6405318 ],
       [-2.1422274 , -2.0409672 , -0.20017132,  3.0585454 ,  2.8191807 ],
       [-1.2222999 , -1.733242  , -0.8023203 ,  2.2171369 ,  3.7052357 ],
       [-2.4563491 , -1.5670786 , -0.10923393,  3.3267918 ,  1.7955644 ]],
      dtype=float32), label_ids=array([0, 0, 0, 0, 0, 2, 1, 3, 3, 2]), metrics={'test_loss': 0.9176275134086609, 'test_accuracy': 0.7, 'test_runtime': 0.3658, 'test_samples_per_second': 27.339, 'test_steps_per_seco

In [17]:
trainer.push_to_hub()

events.out.tfevents.1751387199.d325a3d0a779.2548.1:   0%|          | 0.00/6.52k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Ten-Jampa/yelp_review_classifier/commit/2382d892b849634392553a2a1300efb2165eec25', commit_message='End of training', commit_description='', oid='2382d892b849634392553a2a1300efb2165eec25', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Ten-Jampa/yelp_review_classifier', endpoint='https://huggingface.co', repo_type='model', repo_id='Ten-Jampa/yelp_review_classifier'), pr_revision=None, pr_num=None)

In [None]:
!nvidia-smi

In [None]:
trainer.push_to_hub()

## TensorFlow

[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) is incompatible with Transformers TensorFlow models. Instead, fine-tune these models with [Keras](https://keras.io/) since they're implemented as a standard [tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model).

In [None]:
from transformers import TFAutoModelForSequenceClassification
from datasets import load_dataset
from transformers import AutoTokenizer

model = TFAutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels=5)
dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize(examples):
    return tokenizer(examples["text"])

dataset = dataset.map(tokenize)

There are two methods to convert a dataset to [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).

- [prepare_tf_dataset()](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) is the recommended way to create a [tf.data.Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) because you can inspect the model to figure out which columns to use as inputs and which columns to discard. This allows you to create a simpler, more performant dataset.
- [to_tf_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.to_tf_dataset) is a more low-level method from the [Datasets](https://hf.co/docs/datasets/index) library that gives you more control over how a dataset is created by specifying the columns and label columns to use.

Add the tokenizer to [prepare_tf_dataset()](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset) to pad each batch, and you can optionally shuffle the dataset. For more complicated preprocessing, pass the preprocessing function to the `collate_fn` parameter instead.

In [None]:
tf_dataset = model.prepare_tf_dataset(
    dataset["train"], batch_size=16, shuffle=True, tokenizer=tokenizer
)

Finally, [compile](https://keras.io/api/models/model_training_apis/#compile-method) and [fit](https://keras.io/api/models/model_training_apis/#fit-method) the model to start training.

> [!TIP]
> It isn't necessary to pass a loss argument to [compile](https://keras.io/api/models/model_training_apis/#compile-method) because Transformers automatically chooses a loss that is appropriate for the task and architecture. However, you can always specify a loss argument if you want.

In [None]:
from tensorflow.keras.optimizers import Adam

model.compile(optimizer=Adam(3e-5))
model.fit(tf_dataset)

## Resources

Refer to the Transformers [examples](https://github.com/huggingface/transformers/tree/main/examples) for more detailed training scripts on various tasks. You can also check out the [notebooks](https://huggingface.co/docs/transformers/main/en/./notebooks) for interactive examples.