<a href="https://colab.research.google.com/github/sahupra1357/LLM/blob/main/How_to_use_Transformers_for_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this lecture, we will learn how to use Transformers for fine-tuning. Transformers is a library that provides state-of-the-art LLM models and tools for natural language processing. We will use Transformers to load and fine-tune a pre-trained LLM model for our sentiment analysis task.

There are many LLM models available in Transformers, such as GPT-3, BERT, and T5. Each model has its own architecture, pre-training objective, and vocabulary. For our task, we will use BERT as our LLM model. BERT stands for Bidirectional Encoder Representations from Transformers, which is a model that can encode both left and right context of a text and learn from masked words. BERT has shown impressive results in various natural language processing tasks, such as text classification, question answering, and text generation.

To use BERT for our task, we need to do the following steps:

Import the Transformers library and the Hugging Face Datasets library. The Hugging Face Datasets library provides easy access to various datasets for natural language processing. We will use it to load our SST-2 dataset.
Load the pre-trained BERT model and tokenizer from the Transformers library. The tokenizer is a tool that converts text into numerical tokens that can be fed into the model. We will use the bert-base-uncased model and tokenizer, which are trained on lower-cased English text.
Load the SST-2 dataset from the Hugging Face Datasets library. The dataset consists of train, validation, and test splits, each containing movie reviews and labels.
Preprocess the dataset by using the tokenizer to encode the text and labels into tensors that can be fed into the model. We will also truncate or pad the sequences to a maximum length of 128 tokens.
Define the training arguments by using the TrainingArguments class from the Transformers library. The training arguments specify various parameters and settings for training and evaluation, such as batch size, learning rate, number of epochs, logging steps, output directory, etc.
Define the trainer by using the Trainer class from the Transformers library. The trainer is a tool that provides a simple and efficient way to train and evaluate LLM models. We will pass the model, the dataset, and the training arguments to the trainer.
Train the model by calling the train method of the trainer. This will fine-tune the model on the train split of the dataset and save the best model checkpoint in the output directory.
Evaluate the model by calling the evaluate method of the trainer. This will evaluate the model on the test split of the dataset and return a dictionary of metrics, such as accuracy, precision, recall, etc.
Let’s see how we can implement these steps in Python code. First, we need to import the Transformers library and the Hugging Face Datasets library. We can do this by running the following code:

In [None]:
# Import Transformers library
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

# Import Hugging Face Datasets library
from datasets import load_dataset


Next, we need to load the pre-trained BERT model and tokenizer from the Transformers library. We can do this by using the AutoModelForSequenceClassification and AutoTokenizer classes from the Transformers library. These classes can automatically load any LLM model and tokenizer from a given name or path. We will use the bert-base-uncased model and tokenizer, which are trained on lower-cased English text. We can load them by running the following code:

In [None]:
# Load pre-trained BERT model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

# Load pre-trained BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Next, we need to load the pre-trained BERT model and tokenizer from the Transformers library. We can do this by using the AutoModelForSequenceClassification and AutoTokenizer classes from the Transformers library. These classes can automatically load any LLM model and tokenizer from a given name or path. We will use the bert-base-uncased model and tokenizer, which are trained on lower-cased English text. We can load them by running the following code:

In [None]:
# Load pre-trained BERT model for sequence classification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')

# Load pre-trained BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Then, we need to load the SST-2 dataset from the Hugging Face Datasets library. We can do this by using the load_dataset function from the Hugging Face Datasets library. This function can automatically load any dataset from a given name or path. We will use the glue name with sst2 as a configuration name to load our SST-2 dataset. The dataset consists of train, validation, and test splits, each containing movie reviews and labels. We can load them by running the following code:

In [None]:
# Load SST-2 dataset
dataset = load_dataset('glue', 'sst2')

Next, we need to preprocess the dataset by using the tokenizer to encode the text and labels into tensors that can be fed into the model. We can do this by using a custom function that takes a batch of examples as input and returns a batch of encoded examples as output. We can also truncate or pad the sequences to a maximum length of 128 tokens by using the max_length and padding arguments of the tokenizer. We can define and apply this function by running the following code:

In [None]:
# Define a function to preprocess the dataset
def preprocess(batch):
  # Encode the text and labels into tensors
  encoded = tokenizer(batch['sentence'], max_length=128, padding='max_length', truncation=True)
  # Add the labels to the encoded dictionary
  encoded['labels'] = batch['label']
  # Return the encoded dictionary
  return encoded

# Apply the function to the dataset
dataset = dataset.map(preprocess, batched=True)

Then, we need to define the training arguments by using the TrainingArguments class from the Transformers library. The training arguments specify various parameters and settings for training and evaluation, such as batch size, learning rate, number of epochs, logging steps, output directory, etc. We can define them by running the following code:

In [None]:
# Define training arguments
training_args = TrainingArguments(
  output_dir='output', # Output directory
  num_train_epochs=3, # Number of training epochs
  per_device_train_batch_size=32, # Batch size per device during training
  per_device_eval_batch_size=32, # Batch size for evaluation
  warmup_steps=500, # Number of warmup steps for learning rate scheduler
  weight_decay=0.01, # Strength of weight decay
  logging_dir='logs', # Directory for storing logs
  logging_steps=10, # Log every X updates steps
  evaluation_strategy='steps', # Evaluation strategy to adopt during training
  eval_steps=50, # Evaluation step
)

Next, we need to define the trainer by using the Trainer class from the Transformers library. The trainer is a tool that provides a simple and efficient way to train and evaluate LLM models. We will pass the model, the dataset, and the training arguments to the trainer. We can define it by running the following code:

In [None]:
# Define trainer
trainer = Trainer(
  model=model, # The model to train
  args=training_args, # Training arguments
  train_dataset=dataset['train'], # Training dataset
  eval_dataset=dataset['test'], # Evaluation dataset
)

Finally, we need to train and evaluate the model by calling the train and evaluate methods of the trainer. These methods will fine-tune and evaluate the model on the train and test splits of the dataset respectively. They will also save the best model checkpoint in the output directory and return a dictionary of metrics. We can train and evaluate the model by running the following code:

In [None]:
# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()

If we run this code, we may get an output like this:

In [None]:
{'epoch': 3.0,
 'eval_accuracy': 0.9210526315789473,
 'eval_loss': 0.24363629519939423,
 'eval_runtime': 1.6719,
 'eval_samples_per_second': 358.726,
 'eval_steps_per_second': 29.894}

As you can see, our fine-tuned BERT model achieved an accuracy of about 92% on the test set of SST-2 dataset, which is quite impressive.

In this lecture, we learned how to use Transformers for fine-tuning. We learned how to load and fine-tune a pre-trained BERT model for our sentiment analysis task. We also learned how to use Transformers tools and resources to simplify and optimize our fine-tuning process.

In the next lecture, we will learn how to use Hugging Face Spaces for deployment. Hugging Face Spaces is a platform that allows you to easily host and share your LLM models and applications with others. We will see how we can use Hugging Face Spaces to create a web app that can perform sentiment analysis on user input using our fine-tuned BERT model. See you in the next lecture.