In [None]:
!pip install datasets



In [None]:
!pip install evaluate



In this project, the fine-tuning of a GPT-2 model will be demonstrated to enhance its ability to infer the sentiments of tweets. To improve its performance, fine-tuning will be leveraged by training the pre-trained GPT-2 model from Hugging Face using a dataset of tweets paired with their corresponding sentiments. Below is a concise outline of the process:

Step 1: Selection of a Pre-Trained Model and Dataset:

The first step in the fine-tuning process involves the selection of a pre-trained model. For this task, GPT-2 will be utilized as the base model, along with a suitable dataset that will facilitate effective fine-tuning.


Step 2: Loading the Dataset:

With the model selected, the next critical step involves the acquisition of high-quality data for fine-tuning. This is where the Hugging Face Datasets library is recognized as invaluable. For this project, the Hugging Face Datasets library will be utilized to import a dataset that categorizes tweets based on their sentiments: Positive, Neutral, or Negative. This structured dataset will serve as the foundation for enhancing the model's sentiment analysis capabilities.



In [None]:
from datasets import load_dataset
import pandas as pd
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer

dataset = load_dataset("mteb/tweet_sentiment_extraction")

df = pd.DataFrame(dataset['train'])


Step 3: Tokenizing the Dataset:

Having acquired the dataset, the next essential step involves the preparation of the data for the model by utilizing a tokenizer. Since large language models (LLMs) operate on tokens, a tokenizer is necessary for appropriate processing of the dataset.
To efficiently tokenize the entire dataset, the Datasets library's map method will be employed, allowing for the application of a preprocessing function across the complete dataset in one step. Therefore, the focus of this step is on loading a pre-trained tokenizer and tokenizing the dataset, ensuring that it is ready for fine-tuning.


In [None]:
from transformers import GPT2Tokenizer
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

To optimize processing requirements, a smaller subset of the full dataset will be created for fine-tuning the model. This approach allows for more efficient training while maintaining effective evaluation.
The training set will be utilized for fine-tuning the model, while the testing set will serve as a benchmark for evaluating its performance. This division ensures that the model's capabilities can be assessed on unseen data, thereby enhancing the overall robustness of the fine-tuning process.



In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Step 4: Initializing the Base Model:

The next step involves loading the base model and configuring it to recognize the specific number of labels relevant to this task. Based on the sentiment analysis dataset for tweets, three distinct labels Positive, Neutral, and Negative are identified.
By initializing the model with this information, it is ensured that the model is adequately prepared for the fine-tuning process and capable of accurately classifying the sentiment of the input data.



In [None]:
from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step 5: Defining the Evaluation Method:

The Transformers library offers a Trainer class optimized for training models; however, a mechanism for evaluating model performance is not included in the default implementation. Therefore, prior to the commencement of the training process, an evaluation function will need to be provided to the Trainer.
This evaluation function will facilitate the assessment of the model's performance throughout the training process, ensuring that improvements can be monitored and quantified effectively.



In [None]:


metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   return metric.compute(predictions=predictions, references=labels)

Step 6: Fine-Tuning Using the Trainer Method:

The final step involves the configuration of the training arguments and the initiation of the training process. The Transformers library provides the Trainer class, which supports a comprehensive array of training options and features, including logging, gradient accumulation, and mixed precision.
Initially, the training arguments will be defined along with the evaluation strategy. Once all parameters are established, the model can be trained effortlessly using the train() command, streamlining the fine-tuning process.



In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",

   per_device_train_batch_size=1,
   per_device_eval_batch_size=1,
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss


Model Evaluation:

Following the training process, the model’s performance will be evaluated on a validation or test set. The Trainer class includes an evaluate method that efficiently manages this evaluation. By utilizing this built-in functionality, a comprehensive assessment of the model's performance can be conducted, providing valuable insights into its effectiveness and accuracy.




In [None]:

trainer.evaluate()


These are the most basic steps to perform a fine-tuning of any LLM and fine-tuning an LLM is highly computationally demanding.