# **Fine-Tuning Pre-Trained Models for Sentiment Classification**

#### **1. Title**
**Fine-Tuning Pre-Trained Models: A Comprehensive Guide with Sentiment Classification**

---

#### **2. Objective**
- To understand the concept of fine-tuning pre-trained language models.
- To explore why fine-tuning is critical for transfer learning in NLP tasks.
- To demonstrate the fine-tuning of a GPT-2 model for sentiment classification using the `mteb/tweet_sentiment_extraction` dataset.
- To evaluate the model’s performance and discuss its implications, advantages, and limitations.

---

#### **3. Metadata**
- **Author**: Your Name  
- **Date**: Current Date  
- **Frameworks**: PyTorch, Hugging Face Transformers, Datasets  
- **Dataset**: `mteb/tweet_sentiment_extraction`  

---

#### **4. Dataset Overview**
- The dataset contains tweets with sentiment labels: positive, neutral, and negative.
- **Columns**:
  - `text`: The tweet text.
  - `label`: Sentiment label (0: negative, 1: neutral, 2: positive).  
- **Objective**: Classify text into the appropriate sentiment category.

---

#### **5. Conceptual Overview**

##### **What is Fine-Tuning?**
Fine-tuning is the process of taking a pre-trained language model (like GPT-2, BERT) and further training it on a specific task or domain-specific dataset.

##### **Why Do We Fine-Tune?**
- Pre-trained models are trained on massive generic corpora but may not be optimized for specific tasks.
- Fine-tuning leverages the pre-trained knowledge while adapting the model to solve task-specific problems.

##### **How Do We Fine-Tune?**
1. Load a pre-trained model and tokenizer.
2. Prepare the dataset for the specific task (e.g., classification, summarization).
3. Train the model on the task-specific data while preserving the pre-trained weights.


In [13]:
!pip install datasets
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("mteb/tweet_sentiment_extraction")
df = pd.DataFrame(dataset['train'])




### Code Explanation

1. **Install Library**  
   `!pip install datasets`: Installs the Hugging Face `datasets` library.

2. **Import Modules**  
   - `load_dataset`: Loads datasets from the Hugging Face repository.  
   - `pandas`: Used for data manipulation.

3. **Load Dataset**  
   `dataset = load_dataset("mteb/tweet_sentiment_extraction")`: Loads the `tweet_sentiment_extraction` dataset.

4. **Convert to DataFrame**  
   `df = pd.DataFrame(dataset['train'])`: Converts the training split of the dataset into a pandas DataFrame for easier manipulation.


In [3]:
df.head()

Unnamed: 0,id,text,label,label_text
0,cb774db0d1,"I`d have responded, if I were going",1,neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,0,negative
2,088c60f138,my boss is bullying me...,0,negative
3,9642c003ef,what interview! leave me alone,0,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...",0,negative


In [4]:
from transformers import GPT2Tokenizer

# Loading the dataset to train our model
dataset = load_dataset("mteb/tweet_sentiment_extraction")

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
def tokenize_function(examples):
   return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/27481 [00:00<?, ? examples/s]

Map:   0%|          | 0/3534 [00:00<?, ? examples/s]

### Code Explanation

1. **Tokenizer Setup**  
   `from transformers import GPT2Tokenizer`: GPT-2 tokenizer is used to convert text into numerical input for the model.

2. **Load Dataset**  
   `dataset = load_dataset("mteb/tweet_sentiment_extraction")`: Loads a dataset for training a model on tweet sentiment extraction tasks.

3. **Set Padding Token**  
   - `tokenizer = GPT2Tokenizer.from_pretrained("gpt2")`: Loads the pre-trained GPT-2 tokenizer for consistent input processing.  
   - `tokenizer.pad_token = tokenizer.eos_token`: Sets the padding token to EOS to handle padding for variable-length inputs.

4. **Tokenize Text Data**  
   - `tokenize_function(examples)`: Ensures text data is tokenized into consistent input lengths with padding and truncation, making it suitable for the model.  

5. **Tokenize the Entire Dataset**  
   `tokenized_datasets = dataset.map(tokenize_function, batched=True)`: Tokenizes the dataset in batches for efficient preprocessing before training.


In [5]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

### Code Explanation

1. **Create Smaller Training Dataset**  
   `small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))`:  
   - Shuffles the training dataset with a fixed seed (`42`) for reproducibility.  
   - Selects the first 1000 examples to create a smaller subset for quicker training or testing.

2. **Create Smaller Evaluation Dataset**  
   `small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))`:  
   - Shuffles the test dataset with the same seed for consistency.  
   - Selects the first 1000 examples for evaluation on a manageable dataset size.


In [6]:
from transformers import GPT2ForSequenceClassification

model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Code Explanation

1. **Import Model**  
   `from transformers import GPT2ForSequenceClassification`: Imports GPT-2 specifically designed for sequence classification tasks.

2. **Initialize Model**  
   `model = GPT2ForSequenceClassification.from_pretrained("gpt2", num_labels=3)`:  
   - Loads a pre-trained GPT-2 model.  
   - Configures it for sequence classification with `num_labels=3`, meaning the model will classify inputs into one of three categories (e.g., positive, negative, neutral sentiment).


In [7]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [8]:
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
   logits, labels = eval_pred
   predictions = np.argmax(logits, axis=-1)
   return metric.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

### Code Explanation

1. **Load Evaluation Metric**  
   `metric = evaluate.load("accuracy")`: Loads the accuracy metric to evaluate model performance by comparing predictions to true labels.

2. **Define Metric Computation Function**  
   `compute_metrics(eval_pred)`:  
   - Takes `eval_pred` as input, which contains model logits and true labels.  
   - `logits`: Raw predictions from the model.  
   - `labels`: True class labels.  
   - `predictions = np.argmax(logits, axis=-1)`: Converts logits to predicted class indices by selecting the class with the highest probability.  
   - `metric.compute(predictions=predictions, references=labels)`: Computes accuracy by comparing predictions with the true labels and returns the result.


In [10]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
   output_dir="test_trainer",
   #evaluation_strategy="epoch",
   per_device_train_batch_size=1,  # Reduce batch size here
   per_device_eval_batch_size=1,    # Optionally, reduce for evaluation as well
   gradient_accumulation_steps=4
   )


trainer = Trainer(
   model=model,
   args=training_args,
   train_dataset=small_train_dataset,
   eval_dataset=small_eval_dataset,
   compute_metrics=compute_metrics,

)

trainer.train()


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33msayan-ft252082[0m ([33msayan-ft252082-capgemini[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss
500,0.916


TrainOutput(global_step=750, training_loss=0.7941213175455729, metrics={'train_runtime': 819.3669, 'train_samples_per_second': 3.661, 'train_steps_per_second': 0.915, 'total_flos': 1567794659328000.0, 'train_loss': 0.7941213175455729, 'epoch': 3.0})

### Code Explanation

1. **Import Training Modules**  
   `from transformers import TrainingArguments, Trainer`: These modules handle training and evaluation setups for Hugging Face models.

2. **Set Training Arguments**  
   `training_args = TrainingArguments(...)`: Configures various training parameters:  
   - `output_dir="test_trainer"`: Directory to save model checkpoints and outputs.  
   - `per_device_train_batch_size=1`: Sets the training batch size per device to 1 (helps with memory management).  
   - `per_device_eval_batch_size=1`: Sets the evaluation batch size per device to 1.  
   - `gradient_accumulation_steps=4`: Accumulates gradients over 4 steps before updating model weights to simulate a larger batch size.

3. **Initialize Trainer**  
   `trainer = Trainer(...)`: Creates a training instance with the following components:  
   - `model`: The GPT-2 model configured for sequence classification.  
   - `args`: Training arguments defined above.  
   - `train_dataset`: Smaller training dataset created earlier (`small_train_dataset`).  
   - `eval_dataset`: Smaller evaluation dataset (`small_eval_dataset`).  
   - `compute_metrics`: Custom function to evaluate model performance using accuracy.

4. **Train the Model**  
   `trainer.train()`: Starts the training process using the specified model, datasets, and training configuration.


In [12]:
import evaluate
import numpy as np

trainer.evaluate()


{'eval_loss': 0.9657433032989502,
 'eval_accuracy': 0.739,
 'eval_runtime': 76.5808,
 'eval_samples_per_second': 13.058,
 'eval_steps_per_second': 13.058,
 'epoch': 3.0}

### Code Explanation

1. **Import Libraries**  
   - `import evaluate`: Provides tools to load and compute evaluation metrics.  
   - `import numpy as np`: Used for numerical operations during evaluation.

2. **Evaluate the Model**  
   `trainer.evaluate()`:  
   - Runs the evaluation process using the `Trainer` instance.  
   - Computes metrics (e.g., accuracy) by comparing model predictions with true labels from the evaluation dataset (`small_eval_dataset`).  
   - Returns the evaluation results, such as loss and accuracy, to assess model performance.


In [26]:
import torch
import numpy as np

# Example text to classify
test_text = ["The delivery was faster than expected, and the product arrived in perfect condition.The quality of the materials feels premium and worth the price. I reached out to customer support for a small clarification, and they responded promptly and professionally. This level of service is rare and highly appreciated. I would definitely shop here again and recommend it to others."

,"I was impressed with how seamless the entire process was from start to finish. The website was easy to navigate, and I found what I needed in minutes. The packaging was eco-friendly, which is a big plus for me. The product itself exceeded my expectations in both design and functionality. Overall, it was a fantastic experience."

,"I placed my order weeks ago, and it still hasn’t arrived. When I contacted customer service, they kept giving me generic responses without addressing my issue. The tracking information provided was inaccurate and unhelpful. I expected much better communication and efficiency from a company like this. This has been an incredibly frustrating experience."

,"The product did not match the description on the website at all. It was poorly made and looked cheap despite being expensive. When I tried to initiate a return, the process was overly complicated and unclear. I felt ignored and undervalued as a customer throughout the entire experience. I would not recommend this company to anyone."

,"The experience was average, meeting but not exceeding expectations."]

# Get the device the model is on (GPU if available, otherwise CPU)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Move the model to the device
model.to(device)

# Tokenize the input text and move to the device
encoded_inputs = tokenizer(test_text, padding=True, truncation=True, return_tensors="pt").to(device)

# Set the pad_token_id in the model's configuration
model.config.pad_token_id = tokenizer.pad_token_id  # Add this line

# Perform inference
with torch.no_grad():  # Disable gradient computation for efficiency
    outputs = model(**encoded_inputs)
    logits = outputs.logits

# Convert logits to predicted classes
predicted_classes = np.argmax(logits.cpu().numpy(), axis=-1)  # Move logits to CPU

# Map predicted classes to sentiment labels (assuming 0: negative, 1: neutral, 2: positive)
sentiment_labels = ["negative", "neutral", "positive"]

# Print results
for i, text in enumerate(test_text):
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment_labels[predicted_classes[i]]}")
    print("-" * 20)

Text: The delivery was faster than expected, and the product arrived in perfect condition.The quality of the materials feels premium and worth the price. I reached out to customer support for a small clarification, and they responded promptly and professionally. This level of service is rare and highly appreciated. I would definitely shop here again and recommend it to others.
Sentiment: positive
--------------------
Text: I was impressed with how seamless the entire process was from start to finish. The website was easy to navigate, and I found what I needed in minutes. The packaging was eco-friendly, which is a big plus for me. The product itself exceeded my expectations in both design and functionality. Overall, it was a fantastic experience.
Sentiment: positive
--------------------
Text: I placed my order weeks ago, and it still hasn’t arrived. When I contacted customer service, they kept giving me generic responses without addressing my issue. The tracking information provided was

### Explanation of the Sentiment Classification Code

1. **Device Selection**:  
   The code determines whether a GPU (`cuda`) is available for faster computation. If not, it defaults to using the CPU. This ensures the model runs on the most efficient hardware available.

2. **Model Setup**:  
   The pre-trained model is moved to the selected device (GPU/CPU) to enable inference. This step ensures compatibility and leverages the hardware's capabilities.

3. **Input Tokenization**:  
   The input sentences (`test_text`) are tokenized using the same tokenizer used during training. Tokenization converts text into numerical token IDs, ensuring compatibility with the model. The padding ensures all inputs are of equal length, and truncation handles overly long inputs. The tokenized data is then moved to the selected device.

4. **Set Padding Token ID**:  
   The `pad_token_id` is set in the model's configuration to properly handle padding tokens during inference. This avoids potential issues when dealing with padded sequences.

5. **Inference (Prediction)**:  
   Using `torch.no_grad()`, gradient computation is disabled to save memory and speed up the process. The tokenized inputs are passed through the model to obtain `logits`, which represent raw scores for each sentiment class.

6. **Prediction Conversion**:  
   The `logits` are moved back to the CPU and converted to a NumPy array. The `np.argmax` function is used to find the index of the highest score for each input, corresponding to the predicted sentiment class.

7. **Map Predictions to Sentiments**:  
   The predicted class indices are mapped to human-readable sentiment labels (`negative`, `neutral`, `positive`). The results are printed alongside the input text for interpretation.
