# Fine-tuning GPT-2 for Sentiment Detection

In this notebook, we will demonstrate how to fine-tune a pre-trained GPT-2 model for the downstream task of **Sentiment Analysis**.

## Steps:
1. Prepare and load the dataset
2. Tokenize the dataset
3. Define the model and training parameters
4. Fine-tune the model
5. Evaluate the model
6. Inference


In [1]:
!pip install --upgrade datasets -q

In [2]:
import os
os.environ["WANDB_DISABLED"] = "true"

## Step 1: Load and Prepare Dataset
We’ll use a simple sentiment dataset where each sample is labeled as positive (1) or negative (0).

In [3]:
from datasets import load_dataset

# Load IMDB dataset normally (non-streaming)
dataset = load_dataset("imdb", split="train")
dataset = dataset.train_test_split(test_size=0.1)

# Check the result
dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 22500
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2500
    })
})

In [4]:
print(dataset['train']['text'][0])
print('Sentiment:',dataset['train']['label'][0])
print('------------------------------------------')
print(dataset['train']['text'][1])
print('Sentiment:',dataset['train']['label'][1])

THE MAN IN THE WHITE SUIT, like I'M ALL RIGHT JACK, takes a dim view of both labor and capital. Alec Guinness is a scientific genius - but an eccentric one (he has never gotten his university degree due to an...err...accident in a college laboratory). He manages to push himself into various industrial labs in the textile industry. When the film begins he is in Michael Gough's company, and Gough (in a memorable moment) is trying to impress his would-be father-in-law (Cecil Parker) by showing him the ship-shape firm he runs. While having lunch with Parker and Parker's daughter (Joan Greenwood), Gough gets a message regarding some problems about the lab's unexpectedly large budget problems. He reads the huge expenditures (due to Guinness's experiments), and chokes on his coffee.<br /><br />Guinness goes on to work at Parker's firm, and repeats the same tricks he did with Gough - but Parker discovers it too. Greenwood has discovered what Guinness is working on, and convinces Parker to cont

## Step 2: Tokenize the Dataset
We'll tokenize the dataset using GPT-2 tokenizer.

In [23]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(batch):
    return tokenizer(
        [f"Review: {text}\nSentiment:" for text in batch["text"]],
        padding="max_length",
        truncation=True,
        max_length=128
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True)

## Step 3: Format Data for Training
We'll convert labels to be part of the input so GPT-2 learns to predict sentiment during generation.

In [12]:
# from transformers import AutoTokenizer
# import torch # Import torch
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
# tokenizer.pad_token = tokenizer.eos_token

# def format_labels(example):
#     # Tokenize the full input including the prompt and the label
#     full_sequence = f"Review: {example['text']}\nSentiment: {example['label']}"
#     tokenized_input = tokenizer(
#         full_sequence,
#         padding="max_length",
#         truncation=True,
#         max_length=128,
#         return_attention_mask=True,
#         return_token_type_ids=False # GPT-2 doesn't use token_type_ids
#     )

#     # Tokenize just the prompt part to find its length
#     prompt_sequence = f"Review: {example['text']}\nSentiment:"
#     # Tokenize the prompt without padding/truncation to get the accurate prompt length in tokens
#     tokenized_prompt = tokenizer(
#         prompt_sequence,
#         add_special_tokens=False # Exclude special tokens for accurate length of the prompt text
#     )
#     prompt_length = len(tokenized_prompt["input_ids"])

#     # Create labels list and set initial values to ignore_index (-100)
#     labels = [-100] * 128 # Initialize labels with ignore_index, matching max_length

#     # Tokenize the sentiment label (with a leading space to ensure it's treated as a separate word)
#     label_tokens = tokenizer(f" {example['label']}", add_special_tokens=False)["input_ids"]

#     # Place the label tokens into the labels list, starting after the prompt
#     # Ensure we don't go out of bounds
#     for i, token_id in enumerate(label_tokens):
#          if prompt_length + i < 128:
#              labels[prompt_length + i] = token_id

#     # Update the example with the new input_ids, attention_mask, and labels
#     # Convert lists to PyTorch tensors
#     example["input_ids"] = torch.tensor(tokenized_input["input_ids"], dtype=torch.long)
#     example["attention_mask"] = torch.tensor(tokenized_input["attention_mask"], dtype=torch.long)
#     example["labels"] = torch.tensor(labels, dtype=torch.long) # Labels should be long tensor for cross_entropy

#     return example

# # Reapply the mapping with the corrected format_labels function
# # Remove batched=True since we are handling each example individually
# tokenized_datasets = dataset.map(format_labels)

# # Now, the tokenized_datasets["train"] and tokenized_datasets["test"]
# # should have 'input_ids', 'attention_mask', and 'labels' fields where
# # 'labels' is a tensor of token IDs with -100 for the prompt part and the
# # actual sentiment token IDs for the label part.

Map:   0%|          | 0/22500 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1219 > 1024). Running this sequence through the model will result in indexing errors


Map:   0%|          | 0/2500 [00:00<?, ? examples/s]

## Step 4: Define Model and Training Arguments
We'll fine-tune GPT-2 with Trainer API.

In [16]:
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir="./results",
    # evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=2,
    logging_dir="./logs",
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer
)

model.resize_token_embeddings(len(tokenizer))  # Needed if pad_token was not in original tokenizer

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
  trainer = Trainer(


Embedding(50257, 768)

## Step 5: Train the Model

In [24]:
trainer.train()

ValueError: Expected input batch_size (512) to match target batch_size (4).

## Step 6: Inference
Let's test the fine-tuned model on new samples.

In [None]:
input_text = "Review: The movie was dull and boring.\nSentiment:"
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))