# Fine-Tuning GPT-2 for Context-Aware Text Generation
**Internship Project â€“ Sakshi Dhole**

**Objective:**  
To fine-tune a pre-trained GPT-2 model on a custom dataset to generate coherent and contextually relevant text.


## Step 1: Install Required Libraries
We will install the Hugging Face Transformers, Datasets, and PyTorch libraries.


In [None]:
!pip install transformers datasets torch --quiet


## Step 2: Import Libraries
Import all necessary libraries for model loading, training, and dataset handling.


In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling, pipeline
from datasets import Dataset
import torch


## Step 3: Load GPT-2 Model & Tokenizer
We will use the `distilgpt2` pre-trained model.
We also set the padding token and resize embeddings for GPT-2.


In [None]:
# Load GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
model = GPT2LMHeadModel.from_pretrained("distilgpt2")

# Set pad token and resize embeddings
tokenizer.pad_token = tokenizer.eos_token
model.resize_token_embeddings(len(tokenizer))


## Step 4: Create Custom Dataset
We create a simple dataset of 5 sentences for demonstration.


In [None]:
data = {
    "text": [
        "Hello, how are you?",
        "Artificial Intelligence is transforming the world.",
        "Machine learning allows computers to learn from data.",
        "Deep learning is a subset of machine learning.",
        "Natural language processing helps machines understand text."
    ]
}

dataset = Dataset.from_dict(data)
dataset


## Step 5: Tokenize Dataset
We convert the text data into tokens that GPT-2 can understand.
We also pad and truncate sequences to a maximum length of 128.


In [None]:
def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset


## Step 6: Prepare Data Collator
We use `DataCollatorForLanguageModeling` to create batches for GPT-2.
MLM is False because GPT-2 uses causal language modeling.


In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)


## Step 7: Set Training Parameters
Define the training parameters such as number of epochs, batch size, output directory, and logging.


In [None]:
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=10,
    per_device_train_batch_size=2,
    save_steps=500,
    save_total_limit=2,
    logging_steps=10,
    report_to="none"
)


## Step 8: Train GPT-2 Model
We use the Hugging Face `Trainer` API to fine-tune the model on our dataset.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
)

trainer.train()


## Step 9: Save Fine-Tuned Model
After training, we save the model and tokenizer locally as a zip file for submission.


In [None]:
# Save model and tokenizer
model.save_pretrained("fine_tuned_gpt2")
tokenizer.save_pretrained("fine_tuned_gpt2")

# Create a zip file
!zip -r fine_tuned_gpt2.zip fine_tuned_gpt2


## Step 10: Generate Text Using Fine-Tuned Model
We can now test the model to see if it generates coherent text similar to our dataset.


In [None]:
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

prompt = "Artificial Intelligence is"
output = generator(prompt, max_length=50, num_return_sequences=1)

print(output[0]["generated_text"])


## Summary
- Fine-tuned GPT-2 on a small custom dataset
- Successfully generated coherent text
- Model and tokenizer saved as `fine_tuned_gpt2.zip`
- Ready for submission as an internship project
