# Transformers and Large Language Models (LLMs)

This notebook provides a comprehensive overview of Transformer-based architectures and practical examples of fine-tuning popular LLMs such as **BERT**, **T5**, and **DistilBERT** for text classification, summarization, and question answering tasks.

You’ll learn:
- The core concepts behind the Transformer architecture (Attention, Encoder-Decoder)
- How to use Hugging Face Transformers for various NLP tasks
- How to fine-tune pre-trained models without using `TrainingArguments` or Trainer API

## Prerequisites
Before proceeding, ensure you understand:
- NLP basics: tokenization, embeddings, and vectorization
- Basic PyTorch or TensorFlow concepts
- The importance of pre-trained language models

We'll use the **Hugging Face Transformers library** with **PyTorch backend**.


In [1]:
import torch
from transformers import (
    BertTokenizer, BertForSequenceClassification,
    T5Tokenizer, T5ForConditionalGeneration,
    DistilBertTokenizer, DistilBertForQuestionAnswering
)
from datasets import load_dataset
from torch.utils.data import DataLoader
from torch.optim import AdamW
from tqdm.auto import tqdm
import numpy as np

# Understanding the Transformer Architecture

The **Transformer** is the foundation of modern LLMs like BERT, GPT, and T5.

### Key Components:
1. **Self-Attention Mechanism**
   - Each word attends to every other word in a sentence.
   - Captures contextual relationships regardless of distance.

2. **Multi-Head Attention**
   - Runs multiple attention operations in parallel for richer feature extraction.

3. **Positional Encoding**
   - Adds information about word order since Transformers don’t have recurrence.

4. **Feed-Forward Networks**
   - Processes information independently at each position after attention.

5. **Encoder & Decoder Blocks**
   - **BERT** uses **only Encoder** (bidirectional context)
   - **GPT** uses **only Decoder** (autoregressive)
   - **T5** uses **Encoder + Decoder** (seq2seq tasks)

# Section 1: Fine-Tuning BERT for Text Classification

We’ll fine-tune `bert-base-uncased` on a simple sentiment classification dataset (IMDB subset).

Steps:
1. Load dataset and tokenizer
2. Tokenize text
3. Create PyTorch DataLoader
4. Fine-tune BERT manually using an optimizer
5. Evaluate accuracy


In [2]:
dataset = load_dataset("imdb", split="train[:2000]")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

def tokenize_fn(example):
    return tokenizer(example["text"], padding="max_length", truncation=True, max_length=128)

tokenized_data = dataset.map(tokenize_fn, batched=True)
tokenized_data.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

dataloader = DataLoader(tokenized_data, batch_size=8, shuffle=True)


## Fine-Tuning BERT for Text Classification

We'll fine-tune a pre-trained **BERT** model on a text classification task using the `imdb` dataset (binary sentiment classification).

In [3]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=2e-5)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

In [5]:
model.train()
epochs = 1

for epoch in range(epochs):
    loop = tqdm(dataloader, leave=True)
    for batch in loop:
        optimizer.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["label"].to(device)
        
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())


  0%|          | 0/250 [00:00<?, ?it/s]

## Evaluate the Model
Let's check how well our model performs on unseen data.


In [6]:
model.eval()
sample = "The movie was absolutely wonderful, full of great performances."
inputs = tokenizer(sample, return_tensors="pt", truncation=True, padding=True).to(device)
with torch.no_grad():
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1).item()

print("Predicted label:", "Positive" if prediction == 1 else "Negative")


Predicted label: Negative


# Section 2: Text Summarization using T5

The **T5 model (Text-To-Text Transfer Transformer)** treats every NLP task as a text-to-text problem.
We’ll use `t5-small` to generate summaries for input text.

Steps:
1. Load tokenizer and model
2. Prepare input text with the prefix `"summarize: "`
3. Generate summary using greedy or beam search decoding


In [7]:
t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")
t5_model = T5ForConditionalGeneration.from_pretrained("t5-small").to(device)

text = """
Artificial Intelligence (AI) is transforming industries by automating processes, 
enhancing decision-making, and unlocking new levels of efficiency.
"""

input_text = "summarize: " + text
inputs = t5_tokenizer(input_text, return_tensors="pt", truncation=True, padding=True).to(device)

summary_ids = t5_model.generate(
    inputs["input_ids"],
    num_beams=4,
    max_length=60,
    early_stopping=True
)

summary = t5_tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:\n", summary)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Summary:
 AI is transforming industries by automating processes, enhancing decision-making, and unlocking new levels of efficiency.


# Section 3: Question Answering using DistilBERT

DistilBERT is a lightweight version of BERT that retains 95% of its performance with fewer parameters.

We’ll use `distilbert-base-uncased-distilled-squad` for extractive question answering.


In [8]:
from transformers import pipeline

qa_model = pipeline("question-answering", model="distilbert-base-uncased-distilled-squad")

context = """
The Eiffel Tower is a wrought-iron lattice tower in Paris, France. 
It was constructed from 1887 to 1889 and named after engineer Gustave Eiffel.
"""

question = "When was the Eiffel Tower constructed?"

result = qa_model(question=question, context=context)
print(f"Answer: {result['answer']} (score: {result['score']:.3f})")





Device set to use cpu


Answer: 1887 to 1889 (score: 0.596)


# Summary

In this notebook, you:
- Learned the fundamentals of Transformer architecture.
- Fine-tuned **BERT** manually for sentiment classification.
- Used **T5** for text summarization.
- Applied **DistilBERT** for extractive question answering.

### Key Takeaways
- Transformers rely on **self-attention** for context understanding.
- Pre-trained models like BERT and T5 can be fine-tuned for multiple NLP tasks.
- Manual training gives deeper insight into model optimization steps.

Next Steps:
- Experiment with your own datasets.
- Explore more advanced models (RoBERTa, GPT-Neo, Falcon, Mistral).
- Integrate these models into Retrieval-Augmented Generation (RAG) pipelines.
