## Environment Setup

In [1]:
!pip install evaluate -q

In [2]:
import torch
import numpy as np
import evaluate
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    DataCollatorForLanguageModeling
)

# Device Setup (Use GPU if available)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


## Task 1: Natural Language Inference (NLI) with Encoder Models

### 1.1. Loading the MultiNLI Dataset

MultiNLI is part of the GLUE benchmark. Since this dataset is quite large, we select a subset of 5,000 samples for the training set and 500 for validation to ensure the training process is efficient within a Google Colab session.

In [22]:
# Load the MNLI dataset from GLUE
raw_datasets = load_dataset("glue", "mnli")

# Select a small subset for training efficiency
train_dataset = raw_datasets["train"].shuffle(seed=42).select(range(5000))
val_dataset = raw_datasets["validation_matched"].shuffle(seed=42).select(range(500))

# Display data samples
print("Sample Premise:", train_dataset[0]['premise'])
print("Sample Hypothesis:", train_dataset[0]['hypothesis'])
print("Label (0: Entailment, 1: Neutral, 2: Contradiction):", train_dataset[0]['label'])

Sample Premise: I'll hurry over that part.
Sample Hypothesis: "I'll be quick with that part."
Label (0: Entailment, 1: Neutral, 2: Contradiction): 0


### 1.2. Tokenizing Sentence Pairs

Encoder models like BERT/DistilBERT process two sentences simultaneously. The tokenizer merges the Premise and Hypothesis into a single sequence using a special separator token ([SEP]). We use truncation=True to ensure sequences do not exceed the model's limit (usually 512 tokens).

In [23]:
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_function(examples):
    # Combine two sentences as a single input for the model
    return tokenizer(
        examples["premise"],
        examples["hypothesis"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

# Apply tokenization across the dataset
tokenized_train = train_dataset.map(preprocess_function, batched=True)
tokenized_val = val_dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

### 1.3. Initializing the Sequence Classification Model

We load a pre-trained DistilBERT model. Since MNLI has three categories, we set num_labels=3. The model uses the hidden state of the [CLS] token as the representation for the entire sentence pair to perform classification.

In [24]:
# Load the model with a classification head for 3 labels
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=3)
model.to(device)

# Load the accuracy metric
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading builder script: 0.00B [00:00, ?B/s]

### 1.4. Trainer API and Hyperparameter Configuration

We define the training parameters. A learning rate of $2 \times 10^{-5}$ is standard for fine-tuning Transformers. Weight decay is applied as a form of regularization to prevent the model from overfitting.

In [25]:
training_args = TrainingArguments(
    output_dir="./distilbert-mnli-results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    load_best_model_at_end=True,
    logging_steps=50,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Start Fine-Tuning
print("Starting Training...")
trainer.train()

  trainer = Trainer(


Starting Training...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8848,0.855824,0.626
2,0.7352,0.827886,0.65
3,0.6062,0.842772,0.652


TrainOutput(global_step=939, training_loss=0.7795832677644154, metrics={'train_runtime': 78.974, 'train_samples_per_second': 189.936, 'train_steps_per_second': 11.89, 'total_flos': 496761603840000.0, 'train_loss': 0.7795832677644154, 'epoch': 3.0})

In [10]:
# Evaluate on the validation set
results = trainer.evaluate()
print(f"Final Evaluation Accuracy: {results['eval_accuracy']*100:.2f}%")

Final Evaluation Accuracy: 63.60%


### 1.5. Evaluation and Custom Inference

After training, we evaluate the model on the validation set and create a function to test the model with new, unseen sentence pairs to verify its logical understanding.

In [12]:
# Custom Prediction Function
def predict_nli(premise, hypothesis):
    inputs = tokenizer(premise, hypothesis, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    prediction = torch.argmax(logits, dim=-1).item()
    labels = ["Entailment", "Neutral", "Contradiction"]
    return labels[prediction]

# Practical Test Case
p = "A man is playing soccer."
h = "A man is outside on a field."

print(f"Premise: {p}\nHypothesis: {h}\nPrediction: {predict_nli(p, h)}")

Premise: A man is playing soccer.
Hypothesis: A man is outside on a field.
Prediction: Neutral


## Task 2: Generative Question Answering with Seq2Seq Models

### 2.1. Loading the SQuAD Dataset

The SQuAD (Stanford Question Answering Dataset) is the industry standard for QA. We will use a subset of 5,000 samples to ensure the training remains efficient in a Google Colab environment while still providing a meaningful learning experience.

In [26]:
# Load the SQuAD dataset
raw_squad = load_dataset("squad", split="train[:5000]")

# Split into training and validation sets (90% train, 10% validation)
squad_dataset = raw_squad.train_test_split(test_size=0.1)

print(f"Training samples: {len(squad_dataset['train'])}")
print(f"Validation samples: {len(squad_dataset['test'])}")

Training samples: 4500
Validation samples: 500


### 2.2. Preprocessing and Tokenization (Text-to-Text Paradigm)

T5 requires a specific input format where the task is described in the text. We must prefix our inputs with "question: " and "context: ". Furthermore, in Seq2Seq models, the labels are also text strings that must be tokenized into IDs.

In [27]:
model_checkpoint = "t5-small" # Using the 'small' version for Colab efficiency
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_squad(examples):
    # Format input as a single string: "question: [Q] context: [C]"
    inputs = [f"question: {q}  context: {c}" for q, c in zip(examples["question"], examples["context"])]

    # Tokenize the Inputs
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    # Tokenize the Targets (Answers)
    # T5 expects labels as token IDs of the target answer text
    labels = tokenizer(
        text_target=[a["text"][0] for a in examples["answers"]],
        max_length=128,
        truncation=True,
        padding="max_length"
    )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Map the preprocessing function to the dataset
tokenized_squad = squad_dataset.map(preprocess_squad, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

### 2.3. Model Configuration and Training Arguments

We load the AutoModelForSeq2SeqLM. We use the modern eval_strategy parameter and enable predict_with_generate so that the model actually generates text during the evaluation phase rather than just calculating loss.

In [28]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
model.to(device)

# Configure Training Parameters
training_args = Seq2SeqTrainingArguments(
    output_dir="./t5-squad-results",
    eval_strategy="epoch",        # Using the updated naming convention
    save_strategy="epoch",
    learning_rate=3e-4,           # T5 typically requires a higher learning rate than BERT
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    predict_with_generate=True,   # Essential for generative evaluation
    fp16=True if torch.cuda.is_available() else False, # Faster training with GPU
)

### 2.4. Initializing the Seq2SeqTrainer

We initialize the Seq2SeqTrainer. Notice the use of processing_class=tokenizer to replace the deprecated tokenizer argument, ensuring compliance with the latest Hugging Face standards.

In [18]:
# Initialize the Seq2Seq Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    processing_class=tokenizer, # Fixes the FutureWarning
)

# Start training process
print("Starting T5 Fine-tuning on SQuAD dataset...")
trainer.train()

Starting T5 Fine-tuning on SQuAD dataset...


Epoch,Training Loss,Validation Loss
1,0.3241,0.011864
2,0.0133,0.011164


TrainOutput(global_step=1126, training_loss=0.1511252816786351, metrics={'train_runtime': 237.6143, 'train_samples_per_second': 37.877, 'train_steps_per_second': 4.739, 'total_flos': 1218076213248000.0, 'train_loss': 0.1511252816786351, 'epoch': 2.0})

### 2.5. Model Inference (Testing)

Once trained, we can test the model by providing a new context and question. The model will use its decoder to generate the answer tokens, which we then decode into human-readable text.

In [19]:
def get_answer(question, context):
    input_text = f"question: {question} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    # Generate the prediction (output tokens)
    outputs = model.generate(inputs["input_ids"], max_length=50)

    # Decode the tokens back into text
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer

# Practical Test Case
test_context = "The Eiffel Tower is located in Paris, France. It was constructed in 1887."
test_question = "When was the Eiffel Tower constructed?"

print(f"Context: {test_context}")
print(f"Question: {test_question}")
print(f"Model Answer: {get_answer(test_question, test_context)}")

Context: The Eiffel Tower is located in Paris, France. It was constructed in 1887.
Question: When was the Eiffel Tower constructed?
Model Answer: 1887


## Task 3: Abstractive Summarization with Decoder-Only LLMs (Phi-2)

### 3.1. Loading Phi-2 in Half-Precision (FP16)

Following your code snippet, we load the model with float16. This reduces the VRAM usage to approximately 5-6GB, allowing the 2.7B model to fit comfortably in Google Colab's T4 GPU (15GB) without needing 4-bit quantization.

In [8]:
model_id = "microsoft/phi-2"

# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# Load Model in FP16 precision
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto"
)

print("Model Phi-2 successfully loaded in FP16 precision.")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model Phi-2 successfully loaded in FP16 precision.


### 3.2. Applying LoRA Configuration

We insert small, trainable "adapter" matrices into the model. Instead of training 2.7 billion parameters, we only train about 1% of them, which prevents the GPU from running out of memory during the training process.

In [9]:
# Configure LoRA (Parameter-Efficient Fine-Tuning)
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["Wqkv", "fc1", "fc2"], # Targeting the Attention and MLP layers
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Wrap the model with LoRA adapters
model = get_peft_model(model, config)
model.print_trainable_parameters()

trainable params: 13,107,200 || all params: 2,792,791,040 || trainable%: 0.4693


### 3.3. Loading the Dataset

In [11]:
# Fixed: Added trust_remote_code=True to bypass the script execution error
dataset = load_dataset("EdinburghNLP/xsum", split="train[:1000]")

print(f"Dataset loaded. Sample article: {dataset[0]['document'][:100]}...")

Using the latest cached version of the dataset since EdinburghNLP/xsum couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/EdinburghNLP___xsum/default/0.0.0/b46d1408a83c7c650e4e3605e24dad5c9e06297a (last modified on Sun Jan 11 14:15:47 2026).


Dataset loaded. Sample article: The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed....


### 3.4. Data Preprocessing and Tokenization

In [12]:
def preprocess_function(sample):
    # Instruction-style prompt template
    prompt = f"Instruct: Summarize the following news article concisely.\n{sample['document']}\nOutput: {sample['summary']}"

    # Tokenize the prompt
    tokenized = tokenizer(prompt, truncation=True, max_length=512, padding="max_length")

    # For Causal LM training, labels are equal to input_ids
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

# Apply tokenization and remove old columns
tokenized_dataset = dataset.map(preprocess_function, remove_columns=dataset.column_names)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

### 3.5. Model Training (Fine-Tuning)

We use a batch size of 1 and gradient_accumulation_steps=4 to simulate a batch size of 4 without crashing the GPU memory.

In [14]:
training_args = TrainingArguments(
    output_dir="./phi2-summarization-results",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    save_strategy="no",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    processing_class=tokenizer, # Latest standard argument
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

print("Starting Task 3 Fine-Tuning...")
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.


Starting Task 3 Fine-Tuning...


Step,Training Loss
10,2.2394
20,2.0553
30,2.073
40,2.1507
50,2.1401
60,2.1898
70,2.1529
80,2.0197
90,2.1984
100,2.2665


Step,Training Loss
10,2.2394
20,2.0553
30,2.073
40,2.1507
50,2.1401
60,2.1898
70,2.1529
80,2.0197
90,2.1984
100,2.2665


TrainOutput(global_step=250, training_loss=2.2167973480224608, metrics={'train_runtime': 548.1132, 'train_samples_per_second': 1.824, 'train_steps_per_second': 0.456, 'total_flos': 8176800890880000.0, 'train_loss': 2.2167973480224608, 'epoch': 1.0})

### 3.6. Post-Training Inference (Testing)

We test the model by providing only the "Instruct" portion. The model will auto-regressively generate the summary text after the "Output:" tag.

In [15]:
def generate_summary(text):
    prompt = f"Instruct: Summarize the following news article concisely.\n{text}\nOutput:"
    inputs = tokenizer(prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            do_sample=True,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id
        )

    decoded = tokenizer.batch_decode(outputs)[0]
    # Clean up output to show only the generated summary
    summary = decoded.split("Output:")[-1].strip()
    return summary

# Example Test
news_text = "NASA's latest rover has discovered signs of ancient water on Mars, suggesting the planet was once habitable."
print(f"Article: {news_text}")
print(f"Model Summary: {generate_summary(news_text)}")

Article: NASA's latest rover has discovered signs of ancient water on Mars, suggesting the planet was once habitable.
Model Summary: Researchers have found evidence of ancient water on Mars.
"These are the first results of this kind from the rover," says Dr Alan Stern, the principal investigator for the Opportunity rover.
"It's amazing to look at these images and see water
