### Homework 5 (10pt): Question search engine

Remeber Week01, where you used GloVe embeddings to find related questions? That was... cute. Now, it's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `practice06.ipynb` [notebook](https://github.com/anton-selitskiy/RIT_LLM/blob/main/Week06_bert/practice06.ipynb)

This assignmend is inspired by this [notebook](https://github.com/yandexdataschool/nlp_course/blob/2024/week05_transfer/homework.ipynb)

In [None]:
#%pip install --upgrade transformers datasets accelerate deepspeed
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets
import os

os.environ["TOKENIZERS_PARALLELISM"] = "false"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Load data and model

In [None]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

In [None]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

### Tokenize the data

In [None]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True, return_tensors="pt"
    )
    result['label'] = torch.tensor(examples['label'], dtype=torch.long)
    return result

qqp_preprocessed = {
    split: [preprocess_function(sample) for sample in qqp[split]] for split in ['train', 'validation', 'test']
}


In [None]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

In [None]:
print(tokenizer.decode(qqp_preprocessed['train'][0]["input_ids"].squeeze(0)))

### Task 1: evaluation (3 point)

We randomly chose a model trained on QQP - but is it any good?

One way to assess this is by measuring validation accuracy, which you will implement next.

Here’s the interface to help you get started:

In [None]:
class QQPDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        return {
            "input_ids": item["input_ids"].squeeze(0),  # Remove batch dim
            "attention_mask": item["attention_mask"].squeeze(0),
            "token_type_ids": item["token_type_ids"].squeeze(0),
            "labels": item["label"]
        }

val_set = QQPDataset(qqp_preprocessed['validation'])
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=32, shuffle=False, collate_fn=transformers.default_data_collator, num_workers=8
)

In [None]:
model.to(device)
for batch in val_loader:
    batch = {k: v.to(device) for k, v in batch.items()}  # Move batch to GPU
    break  # Read one batch only
print("Sample batch:", batch)

with torch.no_grad():
    predicted = model(
        input_ids=batch['input_ids'],
        attention_mask=batch['attention_mask'],
        token_type_ids=batch['token_type_ids']
    )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).cpu().numpy())

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [None]:
from tqdm.notebook import tqdm

In [None]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in tqdm(val_loader, desc="Evaluating"):
        # Move batch to GPU
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask'],
            token_type_ids=batch.get('token_type_ids', None)
        )

        # Predictions
        probs = torch.softmax(outputs.logits, dim=1)
        predictions = torch.argmax(probs, dim=1)

        # Compute accuracy
        correct += (predictions == batch['labels']).sum().item()
        total += batch['labels'].size(0)

accuracy = correct / total


In [None]:
assert 0.9 < accuracy < 0.91
print(f"Accuracy: {accuracy:.3f}")

### Task 2: train the model (5 points)

Fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base), but you can choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually (as we did in class) or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.

In [63]:
model_name = "microsoft/deberta-v3-base"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    hidden_dropout_prob=0.1,  # Restoring small dropout to reduce overfitting
    attention_probs_dropout_prob=0.1
)
train_dataset = QQPDataset(qqp_preprocessed['train'])
val_dataset = QQPDataset(qqp_preprocessed['validation'])

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [68]:
from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
epoches = 6

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=3e-6,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=6,
    logging_dir="./logs",
    logging_steps=10,
    weight_decay= 0.005,
    push_to_hub=False,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

def save_custom_checkpoint(trainer, epoch):
    checkpoint_path = f"./results/checkpoint-epoch-{epoch}"
    trainer.save_model(checkpoint_path)
    print(f"Checkpoint saved at {checkpoint_path}")

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=1)],  # Stop after 1 bad epoch
)


In [69]:
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.4318,0.457519
2,0.3092,0.447268
3,0.4219,0.422246
4,0.4938,0.408628
5,0.3456,0.417349


TrainOutput(global_step=113705, training_loss=0.4333482313447585, metrics={'train_runtime': 10755.4905, 'train_samples_per_second': 202.973, 'train_steps_per_second': 12.686, 'total_flos': 1.1966702736167424e+17, 'train_loss': 0.4333482313447585, 'epoch': 5.0})

In [70]:
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in tqdm(val_loader, desc="Evaluating"):
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask'],
            token_type_ids=batch['token_type_ids']
        )
        probs = torch.softmax(outputs.logits, dim=1)
        predictions = torch.argmax(probs, dim=1)
        correct += (predictions == batch['labels']).sum().item()
        total += batch['labels'].size(0)

accuracy = correct / total
print(f"Validation Accuracy: {accuracy:.4f}")

Evaluating:   0%|          | 0/1264 [00:00<?, ?it/s]

Validation Accuracy: 0.8084


### Task 3: try the full pipeline (2 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 3 examples.