### Homework 5: Question search engine

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [1]:
%pip install --upgrade transformers datasets accelerate deepspeed
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m




### Load data and model

In [2]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

Repo card metadata block was not found. Setting CardData to empty.




Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [3]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

2024-10-27 17:08:23.546247: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
  return self.fget.__get__(instance, owner)()


### Tokenize the data

In [4]:
MAX_LENGTH = 128
def preprocess_function(examples, tokenizer):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(lambda x: preprocess_function(x, tokenizer), batched=True)

Map: 100%|██████████| 363846/363846 [00:46<00:00, 7814.58 examples/s]
Map: 100%|██████████| 40430/40430 [00:04<00:00, 8099.39 examples/s]
Map: 100%|██████████| 390965/390965 [00:48<00:00, 8066.76 examples/s]


In [5]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Task 1: evaluation (1 point)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [7]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [8]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'],
      attention_mask=batch['attention_mask'],
      token_type_ids=batch['token_type_ids']
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0]), 'idx': tensor([0]), 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,   

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [17]:
from tqdm import tqdm

def test(model, val_loader, device):
  correct, total = 0, 0
  with torch.no_grad():
    for batch in tqdm(val_loader):
        batch = {k: v.to(device) for k, v in batch.items()}

        with torch.amp.autocast(device):
            predicted = model(
                input_ids=batch['input_ids'].to(device),
                attention_mask=batch['attention_mask'].to(device),
                token_type_ids=batch['token_type_ids'].to(device)
            )

        probs = torch.softmax(predicted.logits, dim=1)
        predictions = torch.argmax(probs, dim=1)

        correct += (predictions == batch['labels']).sum().item()
        total += len(predictions)
  accuracy = correct / total
  return accuracy

In [10]:
val_loader = torch.utils.data.DataLoader(
      val_set, batch_size=32, shuffle=False,
      collate_fn=transformers.default_data_collator, num_workers=2
  )

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)

accuracy = test(model, val_loader, device)

100%|██████████| 1264/1264 [01:06<00:00, 18.91it/s]


In [11]:
assert 0.9 < accuracy < 0.91

### Task 2: train the model (4 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

In [22]:
import time

def statistic(model_name):
  def get_model_size(model):
    param_size = sum(param.numel() for param in model.parameters())
    return param_size * 4 / (1024 ** 2)

  device = 'cuda' if torch.cuda.is_available() else 'cpu'

  tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
  model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name).to(device)

  qqp_preprocessed = qqp.map(lambda x: preprocess_function(x, tokenizer), batched=True)
  val_set = qqp_preprocessed['validation']

  val_loader = torch.utils.data.DataLoader(
      val_set, batch_size=32, shuffle=False,
      collate_fn=transformers.default_data_collator, num_workers=2
  )

  start_time = time.time()
  accuracy = test(model, val_loader, device)
  end_time = time.time()
  elapsed_time = end_time - start_time
  samples_per_sec = len(val_set) / elapsed_time

  model_size = get_model_size(model)

  return {
          "Model": model_name,
          "Accuracy": accuracy,
          "Samples_per_sec": samples_per_sec,
          "Model_size_MB": model_size
         }


In [23]:
results = []

In [24]:
results.append(statistic("gchhablani/bert-base-cased-finetuned-qqp"))

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

100%|██████████| 1264/1264 [01:08<00:00, 18.52it/s]


In [25]:
results.append(statistic("vkk1710/xlnet-base-cased-finetuned-qqp"))

tokenizer_config.json:   0%|          | 0.00/516 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/291 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/469M [00:00<?, ?B/s]

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

100%|██████████| 1264/1264 [03:42<00:00,  5.69it/s]


In [26]:
results.append(statistic("Tomor0720/deberta-base-finetuned-qqp"))

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/963 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/787 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/557M [00:00<?, ?B/s]

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

100%|██████████| 1264/1264 [02:26<00:00,  8.62it/s]


In [27]:
results.append(statistic("M-FAC/bert-mini-finetuned-qqp"))

tokenizer_config.json:   0%|          | 0.00/346 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/44.7M [00:00<?, ?B/s]

Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

100%|██████████| 1264/1264 [00:19<00:00, 64.19it/s]


**vkk1710/xlnet-base-cased-finetuned-qqp** — based on the XLNet architecture. This model achieved accuracy comparable to BERT, but its more complex architecture, which accounts for all possible token orders, results in a slower processing speed. This is expected, as XLNet’s permuted prediction approach requires more computational resources, making it the slowest among the models tested.

**Tomor0720/deberta-base-finetuned-qqp** — based on DeBERTa. This model achieved the highest accuracy among all models tested, due to its architecture that separately processes word meaning and position. This approach enhances contextual understanding, but also increases processing time and model size compared to BERT.
**M-FAC/bert-mini-finetuned-qqp** — a lightweight version of BERT with a simplified architecture. Although it has slightly lower accuracy than the full-sized models, its performance remains reasonably high. It is the smallest and fastest model in processing, making it efficient for tasks with limited resources.

So, If minimizing computational resource usage is a priority, BERT-mini is the optimal choice for its lightweight structure and high processing speed. For tasks requiring the highest accuracy, DeBERTa offers superior performance with a deeper text analysis capability. BERT remains a balanced option, providing solid accuracy, reasonable speed, and moderate resource requirements.

In [28]:
import pandas as pd

results_df = pd.DataFrame(results)
print(results_df)

                                      Model  Accuracy  Samples_per_sec  \
0  gchhablani/bert-base-cased-finetuned-qqp  0.908410       592.362250   
1    vkk1710/xlnet-base-cased-finetuned-qqp  0.908385       181.944658   
2      Tomor0720/deberta-base-finetuned-qqp  0.912788       275.662633   
3             M-FAC/bert-mini-finetuned-qqp  0.870344      2052.316350   

   Model_size_MB  
0     413.176765  
1     447.503914  
2     530.982430  
3      42.614265  


### Task 3: try the full pipeline (1 point)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

In [21]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def preprocess_search_function(query, examples):
    query_batch = [query] * len(examples['text1'])
    
    results = tokenizer(
        examples['text1'], query_batch,
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    
    return results 


def find_top_duplicates(query, train_set, k=5):
    search_set = train_set.map(lambda x: preprocess_search_function(query, x), batched=True)
    search_set_loader = torch.utils.data.DataLoader(
      search_set, batch_size=32, shuffle=False,
      collate_fn=transformers.default_data_collator,
      pin_memory=True
    )
    
    model.to(device)
    model.eval()
    scores = torch.tensor([], device=device)
    
    for batch in tqdm(search_set_loader):
        with torch.no_grad():
            with torch.amp.autocast(device):
                predicted = model(
                    input_ids=batch['input_ids'].to(device),
                    attention_mask=batch['attention_mask'].to(device),
                    token_type_ids=batch['token_type_ids'].to(device)
                )
                probs = torch.softmax(predicted.logits, dim=1)
                scores = torch.cat([scores, probs[:, 1]])

    sorted_idx = scores.argsort(descending=True)
    top_questions = []
    
    for idx in sorted_idx:
        question = train_set['text1'][idx]
        score = scores[idx].item()
        if (question, score) not in top_questions:
            top_questions.append((question, score))
        if len(top_questions) == k:
            break
            
    return top_questions

In [22]:
queries = [
    "What mistakes am I making?",
    "Which mountain holds the title of the tallest in world?",
    "Where to go on a first date?",
    "How do I become a data scientist?",
    "What is the capital city of Russia?",
]

for query in queries:
    print(f"\nQuery: {query}")
    top_duplicates = find_top_duplicates(query, qqp['train'])
    for idx, (question, score) in enumerate(top_duplicates):
        print(f"Top {idx+1} match: {question} (Similarity: {score:.4f})")


Query: What mistakes am I making?


100%|██████████| 11371/11371 [05:17<00:00, 35.86it/s]


Top 1 match: How do NEWLY OPENED FIRMS GETS FUNDS? (Similarity: 0.1037)
Top 2 match: HOW DOES AFFECTIVE MEMORY WORKS? (Similarity: 0.0202)
Top 3 match: WHAT ARE THE SALARIES PAID BY HUL COMPANY FOR NIT STUDENTS? (Similarity: 0.0163)
Top 4 match: WHAT DOES MERTA MERTA MEAN? (Similarity: 0.0136)
Top 5 match: What are the hacks in daily life? (Similarity: 0.0125)

Query: Which mountain holds the title of the tallest in world?


Map: 100%|██████████| 363846/363846 [00:43<00:00, 8406.12 examples/s] 
100%|██████████| 11371/11371 [05:18<00:00, 35.69it/s]


Top 1 match: What's the highest mountain in the world? (Similarity: 0.9849)
Top 2 match: Which is the highest mountain in the world? (Similarity: 0.9848)
Top 3 match: What is the the highest mountain in the world? (Similarity: 0.9844)
Top 4 match: What is the highest mountain in the world? (Similarity: 0.9842)
Top 5 match: Which is the highest peak of the world? (Similarity: 0.9831)

Query: Where to go on a first date?


100%|██████████| 11371/11371 [05:17<00:00, 35.77it/s]


Top 1 match: HOW DO I EARN WITH LOW INVESTMENT? (Similarity: 0.9463)
Top 2 match: How do I LIVE in the PRESENT MOMENT? (Similarity: 0.7480)
Top 3 match: HOW DOESI FUCK A LADY? (Similarity: 0.6220)
Top 4 match: What are some good ideas for a first date? (Similarity: 0.5397)
Top 5 match: I HAVE TWO WHEELER LICENSE FROM WEST BENGAL.CAN I DRIVE BIKE THROUHOUT INDIA? (Similarity: 0.5001)

Query: How do I become a data scientist?


Map: 100%|██████████| 363846/363846 [00:42<00:00, 8555.78 examples/s] 
100%|██████████| 11371/11371 [05:17<00:00, 35.84it/s]


Top 1 match: How can I become a great Data Analyst? (Similarity: 0.9914)
Top 2 match: How become master in database? (Similarity: 0.9904)
Top 3 match: What is the best track to becoming a data scientist? (Similarity: 0.9814)
Top 4 match: How do I become a great computer scientist? (Similarity: 0.9799)
Top 5 match: Jeff Hammerbacher: What is your advice for a young data scientist? (Similarity: 0.9797)

Query: What is the capital city of Russia?


Map: 100%|██████████| 363846/363846 [00:42<00:00, 8571.06 examples/s] 
100%|██████████| 11371/11371 [05:18<00:00, 35.70it/s]


Top 1 match: HOW DOES AFFECTIVE MEMORY WORKS? (Similarity: 0.9767)
Top 2 match: WHO ARE THE GODDESS NAV DURGAS? (Similarity: 0.9697)
Top 3 match: COMPLEX FORMATION METHOD IS USED EXTRACTION of WHICH METAL? (Similarity: 0.9514)
Top 4 match: WHAT ARE THE IMPORTANT EVENTS IN PROPHASE? (Similarity: 0.9499)
Top 5 match: I HAVE TWO WHEELER LICENSE FROM WEST BENGAL.CAN I DRIVE BIKE THROUHOUT INDIA? (Similarity: 0.9340)


__Bonus:__ for bonus points, try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vectorizer = TfidfVectorizer(max_features=10000)
train_tfidf = tfidf_vectorizer.fit_transform(qqp['train']['text1'])

def find_top_duplicates_optimized(query, k=5, shortlist_size=100):
    query_tfidf = tfidf_vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_tfidf, train_tfidf).flatten()
    
    candidate_indices = cosine_similarities.argsort()[-shortlist_size:][::-1]
    
    candidate_train = qqp['train'].select(candidate_indices)
    
    return find_top_duplicates(query, candidate_train)

In [61]:
import copy

queries = [
    "What mistakes am I making?",
    "Which mountain holds the title of the tallest in world?",
    "Where to go on a first date?",
    "How do I become a data scientist?",
    "What is the capital city of Russia?",
]

# original_train = copy.deepcopy(qqp['train'])
s = len(qqp['train']['text1'])

for query in queries:
    print(f"\nQuery: {query}")
    #qqp['train'] = original_train
    top_duplicates = find_top_duplicates_optimized(query)
    for idx, (question, score) in enumerate(top_duplicates):
        print(f"Top {idx+1} match: {question} (Similarity: {score:.4f})")


Query: What mistakes am I making?


Map: 100%|██████████| 100/100 [00:00<00:00, 3342.74 examples/s]
100%|██████████| 4/4 [00:00<00:00, 26.68it/s]


Top 1 match: I made many mistakes in my life? (Similarity: 0.0044)
Top 2 match: How can I learn from my mistakes? (Similarity: 0.0016)
Top 3 match: What are some of the biggest mistakes in history? (Similarity: 0.0013)
Top 4 match: How do we learn from mistakes? (Similarity: 0.0008)
Top 5 match: What are the big mistakes you made in your life? (Similarity: 0.0007)

Query: Which mountain holds the title of the tallest in world?


Map: 100%|██████████| 100/100 [00:00<00:00, 3617.84 examples/s]
100%|██████████| 4/4 [00:00<00:00, 35.66it/s]


Top 1 match: What's the highest mountain in the world? (Similarity: 0.9849)
Top 2 match: Which is the highest mountain in the world? (Similarity: 0.9848)
Top 3 match: What is the the highest mountain in the world? (Similarity: 0.9844)
Top 4 match: What is the highest mountain in the world? (Similarity: 0.9842)
Top 5 match: Which is the biggest mountain in world? (Similarity: 0.8385)

Query: Where to go on a first date?


Map: 100%|██████████| 100/100 [00:00<00:00, 3662.60 examples/s]
100%|██████████| 4/4 [00:00<00:00, 37.30it/s]


Top 1 match: What are some good ideas for a first date? (Similarity: 0.5397)
Top 2 match: I've been on a couple first dates, but I'm finally going on a date with this girl and I really like her. Where should we go for our first date? (Similarity: 0.0880)
Top 3 match: Where is the best place to go on a honeymoon? (Similarity: 0.0378)
Top 4 match: What are the best questions to ask on a first date? (Similarity: 0.0278)
Top 5 match: What are the best places to take a first date to in New York City? (Similarity: 0.0116)

Query: How do I become a data scientist?


Map: 100%|██████████| 100/100 [00:00<00:00, 3918.85 examples/s]
100%|██████████| 4/4 [00:00<00:00, 36.99it/s]


Top 1 match: What is the best track to becoming a data scientist? (Similarity: 0.9814)
Top 2 match: How do I become a great computer scientist? (Similarity: 0.9799)
Top 3 match: How do I become a computer scientist? (Similarity: 0.9777)
Top 4 match: How can I become a data scientist? (Similarity: 0.9743)
Top 5 match: How can I be a data scientist? (Similarity: 0.9701)

Query: What is the capital city of Russia?


Map: 100%|██████████| 100/100 [00:00<00:00, 3797.98 examples/s]
100%|██████████| 4/4 [00:00<00:00, 37.35it/s]

Top 1 match: What is “capital”? (Similarity: 0.0113)
Top 2 match: How much of Russia is actually inhabited? (Similarity: 0.0009)
Top 3 match: What is the biggest city? (Similarity: 0.0003)
Top 4 match: What is the capital of the U.K.? (Similarity: 0.0003)
Top 5 match: What is administrative capital? (Similarity: 0.0002)





To accelerate the duplicate question search, I implemented TF-IDF Vectorization and cosine similarity to shortlist the most similar questions before passing them to the BERT model. This approach reduced the data volume processed by BERT by selecting only the top 100 most relevant candidates. As a result, execution time was significantly reduced and and there are fewer irrelevant questions in the results.