### Homework 5: Question search engine

Remeber week01 where you used GloVe embeddings to find related questions? That was.. cute, but far from state of the art. It's time to really solve this task using context-aware embeddings.

__Warning:__ this task assumes you have seen `seminar.ipynb`!

In [1]:
%pip install --upgrade transformers datasets accelerate deepspeed
import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import datasets



### Load data and model

In [2]:
qqp = datasets.load_dataset('SetFit/qqp')
print('\n')
print("Sample[0]:", qqp['train'][0])
print("Sample[3]:", qqp['train'][3])

Repo card metadata block was not found. Setting CardData to empty.




Sample[0]: {'text1': 'How is the life of a math student? Could you describe your own experiences?', 'text2': 'Which level of prepration is enough for the exam jlpt5?', 'label': 0, 'idx': 0, 'label_text': 'not duplicate'}
Sample[3]: {'text1': 'What can one do after MBBS?', 'text2': 'What do i do after my MBBS ?', 'label': 1, 'idx': 3, 'label_text': 'duplicate'}


In [3]:
model_name = "gchhablani/bert-base-cased-finetuned-qqp"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name)
model = transformers.AutoModelForSequenceClassification.from_pretrained(model_name)

### Tokenize the data

In [4]:
MAX_LENGTH = 128
def preprocess_function(examples):
    result = tokenizer(
        examples['text1'], examples['text2'],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )
    result['label'] = examples['label']
    return result

qqp_preprocessed = qqp.map(preprocess_function, batched=True)

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

In [5]:
print(repr(qqp_preprocessed['train'][0]['input_ids'])[:100], "...")

[101, 1731, 1110, 1103, 1297, 1104, 170, 12523, 2377, 136, 7426, 1128, 5594, 1240, 1319, 5758, 136,  ...


### Task 1: evaluation (1 points)

We randomly chose a model trained on QQP - but is it any good?

One way to measure this is with validation accuracy - which is what you will implement next.

Here's the interface to help you do that:

In [6]:
val_set = qqp_preprocessed['validation']
val_loader = torch.utils.data.DataLoader(
    val_set, batch_size=1, shuffle=False, collate_fn=transformers.default_data_collator
)

In [7]:
for batch in val_loader:
     break  # here be your training code
print("Sample batch:", batch)

with torch.no_grad():
  predicted = model(
      input_ids=batch['input_ids'],
      attention_mask=batch['attention_mask'],
      token_type_ids=batch['token_type_ids']
  )

print('\nPrediction (probs):', torch.softmax(predicted.logits, dim=1).data.numpy())

Sample batch: {'labels': tensor([0]), 'idx': tensor([0]), 'input_ids': tensor([[  101,  2009,  1132,  2170,   118,  4038,  1177,  2712,   136,   102,
          2009,  1132,  1117, 10224,  4724,  1177,  2712,   136,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,   

__Your task__ is to measure the validation accuracy of your model.
Doing so naively may take several hours. Please make sure you use the following optimizations:

- run the model on GPU with no_grad
- using batch size larger than 1
- use optimize data loader with num_workers > 1
- (optional) use [mixed precision](https://pytorch.org/docs/stable/notes/amp_examples.html)


In [8]:
from tqdm.auto import tqdm
import torch
from torch.cuda.amp import autocast


device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)
model.eval()

accurate = 0
total = 0
val_loader = torch.utils.data.DataLoader(
    val_set,
    batch_size=32,
    shuffle=False,
    collate_fn=transformers.default_data_collator,
    num_workers=2
)

with torch.no_grad():
    for batch in tqdm(val_loader, desc="Validating"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        token_type_ids = batch.get('token_type_ids').to(device) if 'token_type_ids' in batch else None

        # using mixed precision, uncomment/comment the following lines:
        with autocast():
             outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        # Else just use:
        #outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)

        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        accurate += (predictions == labels).sum().item()
        total += labels.size(0)

accuracy = accurate / total
print(f'Validation Accuracy: {accuracy:.4f}')


Validating:   0%|          | 0/1264 [00:00<?, ?it/s]

Validation Accuracy: 0.9084


In [9]:
assert 0.9 < accuracy < 0.91

### Task 2: train the model (5 points)

For this task, you have two options:

__Option A:__ fine-tune your own model. You are free to choose any model __except for the original BERT.__ We recommend [DeBERTa-v3](https://huggingface.co/microsoft/deberta-v3-base). Better yet, choose the best model based on public benchmarks (e.g. [GLUE](https://gluebenchmark.com/)).

You can write the training code manually or use transformers.Trainer (see [this example](https://github.com/huggingface/transformers/blob/main/examples/pytorch/text-classification)). Please make sure that your model's accuracy is at least __comparable__ with the above example for BERT.


__Option B:__ compare at least 3 pre-finetuned models (in addition to the above BERT model). For each model, report (1) its accuracy, (2) its speed, measured in samples per second in your hardware setup and (3) its size in megabytes. Please take care to compare models in equal setting, e.g. same CPU / GPU. Compile your results into a table and write a short (~half-page on top of a table) report, summarizing your findings.

In [11]:
!pip install sentencepiece
!pip install pytorch_lightning



Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99
Collecting pytorch_lightning
  Downloading pytorch_lightning-2.1.1-py3-none-any.whl (776 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m776.3/776.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
Collecting torchmetrics>=0.7.0 (from pytorch_lightning)
  Downloading torchmetrics-1.2.0-py3-none-any.whl (805 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m805.2/805.2 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities>=0.8.0 (from pytorch_lightning)
  Downloading lightning_utilities-0.9.0-py3-none-any.whl (23 kB)
Installing collected packages: lightning-utilities, torchmetrics, pytorch_lig

In [10]:
from transformers import DebertaV2Tokenizer, DebertaV2ForSequenceClassification, Trainer, TrainingArguments
import datasets
from datasets import load_metric
from transformers import AdamW
import warnings

# Ignore specific warning related to DebertaV2ForSequenceClassification initialization
warnings.filterwarnings("ignore")


model_name = "microsoft/deberta-v3-small"
tokenizer = DebertaV2Tokenizer.from_pretrained(model_name)
model = DebertaV2ForSequenceClassification.from_pretrained(model_name, num_labels=2)

qqp = datasets.load_dataset('SetFit/qqp')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return metric.compute(predictions=predictions, references=labels)


def preprocess_function(examples):
    return tokenizer(examples['text1'], examples['text2'], truncation=True, padding='max_length', max_length=128)

tokenized_datasets = qqp.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['text1', 'text2', 'idx'])

metric = load_metric("accuracy")

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    load_best_model_at_end=True,
)


optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    optimizers=(optimizer, None)
)

trainer.train()

trainer.evaluate()







Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Downloading builder script:   0%|          | 0.00/1.65k [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2079,0.287052,0.895548
2,0.1239,0.26973,0.911576
3,0.0941,0.310246,0.918031


{'eval_loss': 0.26973047852516174,
 'eval_accuracy': 0.9115755627009646,
 'eval_runtime': 68.3173,
 'eval_samples_per_second': 591.798,
 'eval_steps_per_second': 9.251,
 'epoch': 3.0}

In [11]:
model.save_pretrained('./deberta_finetuned_qqp')
tokenizer.save_pretrained('./deberta_finetuned_qqp')


('./deberta_finetuned_qqp/tokenizer_config.json',
 './deberta_finetuned_qqp/special_tokens_map.json',
 './deberta_finetuned_qqp/spm.model',
 './deberta_finetuned_qqp/added_tokens.json')

### Task 3: try the full pipeline (2 points)

Finally, it is time to use your model to find duplicate questions.
Please implement a function that takes a question and finds top-5 potential duplicates in the training set. For now, it is fine if your function is slow, as long as it yields correct results.

Showcase how your function works with at least 5 examples.

In [12]:
from transformers import DebertaV2Tokenizer, DebertaV2ForSequenceClassification, Trainer, TrainingArguments, DefaultDataCollator
import datasets
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
from tqdm.auto import tqdm


model_path = './deberta_finetuned_qqp'
tokenizer = DebertaV2Tokenizer.from_pretrained(model_path)
model = DebertaV2ForSequenceClassification.from_pretrained(model_path)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)


def tokenize_with_query(examples):
    return tokenizer(
        examples['text1'], [query for _ in range(len(examples['text1']))],
        padding='max_length', max_length=MAX_LENGTH, truncation=True
    )



def find_duplicates(model, tokenizer, query, dataset, topk=5):
    model.eval()
    def tokenize_with_query(examples):
        return tokenizer(
            [query] * len(examples['text1']), examples['text1'],
            padding='max_length', max_length=128, truncation=True, return_tensors='pt'
        )
    tokenized_dataset = dataset.map(tokenize_with_query, batched=True)
    tokenized_dataset = tokenized_dataset.remove_columns(["idx", "label", "text1", "text2"])
    data_collator = transformers.DefaultDataCollator(return_tensors="pt")
    dataloader = DataLoader(tokenized_dataset['train'], batch_size=32, collate_fn=data_collator)

    results = []
    for batch in tqdm(dataloader, desc="Searching"):
        batch = {k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
        with torch.no_grad():
            outputs = model(**batch)
            logits = outputs.logits
            scores = torch.softmax(logits, dim=1)[:, 1]

        for idx, score in enumerate(scores):
            results.append((idx, score.item()))
    results = sorted(results, key=lambda x: x[1], reverse=True)[:topk]
    top_questions = [dataset['train'][idx]['text1'] for idx, score in results]
    top_scores = [score for idx, score in results]

    return list(zip(top_questions, top_scores))


query = "How can I learn Python programming?"
duplicates = find_duplicates(model, tokenizer, query, qqp, topk=5)

for question, score in duplicates:
    print(f"Question: {question}\nScore: {score}\n")


Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.


Searching:   0%|          | 0/11371 [00:00<?, ?it/s]

Question: Which are the best motivational videos?
Score: 0.9970129728317261

Question: Fetch jobs from job portals through API calls?
Score: 0.9967538714408875

Question: How is air traffic controlled?
Score: 0.9967538714408875

Question: How is the life of a math student? Could you describe your own experiences?
Score: 0.9967538714408875

Question: Can I enter University of Melbourne if I couldn't achieve the guaranteed marks in Trinity College Foundation?
Score: 0.9967538714408875



__Bonus:__ for bonus points, try to find a way to run the function faster than just passing over all questions in a loop. For isntance, you can form a short-list of potential candidates using a cheaper method, and then run your tranformer on that short list. If you opted for this solution, please keep both the original implementation and the optimized one - and explain briefly what is the difference there.

In [13]:
from transformers import DebertaV2Tokenizer, DebertaV2ForSequenceClassification
import datasets
from datasets import load_dataset
import torch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm.auto import tqdm
import numpy as np

MAX_LENGTH = 128
TOP_K = 5
SHORTLIST_SIZE = 100
model_path = './deberta_finetuned_qqp'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = DebertaV2Tokenizer.from_pretrained(model_path)
model = DebertaV2ForSequenceClassification.from_pretrained(model_path)
model.to(device)
qqp = load_dataset('SetFit/qqp')

def preprocess_function(examples):
    return tokenizer(examples['text1'], examples['text2'], truncation=True, padding='max_length', max_length=MAX_LENGTH)

tokenized_datasets = qqp.map(preprocess_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['text1', 'text2', 'idx'])

def encode_question(question, tokenizer, model, device):
    inputs = tokenizer(question, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors='pt')
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        model.eval()
        outputs = model(**inputs, output_hidden_states=True)
        embedding = outputs.hidden_states[-1][:, 0]
    return embedding.cpu().numpy()

def encode_questions(questions, tokenizer, model, device):

    embeddings = []
    for question in tqdm(questions, desc='Encoding Questions'):
        embeddings.append(encode_question(question, tokenizer, model, device))
    return np.vstack(embeddings)

# Brute-force approach
def find_duplicates(model, tokenizer, query, dataset, topk=TOP_K):
    model.eval()
    questions = [example['text1'] for example in dataset['train']]
    question_embeddings = encode_questions(questions, tokenizer, model, device)
    query_embedding = encode_question(query, tokenizer, model, device)

    similarities = cosine_similarity(query_embedding, question_embeddings).flatten()
    topk_indices = similarities.argsort()[-topk:][::-1]
    topk_scores = similarities[topk_indices]
    topk_questions = [questions[idx] for idx in topk_indices]

    return list(zip(topk_questions, topk_scores))

# Optimized approach
def find_duplicates_optimized(model, tokenizer, query, dataset, topk=TOP_K, shortlist_size=SHORTLIST_SIZE):
    model.eval()
    questions = [example['text1'] for example in dataset['train']]
    vectorizer = TfidfVectorizer(max_features=5000)

    tfidf_vectors = vectorizer.fit_transform(tqdm(questions, desc='Vectorizing Questions'))
    query_vector = vectorizer.transform([query])
    similarities_tfidf = cosine_similarity(query_vector, tfidf_vectors).flatten()

    shortlist_indices = similarities_tfidf.argsort()[-shortlist_size:][::-1]
    shortlist_questions = [questions[idx] for idx in shortlist_indices]
    shortlist_embeddings = encode_questions(shortlist_questions, tokenizer, model, device)
    query_embedding = encode_question(query, tokenizer, model, device)
    similarities_transformer = cosine_similarity(query_embedding, shortlist_embeddings).flatten()
    topk_indices = similarities_transformer.argsort()[-topk:][::-1]
    final_topk_indices = [shortlist_indices[idx] for idx in topk_indices]
    final_topk_scores = similarities_transformer[topk_indices]

    topk_questions = [questions[idx] for idx in final_topk_indices]
    return list(zip(topk_questions, final_topk_scores))




Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/363846 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/40430 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

Map:   0%|          | 0/390965 [00:00<?, ? examples/s]

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

In [14]:
#query
query = "How can I learn Python programming?"

duplicates_transformer = find_duplicates(model, tokenizer, query, qqp, topk=TOP_K)

duplicates_optimized = find_duplicates_optimized(model, tokenizer, query, qqp, topk=TOP_K, shortlist_size=SHORTLIST_SIZE)

print("Brute-force approach results:")
for question, score in duplicates_transformer:
    print(f"Question: {question}\nScore: {score}\n")

print("Optimized approach results:")
for question, score in duplicates_optimized:
    print(f"Question: {question}\nScore: {score}\n")

Encoding Questions:   0%|          | 0/363846 [00:00<?, ?it/s]

Vectorizing Questions:   0%|          | 0/363846 [00:00<?, ?it/s]

Encoding Questions:   0%|          | 0/100 [00:00<?, ?it/s]

Brute-force approach results:
Question: Where I can buy Xanax with no prescription?
Score: 0.9997811317443848

Question: Where I can buy Xanax with no prescription?
Score: 0.9997811317443848

Question: Where can I get ketamine for depression?
Score: 0.9997487664222717

Question: What are the examples of procedural programming languages?
Score: 0.9997454881668091

Question: What are the examples of procedural programming languages?
Score: 0.9997454881668091

Optimized approach results:
Question: How do I learn Python systematically?
Score: 0.9993565082550049

Question: How do I learn Python systematically?
Score: 0.9993565082550049

Question: How do I learn Python systematically?
Score: 0.9993565082550049

Question: How can I learn advanced Python?
Score: 0.9991908073425293

Question: How can I learn advanced Python?
Score: 0.9991908073425293



We see huge difference in the output and the execution time!!! Lot of pain but managed to do it :)

##Note!!!



Pros and Cons of brute force and optimized method:

- Brute force:

  Pros

    Straightforward and simple to implement.
    Does not depend on any approximation; uses the full power of the Transformer model for every comparison.

  Cons:

    Very time-consuming, especially for large datasets, since it computes Transformer embeddings for every single entry in the dataset.
    Computationally expensive as it requires forward passes through the model for the entire dataset.

- Optimized:

  Pros:

    Much faster because it avoids running the entire dataset through the Transformer model.
    Reduces computational expense by using a cheaper method (TF-IDF) for the initial filtering.
    Still leverages the power of the Transformer model for the most promising candidates.
    
  Cons:

    Introduces an approximation step, which might miss some potential candidates that the brute force method would have caught.
    The final accuracy of finding duplicates depends on the size of the shortlist and the quality of the TF-IDF approximation.

