## The Dataset

In [None]:
import pandas as pd

nq_data = pd.read_csv('nq_simplified.val.tsv', sep='\t', header=None, names=['question', 'answer', 'gold_context'], quoting=3)
nq_data.head()

Unnamed: 0,question,answer,gold_context
0,what purpose did seasonal monsoon winds have o...,enabled European empire expansion into the Ame...,The westerlies (blue arrows) and trade winds (...
1,who got the first nobel prize in physics,"Wilhelm Conrad RÃ¶ntgen, of Germany",The award is presented in Stockholm at an annu...
2,when is the next deadpool movie being released,"May 18, 2018","Though the original creative team of Reynolds,..."
3,where did the idea of fortnite come from,as a cross between Minecraft and Left 4 Dead,"Fortnite is set in contemporary Earth, where t..."
4,which mode is used for short wave broadcast se...,MFSK Olivia,"All one needs is a pair of transceivers, each ..."


In [None]:
print(nq_data.size)

12867


Define the accuracy benchmark function, which returns precision, recall, and f1 score.

In [None]:
def rouge1(gold, predicted):
    assert len(gold) == len(predicted)
    def tokenize(text):
        return set(text.replace(',', ' ').replace('.', ' ').strip().split())

    n_g, n_p, n_c = 0, 0, 0

    for g, p in zip(gold, predicted):
        g = tokenize(g)
        p = tokenize(p)
        n_g += len(g)
        n_p += len(p)
        n_c += len(g.intersection(p))

    pr = n_c / n_p if n_p > 0 else 0
    re = n_c / n_g if n_g > 0 else 0
    f1 = 2 * pr * re / (pr + re) if pr > 0 and re > 0 else 0

    return pr, re, f1

def cleanup(text):
  text = text.replace(',', ' ')
  text = text.replace('.', ' ')
  return text

Download a pre-trained LLM from HuggingFace.

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, AutoModelForCausalLM, pipeline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

tokenizer = AutoTokenizer.from_pretrained("ahxt/LiteLlama-460M-1T")
model = AutoModelForCausalLM.from_pretrained("ahxt/LiteLlama-460M-1T").to(device)
text_generator = pipeline('text-generation', model=model, tokenizer=tokenizer, device=device)

cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/999k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/607 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/923M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Device set to use cuda


## Step 1: Evaluating an LLM on Natural Questions

In [None]:
# Get a small sample for testing
subset = nq_data.sample(10, random_state=11)
Q_NUM = 0

In [None]:
# Define a function to generate the answers
def generate_answer(question):
    # Tokenize the input question
    inputs = tokenizer(question, return_tensors="pt").to(device)
    # Generate the answer
    outputs = model.generate(**inputs, max_new_tokens=50, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
    # Decode the answer back to natural language
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return answer.split("Answer:")[-1].strip()

# Get the questions and answers
questions = subset['question'].tolist()
gold_answers = subset['answer'].tolist()

# Run the questions thorugh the model to get the predictions
predicted_answers = [generate_answer(q) for q in questions]

In [None]:
# Print one example as a sanity check
print("Question:", questions[Q_NUM])
print("Ground Truth:", gold_answers[Q_NUM])
print("Predicted:", predicted_answers[Q_NUM], "\n")

# Print the benchmark scores
pr, re, f1 = rouge1(gold_answers, predicted_answers)
print(f"ROUGE-1 Precision: {pr}, Recall: {re}, F1: {f1}")

Question: where is the inscription on the statue of liberty
Ground Truth: on the inner wall of the pedestal
Predicted: where is the inscription on the statue of liberty.

The statue of liberty is a symbol of the freedom of the people. It is a symbol of the freedom of the people. It is a symbol of the freedom of the people. It is a symbol of the freedom of the people. 

ROUGE-1 Precision: 0.05142857142857143, Recall: 0.20930232558139536, F1: 0.08256880733944955


## Step 2: Exact kNN

In [None]:
# Read the passages
with open('passages.txt', 'r') as f:
    passages = f.readlines()

Represent the passages as vectors

In [None]:
!pip install -U sentence-transformers

In [None]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('all-MiniLM-L6-v2')
representations = encoder.encode(passages, convert_to_tensor=True)

In [None]:
print(representations)

tensor([[ 0.0562,  0.0840,  0.0287,  ..., -0.0674,  0.0032,  0.0448],
        [ 0.0826,  0.0638, -0.0906,  ...,  0.0118, -0.0470, -0.0394],
        [-0.0299,  0.0418, -0.0216,  ..., -0.0198,  0.0784, -0.0587],
        ...,
        [ 0.0535,  0.0559, -0.0509,  ...,  0.0318, -0.0890, -0.0253],
        [-0.1011, -0.0338, -0.0413,  ..., -0.0341,  0.0146, -0.0120],
        [-0.0140,  0.0727,  0.0384,  ...,  0.0100, -0.0981, -0.0242]],
       device='cuda:0')


In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [None]:
import faiss

embedded_passages = representations.cpu().numpy()
index = faiss.IndexFlatL2(embedded_passages.shape[1])
# Store the representations in the vector database
index.add(embedded_passages)

In [None]:
question = questions[Q_NUM]
embedded_question = encoder.encode([question], convert_to_tensor=True).cpu().numpy()

_, ix = index.search(embedded_question, 1)
best_passage = passages[ix[0][0]]
print(question)
print(best_passage)

where is the inscription on the statue of liberty
The statue was dedicated in 1950, as one of approximately 200 replicas installed throughout the United States to commemorate the fortieth anniversary of the establishment of Boy Scouts of America. It was surveyed as part of the Smithsonian Institution's "Save Outdoor Sculpture!" program in 1994. The replica of the Statue of Liberty (Liberty Enlightening the World) is an allegorical representation of Liberty. The female figure is shown wearing a crown and robes, and holding a torch and a book or tablet. The metal sculpture measures approximately 7 ft. 4 in. x 1 ft. 10 in. x 1 ft. 10 in., and rests on a pedestal and octagonal concrete base that measures approximately 5 ft. 4 in. x 3 ft. 2 in. x 2 ft. 8 in. A plaque on the base has the inscription: "WITH THE FAITH AND COURAGE OF / THEIR FOREFATHERS WHO MADE / POSSIBLE THE FREEDOM OF THESE / UNITED STATES / THE BOY SCOUTS OF AMERICA / DEDICATE THIS REPLICA OF THE / STATUE OF LIBERTY AS A PL

In [None]:
import numpy as np

latencies = []

# For each question, find the best-matching passage from the vector database
retrieved_contexts = []
for question in questions:
    embedded_question = encoder.encode([question], convert_to_tensor=True).cpu().numpy()
    start_time = time.time()
    _, ix = index.search(embedded_question, 1)
    end_time = time.time()
    latencies.append(end_time - start_time)
    retrieved_contexts.append(passages[ix[0][0]])

average_latency = np.mean(latencies) * 1000
print(f"Average latency: {average_latency:.4f} ms")

Average latency: 5.5889 ms


In [None]:
questions = subset['question'].tolist()
gold_answers = subset['answer'].tolist()

predicted_answers = []
for i in range(len(questions)):
    # Send the retrieve contexts to the model instead of the Wikipedia gold contextx
    result = generate_answer_with_context(question=questions[i], context=retrieved_contexts[i])
    predicted_answers.append(result)

# Print one example as a sanity check
print("Question:", questions[Q_NUM])
print("Retrieved:", retrieved_contexts[Q_NUM])
print("Ground Truth:", gold_answers[Q_NUM])
print("Predicted:", predicted_answers[Q_NUM], "\n")

pr, re, f1 = rouge1(gold_answers, predicted_answers)
print(f"ROUGE-1 Precision: {pr}, Recall: {re}, F1: {f1}")

Question: where is the inscription on the statue of liberty
Retrieved: The statue was dedicated in 1950, as one of approximately 200 replicas installed throughout the United States to commemorate the fortieth anniversary of the establishment of Boy Scouts of America. It was surveyed as part of the Smithsonian Institution's "Save Outdoor Sculpture!" program in 1994. The replica of the Statue of Liberty (Liberty Enlightening the World) is an allegorical representation of Liberty. The female figure is shown wearing a crown and robes, and holding a torch and a book or tablet. The metal sculpture measures approximately 7 ft. 4 in. x 1 ft. 10 in. x 1 ft. 10 in., and rests on a pedestal and octagonal concrete base that measures approximately 5 ft. 4 in. x 3 ft. 2 in. x 2 ft. 8 in. A plaque on the base has the inscription: "WITH THE FAITH AND COURAGE OF / THEIR FOREFATHERS WHO MADE / POSSIBLE THE FREEDOM OF THESE / UNITED STATES / THE BOY SCOUTS OF AMERICA / DEDICATE THIS REPLICA OF THE / STAT

# Step 3: Approximate kNN

In [None]:
import time
import numpy as np

d = embedded_passages.shape[1]
nlist = 100
quantizer = faiss.IndexFlatL2(d)
index_approx = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
index_approx.train(embedded_passages)
index_approx.add(embedded_passages)
index_approx.nprobe = 1

latencies = []

approx_contexts = []
for question in questions:
    embedded_question = encoder.encode([question], convert_to_tensor=True).cpu().numpy()
    start_time = time.time()
    _, ix = index_approx.search(embedded_question, 1)
    end_time = time.time()
    latencies.append(end_time - start_time)
    approx_contexts.append(passages[ix[0][0]])

questions = subset['question'].tolist()
gold_answers = subset['answer'].tolist()

predicted_approx = []
for i in range(len(questions)):
    # Send the retrieve contexts to the model instead of the Wikipedia gold contextx
    result = generate_answer_with_context(question=questions[i], context=approx_contexts[i])
    predicted_approx.append(result)


pr, re, f1 = rouge1(gold_answers, predicted_approx)
print(f"ROUGE-1 Precision: {pr}, Recall: {re}, F1: {f1}")

average_latency = np.mean(latencies) * 1000
print(f"Average latency: {average_latency:.4f} ms")

ROUGE-1 Precision: 0.0625, Recall: 0.27906976744186046, F1: 0.10212765957446808
Average latency: 0.1463 ms
