## Chunk, embed, and index a small number of paragraphs from Wikipedia into a "Vector DB" for prototyping / debug purposes

In [1]:
import pickle

import faiss
import numpy as np

import torch
import transformers
from sentence_transformers import SentenceTransformer, util


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


### Load embedding model

In [2]:
# NOTE: Since we aren't doing any fine-tuning on query encoder, they are the same
document_encoder = SentenceTransformer("/home/stefanwebb/models/llms/multi-qa-MiniLM-L6-cos-v1")
query_encoder = document_encoder

In [3]:
document_encoder.device

device(type='cuda', index=0)

### Pre-process data
Paragraphs from Wikipedia article on Abraham Lincoln.

In [4]:
with open("data/abraham_lincoln.txt", "r") as file:
    chunks = [l.strip() for l in file.readlines() if len(l) > 32]

with open("data/lincoln_chunks.pkl", "wb") as file:
    pickle.dump(chunks, file)

print(len(chunks))

153


### Embed and index

In [6]:
# NOTE: Number of tokens is not always = to number of words
tokenizer = document_encoder.tokenizer
max_token_length = tokenizer.model_max_length

In [7]:
# TODO: Vectorize for speed-up
# TODO: Breaking up chunks that exceed max length
for chunk in chunks:
    # token_count = len()
    tokens = tokenizer(chunk)['input_ids']
    
    token_length = len(tokens)
    if token_length > max_token_length:
        raise Exception("Chunk is too large!")


In [8]:
# TODO: Progress bar
outer_batch_size = 32
inner_batch_size = 32
count_batches = (len(chunks) + outer_batch_size - 1 ) // outer_batch_size
embed_dim = document_encoder.get_sentence_embedding_dimension()
index = faiss.IndexFlatL2(embed_dim)   # build the index

for batch_idx in range(count_batches):
    if batch_idx % 5 == 0:
        print(f"{batch_idx+1} of {count_batches}")

    batch = chunks[(batch_idx * outer_batch_size):min(len(chunks), (batch_idx + 1) * outer_batch_size)]
    embeddings = document_encoder.encode(batch, batch_size=inner_batch_size)
    index.add(embeddings)                  # add vectors to the index

1 of 5


In [9]:
print(index.ntotal, len(chunks))

153 153


In [10]:
faiss.write_index(index, "data/lincoln_chunks.index")

In [11]:
def top_k_chunks(query: str, k=1) -> str:
    """
    Find closest chunk for a given query.
    """
    embeddings = query_encoder.encode([query])
    D, I = index.search(embeddings, k)
    return D, I

In [12]:
D, I = top_k_chunks("Who told Lincoln to grow a beard?", k=3)
for i, d in zip(I[0], D[0]):
    print(d, chunks[i])
    print("")

0.72956383 Lincoln's portrait appears on two denominations of United States currency, the penny and the $5 bill. He appears on postage stamps across the world. While he is usually portrayed bearded, he did not grow a beard until 1860 at the suggestion of 11-year-old Grace Bedell. He was the first of five presidents to do so.

0.9431581 On February 27, 1860, powerful New York Republicans invited Lincoln to give a speech at Cooper Union, in which he argued that the Founding Fathers of the United States had little use for popular sovereignty and had repeatedly sought to restrict slavery. He insisted that morality required opposition to slavery and rejected any "groping for some middle ground between the right and the wrong". Many in the audience thought he appeared awkward and even ugly. But Lincoln demonstrated intellectual leadership, which brought him into contention. Journalist Noah Brooks reported, "No man ever before made such an impression on his first appeal to a New York audience

In [13]:
D, I = top_k_chunks("Why did Lincoln die?", k=3)
for i, d in zip(I[0], D[0]):
    print(d, chunks[i])
    print("")

0.56550646 On January 27, 1838, Abraham Lincoln, then 28 years old, delivered his first major speech at the Lyceum in Springfield, Illinois, after the murder of newspaper editor Elijah Parish Lovejoy in Alton. Lincoln warned that no trans-Atlantic military giant could ever crush the U.S. as a nation. "It cannot come from abroad. If destruction be our lot, we must ourselves be its author and finisher", said Lincoln. Prior to that, on April 28, 1836, a black man, Francis McIntosh, was burned alive in St. Louis, Missouri. Zann Gill describes how these two murders set off a chain reaction that ultimately prompted Abraham Lincoln to run for President.

0.63072133 At 10:15 in the evening Booth entered the back of Lincoln's theater box, crept up from behind, and fired at the back of Lincoln's head, mortally wounding him. Lincoln's guest, Major Henry Rathbone, momentarily grappled with Booth, but Booth stabbed him and escaped. After being attended by Doctor Charles Leale and two other doctors,