## ATNLP Lab 2 – Towards retrieval-augmented generation: from a PDF to a vector database

The objective of this tutorial is to build a basic system of semantic search or vector search over a pdf. We will build the utilities and data strcutures needed for, given a textual query, retrieve from a pdf document chunks of text that are relevant to it.

* Extract text from a PDF.  
* Split the text into moderately sized chunks (≈ 2–3 sentences each).  
* Encode all chunks with a small SBERT model and persist the embeddings.  
* Implement a tiny retrieval function that, given a user query, returns the most relevant chunk(s) of the original document.

#### Vector databse vs Relational database

A vector database is optimized for storing and querying high-dimensional data, like vectors used in AI for similarity searches. Unlike relational databases, which organize data into tables with rows and columns for structured data, vector databases focus on efficiently managing vector data, enabling fast nearest neighbor searches. While relational databases excel in handling structured data with clear relationships, vector databases are ideal for unstructured data where relationships are defined by vector proximity. 

In this tutorial, we will take a very simple approach, simply persisting the vectors in the disk. But at the end you can find some resources for open source vector databases that are used in production for many industry applications.

### 1. Imports & Global Config

In [None]:
!pip install sentence-transformers==2.2.2 pdfminer.six==20221105 nltk==3.8.1 scikit-learn==1.3.2 tqdm

In [None]:
from pathlib import Path
import pickle
import numpy as np
from typing import List, Tuple
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

from pdfminer.high_level import extract_text
import nltk
from tqdm.auto import tqdm

nltk.download('punkt')  # sentence tokenizer


### 2. Extracting text from a PDF

We need to find a PDF to practice on. You can download a biology textbook from [this link](https://www.basicbiology.net/wp-content/uploads/edd/2018/05/Basic-Biology-an-introduction.pdf), save it on the `data/` folder. We will extract the raw text with a pdf utility & take a look at it. 


In [None]:
PDF_PATH = "data/Basic-Biology-an-introduction.pdf"

raw_text = extract_text(PDF_PATH)

In [None]:
print(f"Raw text length: {len(raw_text):,} characters")
print(raw_text[:100]) 
print(f"\nText in the middle: ...{raw_text[10000:10342]}") 

### 3. Post-process raw text into chunks

The objective is to have small pieces of text that contain meaningul content for us to embed & do vector search over later on. Is the appropiate size a sentence? Maybe too little. How about a paragraph? How to detext paragraphs in the text we have?

Here you should play around a bit with different ways of splitting the text of the pdf into chunks.

In [None]:
def chunk_text(
    text: str,
) -> List[str]:
    """
    Split the raw text into chunks.
    """
    pass

### 4. Embedding the chunks

Now it is time to embed those chunks into vectors. We will use a small transformer model from the `sentence-transformers` lilbrary. I highly recommend you check their [website and documentation](https://sbert.net/), because they include many small blogs and explanations on how to train, evaluate & architect sentence encoders that I think are very well presented!

Because we want the embedding process to be fast, implement batching.


In [None]:
MODEL_NAME = "all-MiniLM-L6-v2"  # ~64 MB, 384-dimensional vectors
model = SentenceTransformer(MODEL_NAME, device="cpu") # if you have GPU available, play around with bigger models!

def embed_texts(texts: List[str]) -> np.ndarray:
    """
    Encode a list of texts -> numpy matrix (n_chunks × dim).
    """
    pass

batch_size = 32  # adjust based on your RAM
for batch_of_chunks in tqdm(...):  # etc
    pass
chunks=[]
embeddings=[]

### 5. Persist the data

The end result should be a list of embeddings and a list of chunks, with matching indices. We will save the data compressed as .pkl. We build a helper function to decompress it and load it too.

In [None]:
save_folder = "data/biology/" 
with open(Path(save_folder) / "chunks.pkl", "wb") as f:
    pickle.dump(chunks, f)
with open(Path(save_folder) / "embeddings.pkl", "wb") as f:
    pickle.dump(embeddings, f)

In [None]:
def unpickle_file(file_path: str):
    with open(file_path, "rb") as f:
        data = pickle.load(f)
    return data

### 6. Build a simple retrieval function

For an input sentence (or query), we need to embed it with the model, and perform a cosine similarity with the embeddings of the textbook. Then, we return the top3 chunks.

In [None]:
embeddings = unpickle_file("data/biology/embeddings.pkl")
chunks = unpickle_file("data/biology/chunks.pkl")

def cosine_similarity(vector_1, vector_2):
    pass

def search(
        query:str,
        embeddings: np.ndarray,
        chunks: List[str],
        model: SentenceTransformer,
) -> List[Tuple[str, float]]:
    """
    Given a query, return the top 5 most similar chunks with similarity scores.
    """
    pass

### 7. Final thoughts

Although good to grasp the basics, this is a very basic implementation of an extremely powerful tool: vector search & vector databases. If you have a bit of time & want to push this a bit further, I would recommend taking a look at [Qdrant](https://qdrant.tech/), a startup that implements open-source vector databases to do what we just did, but at scale.