# Testing framework for creating the rag pipeline 

1. create the collection of text embeddings - we need a qdrant container to start and ignest our documents in a collection
let's add 3-4 texts of documents about museums and their information 
2. send llm prompt + use qdrant method to query the knoweledge base 
results = client.query_points , check if it has options for top-k or similarity type
3. when querying the vector database if all top-k answers are below a threshold then answer with "i do not know"

##### First we have started a qdrant container that is running the vector database, so we need to connect to our database (this is run as a different service, so for now we do not worry about the setup step)

In [68]:
import tiktoken
import numpy as np 

In [112]:
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")

- create simple collection of vectors - choose dim=384 and cosine similarity as distance

In [111]:
from qdrant_client.models import Distance, VectorParams,models

# client.create_collection(
#     collection_name="museum_collection",
#     vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
#)

- we now need to select an embedding model to encode our text documents so we can insert them into the collection
- we can use sentence transformers + bge small - en

# Questions

- another question is how we are going to handle text paragraphs from each document. 
- let's say we have scraped the urls of 3 greek museums/foundations
- each museum/foundation text clearly needs to be a different embedding 
- additionally how will we handle the large text paragraph for each one ? 
- we can't create one embedding for each paragraph 
- so we need to create vectors within each document embedding 

# We need to experiment for this step
- first start by studying our documents 

In [4]:
from openai import OpenAI
import os 
from dotenv import load_dotenv

load_dotenv()

api_key = os.environ['OPENAI_API_KEY']
client = OpenAI()

- A potential idea to explore is based on the assumption that we may have metadata for each museum so we can add those metadata to each embedding part 
- We will start with this idea
- Knowing the museum/foundation name is potentially a strong assumption that needs investigation (maybe an option to use it if it is known)
- Based on the Task description we make the assumption first that we can distinguish the texts into different files so we can ingest them per file 

In [None]:
# # create function to ingest the files (split into chunks) paragraphs into the collection 
# import os 
# document_sources = os.listdir("/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data")
# for d in document_sources:
#     with open(os.path.join(document_sources,d),"r") as df:
#         # call function to split document and ingest


# Study and ingest one document

In [14]:
with open("/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data/d1.txt",mode="r",encoding="utf-8") as df1:
    document_content = df1.read()

In [15]:
document_content

'Acropolis Museum\nThe Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens.\nThe museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres.\nThe museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site\nThe entrance fee to

### count sample document tokens
- This is a required preprocessing step to make an estimation if for example it is possible to encode our whole document paragraph into one embedding or do chunk based approach. Of course this should be generic and changes per document.

In [19]:
enc = tiktoken.encoding_for_model("gpt-4")

token_count = len(enc.encode(document_content))


In [20]:
token_count

483

In [21]:
document_content

'Acropolis Museum\nThe Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens.\nThe museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres.\nThe museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site\nThe entrance fee to

#### let's encode one document  using BGE as one embedding to study the performance of querying and granularity of embedding

In [27]:
path = "/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data"

def get_document_content(path):
    document_sources = os.listdir(path)
    for d in document_sources:
        with open(os.path.join(path,d),"r") as df:
            # call function to split document and ingest
            yield df.read()

In [28]:
gen = iter(get_document_content(path))

In [29]:
doc1 = next(gen)
doc2 = next(gen)
doc3 = next(gen)

In [51]:
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3',  
                       use_fp16=True,
                       devices='cpu') # Setting use_fp16 to True speeds up computation with a slight performance degradation


Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 5175.81it/s]


In [52]:
doc1

'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled collections of Cycladic a

In [78]:
doc2

'Acropolis Museum\nThe Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens.\nThe museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres.\nThe museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site\nThe entrance fee to

In [79]:
doc3

'Eugenides Foundation\nThe Eugenides Foundation (Greek: Ίδρυμα Ευγενίδου) is a Greek private educational foundation. It was established in 1956 in Athens, Greece implementing the will of the late Greek benefactor Eugenios Eugenidis, who died in April 1954.\n\nThe activity of the foundation, in accordance with its articles of association, is to contribute to the scientific and technological education of young people in Greece. The foundation is administered by a committee of three persons, which is participated by each professor which is elected as a rector of the National Technical University of Athens (NTUA) until the end of his term as a rector. For its multifaceted contribution to Greek society, Eugenides Foundation was honored in December 1965 with the gold medal of the Academy of Athens.\nActivities\nThe activities and establishments of the Foundation include:\n\na scholarship program granting 20 scholarships annually\na scientific and technical library,\na museum of science and t

In [61]:
d1_embedding = model.encode([doc1])
d2_embedding = model.encode([doc2])
d3_embedding = model.encode([doc3])

In [65]:
d1_embedding['dense_vecs'].shape, d2_embedding['dense_vecs'].shape, d3_embedding['dense_vecs'].shape

((1, 1024), (1, 1024), (1, 1024))

In [94]:
query = "when was the goulandris museum founded" 
query_embedding = model.encode([query])

In [88]:
all_document_embeddings = np.stack([d1_embedding['dense_vecs'],d2_embedding['dense_vecs'],d3_embedding['dense_vecs']]).reshape(-1,1024)

In [95]:
similarity = query_embedding['dense_vecs'] @ all_document_embeddings.T


In [96]:
similarity

array([[0.6212836 , 0.43497568, 0.35743386]], dtype=float32)

#### The most simple approach is first to encode each document as a different embedding into the database so we start with this 

- This seems like a good foundation for a first simple solution
- The model seems to be able to understand the differences between the embeddings

### So now based on this idea let's encode the documents and ingest them into our collection

# create the collection first 

In [163]:
model_name = "BAAI/bge-m3"
client.create_collection(
    collection_name="museum_collection",
    vectors_config=models.VectorParams(
        size=1024,
        distance=models.Distance.COSINE
    ),  # size and distance are model dependent
)

True

# next we need to insert the documents 

In [156]:
document_sources = os.listdir("/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data")

path = "/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data"

def get_document_content(path):
    document_sources = os.listdir(path)
    for i,d in enumerate(document_sources):
        with open(os.path.join(path,d),"r") as df:
            # call function to split document and ingest
            yield i,d,df.read()

In [157]:
gen = iter(get_document_content(path))

In [158]:
metadata_with_docs = [
    {"id":x[0],"document": x[2], "source": x[1]} for x in gen 
]


In [159]:
metadata_with_docs

[{'id': 0,
  'document': 'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled 

In [164]:
collection_name = "museum_collection"
client.upsert(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=doc['id'],
            payload = {"text":doc['document'],"id":doc['id'],"source":doc['source']},
            vector=model.encode(doc['document'])['dense_vecs'])
        for doc in metadata_with_docs
    ]
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

#### ok all documents for our rag pipeline are uploaded

let's try some queries 

In [169]:
query =  "when was goulandris museum created ?"
query_embedding = model.encode(query)['dense_vecs']

In [None]:
# we can try simple querying using a top-k

In [181]:
top_k = 1
query_result = client.query_points(
    collection_name=collection_name,
    query=query_embedding,
    with_vectors=True,
    with_payload=True,
    limit=top_k, # we can limit the results
    score_threshold = 0.55, # we can also set a minumum score for returning results   
)



- To simplify we limit with topk=1 to get back one chunk only
- The score threshold is great for returning for example a "I do not know" anwser 
- if the returned chunks have all score below the threshold 
- then it is easy to just return an empty list and 
- decide accordingly to return a string "please add more information/ i am not sure/i do not know"

- This is only part of our rag pipeline
- having got the answser we now have to feed the result to an llm 
- together with the prompt (the query text) and get back the final anwser 

# NOTES 
- later we can also use filters on the results for example restricting that all anwsers have the same doc_id 
- which is a logical and required constraint if we ingest different documents and we know that 
- the concepts/entities are separated per file 
- we can for example fetch best results and then if the top-k has doc-id = 1 fetch all other chunks having the same doc-id
- also when score - threshold is below a score we can return an empty list and never go the gpt 
(in this case we give full confidence to our knowledge base and no confidence to llm general confidence)


- we are going to use gpt4.1 for this 

In [180]:
query_result

QueryResponse(points=[ScoredPoint(id=0, version=0, score=0.63075626, payload={'text': 'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times 

### create the prompt to the llm and the enriched context

In [204]:
def system_prompt() -> str:
    """
    Generates a user prompt for information retrieval task

    Args:
        text: The text that needs to be corrected

    Returns:
        str: A formatted user prompt for testing.
    """
    return f"""You are an culture assistant specialized in information about museuems, cultural foundations and events. You get as input questions and need to anwser with accuracy."""

In [206]:
def user_prompt(question,context) -> str:
    """
    Generates a user prompt for information retrieval task

    Args:
        text: The text that needs to be corrected

    Returns:
        str: A formatted user prompt for testing.
    """
    return f"""Answer the following question using the provided context. 
If you can't find the answer, do not pretend you know it, but answer "I don't know".

### Question
{question}

### Context 
{context}
"""



### Offer context information to llm using our retrieved result 

In [199]:
context = ""
context += "\n".join([x.payload['text'] for x in query_result.points])

In [207]:
context

'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled collections of Cycladic a

An important point to take into account is that we need to find a way to chunk the context extracted if for example it is too much to fit into the llm 


In [202]:
token_count = len(enc.encode(context))
token_count


367

#### Now that we have the contextual information as well we neeed to feed it to the llm along with the original query and the prompts 

In [203]:
import random
import time
from openai import RateLimitError


client = OpenAI()


def call_llm_with_retry(func, max_retries: int = 5, initial_delay: int = 4):
    """
    wrapper for llm to handle rate limits
    """
    delay = initial_delay
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            print(f"Rate limit hit, retrying in {delay}s...")
            time.sleep(delay + random.uniform(0, 0.5))  # small jitter
            delay *= 2  # exponential backoff
    raise Exception("Max retries exceeded")

In [209]:
# let's ask the llm 
query = "when was goulandris museum created ?"
context

'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled collections of Cycladic a

In [210]:
completion = call_llm_with_retry(
            lambda: client.chat.completions.create(
                model="gpt-4.1",
                messages=[
                    {"role": "system", "content": system_prompt()},
                    {"role": "user", "content": user_prompt(query,context)},
                ],
                temperature=0.1,
                top_p=0.95,
            )
        )

In [228]:
completion.choices[0].message.model_dump()['content']

'The Goulandris Museum of Cycladic Art was founded in 1986.'