# Testing framework for creating the rag pipeline 

1. create the collection of text embeddings - we need a qdrant container to start and ignest our documents in a collection
let's add 3-4 texts of documents about museums and their information 
2. send llm prompt + use qdrant method to query the knoweledge base 
results = client.query_points , check if it has options for top-k or similarity type
3. when querying the vector database if all top-k answers are below a threshold then answer with "i do not know"

##### First we have started a qdrant container that is running the vector database, so we need to connect to our database (this is run as a different service, so for now we do not worry about the setup step)

In [16]:
import tiktoken
import numpy as np 
from typing import List
import re

In [141]:
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")

- create simple collection of vectors - choose dim=384 and cosine similarity as distance

In [3]:
from qdrant_client.models import Distance, VectorParams,models

# client.create_collection(
#     collection_name="museum_collection",
#     vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE)
#)

- we now need to select an embedding model to encode our text documents so we can insert them into the collection
- we can use sentence transformers + bge small - en

# Questions

- another question is how we are going to handle text paragraphs from each document. 
- let's say we have scraped the urls of 3 greek museums/foundations
- each museum/foundation text clearly needs to be a different embedding 
- additionally how will we handle the large text paragraph for each one ? 
- we can't create one embedding for each paragraph 
- so we need to create vectors within each document embedding 

# We need to experiment for this step
- first start by studying our documents 

In [4]:
from openai import OpenAI
import os 
from dotenv import load_dotenv

load_dotenv()

api_key = os.environ['OPENAI_API_KEY']
client = OpenAI()

- A potential idea to explore is based on the assumption that we may have metadata for each museum so we can add those metadata to each embedding part 
- We will start with this idea
- Knowing the museum/foundation name is potentially a strong assumption that needs investigation (maybe an option to use it if it is known)
- Based on the Task description we make the assumption first that we can distinguish the texts into different files so we can ingest them per file 

In [None]:
# # create function to ingest the files (split into chunks) paragraphs into the collection 
# import os 
# document_sources = os.listdir("/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data")
# for d in document_sources:
#     with open(os.path.join(document_sources,d),"r") as df:
#         # call function to split document and ingest


# Study and ingest one document

In [5]:
with open("/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data/d1.txt",mode="r",encoding="utf-8") as df1:
    document_content = df1.read()

In [7]:
document_content

'Acropolis Museum\nThe Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens.\nThe museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres.\nThe museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site\nThe entrance fee to

### count sample document tokens
- This is a required preprocessing step to make an estimation if for example it is possible to encode our whole document paragraph into one embedding or do chunk based approach. Of course this should be generic and changes per document.

In [8]:
enc = tiktoken.encoding_for_model("gpt-4")

token_count = len(enc.encode(document_content))


In [9]:
token_count

483

In [21]:
document_content

'Acropolis Museum\nThe Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens.\nThe museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres.\nThe museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site\nThe entrance fee to

#### let's encode one document  using BGE as one embedding to study the performance of querying and granularity of embedding

In [27]:
path = "/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data"

def get_document_content(path):
    document_sources = os.listdir(path)
    for d in document_sources:
        with open(os.path.join(path,d),"r") as df:
            # call function to split document and ingest
            yield df.read()

In [28]:
gen = iter(get_document_content(path))

In [29]:
doc1 = next(gen)
doc2 = next(gen)
doc3 = next(gen)

In [11]:
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('BAAI/bge-m3',  
                       use_fp16=True,
                       devices='cpu') # Setting use_fp16 to True speeds up computation with a slight performance degradation


Fetching 30 files: 100%|██████████| 30/30 [00:00<00:00, 10497.13it/s]


In [52]:
doc1

'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled collections of Cycladic a

In [78]:
doc2

'Acropolis Museum\nThe Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens.\nThe museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres.\nThe museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site\nThe entrance fee to

In [79]:
doc3

'Eugenides Foundation\nThe Eugenides Foundation (Greek: Ίδρυμα Ευγενίδου) is a Greek private educational foundation. It was established in 1956 in Athens, Greece implementing the will of the late Greek benefactor Eugenios Eugenidis, who died in April 1954.\n\nThe activity of the foundation, in accordance with its articles of association, is to contribute to the scientific and technological education of young people in Greece. The foundation is administered by a committee of three persons, which is participated by each professor which is elected as a rector of the National Technical University of Athens (NTUA) until the end of his term as a rector. For its multifaceted contribution to Greek society, Eugenides Foundation was honored in December 1965 with the gold medal of the Academy of Athens.\nActivities\nThe activities and establishments of the Foundation include:\n\na scholarship program granting 20 scholarships annually\na scientific and technical library,\na museum of science and t

In [61]:
d1_embedding = model.encode([doc1])
d2_embedding = model.encode([doc2])
d3_embedding = model.encode([doc3])

In [65]:
d1_embedding['dense_vecs'].shape, d2_embedding['dense_vecs'].shape, d3_embedding['dense_vecs'].shape

((1, 1024), (1, 1024), (1, 1024))

In [94]:
query = "when was the goulandris museum founded" 
query_embedding = model.encode([query])

In [88]:
all_document_embeddings = np.stack([d1_embedding['dense_vecs'],d2_embedding['dense_vecs'],d3_embedding['dense_vecs']]).reshape(-1,1024)

In [95]:
similarity = query_embedding['dense_vecs'] @ all_document_embeddings.T


In [96]:
similarity

array([[0.6212836 , 0.43497568, 0.35743386]], dtype=float32)

#### The most simple approach is first to encode each document as a different embedding into the database so we start with this 

- This seems like a good foundation for a first simple solution
- The model seems to be able to understand the differences between the embeddings

### So now based on this idea let's encode the documents and ingest them into our collection

# create the collection first 

In [163]:
model_name = "BAAI/bge-m3"
client.create_collection(
    collection_name="museum_collection",
    vectors_config=models.VectorParams(
        size=1024,
        distance=models.Distance.COSINE
    ),  # size and distance are model dependent
)

True

# next we need to insert the documents 

In [156]:
document_sources = os.listdir("/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data")

path = "/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data"

def get_document_content(path):
    document_sources = os.listdir(path)
    for i,d in enumerate(document_sources):
        with open(os.path.join(path,d),"r") as df:
            # call function to split document and ingest
            yield i,d,df.read()

In [157]:
gen = iter(get_document_content(path))

In [158]:
metadata_with_docs = [
    {"id":x[0],"document": x[2], "source": x[1]} for x in gen 
]


In [159]:
metadata_with_docs

[{'id': 0,
  'document': 'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled 

In [164]:
collection_name = "museum_collection"
client.upsert(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=doc['id'],
            payload = {"text":doc['document'],"id":doc['id'],"source":doc['source']},
            vector=model.encode(doc['document'])['dense_vecs'])
        for doc in metadata_with_docs
    ]
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

#### ok all documents for our rag pipeline are uploaded

let's try some queries 

In [169]:
query =  "when was goulandris museum created ?"
query_embedding = model.encode(query)['dense_vecs']

In [None]:
# we can try simple querying using a top-k

In [181]:
top_k = 1
query_result = client.query_points(
    collection_name=collection_name,
    query=query_embedding,
    with_vectors=True,
    with_payload=True,
    limit=top_k, # we can limit the results
    score_threshold = 0.55, # we can also set a minumum score for returning results   
)



- To simplify we limit with topk=1 to get back one chunk only
- The score threshold is great for returning for example a "I do not know" anwser 
- if the returned chunks have all score below the threshold 
- then it is easy to just return an empty list and 
- decide accordingly to return a string "please add more information/ i am not sure/i do not know"

- This is only part of our rag pipeline
- having got the answser we now have to feed the result to an llm 
- together with the prompt (the query text) and get back the final anwser 

# NOTES 
- later we can also use filters on the results for example restricting that all anwsers have the same doc_id 
- which is a logical and required constraint if we ingest different documents and we know that 
- the concepts/entities are separated per file 
- we can for example fetch best results and then if the top-k has doc-id = 1 fetch all other chunks having the same doc-id
- also when score - threshold is below a score we can return an empty list and never go the gpt 
(in this case we give full confidence to our knowledge base and no confidence to llm general confidence)


- we are going to use gpt4.1 for this 

### create the prompt to the llm and the enriched context

In [204]:
def system_prompt() -> str:
    """
    Generates a user prompt for information retrieval task

    Args:
        text: The text that needs to be corrected

    Returns:
        str: A formatted user prompt for testing.
    """
    return f"""You are an culture assistant specialized in information about museuems, cultural foundations and events. You get as input questions and need to anwser with accuracy."""

In [206]:
def user_prompt(question,context) -> str:
    """
    Generates a user prompt for information retrieval task

    Args:
        text: The text that needs to be corrected

    Returns:
        str: A formatted user prompt for testing.
    """
    return f"""Answer the following question using the provided context. 
If you can't find the answer, do not pretend you know it, but answer "I don't know".

### Question
{question}

### Context 
{context}
"""



### Offer context information to llm using our retrieved result 

In [199]:
context = ""
context += "\n".join([x.payload['text'] for x in query_result.points])

In [207]:
context

'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled collections of Cycladic a

An important point to take into account is that we need to find a way to chunk the context extracted if for example it is too much to fit into the llm 


In [202]:
token_count = len(enc.encode(context))
token_count


367

#### Now that we have the contextual information as well we neeed to feed it to the llm along with the original query and the prompts 

In [203]:
import random
import time
from openai import RateLimitError


client = OpenAI()


def call_llm_with_retry(func, max_retries: int = 5, initial_delay: int = 4):
    """
    wrapper for llm to handle rate limits
    """
    delay = initial_delay
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            print(f"Rate limit hit, retrying in {delay}s...")
            time.sleep(delay + random.uniform(0, 0.5))  # small jitter
            delay *= 2  # exponential backoff
    raise Exception("Max retries exceeded")

In [209]:
# let's ask the llm 
query = "when was goulandris museum created ?"
context

'Goulandris Museum of Cycladic Art\nThe Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art.\n\nThe museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum\'s main building, erected in the centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4]\n\nThe museum\'s permanent collection includes over 3,000 items, and was described in The New York Times as "one of the world\'s most significant privately assembled collections of Cycladic a

In [210]:
completion = call_llm_with_retry(
            lambda: client.chat.completions.create(
                model="gpt-4.1",
                messages=[
                    {"role": "system", "content": system_prompt()},
                    {"role": "user", "content": user_prompt(query,context)},
                ],
                temperature=0.1,
                top_p=0.95,
            )
        )

In [228]:
completion.choices[0].message.model_dump()['content']

'The Goulandris Museum of Cycladic Art was founded in 1986.'

# More advanced chunking implementation 

In [10]:
document_content

'Acropolis Museum\nThe Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens.\nThe museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres.\nThe museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site\nThe entrance fee to

In [12]:
token_count = len(enc.encode(document_content))


In [13]:
token_count

483

### we need to count the tokens and if tokens > threshold split them further

In [17]:

def sliding_window_on_text(
    paragraph_text: str,
    window_size: int = 100,
    slide: int = 90,
) -> List[List[str]]:
    """
    Function to create overlaps in the text

    The function will process window_size items
    then move by slide value and process again window_size items.
    This way we create overlaps and we can maintain  some of the contextual information
    between sentences

    Args:
        paragraph_text: the text given
        window_size: how many elements each chunk will include
        slide: defines how much we move forward (how much information we maintain)
    Returns:
        total_tokens: The overlapping chunks of words  produced(nested list)
    """

    # setting close window and slide we make sure we dont start from the begining of a large sequence
    paragraph_text_list = re.split(r"\s+", paragraph_text.strip())
    total_tokens = []

    i = 0
    while i < len(paragraph_text_list):
        # Get the window from i to i + window_size
        window = paragraph_text_list[i : i + window_size]
        total_tokens.append(window)

        # Slide the window and repeat
        i += slide

    return total_tokens

In [18]:
def split_overlap_window_recurse(
    chunk_text: str, max_tokens: int, window: int, slide: int
) -> List[List[str]]:
    """
    Function that recurses to produce chunks with token count less than a max length given

    This is implemented to give the option to run all the llm extracting steps
    (extract properties, extract parties) in smaller overlapping chunks of the
    page text if the whole page text does not fit in the max_context set to max_tokens.
    Args:
        chunk_text: the text given to split
        max_tokens: user defined max context to send to llm
        window: how many items each chunk can include
        slide: defines how much information we maintain while sliding
    Returns:
        overlapping_chunks: the final list of lists of overlapping sentences
    """
    if window <= 0 or slide <= 0:
        return [[chunk_text]]

    overlapping_chunks = sliding_window_on_text(chunk_text, window, slide)
    for x in overlapping_chunks:
        token_count = len(enc.encode(x[0]))
        if token_count > max_tokens:
            print(f"token count {token_count} is large, should split into smaller")
            return split_overlap_window_recurse(
                chunk_text, max_tokens, window - 10, slide - 10
            )

    return overlapping_chunks

In [23]:
chunks = split_overlap_window_recurse(document_content,4000,100,90)


In [95]:
for x in chunks:
    print(" ".join(x))
    print(len(enc.encode(" ".join(x))))

Acropolis Museum The Acropolis Museum (Greek: Μουσείο Ακρόπολης, Mouseio Akropolis) is an archaeological museum focused on the findings of the archaeological site of the Acropolis of Athens. The museum was built to house every artifact found on the rock and on the surrounding slopes, from the Greek Bronze Age to Roman and Byzantine Greece. The Acropolis Museum also lies over the ruins of part of Roman and early Byzantine Athens. The museum was founded in 2003 while the Organization of the Museum was established in 2008. It opened to the public on 20 June 2009.[1] More than 4,250 objects are
145
public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres. The museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building er

#### so now that we have a simple chunking technique we can split each document 
- and also add the metadata source to it 

In [148]:
path = "/home/ko_st/Documents/rag_with_fastapi/rag-pipeline-with-fastapi/data"


def get_document_content(path):
    document_sources = os.listdir(path)
    for i,d in enumerate(document_sources):
        with open(os.path.join(path,d),"r") as df:
            # call function to split document and ingest
            yield i,d,split_overlap_window_recurse(df.read(),4000,100,90)

gen = iter(get_document_content(path))

In [149]:
# get all chunks and add them to collection
total_results= [(x[0], x[1], " ".join(sublist)) for x in gen for sublist in x[2]] 
    

In [150]:
metadata_with_docs = [
    {"id":i,"document": x[2], "source": x[1]} for i,x in enumerate(total_results)
]

In [151]:
metadata_with_docs

[{'id': 0,
  'document': "Goulandris Museum of Cycladic Art The Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art. The museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum's main building, erected in the centre of Athens in 1985, was designed by the Greek",
  'source': 'd3.txt'},
 {'id': 1,
  'document': 'centre of Athens in 1985, was designed by the Greek architect Ioannis Vikelas [el].[3] In 1991, the Museum acquired a new building, the neo-classical Stathatos Mansion at the corner of Vassilissis Sofias Avenue and Herodotou Street.[4] The museum\'s permanent collection includes over 3,000 items, and wa

In [152]:
client.create_collection(
    collection_name="museum_collection",
    vectors_config=models.VectorParams(size=1024, distance=models.Distance.COSINE)
)

True

In [153]:

collection_name = "museum_collection"
client.upsert(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=doc['id'],
            payload = {"text":doc['document'],"id":doc['id'],"source":doc['source']},
            vector=model.encode(doc['document'])['dense_vecs'])
        for doc in metadata_with_docs
    ]
)

UpdateResult(operation_id=0, status=<UpdateStatus.COMPLETED: 'completed'>)

### let's do some queries using our new logic  (we need to return many chunks)

In [186]:
query =  "tell me about the goulandris museum and about it's fees"
query_embedding = model.encode(query)['dense_vecs']


top_k = 3
query_result = client.query_points_groups(
    collection_name=collection_name,
    query=query_embedding,
    with_vectors=True,
    with_payload=True,
    limit=top_k, # we can limit the results
    score_threshold = 0.55, # we can also set a minumum score for returning results   
    group_by = "source"
)

In [187]:
query_result

GroupsResult(groups=[PointGroup(hits=[ScoredPoint(id=0, version=0, score=0.6238376, payload={'text': "Goulandris Museum of Cycladic Art The Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art. The museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum's main building, erected in the centre of Athens in 1985, was designed by the Greek", 'id': 0, 'source': 'd3.txt'}, vector=[-0.040013667, -0.03455027, -0.058284238, 0.014043867, 0.03426559, -0.033374347, -0.056560367, -0.0427775, 0.05392534, 0.019773845, 0.02913451, 0.05318233, -0.010293764, -0.0016061268, 0.016938712, -0.04130089, 0.04490174, 0.019878546, 0.00

In [250]:
def reducer(acc,item):
    key,val = item 
    acc[key] = acc.get(key,0) + val
    return acc

In [253]:
from collections import defaultdict
from functools import reduce



In [263]:
all_groups = [ y.payload['source'] for x in query_result.groups for y in x.hits]
# if difference is greater than a threshold we will return the top -1 as group  anwser
top_group = all_groups[0]
all_group_counts = list(map((lambda x : (x,1)), sorted(all_groups)))
count_dict = reduce( reducer,all_group_counts,dict(defaultdict()))
#res = count_dict.get(max(count_dict, key=count_dict.get))


In [298]:
def get_unique_max_key(counts):
    if not counts:
        return None

    max_value = max(counts.values())

    # Find all keys with the max value
    max_keys = [k for k, v in counts.items() if v == max_value]

    # Return the key only if it's unique
    if len(max_keys) == 1 or all([k1 == max_keys[0] for k1 in max_keys]):
        return max_keys[0]
    else:
        return None

In [300]:
res = get_unique_max_key(count_dict)

In [None]:
# in case there is a clear majority winner  or a max score for one certain group we need to
# find a threshold  

- in this case we need to return the chunks that are grouped by the majority score key 

- so now we can return multiple chunks per query 
- we need a way to group them / rerank them and return a new answer from them

In [None]:
# > 0.6 threshold  return topk 

- For very general questions that return many chunks, usually from different sources, meaning there is not a clear unique museum/org given we can define a policy that if the top k are very close and are from different sources then the question is probably general so it needs to be restructured.
- for example if the top1 - and top 2 are from different sources then if the top 2 have very small 
differences in score this suggests that the question is missing important infromation to specify a named entity in the vector database

- we can select 0.1 as threshold for difference
- on the other hand if all chunks returned are from the same source , then we should probably concatenate them 
and return them as context (if possible) 
- or take each of them and return a summary before merging with the others so we do not exceed the context length

In [None]:
["Goulandris Museum of Cycladic Art The Nicholas P. Goulandris Foundation - Museum of Cycladic Art (Greek: Μουσείο Κυκλαδικής τέχνης) is a museum in Athens that houses a notable collection of artifacts of Cycladic art. The museum was founded in 1986 in order to house the collection of Cycladic and Ancient Greek art belonging to Nicholas and Dolly Goulandris.[1] Starting in the early 1960s, the couple collected Greek antiquities, with special interest in the prehistoric art from the Cyclades islands of the Aegean Sea.[2] The museum's main building, erected in the centre of Athens in 1985, was designed by the Greek",
 'public on 20 June 2009.[1] More than 4,250 objects are exhibited over an area of 14,000 square metres. The museum is located by the southeastern slope of the Acropolis hill, on the ancient road that led up to the "sacred rock" in classical times. Set only 280 meters (310 yd), away from the Parthenon, and a 400 meters (440 yd) walking distance from it, the museum is the largest modern building erected so close to the ancient site The entrance fee to the museum was €1 for the first year and €5 thereafter. As of 2024, the entrance fee during']