| Model            | Hit Rate              |         MRR          |
|------------------|-----------------------|-----------------------|
| 384_docs_4o_mini | 0.9128593040847202    | 0.7873424104891578    |
| 384_docs_llama   | 0.9131618759455371    | 0.787418053454362     |
| 768_docs_4o_mini | 0.9231467473524962    | 0.7947503782148263    |
| 768_docs_llama   | 0.9228441754916793    | 0.7948562783661121    |


sentence-transformers/multi-qa-MiniLM-L6-cos-v1\
sentence-transformers/multi-qa-distilbert-cos-v

In [1]:
import json
import pandas as pd

In [2]:
with open("../docs_with_q_4o-mini.json", "rt") as f_in:
    ds_gpt = json.load(f_in)

In [3]:
len(ds_gpt)

661

In [4]:
ds_gpt[0]

{'source': 'https://www.reddit.com/r/germany/wiki/autobahn_safety',
 'content': 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autobahn#/media/File%3AAutobahnen_in_Deutschland.svg) with a total length of more than 8,000 miles. [65%](https://en.wikipedia.org/wiki/Autobahn#Speed_limits) of the Autobahn has no speed limit. How safe can that be?\nVehicles traveled 147 billion miles on the Autobahn in 2015. 322 people died = 2.19 deaths per billion miles.\nIn the US, vehicles travelled 757 billion miles on interstate highways. 3,837 people died = 5.07 deaths per billion miles.\nThat means: If you drive on the interstate, your likelihood to die is 131% higher than for the same distance on the Autobahn.\n*sources:*\nStatistisches Bundesamt: [Unfallentwicklung auf deutschen Straßen 2015](https://www.destatis.de/DE/PresseService/Presse/Pressekonferenzen/2016/Unfallentwicklung_2015/Pressebroschuere_unfallentwicklung.pdf?__blob=publicationFile)\nNat

In [5]:
from qdrant_client import QdrantClient, models

In [6]:
client = QdrantClient("http://localhost:6333")

In [7]:
client.create_collection(
    collection_name="faq-sparse-and-dense",
    vectors_config={
        # Named dense vector for jinaai/jina-embeddings-v2-small-en
        "jina-small": models.VectorParams(
            size=512,
            distance=models.Distance.COSINE,
        ),
    },
    sparse_vectors_config={
        "bm25": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    }
)


True

In [8]:
ds_gpt[395]

{'source': 'https://www.reddit.com/r/germany/wiki/health_insurance',
 'headline': 'Private vs public insurance',
 'content': 'The ever-rising premiums are the main reason why private health insurance is such a terrible deal - it looks cheap at the beginning, but when you’re approaching and entering retirement, the premiums will be sky-high (and won’t be reduced as your income drops after retirement). Additionally, children and non-working spouses must be insured individually under private insurance, whereas they are covered for free in the public system. Therefore, if you’re seriously considering moving to Germany, make sure you exhaust every avenue available to enter public insurance before settling for private insurance.',
 'length': 2,
 'id': '576ef606-0614-4c38-ad5a-1c4d1da0a753',
 'question': 'What are the drawbacks of private health insurance compared to public insurance?'}

In [9]:
from tqdm.auto import tqdm

In [10]:
client.upsert(
    collection_name="faq-sparse-and-dense",
    points=[
        models.PointStruct(
            id=ds_gpt[i]["id"],
            vector={
                "jina-small": models.Document(
                    text=ds_gpt[i]["question"] + ' ' + ds_gpt[i]["content"],
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                "bm25": models.Document(
                    text=ds_gpt[i]["question"] + ' ' + ds_gpt[i]["content"],
                    model="Qdrant/bm25",
                ),
            },
            payload={
                "question": ds_gpt[i]["question"],
                "content": ds_gpt[i]["content"],
                "source": ds_gpt[i]["source"],
                "headline": ds_gpt[i]["headline"],
                "question": ds_gpt[i]["question"],
                "length": ds_gpt[i]["length"],
                "id": ds_gpt[i]["id"]
            }
        )
        for i in tqdm(range(len(ds_gpt)))
    ]
)

  0%|          | 0/661 [00:00<?, ?it/s]

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

https://github.com/DataTalksClub/llm-zoomcamp/blob/main/02-vector-search/hybrid_search.ipynb

In [11]:
qs = pd.read_csv("./gp4o-mini-questions.csv", sep=",")

In [12]:
qs.head(n=10)

Unnamed: 0,question,headline,content
0,What is the total length of the Autobahn netwo...,How safe is the Autobahn?,9d8370cf-a2c8-4c54-9f9c-476b9c09a933
1,How many deaths occurred on the Autobahn per b...,How safe is the Autobahn?,9d8370cf-a2c8-4c54-9f9c-476b9c09a933
2,What were the safety statistics for US interst...,How safe is the Autobahn?,9d8370cf-a2c8-4c54-9f9c-476b9c09a933
3,How many miles did vehicles travel on US inter...,How safe is the Autobahn?,9d8370cf-a2c8-4c54-9f9c-476b9c09a933
4,What percentage of the Autobahn has no speed l...,How safe is the Autobahn?,9d8370cf-a2c8-4c54-9f9c-476b9c09a933
5,What are the primary methods for obtaining Ger...,German citizenship by descent,6ef3b8e4-f20b-4893-bf9e-f58f800afb82
6,How can someone trace their German citizenship...,German citizenship by descent,6ef3b8e4-f20b-4893-bf9e-f58f800afb82
7,What circumstances might lead to the loss of G...,German citizenship by descent,6ef3b8e4-f20b-4893-bf9e-f58f800afb82
8,Are there any provisions for restoring German ...,German citizenship by descent,6ef3b8e4-f20b-4893-bf9e-f58f800afb82
9,Where can I seek help or additional informatio...,German citizenship by descent,6ef3b8e4-f20b-4893-bf9e-f58f800afb82


In [13]:
q_list = qs["question"].to_list()
id_list = qs["content"].to_list()

### Reranking Search

In [14]:
def rrf_search(query: str, limit: int = 1) -> list[models.ScoredPoint]:
    results = client.query_points(
        collection_name="faq-sparse-and-dense",
        prefetch=[
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="jinaai/jina-embeddings-v2-small-en",
                ),
                using="jina-small",
                limit=(5 * limit),
            ),
            models.Prefetch(
                query=models.Document(
                    text=query,
                    model="Qdrant/bm25",
                ),
                using="bm25",
                limit=(5 * limit),
            ),
        ],
        # Fusion query enables fusion on the prefetched results
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        with_payload=True,
    )

    return results.points

In [15]:
result = rrf_search(
    query="What happens to the citizenship of children born to a German father and a foreign mother if they are not married?")

In [16]:
print(result[0].payload["content"])

A German woman lost her German citizenship by marrying a foreigner during this time. If they married before your next ancestor was born, and that next ancestor was born
... before 1914: -> [outcome 7](https://www.reddit.com/r/germany/wiki/citizenship#wiki_outcome_7)
... between 1914 and 23 May 1949 -> [outcome 5](https://www.reddit.com/r/germany/wiki/citizenship#wiki_outcome_5)
... after 23 May 1949 -> [outcome 3](https://www.reddit.com/r/germany/wiki/citizenship#wiki_outcome_3)
The rules for children who were born before 24 May 1949:
- both parents were German citizens: The child was born as a German citizen, continue with the next person in line
- German father and foreign mother in wedlock: The child was born as a German citizen, continue with the next person in line
- German father and foreign mother out of wedlock: -> [outcome 5](https://www.reddit.com/r/germany/wiki/citizenship#wiki_outcome_5)
- German mother and foreign father in wedlock: -> [outcome 5](https://www.reddit.com/r/

In [17]:
from haystack import Document, Pipeline
from haystack.components.evaluators import DocumentMRREvaluator, DocumentRecallEvaluator

In [18]:
eval_pipeline = Pipeline()
eval_pipeline.add_component("doc_mrr_evaluator", DocumentMRREvaluator())
eval_pipeline.add_component("doc_rec_evaluator", DocumentRecallEvaluator())

In [19]:
result = rrf_search(
    query="What are the safety statistics comparing the Autobahn to US interstate highways?")

In [20]:
result[0].payload["content"]

'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autobahn#/media/File%3AAutobahnen_in_Deutschland.svg) with a total length of more than 8,000 miles. [65%](https://en.wikipedia.org/wiki/Autobahn#Speed_limits) of the Autobahn has no speed limit. How safe can that be?\nVehicles traveled 147 billion miles on the Autobahn in 2015. 322 people died = 2.19 deaths per billion miles.\nIn the US, vehicles travelled 757 billion miles on interstate highways. 3,837 people died = 5.07 deaths per billion miles.\nThat means: If you drive on the interstate, your likelihood to die is 131% higher than for the same distance on the Autobahn.\n*sources:*\nStatistisches Bundesamt: [Unfallentwicklung auf deutschen Straßen 2015](https://www.destatis.de/DE/PresseService/Presse/Pressekonferenzen/2016/Unfallentwicklung_2015/Pressebroschuere_unfallentwicklung.pdf?__blob=publicationFile)\nNational Highway Traffic Safety Administration: [Fatal Crashes by STATE and Road Fu

In [21]:
import copy

grund_truth = [copy.deepcopy(item) for item in ds_gpt for _ in range(5)]

In [22]:
len(q_list), len(ds_gpt), len(grund_truth)

(3305, 661, 3305)

In [23]:
from haystack import Document

In [24]:
grund_truth_documents = []
for docs in grund_truth:
    grund_truth_documents.append([Document(content=docs["content"])])

In [25]:
grund_truth_documents[:6]

[[Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document

In [26]:
retrieved_documents = []
for question in tqdm(q_list):
    result = rrf_search(query=question)
    retrieved_documents.append([Document(content=result[0].payload["content"]), Document(content=result[1].payload["content"]),
                                Document(content=result[2].payload["content"]), Document(content=result[3].payload["content"]), Document(content=result[4].payload["content"])])

  0%|          | 0/3305 [00:00<?, ?it/s]

In [27]:
retrieved_documents[0]

[Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...'),
 Document(id=e1ff4a2ca7962cc5f90ee550dd56440832cd39697c9e4cb3545938c04980776d, content: '[Amanda:](https://web.archive.org/web/20160316041117/http://www.amiexpat.com/2009/08/20/more-real-ex...'),
 Document(id=78091b22119fd20c676a31a782300900f6b79839e87052fd2d0e17550973ff9e, content: 'While hitchhiking isn't that common any more, it should still be possible to do it. Hitchhike from A...'),
 Document(id=e2d6a731a164d29a671d54fbac5d7e71fd2c9bcaacd9eb0a36d67a077be56e50, content: 'The main ways of getting fixed (home) internet in Germany are:
 * ADSL/VSDL
 * Cable
 * Fibre
 The optio...'),
 Document(id=290e7940e47da8016a01597ea448473f6ffe668b61f8a68ccec4099bb0934b22, content: 'Also note that the law refers to the length of your stay in Germany, not the length of your stay at ...')]

In [28]:
grund_truth_documents[:15]

[[Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document(id=fc12cd232ca4a0d8446820546b3c02e2ce6b4a8b3a8d271f66ae599b7416188c, content: 'The Autobahn is a [network of interstate highways in Germany](https://en.m.wikipedia.org/wiki/Autoba...')],
 [Document

In [29]:
results = eval_pipeline.run(
    {
        "doc_mrr_evaluator": {
            "ground_truth_documents": grund_truth_documents,
            "retrieved_documents": retrieved_documents,
        },
        "doc_rec_evaluator": {
            "ground_truth_documents": grund_truth_documents,
            "retrieved_documents": retrieved_documents,
        },

    }
)

| Model            | Hit Rate              |         MRR          |
|------------------|-----------------------|-----------------------|
| 384_docs_4o_mini | 0.9128593040847202    | 0.7873424104891578    |
| 384_docs_llama   | 0.9131618759455371    | 0.787418053454362     |
| 768_docs_4o_mini | 0.9231467473524962    | 0.7947503782148263    |
| 768_docs_llama   | 0.9228441754916793    | 0.7948562783661121    |

jinaai/jina-embeddings-v2-small-en, bm25

In [30]:
results["doc_rec_evaluator"]["score"],results["doc_mrr_evaluator"]["score"]

(0.9521936459909228, 0.8366717095310136)