# Semestral Home Assignment 
In the semestral home assignment you are tasked with designing and implementing a production ready information retrieval (IR) system with the use of Qdrant. <br>
First will need to implement scalable Qdrant cluster with the principles of NoSQL (sharding, replication quorum). <br>
Then, you will implement the vector search with Qdrant using all the advanced features of the vector database. <br>

In [1]:
%cd ../

/Users/veranika/uni/pa195_semestral_assignment_2025


In [2]:
%load_ext autoreload
%autoreload 2

## Setup

In [3]:
import json
import os
import re
from typing import Any, cast, Callable

import numpy as np
from datasets import load_dataset
from datasets.dataset_dict import DatasetDict
from datasets.dataset_dict import Dataset
from qdrant_client import QdrantClient
from qdrant_client.models import models
from qdrant_client.http.models.models import QueryResponse
from fastembed import TextEmbedding, SparseTextEmbedding, LateInteractionTextEmbedding
from fastembed.sparse.sparse_embedding_base import SparseEmbedding
from dotenv import load_dotenv

from notebooks.utils import evaluate_retrieval

In [4]:
def build_sparse_query_text(query_text: str, filter_values: list[str]) -> str:
    base = re.sub(r"[^a-z0-9\\s]", " ", query_text.lower())
    extended_parts = [base]
    # REMOVE: if filter_values: extended_parts.append(" ".join(filter_values).lower())
    tokens = base.split()
    if len(tokens) < 3:
        extended_parts.append(" ".join(tokens) * 2)
    return " ".join(part.strip() for part in extended_parts if part).strip()

Load environment variables. **Do not forget to create a .env file in the root directory based on the .env.example file**.

In [5]:
load_dotenv("./.env")

True

Start up local instance of Qdrant through docker.

In [7]:
!docker run -p 6335:6333 -p 6336:6334 -d --name qdrant-server qdrant/qdrant:v1.16

docker: Error response from daemon: Conflict. The container name "/qdrant-server" is already in use by container "3b09836c69f73ba7671ae9cd1dff6e67c07dc6de57b481235d01b2411e4c1144". You have to remove (or rename) that container to be able to reuse that name.

Run 'docker run --help' for more information


Initiate the Qdrant client by connecting to the server running as a docker container.

In [8]:
client = QdrantClient(host=os.environ["QDRANT_HOST"], port=int(os.environ["QDRANT_PORT"]))

## Dataset

### Task 1 - Data Loading
Load the data from the Hugging Face dataset [Zovi3/pa195_semestral_assignment](https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/upload/main), explore it and extract/preprocess it if necessary.

In [9]:
# TODO: Import query dataset from https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/tree/main
# Load queries from the query folder  
query_dataset: Dataset = load_dataset(
    "Zovi3/pa195_semestral_assignment", 
    data_files="query-all-MiniLM-L6-v2-100-filters-embedded-results/train.jsonl",
    split="train"
    )

In [10]:
# TODO: Import documents dataset from https://huggingface.co/datasets/Zovi3/pa195_semestral_assignment/tree/main
# Load documents from the corpus folder
documents: Dataset = load_dataset(
    "Zovi3/pa195_semestral_assignment",
    data_files="corpus-all-MiniLM-L6-v2-50K-groups-multi-vector/train.jsonl",
    split="train"
)

In [11]:
query_dataset

Dataset({
    features: ['text', 'id', 'filters', 'embedding', 'multi_vector_embedding', 'result'],
    num_rows: 100
})

In [12]:
documents


Dataset({
    features: ['text', 'id', 'embedding', 'groups', 'multi_vector_embedding'],
    num_rows: 50000
})

In [11]:
# # Preprocess multi-vector embeddings: remove zero-norm vectors
# import numpy as np

# def filter_multi_vectors(example):
#     filtered = [v for v in example['multi_vector_embedding'] if np.linalg.norm(v) > 1e-6]
#     example['multi_vector_embedding'] = filtered
#     return example
# preprocessed_documents = documents.map(filter_multi_vectors)

# print('Original multi vectors count:', len(documents[0]['multi_vector_embedding']))
# print('Filtered multi vectors count:', len(preprocessed_documents[0]['multi_vector_embedding']))


## Models Setup

### Embedding Model

Within the homework you will work with `sentence-transformers/all-MiniLM-L6-v` from fastembed library. <br>
These embedding are precomputed for you in the assignment dataset, but you will need to used model when running the queries.

In [None]:
## Embeddings are precomputed so you can save some memory by not loading the model
# embedding_model = TextEmbedding('sentence-transformers/all-MiniLM-L6-v2')
embedding_model_size = 384

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/650 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/90.4M [00:00<?, ?B/s]

### Sparse Retrieval Model
Some queries require the prioritization of the certain keywords. <br>
Therefor, you will need to use BM25 algorithm to boost the documents with these keywords during retrieval. <br>
Note that BM25 is not taken into account in the dataset, so you will need to apply when uploading and indexing the data.

In [14]:
bm25_model = SparseTextEmbedding("Qdrant/bm25")

Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

finnish.txt: 0.00B [00:00, ?B/s]

german.txt: 0.00B [00:00, ?B/s]

danish.txt:   0%|          | 0.00/424 [00:00<?, ?B/s]

english.txt:   0%|          | 0.00/936 [00:00<?, ?B/s]

italian.txt: 0.00B [00:00, ?B/s]

norwegian.txt:   0%|          | 0.00/851 [00:00<?, ?B/s]

arabic.txt: 0.00B [00:00, ?B/s]

portuguese.txt: 0.00B [00:00, ?B/s]

french.txt:   0%|          | 0.00/813 [00:00<?, ?B/s]

greek.txt: 0.00B [00:00, ?B/s]

dutch.txt:   0%|          | 0.00/453 [00:00<?, ?B/s]

romanian.txt: 0.00B [00:00, ?B/s]

russian.txt: 0.00B [00:00, ?B/s]

spanish.txt: 0.00B [00:00, ?B/s]

swedish.txt:   0%|          | 0.00/559 [00:00<?, ?B/s]

turkish.txt:   0%|          | 0.00/260 [00:00<?, ?B/s]

hungarian.txt: 0.00B [00:00, ?B/s]

### Multi-Vector Model
It is general good practice to include reranking model in the IR system. <br>
Reranking uses stronger model to select the most relevant documents from the initial retrieval. <br>
You will implement reranking with multi-vector late interaction embedding ColBERT.

In [18]:
## Embeddings are precomputed so you can save some memory by not loading the model
# multi_vector_model = LateInteractionTextEmbedding("colbert-ir/colbertv2.0")
multi_vector_model_size = 128

## Database Configuration

### Task 2 - Data Modelling
In this task you will create proper data model for your data including vector representations, index configuration, distance functions and more.

#### Task 2.1 - HNSW Index Configuration
Configure the HNSW index for the retrieval. <br>
**Change the ef_construct parameter to 64 to speed the build time at the cost of the recall.** <br>
We do this for practical reasons, to enable you iterate over the notebook faster.

In [19]:
# Change ef_construct parameter to 64 to speed the build time at the cost of the recall
ef_construct = 64
hnsw_config=models.HnswConfigDiff(
            ef_construct=ef_construct,
        )

#### Task 2.2 - Collection Creation
Create model for your data. You should create three vector representations for your data. <br>
There should be one representation for each model defined above. <br>
For multi-vector model make sure to disable the vector index since it will be used only for reranking. <br>
Also, do not forget that multi-vector computation of similarity is not done only through the cosine similarity (check the lecture for more info). <br>
Configure proper modifier for the sparse vector.

In [20]:
COLLECTION_NAME = "ms_macro"

In [21]:
try:
    client.delete_collection(COLLECTION_NAME)
    print(f"Deleted existing collection: {COLLECTION_NAME}")
except: 
    print(f"Collection {COLLECTION_NAME} does not exist")


# Configure collection creation  
collection_created = client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config={
        "dense": models.VectorParams(
            size=embedding_model_size,
            distance=models.Distance.COSINE,
            hnsw_config=hnsw_config,
        ),
        "multi_vector": models.VectorParams(
            size=multi_vector_model_size,
            distance=models.Distance.COSINE,
            multivector_config=models.MultiVectorConfig(
                comparator=models.MultiVectorComparator.MAX_SIM,
            ),
            hnsw_config=None, # Disable indexing for reranking-only vector
        ),
    },
    sparse_vectors_config={
        "sparse": models.SparseVectorParams(
            modifier=models.Modifier.IDF,
        )
    },
    on_disk_payload=True
)

if collection_created:
    print(f"Created collection \"{COLLECTION_NAME}\".")
else:
    print("Collection creation failed")

Deleted existing collection: ms_macro
Created collection "ms_macro".


#### Task 2.3 - Create Payload Index & Disable Quantization
Configure keyword payload index for the `groups` field. Make sure that payload index is on-disk.

In [22]:
# Create payload index
payload_index_created = client.create_payload_index(
    collection_name=COLLECTION_NAME,
    field_name="groups", # keyword payload index for 'groups' field 
    field_schema=models.PayloadSchemaType.KEYWORD,
)

# Disable quantization
client.update_collection(
    collection_name=COLLECTION_NAME,
    quantization_config=models.Disabled.DISABLED,
)

if payload_index_created:
    print(f"Payload index created for field 'groups'")

Payload index created for field 'groups'


In [23]:
collection_info = client.get_collection(COLLECTION_NAME)
print(f"Payload indices: {list(collection_info.payload_schema.keys())}")

Payload indices: ['groups']


### Task 3 - Data Upload
Upload vector embeddings and metadata to the created collection, make sure to upload the vectors metadata.

In [24]:
points: list[models.PointStruct] = []

print("Preparing points with normalized dense vectors...")
# Iterate over the documents dataset
for doc in documents:  # type: ignore    
    # Generate Sparse Vector (BM25) on the fly
    sparse_embeddings = list[SparseEmbedding](bm25_model.embed([doc["text"]]))
    sparse_embedding = sparse_embeddings[0]
    
    point = models.PointStruct(
        id=doc["id"],
        vector={
            "dense": doc["embedding"], # Qdrant will normalize this automatically
            "sparse": models.SparseVector(
                indices=sparse_embedding.indices.tolist(),
                values=sparse_embedding.values.tolist(),
            ),
            "multi_vector": doc["multi_vector_embedding"], # Already fits our config
        },
        payload={
            "text": doc["text"],
            "groups": doc["groups"],
        },
    )
    points.append(point)

print(f"Upserting {len(points)} documents...")
# Upload in batches to avoid network timeouts
client.upload_points(collection_name=COLLECTION_NAME, points=points, batch_size=128)

print(f"Collection info: {client.get_collection(COLLECTION_NAME).points_count} points in collection")
assert client.get_collection(COLLECTION_NAME).points_count == len(documents), \
    f"Expected {len(documents)} points in collection, got {client.get_collection(COLLECTION_NAME).points_count}"

Preparing points with normalized dense vectors...
Upserting 50000 documents...
Collection info: 50000 points in collection


## Querying

### Task 4 - Design Complex Query
Your task is to design a complex query that will include hybrid search, filtering, reranking and metadata boosting. <br>
**The result of this task should be one Qdrant query (do not add any postprocessing logic outside of the Qdrant query)!**
 
**Subtasks:**
1. Define query filter with relation to the `groups` field, do not forget there can be filter values in the query.
    - Think about in which prefetch you should apply the filter.
2. Define sparse and dense search prefetche, the limit for the retrieval should be 100 objects.
3. Define fusion of the two rankings with Reciprocal Rank Fusion (RRF).
4. Rerank the results with ColBERT multi-vector model, use 50 documents for reranking.
5. Boost the results with metadata weighting, use `group_1` with weight 0.05 and `group_2` with weight 0.1.


In [25]:
def rag_context_retrieval(query: dict[str, Any]) -> QueryResponse:
    # -----------------------------------------
    # Query text and embeddings
    # -----------------------------------------
    query_text = query["text"]
    query_dense_embedding = list(embedding_model.embed([query_text]))[0]
    filter_values: list[str] = query.get("filters", [])

    sparse_text = build_sparse_query_text(query_text, filter_values)
    sparse_emb_obj = next(bm25_model.embed([sparse_text]))
    query_sparse_embedding = models.SparseVector(
        indices=sparse_emb_obj.indices.tolist(),
        values=sparse_emb_obj.values.tolist(),
    )

    # -----------------------------------------
    # Task 4.1 — Filter (EARLY, AND semantics)
    # -----------------------------------------
    filter_condition: models.Filter | None = None
    if filter_values:
        filter_condition = models.Filter(
            must=[
                models.FieldCondition(
                    key="groups",
                    match=models.MatchValue(value=fv),
                )
                for fv in filter_values
            ]
        )

    # -----------------------------------------
    # Task 4.2 — Sparse + Dense search (limit=100)
    # -----------------------------------------
    sparse_limit = 100
    dense_limit = 100

    prefetch_sparse_and_dense_search: list[models.Prefetch] = [
        models.Prefetch(
            query=query_dense_embedding,
            using="dense",
            limit=dense_limit,
            filter=filter_condition,
        ),
        models.Prefetch(
            query=query_sparse_embedding,
            using="sparse",
            limit=sparse_limit,
            filter=filter_condition,
        ),
    ]

    # -----------------------------------------
    # Task 4.3 — RRF fusion (k = 60)
    # -----------------------------------------
    rrf_k = 60

    prefetch_fused_rankings: list[models.Prefetch] = [
        models.Prefetch(
            prefetch=prefetch_sparse_and_dense_search,
            query=models.RrfQuery(
                rrf=models.Rrf(k=rrf_k)
            ),
            limit=10,  # directly match final precision-oriented limit
        )
    ]

    # -----------------------------------------
    # Task 4.4 — Metadata boosting (kept consistent)
    # -----------------------------------------
    boost_terms = []
    for fv in filter_values:
        if fv == "group_1":
            boost_terms.append(
                models.MultExpression(
                    mult=[
                        0.05,
                        models.FieldCondition(
                            key="groups",
                            match=models.MatchAny(any=["group_1"]),
                        ),
                    ]
                )
            )
        elif fv == "group_2":
            boost_terms.append(
                models.MultExpression(
                    mult=[
                        0.1,
                        models.FieldCondition(
                            key="groups",
                            match=models.MatchAny(any=["group_2"]),
                        ),
                    ]
                )
            )

    final_query = models.FormulaQuery(
        formula=models.SumExpression(
            sum=["$score"] + boost_terms
        )
    )

    # -----------------------------------------
    # Final query
    # -----------------------------------------
    final_query_limit = 10

    final_result: QueryResponse = client.query_points(
        collection_name=COLLECTION_NAME,
        prefetch=prefetch_fused_rankings,
        query=final_query,
        limit=final_query_limit,
        with_payload=True,
    )

    return final_result


In [26]:
avg_retrieval_precision = evaluate_retrieval(rag_context_retrieval, query_dataset)

You achieved 0.9869999999999999 enough to pass ✅!
