### Homework: Search Evaluation
---
In this homework, we will evaluate the results of vector search.
> It's possible that your answers won't match exactly. If it's the case, select the closest one.

#### Required libraries
---
We will use minsearch and Qdrant. Make sure you have the most up-to-date versions:

`pip install -U minsearch qdrant_client`

minsearch should be at least 0.0.4.

#### Evaluation data
---
For this homework, we will use the same dataset we generated in the videos.
Let's get them:

In [1]:
# Install required libraries
!pip install -U minsearch qdrant_client rouge scikit-learn tqdm pandas requests

Collecting minsearch
  Downloading minsearch-0.0.4-py3-none-any.whl.metadata (8.1 kB)
Collecting qdrant_client
  Downloading qdrant_client-1.15.0-py3-none-any.whl.metadata (11 kB)
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.7.1-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting grpcio>=1.41.0 (from qdrant_client)
  Downloading grpcio-1.73.1-cp311-cp311-win_amd64.whl.metadata (4.0 kB)
Collecting httpx>=0.20.0 (from httpx[http2]>=0.20.0->qdrant_client)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting portalocker<4.0,>=2.7.0 (from qdrant_client)
  Downloading portalocker-3.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting pydantic!=2.0.*,!=2.1.*,!=2.2.0,>=1.10.8 (from qdrant_client)
  Downloading pydantic-2.11.7-py3-none-any.whl.metadata (67 kB)
     ---------------------------------------- 0.0/68.0 kB ? eta -:--:--
     ---------------------------------------- 0.0/68.0 kB ? e


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\rll14\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [2]:
# Import libraries and load evaluation data
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
# Creates the full URL for the JSON file containing the documents to be used for evaluation
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
# Downloads the JSON file from the URL and loads it into a Python list of dictionaries
documents = requests.get(docs_url).json()
# Creates the full URL for the CSV file containing ground truth data for evaluation
ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
# Downloads the CSV file and loads it into a pandas DataFrame
df_ground_truth = pd.read_csv(ground_truth_url)
# Converts the DataFrame to a list of dictionaries for easier access
# Each dictionary represents a row/record in the DataFrame, with column names as keys
ground_truth = df_ground_truth.to_dict(orient='records')

Here, `documents` contains the documents from the FAQ database with unique IDs, and `ground_truth` contains generated question-answer pairs.

Also, we will need the code for evaluating retrieval:

In [3]:
# Evaluation metrics: hit_rate, mrr, and evaluate function
from tqdm.auto import tqdm

# Calculates the fraction of queries for which at least one relevant document was retrieved
def hit_rate(relevance_total):
    cnt = 0
    # For each query/line, check if any result is True (i.e., a relevant document was found)
    for line in relevance_total:
        if True in line:
            cnt = cnt + 1
    return cnt / len(relevance_total)

# Calculates the Mean Reciprocal Rank of the relevant documents across all queries (quality ranking metric)
def mrr(relevance_total):
    total_score = 0.0
    # For each query, find the rank/position of the first relevant document (where line[rank] is True)
    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
    return total_score / len(relevance_total)

# Evaluates the search function against the ground truth data
def evaluate(ground_truth, search_function):
    relevance_total = []
    # For each query in ground_truth, get the document ID and retrieve search results
    for q in tqdm(ground_truth):
        doc_id = q['document']
        # Call the search function with the query text
        results = search_function(q)
        # Check if the result’s ID matches the ground truth document ID
        relevance = [d['id'] == doc_id for d in results]
        # Append the relevance list to the total relevance list
        relevance_total.append(relevance)
    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

  from .autonotebook import tqdm as notebook_tqdm


#### Q1. Minsearch text
---
Now let's evaluate our usual minsearch approach (*i.e. perform text search with boosting parameters for 'question' and 'section' fields*), but tweak the parameters. Let's use the following boosting params:
```python
boost = {'question': 1.5, 'section': 0.1}
```

What's the hitrate for this approach?

* 0.64
* 0.74
* 0.84
* 0.94

##### Embeddings
---
The latest version of minsearch also supports vector search. We will use it:

```python
from minsearch import VectorSearch
```

We will also use TF-IDF and Singular Value Decomposition to create embeddings from texts. You can refer to our ["Create Your Own Search Engine" workshop](https://github.com/alexeygrigorev/build-your-own-search-engine) if you want to know more about it.
```python
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
```

Let's create embeddings for the "question" field:
```python
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)

pipeline = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
)
X = pipeline.fit_transform(texts)
```

In [None]:
# Q1. Minsearch text search with boosting parameters
from minsearch import Index

# Creates a search index using the minsearch library
index = Index(
    text_fields=["question", "text", "section"], # Full-text search in each document
    keyword_fields=["course", "id"] # Filtering by course or id
)
index.fit(documents) # Build the index from the list of documents

# Searches the index for documents matching the query and course
def minsearch_search(query, course):
    boost = {'question': 1.5, 'section': 0.1}

    results = index.search(
        query=query,
        filter_dict={'course': course}, # Filters results to only those matching the given course
        boost_dict=boost,
        num_results=5
    )

    return results

# Evaluate hitrate for minsearch text search
relevance_total = []
# For each query in the ground truth data, get the document id and search using the question and course
for q in ground_truth:
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results] # Check if results match the expected document id
    relevance_total.append(relevance)

In [17]:
hit_rate(relevance_total), mrr(relevance_total)

(0.848714069591528, 0.7288235717887772)

## Q2. Vector search for question

Now let's index these embeddings with minsearch:
```python
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)
```
Evaluate this seach method. What's MRR for it?

* 0.25
* 0.35
* 0.45
* 0.55

In [None]:
# Q2. Vector search for question field using TF-IDF and SVD embeddings
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

# Create embeddings for 'question' field 
texts = []
# Extracts the value of the 'question' field from each document and appends it to the texts list
for doc in documents:
    t = doc['question']
    texts.append(t)

# Create an Embedding Pipeline
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3), # TF-IDF embedding ignores words that appear in fewer than 3 documents
    TruncatedSVD(n_components=128, random_state=1) # SVD embedding reduces TF-IDF vector dimensionality to 128
 )
X = pipeline.fit_transform(texts) # Fit the pipeline to the texts and transform them into dense vectors

# Index embeddings with minsearch VectorSearch
vindex = VectorSearch(keyword_fields={'course'}) # Initialize a VectorSearch index filtering through 'course'
vindex.fit(X, documents) # Build the index using the embeddings (X) and the original documents

def vector_search(q): # Query q is a dictionary with at least 'question' and 'course' fields
    # Transform the question to embedding
    v_q = pipeline.transform([q['question']]) # Transforms the query’s question into an embedding
    results = vindex.search(v_q, {'course': q['course']}, 5) # Search the top 5 indicies filtering by course
    return results

# Evaluate MRR for vector search on question field
vector_search_mrr = evaluate(ground_truth, vector_search)
print('Vector search MRR:', vector_search_mrr['mrr'])

100%|██████████| 4627/4627 [00:05<00:00, 791.72it/s]

Vector search MRR: 0.3572833369353793





#### Q3. Vector search for question and answer 

We only used question in Q2. We can use both question and answer:
```python
texts = []

for doc in documents:
    t = doc['question'] + ' ' + doc['text']
    texts.append(t)
```
Using the same pipeline (`min_df=3` for TF-IDF vectorizer and `n_components=128` for SVD), evaluate the performance of this approach

What's the hitrate?

* 0.62
* 0.72
* 0.82
* 0.92

In [None]:
# Q3. Vector search for question and answer fields
texts_qa = [doc['question'] + ' ' + doc['text'] for doc in documents]
pipeline_qa = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
 )
X_qa = pipeline_qa.fit_transform(texts_qa)

vindex_qa = VectorSearch(keyword_fields={'course'})
vindex_qa.fit(X_qa, documents)

def vector_search_question_answer(q):
    # Only use the question field from the query, since ground_truth does not have 'text'
    v_q = pipeline_qa.transform([q['question']])
    results = vindex_qa.search(v_q, {'course': q['course']}, 5)
    return results

# Evaluate hitrate for vector search on question+answer
result_q3 = evaluate(ground_truth, vector_search_question_answer)
print('Q3 - Vector search (question+answer) hitrate:', result_q3['hit_rate'])
jinaai/jina-embeddings-v2-small-en

100%|██████████| 4627/4627 [00:07<00:00, 591.94it/s]

Q3 - Vector search (question+answer) hitrate: 0.8210503566025502





#### Q4. Qdrant 

Now let's evaluate the following settings in Qdrant:

* `text = doc['question'] + ' ' + doc['text']`
* `model_handle = "jinaai/jina-embeddings-v2-small-en"`
* `limit = 5`

What's the MRR?

* 0.65
* 0.75
* 0.85
* 0.95

In [28]:
!pip install sentence-transformers

^C



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\rll14\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Collecting sentence-transformers
  Downloading sentence_transformers-5.0.0-py3-none-any.whl.metadata (16 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.53.2-py3-none-any.whl.metadata (40 kB)
     ---------------------------------------- 0.0/40.9 kB ? eta -:--:--
     ---------- ----------------------------- 10.2/40.9 kB ? eta -:--:--
     ---------------------------- --------- 30.7/40.9 kB 330.3 kB/s eta 0:00:01
     -------------------------------------- 40.9/40.9 kB 281.4 kB/s eta 0:00:00
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.7.1-cp311-cp311-win_amd64.whl.metadata (28 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Collecting pyyaml>=5.1 (from huggingface-hub>=0.20.0->sentence-transformers)
  Downloading PyYAML-6.0.2-cp311-cp311-win_amd64.whl.metadata (2.1 kB)
Collecting sympy>=1.13.3 (from torch>=1.11.0

In [27]:
# Q4. Qdrant vector search using Jina embeddings
from qdrant_client import QdrantClient
from qdrant_client.http.models import VectorParams, Distance, PointStruct
import numpy as np

# Use Jina embeddings (requires jinaai/jina-embeddings-v2-small-en)
from sentence_transformers import SentenceTransformer
model_handle = "jinaai/jina-embeddings-v2-small-en"
embedding_model = SentenceTransformer(model_handle)

# Prepare Qdrant collection
client = QdrantClient(':memory:')
collection_name = "search_eval"
client.recreate_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=embedding_model.get_sentence_embedding_dimension(), distance=Distance.COSINE)
 )

# Index documents in Qdrant
for doc in documents:
    text = doc['question'] + ' ' + doc['text']
    vector = embedding_model.encode(text)
    client.upsert(
        collection_name=collection_name,
        points=[PointStruct(id=doc['id'], vector=vector, payload=doc)]
    )

def qdrant_search(q):
    text = q['question'] + ' ' + q['text']
    vector = embedding_model.encode(text)
    hits = client.search(
        collection_name=collection_name,
        query_vector=vector,
        limit=5,
        with_payload=True
    )
    return [hit.payload for hit in hits]

# Evaluate MRR for Qdrant search
result_q4 = evaluate(ground_truth, qdrant_search)
print('Q4 - Qdrant vector search MRR:', result_q4['mrr'])

ModuleNotFoundError: No module named 'sentence_transformers'

## Q5. Cosine similarity calculation for LLM and original answers

We will calculate the average cosine similarity between LLM-generated answers and original answers using TF-IDF and SVD embeddings.

In [None]:
# Q5. Cosine similarity calculation for LLM and original answers
results_url = url_prefix + 'rag_evaluation/data/results-gpt4o-mini.csv'
df_results = pd.read_csv(results_url)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
import numpy as np

# Fit pipeline on all text data
all_texts = pd.concat([df_results['answer_llm'], df_results['answer_orig'], df_results['question']])
pipeline_rag = make_pipeline(
    TfidfVectorizer(min_df=3),
    TruncatedSVD(n_components=128, random_state=1)
 )
pipeline_rag.fit(all_texts)

def normalize(u):
    norm = np.sqrt(u.dot(u))
    return u / norm

def cosine(u, v):
    u_norm = np.sqrt(u.dot(u))
    v_norm = np.sqrt(v.dot(v))
    return u.dot(v) / (u_norm * v_norm)

cosines = []
for _, row in df_results.iterrows():
    v_llm = pipeline_rag.transform([row['answer_llm']])[0]
    v_orig = pipeline_rag.transform([row['answer_orig']])[0]
    cos_val = cosine(v_llm, v_orig)
    cosines.append(cos_val)

avg_cosine = np.mean(cosines)
print('Q5 - Average cosine similarity:', avg_cosine)

## Q6. ROUGE score calculation for LLM and original answers

We will compute the ROUGE-1 F1 score for the 10th document and the average ROUGE-1 F1 score for all answer pairs.

In [None]:
# Q6. ROUGE score calculation for LLM and original answers
from rouge import Rouge

rouge_scorer = Rouge()

# ROUGE-1 F1 for the 10th document (index 10)
r = df_results.iloc[10]
scores_10 = rouge_scorer.get_scores(r.answer_llm, r.answer_orig)[0]
rouge_1_f1_10 = scores_10['rouge-1']['f']
print('Q6 - ROUGE-1 F1 for 10th document:', rouge_1_f1_10)

# Average ROUGE-1 F1 for all pairs
rouge_1_f1_scores = []
for _, row in df_results.iterrows():
    scores = rouge_scorer.get_scores(row['answer_llm'], row['answer_orig'])[0]
    rouge_1_f1_scores.append(scores['rouge-1']['f'])

avg_rouge_1_f1 = np.mean(rouge_1_f1_scores)
print('Q6 - Average ROUGE-1 F1:', avg_rouge_1_f1)