# Homework: Search Evaluation

In [1]:
!pip uninstall minsearch -y
!pip install -U minsearch qdrant_client


### Evaluation data

In [2]:
import requests
import pandas as pd

url_prefix = 'https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/03-evaluation/'
docs_url = url_prefix + 'search_evaluation/documents-with-ids.json'
documents = requests.get(docs_url).json()

ground_truth_url = url_prefix + 'search_evaluation/ground-truth-data.csv'
df_ground_truth = pd.read_csv(ground_truth_url)
ground_truth = df_ground_truth.to_dict(orient='records')

Here, `documents` contains the documents from the FAQ database with unique IDs, and `ground_truth` contains generated question-answer pairs.

### we will need the following code for evaluating retrieval: 

In [3]:
from tqdm.auto import tqdm

# Measures if at least one relevant document is found in the top k results.
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

#  Average rank of the first relevant document across queries.
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

### Q1. Minsearch text

Now let's evaluate our usual minsearch approach, but tweak the parameters. Let's use the following boosting params:

```python
 boost = {'question': 1.5, 'section': 0.1} 
```

In [4]:
from minsearch import Index

boost = {'question': 1.5, 'section': 0.1}

# initalize our index
index = Index(
    text_fields=['question', 'text', 'section'],
    keyword_fields=[]
)
index.fit(documents) # making out document indexable 


<minsearch.minsearch.Index at 0x7f88a103d820>

In [5]:
print(f"Total queries: {len(ground_truth)}")
print("Example query:", ground_truth[0])



Total queries: 4627
Example query: {'question': 'When does the course begin?', 'course': 'data-engineering-zoomcamp', 'document': 'c02e79ef'}


In [6]:
# search function for a certain query

def search_function(q):
    return index.search(
        query=q['question'],
        filter_dict=None,
        boost_dict=boost,
        num_results=10
    )

### Now we will feed each question from `ground_truth` to our `search_function` (minsearch), then we will compare the result from the search to the ground_truth answers.

In [7]:
metrics = evaluate(ground_truth, search_function)
print(metrics)


  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.8597363302355738, 'mrr': 0.6897542375497872}


# `Q1-Answer -> 0.85 and the closer answer is 0.84` 

### Embeddings

The latest version of minsearch also supports vector search. We will use it:


We will also use `TF-IDF (Term Frequency – Inverse Document Frequency)` and Singular Value Decomposition to create embeddings from texts.

#### What TF-IDF Does:

It looks at word appearance patterns across the documents.

It gives more weight to:

- Words that appear frequently in a specific document (high term frequency),

- But less frequently across all documents (high inverse document frequency).

In [8]:
from minsearch import VectorSearch
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline

#### Let's create embeddings for the "question" field:

In [9]:
texts = []

for doc in documents:
    t = doc['question']
    texts.append(t)
    
pipeline = make_pipeline(
    TfidfVectorizer(min_df=3), # Only keep words that appear in at least 3 questions (removes noise/rare words).
    
    # we use random_state for repeatable results (for testing, debugging, or sharing).
    TruncatedSVD(n_components=128, random_state=1) # 128 dimensions, and Hey computer, random_state use the same random choices every time
)

# Creates a reusable pipeline
X = pipeline.fit_transform(texts)

### Q2. Vector search for question

Now let's index these embeddings with minsearch:

In [10]:
vindex = VectorSearch(keyword_fields={'course'})
vindex.fit(X, documents)

<minsearch.vector.VectorSearch at 0x7f88a12d39a0>

#### create the `search_function`

In [13]:

def search_function(q):
    query_vec = pipeline.transform([q['question']])
    return vindex.search(query_vec, filter_dict=None)

#### Now let's evaluate 

In [14]:
metrics = evaluate(ground_truth, search_function)
print(metrics)

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.4696347525394424, 'mrr': 0.30031389257669755}


# `Q2-Answer -> mrr': 0.3, so close one is 0.35`