# Evaluating Text Search with Ground Truth Data

In this notebook, we use the ground truth dataset generated in the previous step to evaluate text search results.  
For each query in our dataset, we check if the relevant document is returned by the search engine and at what rank.  
We will use two metrics: **hit-rate** (recall) and **Mean Reciprocal Rank (MRR)** to measure the quality of our retrieval.

In [48]:
import json

with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

## Indexing Documents in Elasticsearch

We load the processed documents and index them in Elasticsearch.  
Each document has a unique `id` field, which we use for evaluation.  
We also add a `course` field to allow filtering by course.

In [4]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [None]:
# Bulk Indexing all documents into Elasticsearch
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

## Defining the Search Function

We define a function to search for relevant documents in Elasticsearch, filtering by course and ranking by relevance.

In [46]:
def elastic_search(query, course):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [None]:
# Let's try searching for a sample query to see what results we get.
elastic_search(
    query="I just discovered the course. Can I still join?",
    course="data-engineering-zoomcamp"
)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

## Loading Ground Truth Data

We load the ground truth dataset, which contains queries, courses, and the relevant document IDs for evaluation.

In [None]:
import pandas as pd

df_ground_truth = pd.read_csv('ground-truth-data.csv')

ground_truth = df_ground_truth.to_dict(orient='records')

## Evaluating All Queries

For each query in the ground truth data, we execute the search and check if the relevant document is among the top results.  
We store the relevance results for further metric calculation.

In [19]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = elastic_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [None]:
# Example of how the relevance results look for a few queries.
example = [
    [True, False, False, False, False], # 1, 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [False, False, False, False, False], # 0
    [False, False, False, False, False], # 0 
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1
    [True, False, False, False, False], # 1 
    [False, False, True, False, False],  # 1/3
    [False, False, False, False, False], # 0
]

# 1 => 1
# 2 => 1 / 2 = 0.5
# 3 => 1 / 3 = 0.3333
# 4 => 0.25
# 5 => 0.2
# rank => 1 / rank
# none => 0

## Hit-rate Metric

The hit-rate (recall) measures the percentage of queries for which the relevant document was retrieved in the top results.

In [24]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

## Mean Reciprocal Rank (MRR) Metric

MRR measures not only if the relevant document was retrieved, but also how high it was ranked in the results.

In [29]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)

In [None]:
# Calculate hit-rate and MRR for the example relevance results.
hit_rate(example)

0.5833333333333334

In [31]:
mrr(example)

0.5277777777777778

In [None]:
# Calculate hit-rate and MRR for all queries using Elasticsearch.
hit_rate(relevance_total), mrr(relevance_total)

(0.7395720769397017, 0.6032418413658963)

## Evaluating with MinSearch

Now let's evaluate the same queries using MinSearch, a lightweight search library.

In [34]:
import minsearch

index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course", "id"]
)

index.fit(documents)

<minsearch.Index at 0x29109dae150>

## MinSearch Search Function

We define a search function for MinSearch, similar to the one used for Elasticsearch.

In [38]:
def minsearch_search(query, course):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': course},
        boost_dict=boost,
        num_results=5
    )

    return results

## Evaluating All Queries with MinSearch

We repeat the evaluation for all queries using MinSearch and calculate the metrics.

In [39]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['document']
    results = minsearch_search(query=q['question'], course=q['course'])
    relevance = [d['id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/4627 [00:00<?, ?it/s]

In [40]:
hit_rate(relevance_total), mrr(relevance_total)

(0.7722066133563864, 0.661454506159499)

Compare the hit-rate and MRR for Elasticsearch and MinSearch.  
You can experiment with different boost values or fields to see if the results improve:
```
(0.7395720769397017, 0.6032418413658963)
```

## Generic Evaluation Function

To make evaluation easier, we define a generic function that takes a ground truth set and a search function, and returns the metrics.

In [42]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        doc_id = q['document']
        results = search_function(q)
        relevance = [d['id'] == doc_id for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
    }

## Evaluate Both Search Engines

Now we can easily evaluate both Elasticsearch and MinSearch using the same function.

In [44]:
evaluate(ground_truth, lambda q: elastic_search(q['question'], q['course']))

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.7395720769397017, 'mrr': 0.6032418413658963}

In [45]:
evaluate(ground_truth, lambda q: minsearch_search(q['question'], q['course']))

  0%|          | 0/4627 [00:00<?, ?it/s]

{'hit_rate': 0.7722066133563864, 'mrr': 0.661454506159499}

## Experimentation and Further Improvements

You can now experiment with different search parameters, boost values, or even try other search engines.  
For more advanced experiment tracking, consider using tools like MLflow.  
Remember, the ground truth dataset may require further cleaning