# Retrieaval evaluation
In this notebook we evaluate different retrieval methods.
For perform it we need:
* create ground truth dataset, genereate question for our answers
* get search_results for our questions
* measure performanse of search methods

We will try:
* minisearch.py
* elasticsearch (text search)
* elasticsearch (vector search)

And separately evaluate search with chunking (split by 1000 symbols)

To generate ground-truth dataset we use [this script](scripts/generate_ground_truth.py)

## Praparation

In [1]:
!rm -f minsearch.py
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-09-08 17:50:44--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-09-08 17:50:44 (1.47 MB/s) - ‘minsearch.py’ saved [3832/3832]



In [2]:
import json
import math
from tqdm import tqdm

from elasticsearch import Elasticsearch
from sentence_transformers import SentenceTransformer

import minsearch

  from tqdm.autonotebook import tqdm, trange


In [3]:
with open('../data/ground-truth.json', 'r') as f_in:
    ground_truth = json.load(f_in)

with open('../data/site_content.json', 'r') as f_in:
    raw_doc = json.load(f_in)

## Metric functions

In [4]:
def hit_rate(relevance_total):
    cnt = 0

    for line in relevance_total:
        if True in line:
            cnt = cnt + 1

    return cnt / len(relevance_total)

def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)

    return total_score / len(relevance_total)
def ndcg(relevance_total):
    def dcg(relevance):
        return sum((2**rel - 1) / math.log2(i + 2) for i, rel in enumerate(relevance))
    
    def idcg(relevance):
        return dcg(sorted(relevance, reverse=True))
    
    scores = []
    for relevance in relevance_total:
        if sum(relevance) == 0:
            scores.append(0.0)
        else:
            scores.append(dcg(relevance) / idcg(relevance))
    
    return sum(scores) / len(scores)

In [5]:
def evaluate(ground_truth, search_function):
    relevance_total = []

    for q in tqdm(ground_truth):
        results = search_function(q['question'])
        relevance = [d['url'] == q['url'] for d in results]
        relevance_total.append(relevance)

    return {
        'hit_rate': hit_rate(relevance_total),
        'mrr': mrr(relevance_total),
        'ndsg': ndcg(relevance_total)
    }

## Minsearch function

In [6]:
data = [{'url': k, 'header':v['header'], 'main_content':v['main_content']} for k,v in raw_doc.items()]

msearch_index = minsearch.Index(
    text_fields=["url", "header", "main_content"],
    keyword_fields=[]
)

msearch_index.fit(data)

def search_msearch(query: str) -> str:
    results = msearch_index.search(
        query=query,
        num_results=5
    )

    return results
    

In [7]:
evaluate(ground_truth, search_msearch)

100%|████████████████████████████████████████████████████████████████████████████████| 415/415 [00:01<00:00, 238.19it/s]


{'hit_rate': 0.4578313253012048,
 'mrr': 0.324859437751004,
 'ndsg': 0.3580757390552208}

#### With boosting

In [8]:
data = [{'url': k, 'header':v['header'], 'main_content':v['main_content']} for k,v in raw_doc.items()]

msearch_index = minsearch.Index(
    text_fields=["url", "header", "main_content"],
    keyword_fields=[]
)

msearch_index.fit(data)

def search_msearch_boost(query: str) -> str:
    results = msearch_index.search(
        query=query,
        num_results=5,
        boost_dict = {'header':0.5}
    )

    return results
    

In [9]:
evaluate(ground_truth, search_msearch_boost)

100%|████████████████████████████████████████████████████████████████████████████████| 415/415 [00:01<00:00, 223.00it/s]


{'hit_rate': 0.4746987951807229,
 'mrr': 0.330722891566265,
 'ndsg': 0.366675213616622}

## Elastic search (text)

first we need to start Elasticsearch locally, if it's not started yet

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [10]:
es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "url": {"type": "text"},
            "header": {"type": "text"},
            "main_content": {"type": "text"}
        }
    }
}

index_name = "esearchtext"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'esearchtext'})

In [11]:
for doc in tqdm(data):
    es_client.index(index=index_name, document=doc)

100%|██████████████████████████████████████████████████████████████████████████████| 1030/1030 [00:04<00:00, 209.21it/s]


In [12]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["url", "header", "main_content"],
                        "type": "best_fields"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [13]:
evaluate(ground_truth, elastic_search)

100%|████████████████████████████████████████████████████████████████████████████████| 415/415 [00:02<00:00, 204.65it/s]


{'hit_rate': 0.6120481927710844,
 'mrr': 0.43746987951807226,
 'ndsg': 0.48136427158055817}

In [14]:
def elastic_search_boost(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["url", "header", "main_content^3"],
                        "type": "best_fields"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [15]:
evaluate(ground_truth, elastic_search_boost)

100%|████████████████████████████████████████████████████████████████████████████████| 415/415 [00:02<00:00, 203.21it/s]


{'hit_rate': 0.619277108433735,
 'mrr': 0.45497991967871476,
 'ndsg': 0.49613324092305466}

## Elastic search (vector)

In [16]:
model_name = "paraphrase-MiniLM-L6-v2"
model = SentenceTransformer(model_name)


index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "url": {"type": "text"},
            "header": {"type": "text"},
            "main_content": {"type": "text"},
            "header_vector": {
                "type": "dense_vector",
                "dims": model.get_sentence_embedding_dimension(),
                "index": True,
                "similarity": "cosine"
            },
            "main_content_vector": {
                "type": "dense_vector",
                "dims": model.get_sentence_embedding_dimension(),
                "index": True,
                "similarity": "cosine"
            },
            
        }
    }
}

index_name_vector = "esearchvector"

es_client.indices.delete(index=index_name_vector, ignore_unavailable=True)
es_client.indices.create(index=index_name_vector, body=index_settings)



ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'esearchvector'})

In [17]:
for doc in tqdm(data):
    doc['header_vector'] = model.encode(doc['header'])
    doc['main_content_vector'] = model.encode(doc['main_content'])

100%|███████████████████████████████████████████████████████████████████████████████| 1030/1030 [00:36<00:00, 28.16it/s]


In [18]:
for doc in tqdm(data):
    es_client.index(index=index_name_vector, document=doc)

100%|██████████████████████████████████████████████████████████████████████████████| 1030/1030 [00:05<00:00, 199.27it/s]


In [19]:
def elastic_search_vector(field, vector):
    knn = {
        "field": field,
        "query_vector": vector,
        "k": 5,
        "num_candidates": 10000,
    }

    search_query = {
        "knn": knn,
        "_source": ["url", "header", "main_content", "header_vector", "main_content_vector"]
    }

    es_results = es_client.search(
        index=index_name_vector,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [20]:
def elastic_search_vector_header(q):
    return elastic_search_vector('header_vector', model.encode(q))

In [21]:
evaluate(ground_truth, elastic_search_vector_header)

100%|█████████████████████████████████████████████████████████████████████████████████| 415/415 [00:08<00:00, 46.88it/s]


{'hit_rate': 0.27228915662650605,
 'mrr': 0.16602409638554227,
 'ndsg': 0.1922402885934874}

In [22]:
def elastic_search_vector_main_content(q):
    return elastic_search_vector('main_content_vector', model.encode(q))

In [23]:
evaluate(ground_truth, elastic_search_vector_main_content)

100%|█████████████████████████████████████████████████████████████████████████████████| 415/415 [00:08<00:00, 48.03it/s]


{'hit_rate': 0.4506024096385542,
 'mrr': 0.33329317269076303,
 'ndsg': 0.3625595595195731}

In [24]:
def elastic_search_vector_3(q):
    query_vector = model.encode(q).tolist()

    search_query = {
        "size": 5,
        "query": {
            "script_score": {
                "query": {"match_all": {}},
                "script": {
                    "source": "cosineSimilarity(params.query_vector, 'main_content_vector') + 1.0",
                    "params": {"query_vector": query_vector}
                }
            }
        }
    }

    results = es_client.search(index=index_name_vector, body=search_query)
    result_docs = []
    
    for hit in results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [25]:
evaluate(ground_truth, elastic_search_vector_3)

100%|█████████████████████████████████████████████████████████████████████████████████| 415/415 [00:06<00:00, 60.50it/s]


{'hit_rate': 0.4578313253012048,
 'mrr': 0.34052208835341363,
 'ndsg': 0.3697884751822237}

## Minsearch with chunks

In [26]:
def chunk_data(raw_doc, chunk_size=1000, overlap=100):
    def chunk_content(content, chunk_size=1000, overlap=100):
        chunks = []
        start = 0
        while start < len(content):
            end = start + chunk_size
            chunk = content[start:end]
            chunks.append(chunk)
            start = end - overlap
        return chunks

    chunked_data = []
    for k, v in raw_doc.items():
        content_chunks = chunk_content(v['main_content'], chunk_size, overlap)
        for i, chunk in enumerate(content_chunks):
            chunked_data.append({
                'url': k,
                'header': v['header'],
                'main_content': chunk,
                'chunk_index': i
            })
    
    return chunked_data

In [27]:
data_chunk = chunk_data(raw_doc)

In [28]:
msearch_index_chunk = minsearch.Index(
    text_fields=["url", "header", "main_content"],
    keyword_fields=[]
)

msearch_index_chunk.fit(data_chunk)

def search_msearch_boost_chunk(query: str) -> str:
    results = msearch_index_chunk.search(
        query=query,
        num_results=5,
        boost_dict = {'header':0.5}
    )

    return results
    

In [29]:
evaluate(ground_truth, search_msearch_boost_chunk)

100%|████████████████████████████████████████████████████████████████████████████████| 415/415 [00:01<00:00, 207.61it/s]


{'hit_rate': 0.3927710843373494,
 'mrr': 0.4765461847389559,
 'ndsg': 0.3221260677391088}

## Elastic text with chunks

In [30]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "url": {"type": "text"},
            "header": {"type": "text"},
            "main_content": {"type": "text"}
        }
    }
}

index_name = "esearchtext_chunks"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'esearchtext_chunks'})

In [31]:
for doc in tqdm(data_chunk):
    es_client.index(index=index_name, document=doc)

100%|██████████████████████████████████████████████████████████████████████████████| 2834/2834 [00:10<00:00, 257.66it/s]


In [32]:
def elastic_search_boost_chunks(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["url", "header", "main_content^3"],
                        "type": "best_fields"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [33]:
evaluate(ground_truth, elastic_search_boost_chunks)

100%|████████████████████████████████████████████████████████████████████████████████| 415/415 [00:02<00:00, 170.92it/s]


{'hit_rate': 0.6240963855421687,
 'mrr': 0.6240562248995986,
 'ndsg': 0.5197822047592691}

## Elastic vector chunks

In [34]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "url": {"type": "text"},
            "header": {"type": "text"},
            "main_content": {"type": "text"},
            "main_content_vector": {
                "type": "dense_vector",
                "dims": model.get_sentence_embedding_dimension(),
                "index": True,
                "similarity": "cosine"
            },
            
        }
    }
}

index_name_vector = "esearchvector_chunks"

es_client.indices.delete(index=index_name_vector, ignore_unavailable=True)
es_client.indices.create(index=index_name_vector, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'esearchvector_chunks'})

In [35]:
for doc in tqdm(data_chunk):
    doc['header_vector'] = model.encode(doc['header'])
    doc['main_content_vector'] = model.encode(doc['main_content'])

100%|███████████████████████████████████████████████████████████████████████████████| 2834/2834 [02:04<00:00, 22.74it/s]


In [36]:
for doc in tqdm(data_chunk):
    es_client.index(index=index_name_vector, document=doc)

100%|██████████████████████████████████████████████████████████████████████████████| 2834/2834 [00:15<00:00, 185.60it/s]


In [37]:
def elastic_search_vector(field, vector):
    knn = {
        "field": field,
        "query_vector": vector,
        "k": 5,
        "num_candidates": 10000,
    }

    search_query = {
        "knn": knn,
        "_source": ["url", "header", "main_content", "header_vector", "main_content_vector"]
    }

    es_results = es_client.search(
        index=index_name_vector,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [38]:
def elastic_search_vector_main_content_chunks(q):
    return elastic_search_vector('main_content_vector', model.encode(q))

In [39]:
evaluate(ground_truth, elastic_search_vector_main_content_chunks)

100%|█████████████████████████████████████████████████████████████████████████████████| 415/415 [00:11<00:00, 35.23it/s]


{'hit_rate': 0.5060240963855421,
 'mrr': 0.46598393574297164,
 'ndsg': 0.3986455909302925}

### Another models Sentense transformer

In [40]:
model_name = "all-MiniLM-L12-v2"
model = SentenceTransformer(model_name)

In [41]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "url": {"type": "text"},
            "header": {"type": "text"},
            "main_content": {"type": "text"},
            "main_content_vector": {
                "type": "dense_vector",
                "dims": model.get_sentence_embedding_dimension(),
                "index": True,
                "similarity": "cosine"
            },
            
        }
    }
}

index_name_vector = "esearchvector_chunks_2"

es_client.indices.delete(index=index_name_vector, ignore_unavailable=True)
es_client.indices.create(index=index_name_vector, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'esearchvector_chunks_2'})

In [42]:
for doc in tqdm(data_chunk):
    doc['header_vector'] = model.encode(doc['header'])
    doc['main_content_vector'] = model.encode(doc['main_content'])

100%|███████████████████████████████████████████████████████████████████████████████| 2834/2834 [03:58<00:00, 11.89it/s]


In [43]:
for doc in tqdm(data_chunk):
    es_client.index(index=index_name_vector, document=doc)

100%|██████████████████████████████████████████████████████████████████████████████| 2834/2834 [00:17<00:00, 164.99it/s]


In [44]:
evaluate(ground_truth, elastic_search_vector_main_content_chunks)

100%|█████████████████████████████████████████████████████████████████████████████████| 415/415 [00:16<00:00, 24.75it/s]


{'hit_rate': 0.5493975903614458,
 'mrr': 0.49333333333333323,
 'ndsg': 0.4261590511854974}

In [48]:
def elastic_search_combined(query):
    vector = model.encode(query)
    search_query = {
        "_source": ["url", "header", "main_content", "header_vector", "main_content_vector"],
        "query": {
            "bool": {
                "should": [
                    {
                        "multi_match": {
                            "query": query,
                            "fields": ["header", "main_content"],
                            "type": "best_fields",
                            "tie_breaker": 0.3
                        }
                    },
                    {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "cosineSimilarity(params.query_vector, 'main_content_vector') + 1.0",
                                "params": {"query_vector": vector}
                            }
                        }
                    }
                ]
            }
        },
        "size": 5
    }
    
    es_results = es_client.search(
        index=index_name_vector,
        body=search_query
    )
    
    result_docs = [hit['_source'] for hit in es_results['hits']['hits']]
    return result_docs

In [49]:
evaluate(ground_truth, elastic_search_combined)

100%|█████████████████████████████████████████████████████████████████████████████████| 415/415 [00:12<00:00, 34.06it/s]


{'hit_rate': 0.5783132530120482,
 'mrr': 0.6417670682730925,
 'ndsg': 0.47983191351995125}

In [50]:
def elastic_search_combined_10(query):
    vector = model.encode(query)
    search_query = {
        "_source": ["url", "header", "main_content", "header_vector", "main_content_vector"],
        "query": {
            "bool": {
                "should": [
                    {
                        "multi_match": {
                            "query": query,
                            "fields": ["header", "main_content"],
                            "type": "best_fields",
                            "tie_breaker": 0.3
                        }
                    },
                    {
                        "script_score": {
                            "query": {"match_all": {}},
                            "script": {
                                "source": "cosineSimilarity(params.query_vector, 'main_content_vector') + 1.0",
                                "params": {"query_vector": vector}
                            }
                        }
                    }
                ]
            }
        },
        "size": 10
    }
    
    es_results = es_client.search(
        index=index_name_vector,
        body=search_query
    )
    
    result_docs = [hit['_source'] for hit in es_results['hits']['hits']]
    return result_docs

In [51]:
evaluate(ground_truth, elastic_search_combined_10)

100%|█████████████████████████████████████████████████████████████████████████████████| 415/415 [00:14<00:00, 28.58it/s]


{'hit_rate': 0.6578313253012048,
 'mrr': 0.7077184930196975,
 'ndsg': 0.49965424064494396}

### Final results

Results of various search methods, including Minsearch and Elasticsearch with different configurations. The metrics used for evaluation include Hit Rate, MRR (Mean Reciprocal Rank), and NDCG (Normalized Discounted Cumulative Gain).

| Method | Hit Rate | MRR | NDCG |
|--------|----------|-----|------|
| Minsearch | 0.4578 | 0.3249 | 0.3581 |
| Minsearch (with boosting) | 0.4747 | 0.3307 | 0.3667 |
| Elasticsearch (text search) | 0.6120 | 0.4375 | 0.4814 |
| Elasticsearch (text search with boosting) | 0.6193 | 0.4550 | 0.4961 |
| Elasticsearch (vector search on header) | 0.2723 | 0.1660 | 0.1922 |
| Elasticsearch (vector search on main content) | 0.4506 | 0.3333 | 0.3626 |
| Elasticsearch (vector search with cosine similarity) | 0.4578 | 0.3405 | 0.3698 |
| Minsearch (with chunks) | 0.3928 | 0.4765 | 0.3221 |
| Elasticsearch (text search with chunks) | 0.6241 | 0.6241 | 0.5198 |
| Elasticsearch (vector search with chunks, paraphrase-MiniLM-L6-v2) | 0.5060 | 0.4660 | 0.3986 |
| Elasticsearch (vector search with chunks, all-MiniLM-L12-v2) | 0.5494 | 0.4933 | 0.4262 |
| Elasticsearch (combined search) | 0.5783 | 0.6418 | 0.4798 |
| Elasticsearch (combined search, size 10) | 0.6578 | 0.7077 | 0.4997 |


## Conclusion:
The evaluation of various search methods reveals several important insights:

- Elasticsearch consistently outperforms Minsearch across all metrics, with text-based search methods generally showing better results than vector-based methods alone.
- Content chunking significantly improves performance, especially for Elasticsearch, as evidenced by the strong results of Elasticsearch text search with chunks.
- While vector search methods show mixed results, they improve with chunking and offer complementary strengths to text-based search.
- The combined search approach in Elasticsearch, which leverages both text and vector search capabilities, shows promising results with a strong balance across all metrics (Hit Rate: 0.5783, MRR: 0.6418, NDCG: 0.4798).

For this project, we have decided to implement the combined search method using Elasticsearch. This approach offers several advantages:

- It provides robust performance across different types of queries and content structures.
- It balances the strengths of both text-based and vector-based search methods.

By choosing this combined method, we aim to create a versatile and effective search solution that can handle a wide range of search scenarios while maintaining high relevance and accuracy in results.