# Retrieval Evaluation

Using `ground-truth-data.csv`, the following search methods are evaluated in order to find the most suitable search method for the final application.

* text search/lexical search
* vector search/semantic search
* hybrid search

These evaluation metrics will be applied to each search method:
* Hit Rate (HR)
* Mean Reciprocal Rank (MRR)

The final search method is decided by taking the highest scorer in both HR and MRR results.

In [1]:
import os
import json
import pandas as pd
import numpy as np

from elasticsearch import Elasticsearch
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer

## Index data into Elasticsearch

The preprocessed data in `az900_notes_with_vectors.pkl` would be first indexed into Elasticsearch in preparation for the search methods.

**NOTE**: Elasticsearch container must be running in Docker before executing the following code blocks.

In [2]:
folder = "../data/"
documents_file = "az900_notes_with_vectors.pkl"

In [3]:
df = pd.read_pickle(f"{folder}{documents_file}")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 385 entries, 0 to 384
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   doc_id         385 non-null    object
 1   header         385 non-null    object
 2   subheader      385 non-null    object
 3   document       385 non-null    object
 4   doc_text       385 non-null    object
 5   header_vec     385 non-null    object
 6   subheader_vec  385 non-null    object
 7   doc_text_vec   385 non-null    object
dtypes: object(8)
memory usage: 24.2+ KB


In [4]:
# set doc_id as index since it will be used as keyword type in Elasticsearch
df = df.set_index("doc_id")

In [5]:
model_name = 'sentence-transformers/all-MiniLM-L12-v2'
vec_dim = SentenceTransformer(model_name).get_sentence_embedding_dimension()
vec_dim



384

In [6]:
def infer_es_mapping(df, vec_dim):
    '''
    Accepts a pandas dataframe and vector embedding dimension and generates elasticsearch index mapping properties dynamically.
    Index is keyword type.
    Fields end with "_vec" are dense vector types, while others are text types.
    '''
    similarity = 'cosine'    
    es_mapping = {}

    if type(df.index) == pd.Index :
        es_mapping[df.index.name] = {'type': 'keyword'}
        
    # for col, dtype in df.dtypes.iteritems():
    for col in df.columns:
        if col.endswith("_vec"):
            es_mapping[col] = {'type': 'dense_vector', 'dims': vec_dim, 'index': True, 'similarity':similarity}
        else:
            es_mapping[col] = {'type': 'text'}
    return es_mapping

In [7]:
es_mapping_properties = infer_es_mapping(df, vec_dim)
es_mapping_properties

{'doc_id': {'type': 'keyword'},
 'header': {'type': 'text'},
 'subheader': {'type': 'text'},
 'document': {'type': 'text'},
 'doc_text': {'type': 'text'},
 'header_vec': {'type': 'dense_vector',
  'dims': 384,
  'index': True,
  'similarity': 'cosine'},
 'subheader_vec': {'type': 'dense_vector',
  'dims': 384,
  'index': True,
  'similarity': 'cosine'},
 'doc_text_vec': {'type': 'dense_vector',
  'dims': 384,
  'index': True,
  'similarity': 'cosine'}}

Create elasticsearch index `az900_course_notes`. If this index already exists, delete existing index and replace with new one.

In [8]:
es_client = Elasticsearch('http://localhost:9200') 

In [9]:
def create_es_index(es_mapping_properties, index_name):
    index_settings = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "mappings": {
            "properties": es_mapping_properties
        }
    }

    es_client.indices.delete(index=index_name, ignore_unavailable=True)
    es_client.indices.create(index=index_name, body=index_settings)

In [10]:
es_index_name = "az900_course_notes"
create_es_index(es_mapping_properties, es_index_name)

Transform data from `az900_notes_with_vectors.pkl` into JSON format and index into elasticsearch index `az900_course_notes`

In [11]:
documents = pd.read_pickle(f"{folder}{documents_file}").to_dict(orient="records")

In [12]:
len(documents)

385

In [13]:
for doc in tqdm(documents):
    es_client.index(index=es_index_name, document=doc)

  0%|          | 0/385 [00:00<?, ?it/s]

### Load Ground Truth data

In [14]:
ground_truth_file = "ground-truth-data.pkl"
ground_truth = pd.read_pickle(f"{folder}{ground_truth_file}").to_dict(orient="records")

In [15]:
ground_truth[0].keys()

dict_keys(['doc_id', 'question', 'question_vec'])

### Setup Evaluation Metrics

### Hit Rate (HR) or Recall at k:
* Measures the proportion of queries for which at least one relevant document is retrieved in the top k results.
* Formula: HR@k $= \displaystyle\frac{(\text{\#queries with at least one relevant document in top k})}{|Q|}$ where $|Q|$: total number of queries.
* Example, if there are 3 questions and each question has 5 search results, if only question 1 has at least one relevant document in its search results, while none of question 2's and question 3's results are relevant, then Hit Rate would be $\frac{1}{3}$.

In [16]:
def hit_rate(relevance_total):
    relevant_count = 0

    for line in relevance_total:
        if True in line:
            relevant_count = relevant_count + 1

    return relevant_count / len(relevance_total)

### Mean Reciprocal Rank (MRR):
* For each query, evaluates the rank position of the first relevant document. Then evaluates the mean of those rank positions.
* If there is no relevant document in the query result, evaluate as 0.
* Formula: MRR $ \displaystyle = \frac{1}{|Q|} \sum_{q=1}^{|Q|} (\frac{i}{\text{rank } i})$  where $|Q|$: total number of queries, $i = 1$: first relevant document, else $i=0$ if no relevant document.
* Example, for a question's 5 search results, if the first relevant document is the 3rd top result then we get $\frac{1}{3}$ for $\frac{i}{\text{rank } i}$.
* [Mean Reciprocal Rank (MRR) explained](https://www.evidentlyai.com/ranking-metrics/mean-reciprocal-rank-mrr)

In [17]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
                break

    return total_score / len(relevance_total)

### Stores evaluations into a DataFrame
Setup a dictionary to store Hit Rate and MRR of each evaluation method

In [18]:
evaluation_dict = {}

### Setup text-based search

Given a text-based query, perform a text-based search that retrieves best results that matches the query at fields `header`, `subheader`, and `doc_text`. Note that `doc_text` is given 3 times of weights (importance) as this field contains more information compared to `header` and `subheader`.

In [19]:
def es_text_search(question, fields, index_name):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": question,
                        "fields": fields,
                        "type": "best_fields"
                    }
                },
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

### Evaluate text-based search
* Evaluate text-based search using ground truth data.
* Save evaluation results for comparison later.

Using keyword fields `header`, `subheader`, `doc_text` with equal weights.

In [20]:
relevance_total = []
fields = ["header", "subheader", "doc_text"]

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_text_search(question=q['question'], fields = fields, index_name=es_index_name)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [21]:
text_base_hit_rate = hit_rate(relevance_total)
text_base_mrr = mrr(relevance_total)

evaluation_dict['text_search_baseline'] = { 'hit_rate': text_base_hit_rate,
                                         'mrr': text_base_mrr}

Using keyword fields `header`, `subheader`, `doc_text`, boost `doc_text` by 3.

In [22]:
relevance_total = []
fields = ["header", "subheader", "doc_text^3"]

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_text_search(question=q['question'], fields = fields, index_name=es_index_name)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [23]:
text_1_hit_rate = hit_rate(relevance_total)
text_1_mrr = mrr(relevance_total)

evaluation_dict['text_search_1'] = { 'hit_rate': text_1_hit_rate,
                                         'mrr': text_1_mrr}

Using keyword fields `subheader`, `doc_text`, boost `doc_text` by 3.

In [24]:
relevance_total = []
fields = ["subheader", "doc_text^3"]

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_text_search(question=q['question'], fields = fields, index_name=es_index_name)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [25]:
text_2_hit_rate = hit_rate(relevance_total)
text_2_mrr = mrr(relevance_total)

evaluation_dict['text_search_2'] = { 'hit_rate': text_2_hit_rate,
                                         'mrr': text_2_mrr}

Using keyword fields `subheader`, `doc_text`, boost `subheader` by 0.5 (less important), `doc_text` by 3.

In [26]:
relevance_total = []
fields = ["subheader^0.5", "doc_text^3"]

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_text_search(question=q['question'], fields = fields, index_name=es_index_name)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [27]:
text_3_hit_rate = hit_rate(relevance_total)
text_3_mrr = mrr(relevance_total)

evaluation_dict['text_search_3'] = { 'hit_rate': text_3_hit_rate,
                                         'mrr': text_3_mrr}

### Setup vector-based search (one vector field)

Given an embedded question `question_vec`, perform a vector-based search that retrieves best results that matches `doc_text_vec` field.

The fields in results are specified in `result_fields` parameter.

In [28]:
def es_vector_search_one(question_vec, index_name, result_fields):
    knn = {
        "field": "doc_text_vec",
        "query_vector": question_vec,
        "k": 5,
        "num_candidates": 10000,
    }

    search_query = {
        "knn": knn,
        "_source": result_fields
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

### Evaluate vector-based search (one vector field)
* Evaluate vector-based search using ground truth data and one vector field `doc_text_vec`.
* Save evaluation results for comparison later.

In [29]:
result_fields = ['doc_id', 'header', 'suheader', 'document', 'doc_text']

In [30]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_vector_search_one(question_vec=q['question_vec'], index_name=es_index_name, result_fields=result_fields)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [31]:
vec_1_hit_rate = hit_rate(relevance_total)
vec_1_mrr = mrr(relevance_total)
evaluation_dict['vec_search_1'] = { 'hit_rate': vec_1_hit_rate,
                                         'mrr': vec_1_mrr}

### Setup vector-based search (2 vector fields) baseline

Given an embedded question `question_vec`, perform a vector-based search that retrieves best results that matches the query at fields `subheader_vec`, and `doc_text_vec`. 

The fields in results are specified in `result_fields` parameter.

**NOTE**: 
* Equal weights are assigned to `subheader_vec` and `doc_text_vec`.
* $+ 1$ is required in script to eliminate BadRequestException caused by the negative scores from cosineSimilarity.

In [32]:
def es_vector_search_two_baseline(question_vec, index_name, result_fields):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": [
                    {
                        "script_score": {
                            "query": {
                                "match_all": {},
                            },
                            "script": {
                                "source": """
                                    cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {
                                    "query_vector": question_vec
                                }
                            }
                        }
                    }
                ],
            }
        },
        "_source": result_fields
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [33]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_vector_search_two_baseline(question_vec=q['question_vec'], index_name=es_index_name, result_fields=result_fields)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [34]:
vec_2_baseline_hit_rate = hit_rate(relevance_total)
vec_2_baseline_mrr = mrr(relevance_total)
evaluation_dict['vec_search_2_baseline'] = { 'hit_rate': vec_2_baseline_hit_rate,
                                         'mrr': vec_2_baseline_mrr}

### Setup vector-based search (2 vector fields)

Given an embedded question `question_vec`, perform a vector-based search that retrieves best results that matches the query at fields `subheader_vec`, and `doc_text_vec`. 

The fields in results are specified in `result_fields` parameter.

**NOTE**: 
* 3 times weights is assigned to `doc_text_vec`.
* $+ 1$ is required in script to eliminate BadRequestException caused by the negative scores from cosineSimilarity.

In [35]:
def es_vector_search_two(question_vec, index_name, result_fields):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": [
                    {
                        "script_score": {
                            "query": {
                                "match_all": {},
                            },
                            "script": {
                                "source": """
                                    cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    3 * cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {
                                    "query_vector": question_vec
                                }
                            }
                        }
                    }
                ],
            }
        },
        "_source": result_fields
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [36]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_vector_search_two(question_vec=q['question_vec'], index_name=es_index_name, result_fields=result_fields)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [37]:
vec_2_hit_rate = hit_rate(relevance_total)
vec_2_mrr = mrr(relevance_total)
evaluation_dict['vec_search_2'] = { 'hit_rate': vec_2_hit_rate,
                                         'mrr': vec_2_mrr}

### Setup vector-based search (3 vector fields)

Given an embedded question `question_vec`, perform a vector-based search that retrieves best results that matches the query at fields `header_vec`, `subheader_vec`, and `doc_text_vec`. 

The fields in results are specified in `result_fields` parameter.

**NOTE**: 
* Equal weights are assigned to `header_vec`, `subheader_vec`, and `doc_text_vec`. 
* $+ 1$ is required in script to eliminate BadRequestException caused by the negative scores from cosineSimilarity.

In [38]:
def es_vector_search_three_baseline(question_vec, index_name, result_fields):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": [
                    {
                        "script_score": {
                            "query": {
                                "match_all": {},
                            },
                            "script": {
                                "source": """
                                    cosineSimilarity(params.query_vector, 'header_vec') + 
                                    cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {
                                    "query_vector": question_vec
                                }
                            }
                        }
                    }
                ],
            }
        },
        "_source": result_fields
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

### Evaluate vector-based search (three vector field) baseline
* Evaluate vector-based search using ground truth data and 3 vector fields `header_vec`, `subheader_vec`, `doc_text_vec`.
* Save evaluation results for comparison later.

In [39]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_vector_search_three_baseline(question_vec=q['question_vec'], index_name=es_index_name, result_fields=result_fields)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [40]:
vec_3_baseline_hit_rate = hit_rate(relevance_total)
vec_3_baseline_three_mrr = mrr(relevance_total)
evaluation_dict['vec_search_3_baseline'] = { 'hit_rate': vec_3_baseline_hit_rate,
                                         'mrr': vec_3_baseline_three_mrr}

### Setup vector-based search (3 vector fields)

Given an embedded question `question_vec`, perform a vector-based search that retrieves best results that matches the query at fields `header_vec`, `subheader_vec`, and `doc_text_vec`. 

The fields in results are specified in `result_fields` parameter.

**NOTE**: 
* The weights are assigned to give `doc_text_vec` 3 times more importance than `header_vec` and `subheader_vec`.
* $+ 1$ is required in script to eliminate BadRequestException caused by the negative scores from cosineSimilarity.

In [41]:
def es_vector_search_three(question_vec, index_name, result_fields):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": [
                    {
                        "script_score": {
                            "query": {
                                "match_all": {},
                            },
                            "script": {
                                "source": """
                                    0.2 * cosineSimilarity(params.query_vector, 'header_vec') + 
                                    0.2 * cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    0.6 * cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {
                                    "query_vector": question_vec
                                }
                            }
                        }
                    }
                ],
            }
        },
        "_source": result_fields
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [42]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_vector_search_three(question_vec=q['question_vec'], index_name=es_index_name, result_fields=result_fields)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [43]:
vec_3_hit_rate = hit_rate(relevance_total)
vec_3_mrr = mrr(relevance_total)
evaluation_dict['vec_search_3'] = { 'hit_rate': vec_3_hit_rate,
                                         'mrr': vec_3_mrr}

### Evaluation Results 1

In [44]:
df_eval = pd.DataFrame(evaluation_dict).T
df_eval.reset_index().sort_values(by=['hit_rate','mrr'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
8,vec_search_3,0.770909,0.584442
5,vec_search_2_baseline,0.767792,0.562771
6,vec_search_2,0.766753,0.58419
7,vec_search_3_baseline,0.758961,0.553801
0,text_search_baseline,0.756364,0.605351
4,vec_search_1,0.728312,0.570017
1,text_search_1,0.70961,0.585229
2,text_search_2,0.70961,0.585229
3,text_search_3,0.70961,0.585229


In [45]:
df_eval.reset_index().sort_values(by=['hit_rate'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
8,vec_search_3,0.770909,0.584442
5,vec_search_2_baseline,0.767792,0.562771
6,vec_search_2,0.766753,0.58419
7,vec_search_3_baseline,0.758961,0.553801
0,text_search_baseline,0.756364,0.605351
4,vec_search_1,0.728312,0.570017
1,text_search_1,0.70961,0.585229
2,text_search_2,0.70961,0.585229
3,text_search_3,0.70961,0.585229


In [46]:
df_eval.reset_index().sort_values(by=['mrr'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
0,text_search_baseline,0.756364,0.605351
1,text_search_1,0.70961,0.585229
2,text_search_2,0.70961,0.585229
3,text_search_3,0.70961,0.585229
8,vec_search_3,0.770909,0.584442
6,vec_search_2,0.766753,0.58419
4,vec_search_1,0.728312,0.570017
5,vec_search_2_baseline,0.767792,0.562771
7,vec_search_3_baseline,0.758961,0.553801


So far out of all the previous search methods, **vector-based search using 3 vectors** and **doc_text_vec with more weights** produced the best results for **Hit Rate**. 

However for **MRR**, **text-based search using 3 fields** of equal weights performed the best.

This prompts for a need to consider hybrid search to see if the advantages of both methods could be optimised here.

### Setup hybrid search

Take the best vector search setup and text search setup from evaluation above and merge into a hybrid search.

The parameter `boost` will be tuned to adjust the importance of text search in hybrid search. Defaults to 1.

In [47]:
def es_hybrid_search(question, question_vec, index_name, result_fields, boost=1):
    keyword_query = {"query": question,
                    "fields": ["header", "subheader", "doc_text"],
                    "type": "best_fields",
                    "boost": boost,
                    }

    vector_query = {"query": {"match_all": {}},
                    "script": {
                                "source": """
                                    0.2 * cosineSimilarity(params.query_vector, 'header_vec') + 
                                    0.2 * cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    0.6 * cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {"query_vector": question_vec}
                            }
                   }
    
    hybrid_query = {
        "bool": {
            "must": [{"multi_match": keyword_query},
                     {"script_score": vector_query}
                    ]
            },
        }

    search_query = {
        "query": hybrid_query,
        "size": 5,
        "_source": result_fields
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [48]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search(q['question'], q['question_vec'], es_index_name, result_fields)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [49]:
hybrid_1_hit_rate = hit_rate(relevance_total)
hybrid_1_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_1'] = { 'hit_rate': hybrid_1_hit_rate,
                                     'mrr': hybrid_1_mrr}

In [50]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search(q['question'], q['question_vec'], es_index_name, result_fields, boost=0.5)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [51]:
hybrid_2_hit_rate = hit_rate(relevance_total)
hybrid_2_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_2'] = { 'hit_rate': hybrid_2_hit_rate,
                                     'mrr': hybrid_2_mrr}

In [52]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search(q['question'], q['question_vec'], es_index_name, result_fields, boost=2)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [53]:
hybrid_3_hit_rate = hit_rate(relevance_total)
hybrid_3_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_3'] = { 'hit_rate': hybrid_3_hit_rate,
                                     'mrr': hybrid_3_mrr}

### Evaluation Results 2

In [54]:
df_eval = pd.DataFrame(evaluation_dict).T
df_eval.reset_index().sort_values(by=['hit_rate','mrr'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
8,vec_search_3,0.770909,0.584442
10,hybrid_search_2,0.77039,0.616649
5,vec_search_2_baseline,0.767792,0.562771
6,vec_search_2,0.766753,0.58419
9,hybrid_search_1,0.764675,0.609342
11,hybrid_search_3,0.761558,0.605723
7,vec_search_3_baseline,0.758961,0.553801
0,text_search_baseline,0.756364,0.605351
4,vec_search_1,0.728312,0.570017
1,text_search_1,0.70961,0.585229


In **hybrid_search_2**, the `boost` was 0.5. Since decreasing the importance of text search in hybrid search produces not only **the best mrr** but also a **hit rate that is similar to vector search with 3 vector fields**, lower boost values will be tuned for further evaluation.

In [55]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search(q['question'], q['question_vec'], es_index_name, result_fields, boost=0.1)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [56]:
hybrid_4_hit_rate = hit_rate(relevance_total)
hybrid_4_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_4'] = { 'hit_rate': hybrid_4_hit_rate,
                                     'mrr': hybrid_4_mrr}

In [57]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search(q['question'], q['question_vec'], es_index_name, result_fields, boost=0.05)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [58]:
hybrid_5_hit_rate = hit_rate(relevance_total)
hybrid_5_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_5'] = { 'hit_rate': hybrid_5_hit_rate,
                                     'mrr': hybrid_5_mrr}

In [59]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search(q['question'], q['question_vec'], es_index_name, result_fields, boost=0.01)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [60]:
hybrid_6_hit_rate = hit_rate(relevance_total)
hybrid_6_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_6'] = { 'hit_rate': hybrid_6_hit_rate,
                                     'mrr': hybrid_6_mrr}

### Evaluation Results 3

In [61]:
df_eval = pd.DataFrame(evaluation_dict).T
df_eval.reset_index().sort_values(by=['hit_rate','mrr'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
13,hybrid_search_5,0.848312,0.679853
14,hybrid_search_6,0.837403,0.673688
12,hybrid_search_4,0.81974,0.654753
8,vec_search_3,0.770909,0.584442
10,hybrid_search_2,0.77039,0.616649
5,vec_search_2_baseline,0.767792,0.562771
6,vec_search_2,0.766753,0.58419
9,hybrid_search_1,0.764675,0.609342
11,hybrid_search_3,0.761558,0.605723
7,vec_search_3_baseline,0.758961,0.553801


In [62]:
df_eval.reset_index().sort_values(by=['hit_rate'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
13,hybrid_search_5,0.848312,0.679853
14,hybrid_search_6,0.837403,0.673688
12,hybrid_search_4,0.81974,0.654753
8,vec_search_3,0.770909,0.584442
10,hybrid_search_2,0.77039,0.616649
5,vec_search_2_baseline,0.767792,0.562771
6,vec_search_2,0.766753,0.58419
9,hybrid_search_1,0.764675,0.609342
11,hybrid_search_3,0.761558,0.605723
7,vec_search_3_baseline,0.758961,0.553801


In [63]:
df_eval.reset_index().sort_values(by=['mrr'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
13,hybrid_search_5,0.848312,0.679853
14,hybrid_search_6,0.837403,0.673688
12,hybrid_search_4,0.81974,0.654753
10,hybrid_search_2,0.77039,0.616649
9,hybrid_search_1,0.764675,0.609342
11,hybrid_search_3,0.761558,0.605723
0,text_search_baseline,0.756364,0.605351
1,text_search_1,0.70961,0.585229
2,text_search_2,0.70961,0.585229
3,text_search_3,0.70961,0.585229


## Evaluation Results conclusion

The best retrieval method for this document would be using **hybrid search**, consists of:
* text-based search with fields `header`, `subheader`, `doc_text` and a `boost` of 0.05
* vector-based search with fields `header_vec`, `subheader_vec`, `doc_text_vec`(3 times more importance)

```python
    keyword_query = {"query": question,
                    "fields": ["header", "subheader", "doc_text"],
                    "type": "best_fields",
                    "boost": 0.05,
                    }

    vector_query = {"query": {"match_all": {}},
                    "script": {
                                "source": """
                                    0.2 * cosineSimilarity(params.query_vector, 'header_vec') + 
                                    0.2 * cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    0.6 * cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {"query_vector": question_vec}
                            }
                   }
```