# Hybrid Search with Document Reranking

This notebook aims to explore if the performance of hybrid search can be further improved using document reranking.

The best hybrid search method concluded from notebook `retrieval_evaluation.ipynb` would be used here as a baseline, then a hybrid search with document reranking would be setup and evaluated alongside the baseline for comparison.

In [1]:
import os
import json
import pandas as pd
import numpy as np

from elasticsearch import Elasticsearch
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer

## Index data into Elasticsearch

Reindexing data in case anyone is running this notebook standalone (not following the project flow)

In [2]:
folder = "../data/"
documents_file = "az900_notes_with_vectors.pkl"

In [3]:
df = pd.read_pickle(f"{folder}{documents_file}").set_index("doc_id")

In [4]:
model_name = 'sentence-transformers/all-MiniLM-L12-v2'
vec_dim = SentenceTransformer(model_name).get_sentence_embedding_dimension()



In [5]:
def infer_es_mapping(df, vec_dim):
    '''
    Accepts a pandas dataframe and vector embedding dimension and generates elasticsearch index mapping properties dynamically.
    Index is keyword type.
    Fields end with "_vec" are dense vector types, while others are text types.
    '''
    similarity = 'cosine'    
    es_mapping = {}

    if type(df.index) == pd.Index :
        es_mapping[df.index.name] = {'type': 'keyword'}
        
    # for col, dtype in df.dtypes.iteritems():
    for col in df.columns:
        if col.endswith("_vec"):
            es_mapping[col] = {'type': 'dense_vector', 'dims': vec_dim, 'index': True, 'similarity':similarity}
        else:
            es_mapping[col] = {'type': 'text'}
    return es_mapping

In [6]:
es_mapping_properties = infer_es_mapping(df, vec_dim)

In [7]:
es_client = Elasticsearch('http://localhost:9200') 

In [8]:
def create_es_index(es_mapping_properties, index_name):
    index_settings = {
        "settings": {
            "number_of_shards": 1,
            "number_of_replicas": 0
        },
        "mappings": {
            "properties": es_mapping_properties
        }
    }

    es_client.indices.delete(index=index_name, ignore_unavailable=True)
    es_client.indices.create(index=index_name, body=index_settings)

In [9]:
es_index_name = "az900_course_notes"
create_es_index(es_mapping_properties, es_index_name)

In [10]:
documents = pd.read_pickle(f"{folder}{documents_file}").to_dict(orient="records")
for doc in tqdm(documents):
    es_client.index(index=es_index_name, document=doc)

  0%|          | 0/385 [00:00<?, ?it/s]

## Load ground truth data

In [11]:
ground_truth_file = "ground-truth-data.pkl"
ground_truth = pd.read_pickle(f"{folder}{ground_truth_file}").to_dict(orient="records")

### Setup Hybrid search (no reranking)

As mentioned, this would be the baseline for comparison.

In [12]:
def es_hybrid_search(question, question_vec, index_name, boost=0.05):
    keyword_query = {"query": question,
                    "fields": ["header", "subheader", "doc_text"],
                    "type": "best_fields",
                    "boost": boost,
                    }

    vector_query = {"query": {"match_all": {}},
                    "script": {
                                "source": """
                                    0.2 * cosineSimilarity(params.query_vector, 'header_vec') + 
                                    0.2 * cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    0.6 * cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {"query_vector": question_vec}
                            }
                   }
    
    hybrid_query = {
        "bool": {
            "must": [{"multi_match": keyword_query},
                     {"script_score": vector_query}
                    ]
            },
        }

    search_query = {
        "query": hybrid_query,
        "size": 5,
        "_source": ['doc_id', 'header', 'suheader', 'document', 'doc_text']
    }

    es_results = es_client.search(
        index=index_name,
        body=search_query
    )
    
    result_docs = []
    
    for hit in es_results['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

### Setup Hybrid Search with Document Reranking using Reciprocal Rank Fusion (RRF)

* For text-based search and vector-search, retrieve a set of top 10 results separately.
* For each set of results, rerank the documents using RRF method.
* Return the top 5 results from reranked documents.

### How RRF works

Basically for each document that appears in the search result, each appearance is scored by computing its rrf (relevance using its rank).
A document could appear more than once in the search result, hence the score could be boosted by incrementing each individual appearance's rrf.

The RRF is used to rerank the search results for a **single query**:

* Retrieve text-based results using ES search with `text_search_query`.
* Retrieve vector-based results using ES search with `vector_search_query`.
* Initialise an empty dictonary `rrf_score` which stores key-value pairs of `_id` : `score`, `_id` is the unique ES index that is generated during indexing and not the doc_id.
* For each result in vector-based results (which is already sorted in top 10 results):
    * Compute the rrf score.
    * Write a key-value pair of `_id` : `score` into the dictonary `rrf_score`.
* For each result in text-based results (which is already sorted in top 10 results):
    * Compute the `rrf score`.
    * If `rrf_score[_id]` exists already, increment its rrf score, else write a key-value pair of `_id` : `score` into the dictonary rrf_score.
* Reorder the dictonary `rrf_score` based on `score` and return only the top 5 results' `doc_id` and `_source` for this query.

In [13]:
def compute_rrf(rank, k=60):
    """ Our own implementation of the relevance score """
    return 1 / (k + rank)

In [14]:
def es_hybrid_search_rrf(question, question_vec, index_name, k=10):    
    text_search_query = {
        "size":10,
        "query" :{
        "bool": {
            "must": {"multi_match": {"query": question,
                    "fields": ["header", "subheader", "doc_text"],
                    "type": "best_fields",
                    "boost": 0.05,
                    }
                },
            },
        }
    }

    vector_search_query = {
        "size": 10,
        "query": {
            "bool": {
                "must": [
                    {
                        "script_score": {
                            "query": {
                                "match_all": {},
                            },
                            "script": {
                                "source": """
                                    0.2 * cosineSimilarity(params.query_vector, 'header_vec') + 
                                    0.2 * cosineSimilarity(params.query_vector, 'subheader_vec') + 
                                    0.6 * cosineSimilarity(params.query_vector, 'doc_text_vec') + 1
                                """,
                                "params": {
                                    "query_vector": question_vec
                                }
                            }
                        }
                    }
                ],
            }
        },
    }

    text_results = es_client.search(
        index=index_name,
        body=text_search_query
    )['hits']['hits']

    vector_results = es_client.search(
        index=index_name,
        body=vector_search_query
    )['hits']['hits']
    

    rrf_scores = {}
    # Calculate RRF using vector search results
    for rank, hit in enumerate(vector_results):
        doc_id = hit['_id']
        rrf_scores[doc_id] = compute_rrf(rank + 1, k)

    # Adding text search result scores
    for rank, hit in enumerate(text_results):
        doc_id = hit['_id']
        if doc_id in rrf_scores:
            rrf_scores[doc_id] += compute_rrf(rank + 1, k)
        else:
            rrf_scores[doc_id] = compute_rrf(rank + 1, k)

    # Sort RRF scores in descending order
    reranked_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Get top-5 documents by the score
    final_results = []
    for doc_id, score in reranked_docs[:5]:
        doc = es_client.get(index=index_name, id=doc_id)
        final_results.append(doc['_source'])
    
    return final_results

### Setup Hit Rate and MRR evaluations

In [15]:
def hit_rate(relevance_total):
    relevant_count = 0

    for line in relevance_total:
        if True in line:
            relevant_count = relevant_count + 1

    return relevant_count / len(relevance_total)

In [16]:
def mrr(relevance_total):
    total_score = 0.0

    for line in relevance_total:
        for rank in range(len(line)):
            if line[rank] == True:
                total_score = total_score + 1 / (rank + 1)
                break

    return total_score / len(relevance_total)

Setup a dictionary to store Hit Rate and MRR of each evaluation method

In [17]:
evaluation_dict = {}

### Evaluate hybrid search baseline

In [18]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search(question=q['question'], question_vec=q['question_vec'], index_name=es_index_name)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [19]:
hybrid_hit_rate = hit_rate(relevance_total)
hybrid_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_baseline'] = { 'hit_rate': hybrid_hit_rate,
                                         'mrr': hybrid_mrr}

### Evaluate hybrid search with document reranking using RRF

In [20]:
relevance_total = []

for q in tqdm(ground_truth):
    doc_id = q['doc_id']
    results = es_hybrid_search_rrf(question=q['question'], question_vec=q['question_vec'], index_name=es_index_name)
    relevance = [d['doc_id'] == doc_id for d in results]
    relevance_total.append(relevance)

  0%|          | 0/1925 [00:00<?, ?it/s]

In [21]:
hybrid_rrf_hit_rate = hit_rate(relevance_total)
hybrid_rrf_mrr = mrr(relevance_total)
evaluation_dict['hybrid_search_rrf'] = { 'hit_rate': hybrid_rrf_hit_rate,
                                         'mrr': hybrid_rrf_mrr}

### Evaluation Results

In [22]:
df_eval = pd.DataFrame(evaluation_dict).T
df_eval.reset_index().sort_values(by=['hit_rate','mrr'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
0,hybrid_search_baseline,0.851429,0.682641
1,hybrid_search_rrf,0.830649,0.650658


In [23]:
df_eval.reset_index().sort_values(by=['hit_rate'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
0,hybrid_search_baseline,0.851429,0.682641
1,hybrid_search_rrf,0.830649,0.650658


In [24]:
df_eval.reset_index().sort_values(by=['mrr'], ascending=False)

Unnamed: 0,index,hit_rate,mrr
0,hybrid_search_baseline,0.851429,0.682641
1,hybrid_search_rrf,0.830649,0.650658


## Conclusion

The reranking did not fare better compared to the baseline, hence the retrieval method to be used in the RAG application would remain to be the hybrid search baseline.