# Hybrid Search

Hybrid search combines traditional keyword-based search with semantic search to provide more accurate and relevant results. In the RAG application, it facilitates the discovery of relevant research articles based on user queries by integrating keyword-based search with semantic search capabilities. This integration enables the application to retrieve articles that match both keywords and semantic meaning, making it particularly useful for handling complex queries involving nuanced concepts, synonyms, and related ideas.

![Hybrid Search](images/Hybrid_Search.png)


In this notebook, we will delve into the implementation details of the hybrid search approach in the RAG application, exploring how it leverages both keyword-based and semantic search techniques to provide a more effective search experience.

Here are the steps:
* [Loading chunked dataset](#loading-the-chunks-from-the-previous-steps)
* [Sparse Index](#Hybrid-Search---Sparse-Index)
* [Dense Index](#hybrid-search---dense-index)
* [Merging Results](#hybrid-search---merging-results)
* [Generating a reply with merged results](#using-merged-results-to-generate-a-reply)



### Visual improvements

We will use [rich library](https://github.com/Textualize/rich) to make the output more readable, and supress warning messages.

In [1]:
from rich.console import Console
from rich_theme_manager import Theme, ThemeManager
import pathlib

theme_dir = pathlib.Path("themes")
theme_manager = ThemeManager(theme_dir=theme_dir)
dark = theme_manager.get("dark")

# Create a console with the dark theme
console = Console(theme=dark)

In [2]:
import warnings

# Suppress warnings
warnings.filterwarnings('ignore')

## Hybrid Search - Sparse Index

We will use bm25 supported database to complement the semantic search with the vector database.

In [3]:
import bm25s
from bm25s.tokenization import Tokenizer, Tokenized
import Stemmer  # optional: for stemming

### Loading the chunks from the previous steps

We will use the chunks from the AI Arxiv dataset, we used before. These chunks were split using semantic chunking and enriched with context.

In [4]:
import json
corpus_json = json.load(open('data/corpus.json'))

### Creating the Sparse Index

We will use an in-memory index using BM25. Many (vector) databases support BM25 natively, and many others support indexing and searching on calculated sparse vectors.

In this example, we will also define a stemmer and stop-words to clean up the text and better select the tokens/terms that will be indexed in the sparse index.

In [5]:
corpus_text = [doc["text"] for doc in corpus_json]

# optional: create a stemmer
english_stemmer = Stemmer.Stemmer("english")

# Initialize the Tokenizer with the stemmer
sparse_tokenizer = Tokenizer(
    stemmer=english_stemmer,
    lower=True, # lowercase the tokens
    stopwords="english",  # or pass a list of stopwords
    splitter=r"\w+",  # by default r"(?u)\b\w\w+\b", can also be a function
)

In [6]:
console.print(sparse_tokenizer.stopwords)

In [7]:
# Tokenize the corpus and only keep the ids (faster and saves memory)
corpus_sparse_tokens = (
    sparse_tokenizer
    .tokenize(
        corpus_text, 
        update_vocab=True, # update the vocab as we tokenize
        return_as="ids"
    )
)

# Create the BM25 retriever and attach your corpus_json to it
sparse_index = bm25s.BM25(corpus=corpus_json)
# Now, index the corpus_tokens (the corpus_json is not used yet)
sparse_index.index(corpus_sparse_tokens)

                                                                       

In [8]:
vocab_dict = sparse_tokenizer.get_vocab_dict()
console.print(f"The tokenizer vocabulary includes {len(vocab_dict)} tokens/terms")

focus_token = 'context'
focus_token_index = vocab_dict.get(focus_token)
console.print(f"The index of the {focus_token} is {focus_token_index}")

The tokenizer can encode (convert the text into ids) and decode (convert the ids back into text).

In [9]:
console.print(sparse_tokenizer.decode([[focus_token_index]]))

### Exploring the Sparse Index

In [10]:
console.print(sparse_index.scores)

For each token, the index holds the list of documents (chunks) that include it, and the score of that token in that document (chunk).

In [11]:
from rich.table import Table
from rich.style import Style

token_index = vocab_dict.get(focus_token)
console.print(f"Index of the token `{focus_token}` in the BM25 retriever: {token_index}")
score_index = sparse_index.scores.get('indptr')[token_index]
next_score_index = sparse_index.scores.get('indptr')[token_index+1]

table = Table(title=f"Document Scores for `{focus_token}`")

table.add_column("Document ID", justify="right", style="cyan", no_wrap=True)
table.add_column("Score", justify="right", style="bright_green")

max_score = max(sparse_index.scores['data'][score_index:next_score_index])
# Define styles for specific rows
highlight_style = Style(bgcolor="yellow")

for i in range(score_index, next_score_index):
    doc_id = sparse_index.scores['indices'][i]
    doc_score = sparse_index.scores['data'][i]
    if doc_score == max_score:
        table.add_row(
            str(doc_id),
            str(doc_score), style=highlight_style
        )
    else:
        table.add_row(
            str(doc_id),
            str(doc_score)
        )

console.print(table)

### Searching the Sparse Index

As we are doing in the dense index, we need to tokenize and encode the query text:

In [12]:
# Query the corpus
query = "What is context size of Mixtral?"
query_tokens = (
    sparse_tokenizer
    .tokenize(
        [query], 
        update_vocab=False, 
        return_as="ids"
    )
)

console.print(query_tokens)

                                                     

And use the encoded query to search the sparse index:

In [13]:
# Query the corpus
sparse_results, sparse_scores = sparse_index.retrieve(query_tokens, k=10)

for i in range(sparse_results.shape[1]):
    doc, score = sparse_results[0, i], sparse_scores[0, i]
    console.print(f"Rank {i+1} (score: {score:.2f}): {doc}")

                                                     

## Hybrid Search - Dense Index

For the Hybrid Search, we also need the dense index using the vector database, as we used in the previous steps. 

### Creaing the Dense Index

In [14]:
from qdrant_client import QdrantClient
from qdrant_client.http import models
from sentence_transformers import SentenceTransformer

qdrant_client = QdrantClient(
    ":memory:"
) 

# Create the embedding encoder
dense_encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

In [15]:
collection_name = "hybrid_search"

dense_index = qdrant_client.recreate_collection(
    collection_name=collection_name,
        vectors_config=models.VectorParams(
        size=dense_encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)
print(dense_index)

True


In [16]:
# vectorize!
qdrant_client.upload_points(
    collection_name=collection_name,
    points=[
        models.PointStruct(
            id=idx,
            vector=dense_encoder.encode(doc["text"]).tolist(),
            payload=doc
        ) for idx, doc in enumerate(corpus_json) # data is the variable holding all the enriched texts
    ]
)

### Searching the Dense Index

We will start with encoding the query with the dense encoder:

In [17]:
query_vector = dense_encoder.encode(query).tolist()

And use the encoded query to search the dense index:

In [18]:
dense_results = qdrant_client.search(
    collection_name=collection_name,
    query_vector=query_vector,
    limit=10
)

In [19]:
console.print(dense_results)

## Hybrid Search - Merging Results

There are a few options to merge the results from the two methods (sparse and dense). In this notebook, we will use a simple weighted average.

In [20]:
documents_with_scores = []
for hit in dense_results:
    doc_id = hit.payload["id"]
    doc_text = next((doc for doc in corpus_json if doc["id"] == doc_id), None)["text"]
    doc_dense_score = hit.score
    documents_with_scores.append({
        "id": doc_id,
        "text": doc_text,
        "dense_score": doc_dense_score
    })

for i, result in enumerate(sparse_results[0]):
    doc_id = result["id"]
    doc_text = next((doc for doc in corpus_json if doc["id"] == doc_id), None)["text"]
    doc_sparse_score = sparse_scores[0][i]
    for doc in documents_with_scores:
        if doc["id"] == doc_id:
            doc["sparse_score"] = doc_sparse_score
            break




In [21]:
console.print(documents_with_scores)

We will normalize the scores of each index, and than calculate a weighted score that gives more weight (0.8) to the dense index.

In [22]:
import numpy as np

# Normalize the two types of scores
dense_scores = np.array([doc.get("dense_score", 0) for doc in documents_with_scores])
sparse_scores = np.array([doc.get("sparse_score", 0) for doc in documents_with_scores])

dense_scores_normalized = (dense_scores - np.min(dense_scores)) / (np.max(dense_scores) - np.min(dense_scores))
sparse_scores_normalized = (sparse_scores - np.min(sparse_scores)) / (np.max(sparse_scores) - np.min(sparse_scores))

# Calculate a weighted score with alpha of 0.2 to the sparse score
alpha = 0.2
weighted_scores = (1 - alpha) * dense_scores_normalized + alpha * sparse_scores_normalized

# Pick up the top 3 documents with the weighted score
top_docs = sorted(
    zip(
        documents_with_scores, 
        weighted_scores
    ), 
    key=lambda x: x[1], 
    reverse=True
)[:3]



In [23]:
console.print(top_docs)

## Using merged results to generate a reply

We can now take the merged results and call the LLM to generate the reply to the user's query.

In [24]:
# define a variable to hold the search results for the generation model
search_results = [doc[0]['text'] for doc in top_docs]

In [25]:
from dotenv import load_dotenv

load_dotenv()

True

In [26]:
# Now time to connect to the large language model
from openai import OpenAI
from rich.text import Text

client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are chatbot, an research expert. Your top priority is to help guide users to understand reserach papers."},
        {"role": "user", "content": query},
        {"role": "assistant", "content": str(search_results)}
    ]
)

response_text = Text(completion.choices[0].message.content)

In [27]:
from rich.panel import Panel

panel = Panel(response_text, title=f"Hybrid Search Reply to \"{query}\"")
console.print(panel)

Saving the retrieved documents to be used in the next reranking notebook, which demonstrates a more advanced method to merge Hybrid Search results.

In [28]:
import json

with open('data/dense_results.json', 'w') as f:
    json.dump([dense_result.payload for dense_result in dense_results], f, default=str)

with open('data/sparse_results.json', 'w') as f:
    json.dump([sparse_result for sparse_result in sparse_results[0]], f, default=str)

