# Introduction
In this tutorial, we will build a "Hybrid Search" application that retrieves the most relevant passages for a given query. 

### Why Hybrid Search?
Traditional search setups typically rely on either dense or sparse embeddings, each with its own strengths:

- **Dense embeddings** capture semantic meaning, helping the model understand contextual similarities.
- **Sparse embeddings** emphasize explicit keyword matches, making them useful for handling rare or domain-specific terms.

Hybrid search leverages both approaches, allowing for a more balanced and effective retrieval system. By adjusting the weight of each component, you can fine-tune search performance to better suit your needs.

### When is Hybrid Search Useful?

A common limitation of using only dense embeddings is that they may struggle with rare or domain-specific terms. 
Since dense models focus on semantic meaning, they might overlook **crucial keywords** that are 
essential for certain types of searches.

Hybrid search is especially useful when:

- You are using pre-trained embedding models to generate dense representations.
- Your text data contains a high number of rare or domain-specific terms (e.g., medical manuals, legal documents).

For example, in legal or medical texts, dense embeddings alone may struggle to distinguish between highly specific technical terms, reducing search precision.

By combining dense embeddings for contextual understanding with sparse embeddings for keyword precision, hybrid search ensures more accurate and reliable retrieval—capturing both meaning and specificity in search results.



### Implementation Steps
1. Setup Knowledge Base 
    - We will use the [MS MARCO dataset](https://huggingface.co/datasets/microsoft/ms_marco), which contains real Bing search queries paired with relevant passages to construct a knowledge base.
2. Embedding Generation
    - Dense Embeddings – We will rely on PreciseSearch's integrated embedding model to generate dense embeddings for the queries and passages.
    - Sparse Embeddings – We will use the `sparse_embedding` function to generate sparse embeddings for the queries and passages.
3. Indexing for Search 
    - We will build a hybrid search index with PreciseSearch's SDK for fast and accurate retrieval using both dense and sparse embeddings.

By the end, you'll have a fully functional search system capable of understanding natural language queries.

## 1. Install dependencies

To get started, install the required Python packages:

- `datasets`: A library to load and process the MS MARCO dataset.
- `sentence-transformers`: Sentence Transformers Python SDK for generating dense embeddings for queries and passages.
- `vectorstackai`: A package for building and querying semantic search indexes. It provides seamless integration with PreciseSearch, a high-performance search solution from [VectorStackAI](https://vectorstack.ai), designed for efficient and accurate retrieval.


Run the following command to install the packages (note this may take some time):


In [1]:
%pip install -q sentence-transformers datasets vectorstackai

Note: you may need to restart the kernel to use updated packages.


## 2. Download and prepare the dataset (MS MARCO)

In [5]:
# Download and prepare the dataset (MS MARCO) from Hugging Face
from datasets import load_dataset
ds = load_dataset("microsoft/ms_marco", "v1.1")

In [6]:
# We will use the first 100 samples from the training set
ds_tutorial = ds['train'].select(range(100))

sample = ds_tutorial[1]

# Each sample contains:
# query: the question
# list_of_passages: a list of passages which are relevant to the question
query = sample['query']
list_of_passages = sample['passages']['passage_text']

print('Query: ', query)
print('List of passages: ')
for passage in list_of_passages:
    print(passage)
    print('-'*100)

Query:  was ronald reagan a democrat
List of passages: 
In his younger years, Ronald Reagan was a member of the Democratic Party and campaigned for Democratic candidates; however, his views grew more conservative over time, and in the early 1960s he officially became a Republican. In November 1984, Ronald Reagan was reelected in a landslide, defeating Walter Mondale and his running mate Geraldine Ferraro (1935-), the first female vice-presidential candidate from a major U.S. political party.
----------------------------------------------------------------------------------------------------
From Wikipedia, the free encyclopedia. A Reagan Democrat is a traditionally Democratic voter in the United States, especially a white working-class Northerner, who defected from their party to support Republican President Ronald Reagan in either or both the 1980 and 1984 elections. During the 1980 election a dramatic number of voters in the U.S., disillusioned with the economic 'malaise' of the 1970

In [7]:
# We will collect all the passage texts to make our knowledge base
knowledge_base = []
for sample in ds_tutorial:
    list_of_passages = sample['passages']['passage_text']
    knowledge_base.append(list_of_passages)

# Flatten the list of passages
knowledge_base = [item for sublist in knowledge_base for item in sublist]

print('Collected ', len(knowledge_base), ' passages for our knowledge base')

Collected  814  passages for our knowledge base


## 3. Generate dense and sparse embeddings for the knowledge base

Now that we have our knowledge base, we need to generate both dense and sparse embeddings for each passage.


### Dense Embeddings

We will use OpenAI's API to generate dense embeddings for each passage.

In [8]:
from sentence_transformers import SentenceTransformer

# Instantiate the MiniLM-L6 model from sentence-transformers 
model = SentenceTransformer('sentence-transformers/msmarco-MiniLM-L6-cos-v5')

# Generate embeddings for the knowledge base
print('Generating embeddings for the knowledge base...')
dense_embeddings = model.encode(knowledge_base, show_progress_bar=True)

# Convert from numpy array to List[List[float]]
dense_embeddings = dense_embeddings.tolist()


print('Embedding generation is done!')
# Embeddings are stored in the `dense_embeddings` numpy array
print(f'Generated embeddings for {len(dense_embeddings)} passages') 
print(f'Dimensions of the embeddings: {len(dense_embeddings[0])}')

Generating embeddings for the knowledge base...


Batches: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 39.03it/s]

Embedding generation is done!
Generated embeddings for 814 passages
Dimensions of the embeddings: 384





### Sparse Embeddings

Now, we will generate sparse embeddings for the knowledge base.
In general, there are multiple ways to generate sparse embeddings, from BM25 to TF-IDF, to recent state-of-the-art transformer-based models like SPLADE. All of these models transform the text into a sparse vector space.

To keep things simple in this tutorial, we will use BERT based tokenizer to generate sparse embeddings.

**Important**: 

Regardless of the method used to generate the sparse embeddings, PreciseSearch expects the sparse embeddings to be represented by two lists:
- sparse_indices List[int]: contains the token indices for the corresponding text.
- sparse_values List[float]: contains the frequency of each token in the corresponding text.

For example, the sparse vector [0, 0, 1.2, 0 , 2.3] can be represented by:
- sparse_indices = [2, 4]
- sparse_values = [1.2, 2.3]

In [9]:
from collections import Counter
from transformers import BertTokenizer

def generate_sparse_embeddings_with_bert_tokenizer(list_of_texts):
    """
    Generate sparse embeddings for a list of texts using the BERT tokenizer.
    
    This function tokenizes input texts using the BERT tokenizer and constructs sparse 
    vector representations based on token frequency. Each token ID serves as an index 
    in the sparse vector, and its frequency in the text determines its corresponding value.

    Args:
        list_of_texts (List[str]): A list of textual inputs to be tokenized.

    Returns:
        list_of_sparse_indices (List[List[int]]): A list where each sublist contains 
            unique token indices for the corresponding text.
        list_of_sparse_values (List[List[float]]): A list where each sublist contains 
            the frequency of each token in the corresponding text.
    """
    # Load the BERT tokenizer (pretrained on 'bert-base-uncased')
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    # Lists to store sparse representations for each text
    list_of_sparse_indices = []
    list_of_sparse_values = []

    for text in list_of_texts:
        # Tokenize the text without padding, truncating to a maximum of 512 tokens
        output = tokenizer(text, padding=False, truncation=True, max_length=512, return_tensors='np')
        token_ids = output['input_ids'][0]  # Extract token IDs (list of integers)

        # Count occurrences of each token ID in the text
        token_counts = Counter(token_ids)

        # Store the token indices and their respective counts (sparse representation)
        list_of_sparse_indices.append([int(token_id) for token_id in list(token_counts.keys())])  # Unique token IDs (sparse indices)
        list_of_sparse_values.append(list(token_counts.values()))  # Corresponding counts (sparse values)

    return list_of_sparse_indices, list_of_sparse_values

# Generate sparse embeddings for the knowledge base
sparse_indices_knowledge_base, sparse_values_knowledge_base = generate_sparse_embeddings_with_bert_tokenizer(knowledge_base)

## 4. Create a search index

With our knowledge base and its dense and sparse embeddings ready, it's time to build a search index for efficient retrieval.
We'll use **PreciseSearch** to create a high-performance index, enabling fast and accurate passage retrieval.

In [12]:
from vectorstackai import Client
import time

precise_search_client = Client(
    api_key="temp_api_key"
    ) # Replace with your own PreciseSearch API key

# Create the index 
precise_search_client.create_index(index_name="ms_marco_hybrid_index", 
                                   dimension=len(dense_embeddings[0]), # The dimension of the dense embeddings
                                   metric="dotproduct", # The metric to use for the search
                                   features_type="hybrid") # The type of features we are going to build the index on

# Wait for the index to be ready
while precise_search_client.get_index_info(index_name="ms_marco_hybrid_index")['status'] != "ready":
    time.sleep(2)
    print("Index is not ready yet. Waiting for 2 seconds...")
    
# Connect to the index
index = precise_search_client.connect_to_index(index_name="ms_marco_hybrid_index")
print('Connected to the index')

Request accepted: Index creation for 'ms_marco_hybrid_index' started.
Index is not ready yet. Waiting for 2 seconds...
Connected to the index


## 5. Upload the dense and sparse embeddings to the index

Upsert in a hybrid index requires a batch of ids, dense vectors and sparse vectors

- Required inputs:
  - batch_ids (List[str]): Unique identifiers for each passage.
  - batch_vectors (List[List[float]]): Embeddings for each passage.
  - batch_sparse_indices (List[List[int]]): Sparse indices for each passage.
  - batch_sparse_values (List[List[float]]): Sparse values for each passage.
- Optional inputs:
  - batch_metadata (List[Dict[str, Any]]): Metadata for each passage, can store any additional information about the items in the batch



In [13]:
# Convert the data in the required format
batch_ids = [str(i) for i in list(range(len(knowledge_base)))]
batch_vectors = dense_embeddings
batch_sparse_indices = sparse_indices_knowledge_base
batch_sparse_values = sparse_values_knowledge_base
batch_metadata = [{"text": passage} for passage in knowledge_base]

index.upsert(batch_ids=batch_ids, 
             batch_vectors=batch_vectors, 
             batch_sparse_values=batch_sparse_values,
             batch_sparse_indices=batch_sparse_indices,
             batch_metadata=batch_metadata)
print('Embeddings uploaded to the index')

Embeddings uploaded to the index


## 6. Querying the index

Now that we have uploaded the knowledge base and its embeddingsto the index, we can query it.

Given an input query as a text, we will first compute the dense and sparse embeddings for the query and then search for the most relevant passages in the index. 

In [15]:
query = "How should I store cooked food"

# Compute the dense embedding for the query
query_embedding = model.encode(query).tolist()

# Compute the sparse embedding for the query
query_sparse_indices, query_sparse_values = generate_sparse_embeddings_with_bert_tokenizer([query])

# Get the first and only element from the list
query_sparse_indices = query_sparse_indices[0]
query_sparse_values = query_sparse_values[0]

# Search for the most relevant passages
search_results = index.search(query_vector=query_embedding, 
                              query_sparse_indices=query_sparse_indices,
                              query_sparse_values=query_sparse_values,
                              top_k=5
                              )

# Print the results
for result in search_results:
    print(result['id'])
    print(result['similarity'])
    print(result['metadata'])
    print('-'*100)

285
13.2890625
{'text': 'Cook Pork and Pork Products to an internal temperature of at least 155°F. Pork ordered medium should be cooked to at least 155°F. Pork ordered well done should be cooked to at least 170°F. Temperatures should be taken at the thickest portion of pork. Meat should be firm, not mushy. Juices should be clear, not pink. Cook Eggs and Foods Containing Raw Eggs to an internal temperature of at least 145°F. This requirement does not apply to foods made with pasteurized eggs. Temperatures should be taken in the center of the egg-containing food. Cooked egg whites and yolks should be firm after cooking, not runny..'}
----------------------------------------------------------------------------------------------------
663
11.609375
{'text': 'Kitchen Fact: Cooked food stored in the refrigerator should be eaten in 3 to 4 days. After food is cooked, it should sit out at room temperature no more than two hours before being refrigerated to slow down bacteria growth. But once st

The search results are returned as a list of dictionaries, where each dictionary contains the following keys:

- `id`: The identifier of the matching document/vector.
- `similarity`: The similarity score for the match (higher is typically more relevant).
- `metadata`: Any additional metadata stored with the vector (only returned if `return_metadata=True`).

In the search results shown above, we can note the following:

- The results are sorted by similarity score, meaning the first result is the most relevant.
- The results are relevant to the query!



## 7. Adjusting the importance of dense and sparse embeddings
When searching in a hybrid index, the similarity score between a query and a document is computed as a weighted sum of the query–document similarity in both dense and sparse vector spaces:

$$
\text{similarity} =  \text{dense\_similarity} \cdot \text{dense\_scale} + \text{sparse\_similarity} \cdot \text{sparse\_scale}
$$

where `dense_scale` and `sparse_scale` are the weights for the dense and sparse embeddings, respectively.
By default, both weights are set to 1.0. You can change the weights by calling the `set_similarity_scale` method.

In [25]:
#To change the weights, for example, to make dense embeddings twice as important as sparse embeddings, we can do the following:
index.set_similarity_scale(dense_scale=1.0, sparse_scale=0.5)

# Search again
search_results = index.search(query_vector=query_embedding, 
                              query_sparse_indices=query_sparse_indices,
                              query_sparse_values=query_sparse_values,
                              top_k=5)

# Print the results
for result in search_results:
    print(result['id'])
    print(result['similarity'])
    print(result['metadata'])
    print('-'*100)

285
6.79296875
{'text': 'Cook Pork and Pork Products to an internal temperature of at least 155°F. Pork ordered medium should be cooked to at least 155°F. Pork ordered well done should be cooked to at least 170°F. Temperatures should be taken at the thickest portion of pork. Meat should be firm, not mushy. Juices should be clear, not pink. Cook Eggs and Foods Containing Raw Eggs to an internal temperature of at least 145°F. This requirement does not apply to foods made with pasteurized eggs. Temperatures should be taken in the center of the egg-containing food. Cooked egg whites and yolks should be firm after cooking, not runny..'}
----------------------------------------------------------------------------------------------------
663
6.109375
{'text': 'Kitchen Fact: Cooked food stored in the refrigerator should be eaten in 3 to 4 days. After food is cooked, it should sit out at room temperature no more than two hours before being refrigerated to slow down bacteria growth. But once sto

## 8. Clean up
Feel free to experiment with different queries and weights to adjust the importance of dense and sparse embeddings.
Once you are done with the tutorial, you can delete the index.

In [11]:
precise_search_client.delete_index(index_name="ms_marco_hybrid_index")

Are you sure you want to delete index 'ms_marco_hybrid_index'? This action is irreversible.
Index deletion for 'ms_marco_hybrid_index' started.
