# Introduction
In this tutorial, we will build a semantic search application that retrieves the most relevant passages for a given query using **PreciseSearch**. 

Unlike traditional retrieval setups, which require generating embeddings for both queries and passages, **PreciseSearch** simplifies the process with its integrated embedding model—eliminating the need to generate embeddings during upsert and search.

### Implementation Steps
1. Setup Knowledge Base – We will use the [MS MARCO dataset](https://huggingface.co/datasets/microsoft/ms_marco), which contains real Bing search queries paired with relevant passages to construct a knowledge base.
2. Skip Embedding Generation – Since PreciseSearch handles embeddings internally, we can skip this step entirely! 
3. Indexing for Search – Build an efficient search index with PreciseSearch's SDK for fast and accurate retrieval.

By the end, you'll have a fully functional semantic search system capable of understanding natural language queries. 🚀

## 1. Install dependencies

To get started, install the required Python packages:

- `vectorstackai`: A package for building and querying semantic search indexes. It provides seamless integration with PreciseSearch, a high-performance search solution from [VectorStackAI](https://vectorstack.ai), designed for efficient and accurate retrieval.
- `datasets, huggingface_hub, fsspec`: Libraries to load and process the MS MARCO dataset.

Run the following command to install the packages:


In [1]:
# Install vectorstackai
%pip install -q vectorstackai
# Install datasets, huggingface_hub and fsspec for loading the MS MARCO dataset
%pip install -q -U datasets huggingface_hub fsspec


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## 2. Download and prepare the dataset (MS MARCO)

In [1]:
# Download and prepare the dataset (MS MARCO) from Hugging Face
from datasets import load_dataset
ds = load_dataset("microsoft/ms_marco", "v1.1")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# We will use the first 100 samples from the training set
ds_tutorial = ds['train'].select(range(100))

sample = ds_tutorial[1]

# Each sample contains:
# query: the question
# list_of_passages: a list of passages which are relevant to the question
query = sample['query']
list_of_passages = sample['passages']['passage_text']

print('Query: ', query)
print('List of passages: ')
for passage in list_of_passages:
    print(passage)
    print('-'*100)

Query:  was ronald reagan a democrat
List of passages: 
In his younger years, Ronald Reagan was a member of the Democratic Party and campaigned for Democratic candidates; however, his views grew more conservative over time, and in the early 1960s he officially became a Republican. In November 1984, Ronald Reagan was reelected in a landslide, defeating Walter Mondale and his running mate Geraldine Ferraro (1935-), the first female vice-presidential candidate from a major U.S. political party.
----------------------------------------------------------------------------------------------------
From Wikipedia, the free encyclopedia. A Reagan Democrat is a traditionally Democratic voter in the United States, especially a white working-class Northerner, who defected from their party to support Republican President Ronald Reagan in either or both the 1980 and 1984 elections. During the 1980 election a dramatic number of voters in the U.S., disillusioned with the economic 'malaise' of the 1970

In [3]:
# We will collect all the passage texts to make our knowledge base
knowledge_base = []
for sample in ds_tutorial:
    list_of_passages = sample['passages']['passage_text']
    knowledge_base.append(list_of_passages)

# Flatten the list of passages
knowledge_base = [item for sublist in knowledge_base for item in sublist]

print('Collected ', len(knowledge_base), ' passages for our knowledge base')

Collected  814  passages for our knowledge base


## 3. Create a search index

With our knowledge base ready, it's time to build a search index for efficient retrieval.
We'll use **PreciseSearch** to create a high-performance index, enabling fast and accurate passage retrieval.

Note, since we are creating an index with integrated embedding model, we need to specify the integrated embedding model during index creation.
In the example below, we will create a search index using the integrated embedding model `e5-small-v2`.

In [4]:
from vectorstackai import PreciseSearch
import time

precise_search_client = PreciseSearch(
    api_key="your_api_key"
    ) # Replace with your own PreciseSearch API key

# Create the index 
precise_search_client.create_index(index_name="ms_marco_dense_index", 
                                   embedding_model_name="e5-small-v2", # Name of the integrated embedding model
                                   metric="cosine") # Metric for similarity computation

# Wait for the index to be ready
while precise_search_client.index_status(index_name="ms_marco_dense_index") != "ready":
    time.sleep(5)
    print("Index is not ready yet. Waiting for 5 seconds...")
    
# Connect to the index
index = precise_search_client.connect_to_index(index_name="ms_marco_dense_index")
print('Connected to the index')

Request accepted: Index creation for 'ms_marco_dense_index' started.
Index is not ready yet. Waiting for 5 seconds...
Connected to the index


## 5. Upload dense embeddings to the index

PreciseSearch features an integrated embedding model, eliminating the need to manually generate embeddings for your knowledge base. 
During upsert, PreciseSearch automatically generates embeddings, streamlining the process.

To enable this, simply provide the passage text as metadata. PreciseSearch will handle embedding generation seamlessly.

- Required inputs:
  - batch_ids (List[str]): Unique identifiers for each passage.
  - batch_metadata (List[Dict[str, Any]]): Metadata for each passage. Contains the text that will be embedded.



In [5]:
# Convert the data in the required format
batch_ids = [str(i) for i in list(range(len(knowledge_base)))]
batch_metadata = [{"text": passage} for passage in knowledge_base]

index.upsert(batch_ids=batch_ids, 
             batch_metadata=batch_metadata)
print('Embeddings uploaded to the index')

Embeddings uploaded to the index


## 6. Querying the index

Now that we have uploaded the data to the index, we can query it. 

Given an input query as a text, PreciseSearch will generate an embedding for the query and then search for the most relevant passages in the index. 

In [7]:
query = "How should I store cooked food"

# Search for the most relevant passages
search_results = index.search(query_text=query, 
                       top_k=5
                       )

# Print the results
for result in search_results:
    print(result['id'])
    print(result['similarity'])
    print(result['metadata'])
    print('-'*100)

664
0.89013671875
{'text': 'Cooked Food: Leftover, cooked foods should be kept in the refrigerator in an airtight container and eaten within 4-5 days. Food, whether cooked or not, should not be left at room temperature for more than 4 hours otherwise the risk of food poisoning increases. You can refrigerate them for up to 5 days as long as they are stored properly — wrapped in a paper towel and then sealed inside a plastic bag. Fresh Beans/Peas: Depending on the variety, they can be kept, tightly wrapped, in the refrigerator for 3-5 days.'}
----------------------------------------------------------------------------------------------------
291
0.88427734375
{'text': '1 Use a food thermometer to measure the internal temperature of cooked foods. 2  Check the internal temperature in several places to make sure that the meat, poultry, seafood, eggs or dishes containing eggs are cooked to safe minimum internal temperatures as shown in the Safe Cooking Temperatures Chart. To chill foods prop

The search results are returned as a list of dictionaries, where each dictionary contains the following keys:

- `id`: The identifier of the matching document/vector.
- `similarity`: The similarity score for the match (higher is typically more relevant).
- `metadata`: Any additional metadata stored with the vector (only returned if `return_metadata=True`).

In the search results shown above, we can note the following:

- The results are sorted by similarity score, meaning the first result is the most relevant.
- The results are relevant to the query!

## 7. Clean up
Feel free to play with the query and see the results.
Once you are done with the tutorial, you can delete the index.

In [8]:
precise_search_client.delete_index(index_name="ms_marco_dense_index")

Are you sure you want to delete index 'ms_marco_dense_index'? This action is irreversible.


Index deletion for 'ms_marco_dense_index' started.
