# Introduction

This tutorial demonstrates how to use PreciseSearch to build a semantic search application that retrieves the most relevant passages for a given query.

### Implementation Steps
1. Setup Knowledge Base – We will use the [MS MARCO dataset](https://huggingface.co/datasets/microsoft/ms_marco), which contains real Bing search queries paired with relevant passages to construct a knowledge base.
2. Embedding Generation – Generate vector embeddings for queries and passages using [Sentence Transformers library](https://sbert.net/).
3. Indexing for Search – Build an efficient search index with PreciseSearch's SDK for fast and accurate retrieval.

By the end, you'll have a fully functional semantic search system capable of understanding natural language queries. 🚀 

Note, the sentence-transformers library is an excellent choice for experimenting with different embedding models and enabling private deployments. However, if you lack GPU access, consider using an integrated embedding model from [VectorStackAI](https://vectorstack.ai) or a cloud-based API such as [OpenAI’s](https://platform.openai.com/docs/guides/embeddings) or [Google’s](https://ai.google.dev/gemini-api/docs/embeddings) for efficient inference.

## 1. Install dependencies

To get started, install the required Python packages:

- `sentence-transformers`: Sentence Transformers Python SDK for generating embeddings for queries and passages.
- `vectorstackai`: A package for building and querying semantic search indexes. It provides seamless integration with PreciseSearch, a high-performance search solution from [VectorStackAI](https://vectorstack.ai), designed for efficient and accurate retrieval.
- `datasets, huggingface_hub, fsspec`: Libraries to load and process the MS MARCO dataset.


Run the following command to install the packages (note this may take some time):


In [None]:
# Install sentence-transformers, vectorstackai 
%pip install sentence-transformers vectorstackai
# Install datasets, huggingface_hub and fsspec for loading the MS MARCO dataset
%pip install -q datasets huggingface_hub fsspec


## 2. Download and prepare the dataset (MS MARCO)

In [2]:
# Download and prepare the dataset (MS MARCO) from Hugging Face
from datasets import load_dataset
ds = load_dataset("microsoft/ms_marco", "v1.1")

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# We will use the first 100 samples from the training set
ds_tutorial = ds['train'].select(range(100))

sample = ds_tutorial[1]

# Each sample contains:
# query: the question
# list_of_passages: a list of passages which are relevant to the question
query = sample['query']
list_of_passages = sample['passages']['passage_text']

print('Query: ', query)
print('List of passages: ')
for passage in list_of_passages:
    print(passage)
    print('-'*100)

Query:  was ronald reagan a democrat
List of passages: 
In his younger years, Ronald Reagan was a member of the Democratic Party and campaigned for Democratic candidates; however, his views grew more conservative over time, and in the early 1960s he officially became a Republican. In November 1984, Ronald Reagan was reelected in a landslide, defeating Walter Mondale and his running mate Geraldine Ferraro (1935-), the first female vice-presidential candidate from a major U.S. political party.
----------------------------------------------------------------------------------------------------
From Wikipedia, the free encyclopedia. A Reagan Democrat is a traditionally Democratic voter in the United States, especially a white working-class Northerner, who defected from their party to support Republican President Ronald Reagan in either or both the 1980 and 1984 elections. During the 1980 election a dramatic number of voters in the U.S., disillusioned with the economic 'malaise' of the 1970

In [4]:
# We will collect all the passage texts to make our knowledge base
knowledge_base = []
for sample in ds_tutorial:
    list_of_passages = sample['passages']['passage_text']
    knowledge_base.append(list_of_passages)

# Flatten the list of passages
knowledge_base = [item for sublist in knowledge_base for item in sublist]

print('Collected ', len(knowledge_base), ' passages for our knowledge base')

Collected  814  passages for our knowledge base


## 3. Generate embeddings for the knowledge base

Now that we have our knowledge base, we need to generate embeddings for each passage.

We will use OpenAI's API to generate embeddings for each passage.

In [9]:
from sentence_transformers import SentenceTransformer

# Instantiate the MiniLM-L6 model from sentence-transformers 
model = SentenceTransformer('sentence-transformers/msmarco-MiniLM-L6-cos-v5')

# Generate embeddings for the knowledge base
print('Generating embeddings for the knowledge base...')
all_embeddings = model.encode(knowledge_base, show_progress_bar=True)

# Convert from numpy array to List[List[float]]
all_embeddings = all_embeddings.tolist()


print('Embedding generation is done!')
# Embeddings are stored in the `all_embeddings` numpy array
print(f'Generated embeddings for {len(all_embeddings)} passages') 
print(f'Dimensions of the embeddings: {len(all_embeddings[0])}')

Generating embeddings for the knowledge base...


Batches: 100%|██████████████████████████████████████████████████████████████████| 26/26 [00:00<00:00, 43.61it/s]

Embedding generation is done!
Generated embeddings for 814 passages
Dimensions of the embeddings: 384





## 4. Create a search index

With our knowledge base ready, it's time to build a search index for efficient retrieval.
We'll use **PreciseSearch** to create a high-performance index, enabling fast and accurate passage retrieval.

In [10]:
from vectorstackai import PreciseSearch
import time

precise_search_client = PreciseSearch(
    api_key="your_api_key"
    ) # Replace with your own PreciseSearch API key

# Create the index 
precise_search_client.create_index(index_name="ms_marco_dense_index", 
                                   dimension=len(all_embeddings[0]), # The dimension of the embeddings
                                   metric="cosine", # The metric to use for the search
                                   features_type="dense") # The type of features to use for the search

# Wait for the index to be ready
while precise_search_client.index_status(index_name="ms_marco_dense_index") != "ready":
    time.sleep(2)
    print("Index is not ready yet. Waiting for 2 seconds...")
    
# Connect to the index
index = precise_search_client.connect_to_index(index_name="ms_marco_dense_index")
print('Connected to the index')

Request accepted: Index creation for 'ms_marco_dense_index' started.
Index is not ready yet. Waiting for 2 seconds...
Connected to the index


## 5. Upload dense embeddings to the index

Upserting into a dense index requires a batch of IDs and dense vectors. Optionally, you can provide metadata for each passage.

- Required inputs:
   - batch_ids (List[str]): Unique identifiers for each passage.
   - batch_vectors (List[List[float]]): Embeddings for each passage.
- Optional inputs:
   - batch_metadata (List[Dict[str, Any]]): Metadata for each passage, can store any additional information about the items in the batch

In [11]:
# Convert the data in the required format
batch_ids = [str(i) for i in list(range(len(knowledge_base)))]
batch_vectors = all_embeddings
batch_metadata = [{"text": passage} for passage in knowledge_base]

index.upsert(batch_ids=batch_ids, 
             batch_vectors=batch_vectors, 
             batch_metadata=batch_metadata)
print('Embeddings uploaded to the index')

Embeddings uploaded to the index


## 6. Querying the index

Now that we have uploaded the data to the index, we can query it. 

Given an input query as a text, we will first compute the embedding for the query and then search for the most relevant passages in the index. 

In [13]:
query = "How should I store cooked food"

# Compute the embedding for the query
query_embedding = model.encode(query).tolist()

# Search for the most relevant passages
search_results = index.search(query_vector=query_embedding, 
                       top_k=5
                       )

# Print the results
for result in search_results:
    print(result['id'])
    print(result['similarity'])
    print(result['metadata'])
    print('-'*100)

664
0.65966796875
{'text': 'Cooked Food: Leftover, cooked foods should be kept in the refrigerator in an airtight container and eaten within 4-5 days. Food, whether cooked or not, should not be left at room temperature for more than 4 hours otherwise the risk of food poisoning increases. You can refrigerate them for up to 5 days as long as they are stored properly — wrapped in a paper towel and then sealed inside a plastic bag. Fresh Beans/Peas: Depending on the variety, they can be kept, tightly wrapped, in the refrigerator for 3-5 days.'}
----------------------------------------------------------------------------------------------------
663
0.60888671875
{'text': 'Kitchen Fact: Cooked food stored in the refrigerator should be eaten in 3 to 4 days. After food is cooked, it should sit out at room temperature no more than two hours before being refrigerated to slow down bacteria growth. But once stored in the fridge, leftovers should be eaten up within three to four days because bacter

The search results are returned as a list of dictionaries, where each dictionary contains the following keys:

- `id`: The identifier of the matching document/vector.
- `similarity`: The similarity score for the match (higher is typically more relevant).
- `metadata`: Any additional metadata stored with the vector (only returned if `return_metadata=True`).

In the search results shown above, we can note the following:

- The results are sorted by similarity score, meaning the first result is the most relevant.
- The results are relevant to the query!

## 7. Clean up
Feel free to play with the query and see the results. You can also experiment with other models available via [SentenceTransformer library](https://sbert.net/docs/sentence_transformer/pretrained_models.html#).


Once you are done with the tutorial, you can delete the index.

In [14]:
precise_search_client.delete_index(index_name="ms_marco_dense_index")

Are you sure you want to delete index 'ms_marco_dense_index'? This action is irreversible.
Index deletion for 'ms_marco_dense_index' started.
