# Vector Search using vCore-based Azure Cosmos DB for MongoDB

This notebook demonstrates using an Azure OpenAI embedding model to vectorize documents already stored in Azure Cosmos DB API for MongoDB, storing the embedding vectors and the creation of a vector index. Lastly, the notebook will demonstrate how to query the vector index to find similar documents.

This lab expects the data that was loaded in Lab 2.

In [1]:
import os
import pymongo
import time
import json
from openai import AzureOpenAI
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt

## Load settings

This lab expects the `.env` file that was created in Lab 1 to obtain the connection string for the database.

Add the following entries into the `.env` file to support the connection to Azure OpenAI API, replacing the values for `<your key>` and `<your endpoint>` with the values from your Azure OpenAI API resource.

```text
AOAI_ENDPOINT="<your endpoint>"
AOAI_KEY="<your key>""
```

In [2]:
load_dotenv()
CONNECTION_STRING = os.environ.get("DB_CONNECTION_STRING")
EMBEDDINGS_DEPLOYMENT_NAME = "text-embedding-ada-002"
COMPLETIONS_DEPLOYMENT_NAME = "gpt-35-turbo"
AOAI_ENDPOINT = os.environ.get("AOAI_ENDPOINT")
AOAI_KEY = os.environ.get("AOAI_KEY")
AOAI_API_VERSION = "2023-05-15"

## Establish connectivity to the database

In [3]:
db_client = pymongo.MongoClient(CONNECTION_STRING)
# Create database to hold cosmic works data
# MongoDB will create the database if it does not exist
db = db_client.cosmic_works

## Establish Azure OpenAI connectivity

In [4]:
ai_client = AzureOpenAI(
    azure_endpoint = AOAI_ENDPOINT,
    api_version = AOAI_API_VERSION,
    api_key = AOAI_KEY
    )

## Vectorize and store the embeddings in each document

The process of creating a vector embedding field on each document only needs to be done once. However, if a document changes, the vector embedding field will need to be updated with an updated vector.

In [5]:
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(3))
def generate_embeddings(text: str):
    '''
    Generate embeddings from string of text using the deployed Azure OpenAI API embeddings model.
    This will be used to vectorize document data and incoming user messages for a similarity search with
    the vector index.
    '''
    response = ai_client.embeddings.create(input=text, model=EMBEDDINGS_DEPLOYMENT_NAME)
    embeddings = response.data[0].embedding
    time.sleep(0.5) # rest period to avoid rate limiting on AOAI
    return embeddings

In [6]:
# demonstrate embeddings generation using a test string
test = "hello, world"
print(generate_embeddings(test))

[-0.016783414, -0.006727666, -0.027430676, -0.046463147, -0.01095277, 0.01014025, -0.013910343, -0.0048393696, -0.018681461, -0.0283667, 0.028990716, 0.019799488, -0.021710536, -0.006327906, 0.009522735, 0.0066171633, 0.017589433, -0.014456357, 0.011784791, 0.018460454, -0.012330804, -1.1311802e-05, 0.009295229, -0.009893244, -0.009581236, -0.016315402, 0.006864169, -0.016874416, 0.024388602, -0.037882935, 0.00066789147, 0.0033865834, -0.016081396, -0.0064254086, 0.011115274, -0.011895293, 0.00094739837, -0.02756068, 0.02917272, -0.01138828, 0.0024960616, -0.0070396736, 0.0040821005, -0.013715338, -0.032708805, 0.012623311, 0.008768716, -0.015080372, 0.0042706053, 0.02269856, 0.021606533, 0.0015787265, -0.024401601, -0.0019939241, -0.013097823, 0.008671214, -0.035360873, 0.014703362, 0.019994494, -0.020579508, 0.01582139, 0.0038155941, -0.025324624, 0.0121748, -0.010153251, 0.010328755, 0.015990395, 0.008632213, -0.019539481, 0.01458636, 0.020735512, 0.018993469, -0.0054146335, -0.0077

### Vectorize and update all documents in the Cosmic Works database

In [7]:
def add_collection_content_vector_field(collection_name: str):
    '''
    Add a new field to the collection to hold the vectorized content of each document.
    '''
    collection = db[collection_name]
    bulk_operations = []
    for doc in collection.find():
        # remove any previous contentVector embeddings
        if "contentVector" in doc:
            del doc["contentVector"]

        # generate embeddings for the document string representation
        content = json.dumps(doc, default=str)
        content_vector = generate_embeddings(content)       
        
        bulk_operations.append(pymongo.UpdateOne(
            {"_id": doc["_id"]},
            {"$set": {"contentVector": content_vector}},
            upsert=True
        ))
    # execute bulk operations
    collection.bulk_write(bulk_operations)

In [8]:
# Add vector field to customers documents
#IMPORTANT TO ONLY RUN THIS FOR ALL 3
from pymongo import MongoClient
from pymongo.errors import CursorNotFound
from pymongo import UpdateOne
import numpy as np  # Assuming you have numpy for generating example vectors

# Initialize MongoDB client



def add_collection_content_vector_field(collection_name):
    collection = db[collection_name]
    bulk_operations = []
    batch_size = 100  # Adjust the batch size as needed

    # Generate a placeholder vector of 1536 dimensions
    placeholder_vector = list(np.random.rand(1536))  # Replace with actual vector embeddings

    try:
        cursor = collection.find({}, batch_size=batch_size, no_cursor_timeout=True)

        for doc in cursor:
            # remove any previous contentVector embeddings
            if "contentVector" in doc:
                del doc["contentVector"]
            # Update document with new field (example)
            bulk_operations.append(UpdateOne({"_id": doc["_id"]}, {"$set": {"contentVector": placeholder_vector}}))

            if len(bulk_operations) == batch_size:
                collection.bulk_write(bulk_operations, ordered=False)
                bulk_operations = []

        if bulk_operations:
            collection.bulk_write(bulk_operations, ordered=False)

    except CursorNotFound:
        print("Cursor expired. Retrying...")
        add_collection_content_vector_field(collection_name)
    finally:
        cursor.close()

# Run the function for each collection
add_collection_content_vector_field("products")
add_collection_content_vector_field("customers")
add_collection_content_vector_field("sales")

  return Cursor(self, *args, **kwargs)


In [10]:
#DO NOT RUN THIS DO NOT RUN THIS DO NOT RUN THIS
"""from pymongo import MongoClient
from pymongo.errors import CursorNotFound
from pymongo import UpdateOne



def add_collection_content_vector_field(collection_name):
    collection = db[collection_name]
    bulk_operations = []
   

    try:
        cursor = collection.find({}, batch_size=batch_size, no_cursor_timeout=True)

        for doc in cursor:
            # remove any previous contentVector embeddings
            if "contentVector" in doc:
                del doc["contentVector"]
            # Update document with new field (example)
            bulk_operations.append(UpdateOne({"_id": doc["_id"]}, {"$set": {"contentVector": [0]}}))

            if len(bulk_operations) == batch_size:
                collection.bulk_write(bulk_operations, ordered=False)
                bulk_operations = []

        if bulk_operations:
            collection.bulk_write(bulk_operations, ordered=False)

    except CursorNotFound:
        print("Cursor expired. Retrying...")
        add_collection_content_vector_field(collection_name)
    finally:
        cursor.close()
add_collection_content_vector_field("customers")"""

UnboundLocalError: local variable 'cursor' referenced before assignment

In [11]:
#DO NOT RUN THIS DO NOT RUN THIS DO NOT RUN THIS
"""from pymongo import MongoClient
from pymongo.errors import CursorNotFound
from pymongo import UpdateOne



def add_collection_content_vector_field(collection_name):
    collection = db[collection_name]
    bulk_operations = []
    

    try:
        cursor = collection.find({}, batch_size=batch_size, no_cursor_timeout=True)

        for doc in cursor:
            # remove any previous contentVector embeddings
            if "contentVector" in doc:
                del doc["contentVector"]
            # Update document with new field (example)
            bulk_operations.append(UpdateOne({"_id": doc["_id"]}, {"$set": {"contentVector": [0]}}))

            if len(bulk_operations) == batch_size:
                collection.bulk_write(bulk_operations, ordered=False)
                bulk_operations = []

        if bulk_operations:
            collection.bulk_write(bulk_operations, ordered=False)

    except CursorNotFound:
        print("Cursor expired. Retrying...")
        add_collection_content_vector_field(collection_name)
    finally:
        cursor.close()
add_collection_content_vector_field("sales")"""

UnboundLocalError: local variable 'cursor' referenced before assignment

In [12]:
#START RUNNING AGAIN
# Create the products vector index, above didnt load for products so comment out
db.command({
  'createIndexes': 'products',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the customers vector index
db.command({
  'createIndexes': 'customers',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
        'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

# Create the sales vector index, above didnt work for sales so comment out 
db.command({
  'createIndexes': 'sales',
  'indexes': [
    {
      'name': 'VectorSearchIndex',
      'key': {
        "contentVector": "cosmosSearch"
      },
      'cosmosSearchOptions': {
       'kind': 'vector-ivf',
        'numLists': 1,
        'similarity': 'COS',
        'dimensions': 1536
      }
    }
  ]
})

{'raw': {'defaultShard': {'numIndexesBefore': 1,
   'numIndexesAfter': 2,
   'createdCollectionAutomatically': False,
   'ok': 1}},
 'ok': 1}

## Use vector search in vCore-based Azure Cosmos DB for MongoDB

Now that each document has its associated vector embedding and the vector indexes have been created on each collection, we can now use the vector search capabilities of vCore-based Azure Cosmos DB for MongoDB.

In [13]:
def vector_search(collection_name, query, num_results=3):
    """
    Perform a vector search on the specified collection by vectorizing
    the query and searching the vector index for the most similar documents.

    returns a list of the top num_results most similar documents
    """
    collection = db[collection_name]
    query_embedding = generate_embeddings(query)    
    pipeline = [
        {
            '$search': {
                "cosmosSearch": {
                    "vector": query_embedding,
                    "path": "contentVector",
                    "k": num_results
                },
                "returnStoredSource": True }},
        {'$project': { 'similarityScore': { '$meta': 'searchScore' }, 'document' : '$$ROOT' } }
    ]
    results = collection.aggregate(pipeline)
    return results

def print_product_search_result(result):
    '''
    Print the search result document in a readable format
    '''
    print(f"Similarity Score: {result['similarityScore']}")  
    print(f"Name: {result['document']['name']}")   
    print(f"Category: {result['document']['categoryName']}")
    print(f"SKU: {result['document']['categoryName']}")
    print(f"_id: {result['document']['_id']}\n")

In [14]:
query = "What bikes do you have?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)   

Similarity Score: -0.04840514717753619
Name: Touring Tire Tube
Category: Accessories, Tires and Tubes
SKU: Accessories, Tires and Tubes
_id: 29663491-D2E9-47B4-83AE-D9459B6B5B67

Similarity Score: -0.04840514717753619
Name: Classic Vest, L
Category: Clothing, Vests
SKU: Clothing, Vests
_id: 80D3630F-B661-4FD6-A296-CD03BB7A4A0C

Similarity Score: -0.04840514717753619
Name: Touring-3000 Blue, 50
Category: Bikes, Touring Bikes
SKU: Bikes, Touring Bikes
_id: DDD64AA0-30DC-4DC1-BCDC-2882A0FD178C

Similarity Score: -0.04840514717753619
Name: Road-750 Black, 52
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: FEEFEE3B-6CB9-4A75-B896-5182531F661B



In [15]:
query = "What do you have that is yellow?"
results = vector_search("products", query, num_results=4)
for result in results:
    print_product_search_result(result)   

Similarity Score: -0.043650975292541805
Name: Touring Tire Tube
Category: Accessories, Tires and Tubes
SKU: Accessories, Tires and Tubes
_id: 29663491-D2E9-47B4-83AE-D9459B6B5B67

Similarity Score: -0.043650975292541805
Name: Classic Vest, L
Category: Clothing, Vests
SKU: Clothing, Vests
_id: 80D3630F-B661-4FD6-A296-CD03BB7A4A0C

Similarity Score: -0.043650975292541805
Name: Touring-3000 Blue, 50
Category: Bikes, Touring Bikes
SKU: Bikes, Touring Bikes
_id: DDD64AA0-30DC-4DC1-BCDC-2882A0FD178C

Similarity Score: -0.043650975292541805
Name: Road-750 Black, 52
Category: Bikes, Road Bikes
SKU: Bikes, Road Bikes
_id: FEEFEE3B-6CB9-4A75-B896-5182531F661B



## Use vector search results in a RAG pattern with Chat GPT-3.5

In [16]:
# A system prompt describes the responsibilities, instructions, and persona of the AI.
system_prompt = """
You are a helpful, fun and friendly sales assistant for Cosmic Works, a bicycle and bicycle accessories store. 
Your name is Cosmo.
You are designed to answer questions about the products that Cosmic Works sells.

Only answer questions related to the information provided in the list of products below that are represented
in JSON format.

If you are asked a question that is not in the list, respond with "I don't know."

List of products:
"""

In [17]:
def rag_with_vector_search(question: str, num_results: int = 3):
    """
    Use the RAG model to generate a prompt using vector search results based on the
    incoming question.  
    """
    # perform the vector search and build product list
    results = vector_search("products", question, num_results=num_results)
    product_list = ""
    for result in results:
        if "contentVector" in result["document"]:
            del result["document"]["contentVector"]
        product_list += json.dumps(result["document"], indent=4, default=str) + "\n\n"

    # generate prompt for the LLM with vector results
    formatted_prompt = system_prompt + product_list

    # prepare the LLM request
    messages = [
        {"role": "system", "content": formatted_prompt},
        {"role": "user", "content": question}
    ]

    completion = ai_client.chat.completions.create(messages=messages, model=COMPLETIONS_DEPLOYMENT_NAME)
    return completion.choices[0].message.content

In [18]:
print(rag_with_vector_search("What bikes do you have?", 5))

We have the following bikes available:
1. Touring-3000 Blue, 50
2. Road-750 Black, 52


In [20]:
print(rag_with_vector_search("What was the price of the product with sku `FR-R92B-58`", 5))

I don't know.
