This script is based on: https://github.com/microsoft/AzureDataRetrievalAugmentedGenerationSamples/blob/main/Python/CosmosDB-NoSQL_VectorSearch/CosmosDB-NoSQL-Vector_AzureOpenAI_Tutorial.ipynb

# Create an Azure Cosmos DB for NoSQL resource

Let's start by creating an Azure Cosmos DB for NoSQL Resource (Cosmos DB Account) by following [this section in the Quickstart guide](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/quickstart-portal#create-account)

## Get Cosmos DB Account Key and Endpoint
Once the account is provisioned, head over to the provisioned account and navigate to **"Settings > Keys"** section in the left-side panel. From the Keys section, make a note of the **Primary Key and the URI** - these will be used later to connect to the cosmos DB account through the python client.
Store the Primary Key and URI in a .env file

# Provision Azure Open AI resource
Finally, let's setup our Azure OpenAI resource Currently, access to this service is granted only by application. You can apply for access to Azure OpenAI by completing the form at [https://aka.ms/oai/access](https://aka.ms/oai/access)

Once you have access, complete the following step:
1. Create an Azure OpenAI resource [following this quickstart](https://learn.microsoft.com/azure/ai-services/openai/how-to/create-resource?pivots=-eb-portal)
2. Deploy an embeddings model. For more information on embeddings, refer to [this article](https://learn.microsoft.com/azure/ai-services/openai/how-to/embeddings)
3. Deploy a completions model. For more information on completions, refer to [this article](https://learn.microsoft.com/azure/ai-services/openai/how-to/completions)
4. Make a note of the endpoint and key for your Azure OpenAI resource
5. Make a note of the **deployment names** of the embedding and completion models.

Store the Endpoint, Key, and deployment names in the .env file


# Install the required libraries

In [1]:
'''
! pip install numpy
! pip install openai
! pip install python-dotenv
! pip install azure-core
! pip install azure-cosmos
'''

'\n! pip install numpy\n! pip install openai\n! pip install python-dotenv\n! pip install azure-core\n! pip install azure-cosmos\n'

# Necessary imports

In [2]:
import json
import datetime
import time
import urllib 

from azure.core.exceptions import AzureError
from azure.core.credentials import AzureKeyCredential

#Cosmos DB imports
from azure.cosmos import CosmosClient

from azure.cosmos.aio import CosmosClient as CosmosAsyncClient
from azure.cosmos import PartitionKey, exceptions

from openai import AzureOpenAI
from dotenv import load_dotenv

# Load Keys, Endpoints, and other variables from the .env file

In [3]:
import os
from dotenv import dotenv_values

'''
env_name = os.path.join(os.getcwd(), "localsettings.env")
config = dotenv_values(env_name)
print(f"this is the env file: {env_name}")
'''


# Load environment variables from the .env file
root_dir = os.path.dirname(os.getcwd())
env_path = os.path.join(root_dir, "localsettings.env")
config = dotenv_values(env_path)
print(f"this is the env file: {env_path}")

this is the env file: /workspaces/ai_accelerate_ai_database/localsettings.env


In [4]:
print(f"Printing config: {config}")

Printing config: OrderedDict([('openai_api_key', 'de907c66c74744feb99b524f077b8167'), ('openai_api_endpoint', 'https://jqopenai.openai.azure.com/openai/deployments/jq-gpt-4o/chat/completions?api-version=2024-10-21'), ('openai_api_version', '2024-10-21'), ('completions_model_deployment_name', 'jq-gpt-4o'), ('embedding_model_endpoint', 'https://jqopenai.openai.azure.com/openai/deployments/jq-text-embedding-3-large/embeddings?api-version=2023-05-15'), ('embedding_model_key', 'de907c66c74744feb99b524f077b8167'), ('embedding_model_deployment_name', 'jq-text-embedding-3-large'), ('embedding_model_api_version', '2024-12-01-preview'), ('PERSONAL_COSMOSDB_CONNECTION_URI', 'https://jqcosmosdb.documents.azure.com:443/'), ('PERSONAL_COSMOSDB_KEY', 'huzi3I6kYmLVFaeox3fdktWYAQzUCi1oO30NoonaUAxhHjObwSt92p2L8u72XKy6tPQn09Xeg4IMACDb2dMZWA=='), ('PERSONAL_COSMOSDB_CONNECTION_STRING', 'AccountEndpoint=https://jqcosmosdb.documents.azure.com:443/;AccountKey=huzi3I6kYmLVFaeox3fdktWYAQzUCi1oO30NoonaUAxhHjObw

In [4]:
print("openai_api_key:", config.get("openai_api_key"))
print("openai_api_endpoint:", config.get("openai_api_endpoint"))
print("cosmosdb_nosql_account_endpoint:", config.get("PERSONAL_COSMOSDB_CONNECTION_STRING"))

openai_api_key: de907c66c74744feb99b524f077b8167
openai_api_endpoint: https://jqopenai.openai.azure.com/openai/deployments/jq-gpt-4o/chat/completions?api-version=2024-10-21
cosmosdb_nosql_account_endpoint: AccountEndpoint=https://jqcosmosdb.documents.azure.com:443/;AccountKey=huzi3I6kYmLVFaeox3fdktWYAQzUCi1oO30NoonaUAxhHjObwSt92p2L8u72XKy6tPQn09Xeg4IMACDb2dMZWA==;


In [5]:
# Completion Model
OPENAI_API_KEY = config['openai_api_key']
OPENAI_API_ENDPOINT = config['openai_api_endpoint']
OPENAI_API_VERSION = config['openai_api_version'] # at the time of authoring, the api version is 2024-02-01
COMPLETIONS_MODEL_DEPLOYMENT_NAME = config['completions_model_deployment_name']

# Embedding Model
EMBEDDING_MODEL_DEPLOYMENT_NAME = config['embedding_model_deployment_name']
EMBEDDING_MODEL_ENDPOINT = config['embedding_model_endpoint']
EMBEDDING_MODEL_KEY=config['embedding_model_key']
EMBEDDING_MODEL_API_VERSION = config['embedding_model_api_version'] # at the time of authoring, the api version is 2024-02-01

# CosmosDB Information
COSMOSDB_NOSQL_ACCOUNT_KEY = config['PERSONAL_COSMOSDB_KEY']
COSMOSDB_NOSQL_ACCOUNT_ENDPOINT = config['PERSONAL_COSMOSDB_CONNECTION_URI']

print("OPENAI_API_ENDPOINT: ", OPENAI_API_ENDPOINT)
print("COSMOSDB_NOSQL_ACCOUNT_ENDPOINT: ", COSMOSDB_NOSQL_ACCOUNT_ENDPOINT)
print(f"EMBEDDING_MODEL_DEPLOYMENT_NAME: {EMBEDDING_MODEL_DEPLOYMENT_NAME}")

OPENAI_API_ENDPOINT:  https://jqopenai.openai.azure.com/openai/deployments/jq-gpt-4o/chat/completions?api-version=2024-10-21
COSMOSDB_NOSQL_ACCOUNT_ENDPOINT:  https://jqcosmosdb.documents.azure.com:443/
EMBEDDING_MODEL_DEPLOYMENT_NAME: jq-text-embedding-3-large


# Instantiate the Azure Open AI client

In [6]:
AOAI_client = AzureOpenAI(api_key=OPENAI_API_KEY, azure_endpoint=OPENAI_API_ENDPOINT, api_version=OPENAI_API_VERSION,)
Embedding_client = AzureOpenAI(api_key=EMBEDDING_MODEL_KEY, azure_endpoint=EMBEDDING_MODEL_ENDPOINT, api_version=EMBEDDING_MODEL_API_VERSION,)


# Generating Embedding
We'll use the deployed embeddings model to generate the embeddings

In [7]:
def generate_embeddings(text):
    '''
    Generate embeddings from string of text.
    This will be used to vectorize data and user input for interactions with Azure OpenAI.
    '''
    # To convert a value to a string in Python, use the built-in str() function.
    embeddingString = str(EMBEDDING_MODEL_DEPLOYMENT_NAME)
    response = Embedding_client.embeddings.create(input=text, model=embeddingString)
    embeddings =response.model_dump()
    time.sleep(0.5) 
    return embeddings['data'][0]['embedding']

In [9]:
print("Embeddings generated: ", generate_embeddings("Joaquin Saldana"))

Embeddings generated:  [0.0077652097679674625, 0.008775642141699791, -0.02205446921288967, 0.0031585944816470146, -0.025921162217855453, 0.06543143838644028, -0.013517512008547783, 0.01561793778091669, -0.013708459213376045, 0.0037592845037579536, -0.06107146665453911, -0.0008468335145153105, -0.014631373807787895, 0.03908064588904381, 0.009205274283885956, 0.03580271080136299, -0.03653467446565628, 0.006551895756274462, 0.00430825911462307, -0.027066847309470177, 0.0266053918749094, -0.027942026033997536, 0.0064405095763504505, 0.04210398718714714, 0.0625990480184555, -0.0136448098346591, 0.036184605211019516, 0.022420452907681465, -0.00366977765224874, 0.012467298656702042, 0.0016290232306346297, 0.007948201149702072, 0.009992933832108974, -0.01467115432024002, 0.003387334058061242, 0.04019450768828392, 0.016914790496230125, 0.0021859542466700077, -0.0172966867685318, -0.01047825999557972, -0.012411605566740036, -0.009786074049770832, -0.002900019520893693, 0.010653294622898102, 0.03

# Load the data with embeddings or generate embeddings
We have a sample data file with embeddings but you can generate the embeddings afresh before uploading the data.

In [10]:
root_dir = os.path.dirname(os.getcwd())
data_file = os.path.join(root_dir, "Datasets/netflix_titles_sml.json")

with open(data_file, "r") as f:
    data = json.load(f)

print(f"This is the file name: {data_file}")

This is the file name: /workspaces/ai_accelerate_ai_database/Datasets/netflix_titles_sml.json


In [11]:
# Take a peek at one data item
print(json.dumps(data[0], indent=2))

{
  "id": "s1",
  "type": "Movie",
  "title": "Dick Johnson Is Dead",
  "director": "Kirsten Johnson",
  "cast": "",
  "country": "United States",
  "date_added": "September 25, 2021",
  "release_year": "2020",
  "rating": "PG-13",
  "duration": "90 min",
  "description": "As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.",
  "listed_in": "Documentaries"
}


In [12]:
# Generate embeddings for title and content fields
n = 0
for item in data:
    n+=1
    #print(item['title'] + ' | ' + item['description'])
    doc_embeddings = generate_embeddings(item['title'] + ' | ' + item['description'])
    item['docVector'] = doc_embeddings
    print("Creating embeddings for item:", n, "/" ,len(data), end='\r')

Creating embeddings for item: 19 / 19

In [None]:
# If you have the embeddings pre-computed, you can load them from the file
# data_file = open(file="./Datasets/netflix_titles_withembeddings.json", mode="r")

# data = json.load(data_file)
# data_file.close()

In [14]:
print(len(data))
print(json.dumps(data[1], indent=2))

19
{
  "id": "s2",
  "type": "TV Show",
  "title": "Blood & Water",
  "director": "",
  "cast": "Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng",
  "country": "South Africa",
  "date_added": "September 24, 2021",
  "release_year": "2021",
  "rating": "TV-MA",
  "duration": "2 Seasons",
  "description": "After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.",
  "listed_in": "International TV Shows, TV Dramas, TV Mysteries",
  "docVector": [
    0.007296870928257704,
    -0.0019422112964093685,
    -0.00897500105202198,
    -0.0057386066764593124,
    -0.04827621579170227,
    -0.0018504385370761156,
    0.015612606890499592,
    -0.0033281671

# Connect and setup Cosmos DB for NoSQL
Now that we have the data with embeddings ready, we need to upload this data to Azure Cosmos DB container with vector search capability. For this, we need to create a new container (as vector search is currently supported in new containers only) with vector embedding and indexing policy.

## Set up the connection

In [15]:
cosmos_client = CosmosClient(url=COSMOSDB_NOSQL_ACCOUNT_ENDPOINT, credential=COSMOSDB_NOSQL_ACCOUNT_KEY)

## Create a new database or use existing one

In [16]:
#create database
DATABASE_NAME = "vector-nosql-db"
db= cosmos_client.create_database_if_not_exists(
    id=DATABASE_NAME
)
properties = db.read()
print(json.dumps(properties))

{"id": "vector-nosql-db", "_rid": "DdkOAA==", "_self": "dbs/DdkOAA==/", "_etag": "\"00006615-0000-0700-0000-682aa7980000\"", "_colls": "colls/", "_users": "users/", "_ts": 1747625880}


## Author the vector embedding policy
Vector embedding policy defines the necessary information for the vector search queries as detailed below: 
* “path”: what properties contain vectors 
* “datatype”: What type are the vector’s elements (default Float32) 
* “dimensions”: The length of each vector in the path (default 1536) 
* “distanceFunction”: The metric used to compute distance/similarity (default Cosine)

In [17]:
vector_embedding_policy = {
    "vectorEmbeddings": [
        {
            "path":"/docVector",
            "dataType":"float32",
            "distanceFunction":"cosine",
            "dimensions":1536
        }
    ]
}

## Add vector indexes to indexing policy

In [18]:
indexing_policy = {
    "includedPaths": [
        {
            "path": "/*"
        }
    ],
    "excludedPaths": [
        {
            "path": "/docVector/?"
        },
        {
            "path": "/\"_etag\"/?"
        }
    ],
    "vectorIndexes": [
        {"path": "/docVector",
         "type": "diskANN"
        }
    ]
}


## Create container with the embedding and indexing policy

In [19]:
CONTAINER_NAME = "vector-diskann"
try:    
    container = db.create_container_if_not_exists(
                    id=CONTAINER_NAME,
                    partition_key=PartitionKey(path='/partKey', kind='Hash'),
                    indexing_policy=indexing_policy,
                    vector_embedding_policy=vector_embedding_policy)

    print('Container with id \'{0}\' created'.format(id))

except exceptions.CosmosResourceExistsError:
    print('A container with id \'{0}\' already exists'.format(id))

Container with id '<built-in function id>' created


In [20]:
# Skip creating the container if it already exists
DATABASE_NAME = "vector-nosql-db"
CONTAINER_NAME = "vector-diskann"
db= cosmos_client.get_database_client(DATABASE_NAME)
container = db.get_container_client(CONTAINER_NAME)

## Upload data to the container
Azure Cosmos DB Python SDK does not currently support bulk inserts so we'll have to insert the items sequentially

In [21]:
# with open('./DataSet/AzureServices/text-sample_w_embeddings.json') as f:
#    data = json.load(f)

container_client = db.get_container_client(CONTAINER_NAME)

for item in data:
    item['partKey'] = item['release_year']
    container_client.upsert_item(item)
    print("writing item {} - {}".format(item['id'], container_client.client_connection.last_response_headers['x-ms-request-charge']))

writing item s1 - 113.39
writing item s2 - 113.39
writing item s3 - 113.39
writing item s4 - 113.39
writing item s5 - 113.39
writing item s6 - 113.39
writing item s7 - 113.39
writing item s8 - 113.39
writing item s9 - 113.39
writing item s10 - 113.39
writing item s11 - 113.39
writing item s12 - 113.39
writing item s13 - 113.39
writing item s14 - 113.39
writing item s15 - 113.39
writing item s16 - 113.39
writing item s17 - 113.39
writing item s18 - 113.39
writing item s19 - 113.39


## Vector search in Azure Cosmos DB for NoSQL
Let's write a function that will take in user's query, generate embeddings for the query text and then use the embedding to run a vector search to find the similar items. The most similar items must be used as additional knowledgebase for the completions model to answer the user's query

In [22]:
print("Embeddings generated: ", generate_embeddings("Best romantic comedy movies to watch in family with kids coming to tenage years"))

Embeddings generated:  [-0.047173164784908295, 0.030830349773168564, -0.0061558387242257595, 0.008764822967350483, 0.0017401754157617688, 0.02807471714913845, -0.03380424901843071, -0.005821615923196077, 0.007475677877664566, 0.016083620488643646, 0.03669629991054535, -0.029275190085172653, -0.009685641154646873, 0.0006978606688790023, -0.014787654392421246, -0.040870677679777145, -0.01051778718829155, -0.017024900764226913, -0.029275190085172653, -0.019330356270074844, 0.00990390870720148, 0.014787654392421246, -0.016656573861837387, 0.03612334653735161, 0.04332619160413742, 0.01059963833540678, -0.03208538889884949, 0.0003410436911508441, -0.039779335260391235, 0.023818491026759148, 0.021567603573203087, -0.021349335089325905, 0.030311962589621544, -0.024978039786219597, -0.03265834227204323, 0.04166189581155777, -0.01705218479037285, 0.018825611099600792, -0.008212331682443619, -0.016492873430252075, -0.01780248060822487, 0.007182380184531212, 0.010306340642273426, 0.012604975141584

In [23]:
# Simple function to assist with vector search
def vector_search(query, num_results=5, printQuery=False):
    query_embedding = generate_embeddings(query)

    querystring = "SELECT TOP {} c.id, c.type, c.title, c.rating, c.release_year, c.description, VectorDistance(c.docVector,{}) AS similarityScore FROM c".format(num_results, query_embedding)
    
    results = container_client.query_items(
            query=querystring,
            enable_cross_partition_query=True)
    
    return results

In [24]:
# Simple function to assist with vector search
def vector_search_ordered(query, num_results=3, printQuery=False):
    query_embedding = generate_embeddings(query)

    querystring = "SELECT TOP {} c.id, c.type, c.title, c.rating, c.release_year, c.description, VectorDistance(c.docVector,{}) AS similarityScore FROM c ORDER BY VectorDistance(c.docVector,{})".format(num_results, query_embedding, query_embedding)
    
    if printQuery:
        print(querystring)

    results = container_client.query_items(
            query=querystring,
            enable_cross_partition_query=True)
    
    return results

Let's run a test below

In [23]:
DATABASE_NAME = "vector-nosql-db"
CONTAINER_NAME = "vector-diskann"
db= cosmos_client.get_database_client(DATABASE_NAME)
container_client = db.get_container_client(CONTAINER_NAME)

In [25]:
query = "Best romantic comedy movies to watch in family with kids coming to tenage years"

results = vector_search(query)

# print
for result in results: 
    # print(result)
    print(f"Similarity Score: {result['similarityScore']}")
    print(f"Id: {result['id']}")  
    print(f"Type: {result['type']}")        
    print(f"Title: {result['title']}")  
    print(f"Rating: {result['rating']}")  
    print(f"Description: {result['description']}") 
    print(f"Release_year: {result['release_year']}\n") 

# print("Consumed RUs: {} ".format(container_client.client_connection.last_response_headers['x-ms-request-charge']))

Similarity Score: 0.25277268870052566
Id: s1
Type: Movie
Title: Dick Johnson Is Dead
Rating: PG-13
Description: As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.
Release_year: 2020

Similarity Score: 0.2330229443968936
Id: s2
Type: TV Show
Title: Blood & Water
Rating: TV-MA
Description: After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.
Release_year: 2021

Similarity Score: 0.22376364112414404
Id: s3
Type: TV Show
Title: Ganglands
Rating: TV-MA
Description: To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.
Release_year: 2021

Similarity Score: 0.08415469000328653
Id: s4
Type: TV Show
Title: Jailbirds New Orleans
Rating: TV-MA
Description: Feuds, flirtations and toilet talk go down among the inc

In [28]:
# query = "What are some NoSQL databases in Azure?"
# query = "What are the services for event messaging patterns?"

#query = "Best romantic comedy movies to watch in family with kids coming to tenage years"
#query = "Action movie with zombies and a lot of blood"
query = "Sports movie that shows the struggle of a team to win the championship"

results = vector_search_ordered(query)

for result in results: 
    # print(result)
    print(f"Similarity Score: {result['similarityScore']}")
    print(f"Id: {result['id']}")  
    print(f"Type: {result['type']}")        
    print(f"Title: {result['title']}")  
    print(f"Rating: {result['rating']}")  
    print(f"Description: {result['description']}") 
    print(f"Release_year: {result['release_year']}\n") 

# print("Consumed RUs: {} ".format(container_client.client_connection.last_response_headers['x-ms-request-charge']))

Similarity Score: 0.862960131157242
Id: s5323
Type: Movie
Title: Undefeated
Rating: PG-13
Description: An inspirational profile of an inner-city high school football team's valiant effort to reach the school's first-ever playoff game.
Release_year: 2011

Similarity Score: 0.8627748584678387
Id: s8479
Type: Movie
Title: The Rebound
Rating: TV-MA
Description: This documentary follows three players on the Miami Heat Wheels, who face long odds on the way to the national wheelchair basketball championship.
Release_year: 2016

Similarity Score: 0.8565862992174109
Id: s5747
Type: Movie
Title: A Mighty Team
Rating: TV-MA
Description: When a fit of anger leads to a serious injury, a sidelined soccer star returns to his hometown and reluctantly agrees to train the local youth.
Release_year: 2016



In [25]:
# Simple predicate based on the partition key (unfortunatelly I only have one partition as of now)
def vector_search_filterordered(query, releaseYear, num_results=3, printQuery=False):
    query_embedding = generate_embeddings(query)

    querystring = "SELECT TOP {} c.id, c.type, c.title, c.rating, c.release_year, c.description, VectorDistance(c.docVector,{}) AS similarityScore FROM c WHERE c.partKey = '{}' ORDER BY VectorDistance(c.docVector,{})".format(num_results, query_embedding, releaseYear, query_embedding)
    
    if printQuery:
        print(querystring)

    results = container_client.query_items(
            query=querystring,
            enable_cross_partition_query=True)
    
    return results

In [30]:
# query = "What are some NoSQL databases in Azure?"
# query = "What are the services for event messaging patterns?"

#query = "Best romantic comedy movies to watch in family with kids coming to tenage years"
#query = "Action movie with zombies and a lot of blood"
query = "Sports movie that shows the struggle of a team to win the championship"

results = vector_search_filterordered(query, 2016, printQuery=True)
#results = vector_search_filterordered(query, 2011, printQuery=True)

for result in results: 
    # print(result)
    print(f"Similarity Score: {result['similarityScore']}")
    print(f"Id: {result['id']}")  
    print(f"Type: {result['type']}")        
    print(f"Title: {result['title']}")  
    print(f"Rating: {result['rating']}")  
    print(f"Description: {result['description']}") 
    print(f"Release_year: {result['release_year']}\n") 

# print("Consumed RUs: {} ".format(container_client.client_connection.last_response_headers['x-ms-request-charge']))

SELECT TOP 3 c.id, c.type, c.title, c.rating, c.release_year, c.description, VectorDistance(c.docVector,[-0.01079730037599802, -0.03456145152449608, -0.005688764620572329, -0.008892635814845562, -0.01865561492741108, 0.016019359230995178, 0.003147110342979431, -0.01974039152264595, -0.02428131178021431, -0.02835552580654621, 0.00843854434788227, 0.009964797645807266, 0.01687708869576454, 0.00899985246360302, 0.013912876136600971, -0.02709415927529335, 0.0313323512673378, -0.019765619188547134, -0.008350248448550701, -0.028204161673784256, -0.006294220685958862, -0.021430622786283493, 0.00019462495401967317, -0.005222058854997158, 0.01175593864172697, -0.011648722924292088, 0.01811322756111622, -0.025883248075842857, 0.02714461460709572, -0.02913757413625717, 0.03438486158847809, 0.0019519651541486382, -0.006007259711623192, -0.02272983081638813, -0.02565620094537735, -0.03509122505784035, -0.009125988930463791, -0.01420299056917429, 0.025441769510507584, -0.010721618309617043, 0.018365