# Similarity Search with Azure OpenAI

This recipe will show you how to run semantic search with embeddings from Azure OpenAI.

## Requirements

1. Weaviate cluster
    1. You can create a 14-day free sandbox on [WCD](https://console.weaviate.cloud/)
    2. [Embedded Weaviate](https://weaviate.io/developers/weaviate/installation/embedded)
    3. [Local deployment](https://weaviate.io/developers/weaviate/installation/docker-compose#starter-docker-compose-file)
    4. [Other options](https://weaviate.io/developers/weaviate/installation)

2. Azure API key. Grab one [here](https://portal.azure.com/).

## Import Dependencies, Libraries, and Keys

In [None]:
!pip install --q weaviate-client

In [1]:
import weaviate, os
from weaviate.embedded import EmbeddedOptions
import weaviate.classes as wvc
import weaviate.classes.config as wc
import requests, json
import weaviate.classes.query as wq

## Connect to Weaviate

Only choose one option from the below.

**Weaviate Cloud Deployment**

In [2]:
WCD_URL = os.environ["WEAVIATE_URL"] # Replace with your Weaviate cluster URL
WCD_AUTH_KEY = os.environ["WEAVIATE_AUTH"] # Replace with your cluster auth key
AZURE_OPENAI_KEY = os.environ["AZURE_OPENAI_API_KEY"] # Replace with your Azure key

# Weaviate Cloud Deployment
client = weaviate.connect_to_wcs(
    cluster_url=WCD_URL,
    auth_credentials=weaviate.auth.AuthApiKey(WCD_AUTH_KEY),
      headers={ "X-Azure-Api-Key": AZURE_OPENAI_KEY}
)

print(client.is_ready())

True


**Embedded Weaviate**

In [None]:
# AZURE_OPENAI_KEY = os.environ["AZURE_OPENAI_KEY"] # Replace with your Azure OpenAI key

# client = weaviate.WeaviateClient(
#     embedded_options=EmbeddedOptions(
#         version="1.29.0",
#         additional_env_vars={
#             "ENABLE_MODULES": "text2vec-openai"
#         }),
#         additional_headers={
#             "X-Azure-Api-Key": AZURE_OPENAI_KEY
#         }
# )

# client.connect()

**Local Deployment**

In [None]:
# AZURE_OPENAI_KEY = os.environ["AZURE_OPENAI_KEY"] # Replace with your Azure OpenAI key

# client = weaviate.connect_to_local(
#   headers={
#     "X-Azure-Api-Key": AZURE_OPENAI_KEY
#   }
# )
# print(client.is_ready())

## Create a collection
> Collection stores your data and vector embeddings.

In [3]:
# Note: in practice, you shouldn't rerun this cell, as it deletes your data
# in "JeopardyQuestion", and then you need to re-import it again.

# Delete the collection if it already exists
if (client.collections.exists("JeopardyQuestion")):
    client.collections.delete("JeopardyQuestion")

client.collections.create(
    name="JeopardyQuestion",

    vectorizer_config=wc.Configure.Vectorizer.text2vec_azure_openai(
        resource_name="xyz", # name of your resource
        deployment_id="text-embedding-3-small", # model deployed
    ),

    properties=[ # defining properties (data schema) is optional
        wc.Property(name="Question", data_type=wc.DataType.TEXT), 
        wc.Property(name="Answer", data_type=wc.DataType.TEXT),
        wc.Property(name="Category", data_type=wc.DataType.TEXT, skip_vectorization=True), 
    ]
)

print("Successfully created collection: JeopardyQuestion.")

Successfully created collection: JeopardyQuestion.


## Import the Data

In [4]:
url = 'https://raw.githubusercontent.com/weaviate/weaviate-examples/main/jeopardy_small_dataset/jeopardy_tiny.json'
resp = requests.get(url)
data = json.loads(resp.text)

# Get a collection object for "JeopardyQuestion"
jeopardy = client.collections.get("JeopardyQuestion")

# Insert data objects
response = jeopardy.data.insert_many(data)

# Note, the `data` array contains 10 objects, which is great to call insert_many with.
# However, if you have a milion objects to insert, then you should spit them into smaller batches (i.e. 100-1000 per insert)

if (response.has_errors):
    print(response.errors)
else:
    print("Insert complete.")

Insert complete.


## Query Weaviate: Similarity Search (Text objects)

Similarity search options for text objects in **Weaviate**:

1. [near_text](https://weaviate.io/developers/weaviate/search/similarity#an-input-medium)

2. [near_object](https://weaviate.io/developers/weaviate/search/similarity#an-object)

3. [near_vector](https://weaviate.io/developers/weaviate/search/similarity#a-vector)

### `nearText` Example
Find a `JeopardyQuestion` about "animals in movies". Limit it to only 4 responses.

In [5]:
# note, you can reuse the collection object from the previous cell.
# Get a collection object for "JeopardyQuestion"
jeopardy = client.collections.get("JeopardyQuestion")

response = jeopardy.query.near_text(
    query="african beasts",
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", json.dumps(item.properties, indent=2), "\n")

ID: 80834d59-c3f6-4ca1-886b-ea790b8a60f1
Data: {
  "answer": "Antelope",
  "question": "Weighing around a ton, the eland is the largest species of this animal in Africa",
  "category": "ANIMALS"
} 

ID: c8720c6c-1dd7-4572-9625-a7cc71a4b96d
Data: {
  "answer": "Elephant",
  "question": "It's the only living mammal in the order Proboseidea",
  "category": "ANIMALS"
} 

ID: dd160788-07a4-48ea-96d9-a328a5660827
Data: {
  "answer": "the nose or snout",
  "question": "The gavial looks very much like a crocodile except for this bodily feature",
  "category": "ANIMALS"
} 

ID: fd15dd89-5da9-4017-902f-548349ff87d9
Data: {
  "answer": "the diamondback rattler",
  "question": "Heaviest of all poisonous snakes is this North American rattlesnake",
  "category": "ANIMALS"
} 



**Return vector embeddings**

In [6]:
response = jeopardy.query.near_text(
    query="african beasts",
    include_vector=True,
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", json.dumps(item.properties, indent=2))
    print("Vector:", item.vector, "\n")

ID: 80834d59-c3f6-4ca1-886b-ea790b8a60f1
Data: {
  "answer": "Antelope",
  "question": "Weighing around a ton, the eland is the largest species of this animal in Africa",
  "category": "ANIMALS"
}
Vector: {'default': [0.06673423945903778, -0.0031030948739498854, 0.026736171916127205, 0.0258866548538208, -0.078155517578125, 0.10316906124353409, -0.0032004353124648333, 0.03275357931852341, 0.0032004353124648333, 0.030676985159516335, 0.017580272629857063, -0.006790232378989458, -0.05163172259926796, -0.01898433454334736, -0.03232882171869278, -0.031054547056555748, -0.0004855960141867399, -0.0054687624797225, -0.04162630811333656, 0.031054547056555748, 0.007893424481153488, 0.021214313805103302, -0.011710348539054394, -0.002154763089492917, 0.023857254534959793, -0.007698743138462305, -0.051348552107810974, 0.005698839668184519, 0.07504062354564667, -0.02091934345662594, 0.05403868854045868, -0.03334352374076843, -0.00018638480105437338, 0.0028331964276731014, -0.04509516805410385, -0.03

**Now, also request the `distance` for each returned item.**

In [7]:
response = jeopardy.query.near_text(
    query="african beasts",
    return_metadata=wq.MetadataQuery(distance=True),
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Distance:", item.metadata.distance)
    print("Data:", item.properties, "\n")

ID: 80834d59-c3f6-4ca1-886b-ea790b8a60f1
Distance: 0.4886343479156494
Data: {'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'} 

ID: c8720c6c-1dd7-4572-9625-a7cc71a4b96d
Distance: 0.6216651797294617
Data: {'answer': 'Elephant', 'question': "It's the only living mammal in the order Proboseidea", 'category': 'ANIMALS'} 

ID: dd160788-07a4-48ea-96d9-a328a5660827
Distance: 0.6688282489776611
Data: {'answer': 'the nose or snout', 'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'category': 'ANIMALS'} 

ID: fd15dd89-5da9-4017-902f-548349ff87d9
Distance: 0.7234818935394287
Data: {'answer': 'the diamondback rattler', 'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'category': 'ANIMALS'} 



### `nearObject` Example

Search through the `JeopardyQuestion` class to find the top 4 objects closest to id `a1dd67f9-bfa7-45e1-b45e-26eb8c52e9a6`. (The id was taken from the query above)

In [8]:
response = jeopardy.query.near_object(
    near_object="fd15dd89-5da9-4017-902f-548349ff87d9", # replace with your id of interest
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")

ID: fd15dd89-5da9-4017-902f-548349ff87d9
Data: {'answer': 'the diamondback rattler', 'question': 'Heaviest of all poisonous snakes is this North American rattlesnake', 'category': 'ANIMALS'} 

ID: 80834d59-c3f6-4ca1-886b-ea790b8a60f1
Data: {'answer': 'Antelope', 'question': 'Weighing around a ton, the eland is the largest species of this animal in Africa', 'category': 'ANIMALS'} 

ID: dd160788-07a4-48ea-96d9-a328a5660827
Data: {'answer': 'the nose or snout', 'question': 'The gavial looks very much like a crocodile except for this bodily feature', 'category': 'ANIMALS'} 

ID: c8720c6c-1dd7-4572-9625-a7cc71a4b96d
Data: {'answer': 'Elephant', 'question': "It's the only living mammal in the order Proboseidea", 'category': 'ANIMALS'} 



### `nearVector` Example
Search through the `JeopardyQuestion` class to find the top 2 objects closest to the query vector `[-0.0125526935, -0.021168863, ... ]`

In [None]:
response = jeopardy.query.near_vector(
    near_vector=[-0.0125526935, -0.021168863, ... ], # replace with a vector object in your cluster
    limit=4
)

for item in response.objects:
    print("ID:", item.uuid)
    print("Data:", item.properties, "\n")