# Azure Cognitive Search Vector Search

Use Azure Cognitive Search to retrieve relevant content to build effective prompt for Azure Open AI. The example below uses LangChain modules to perform the task.

## Setup
#### Follow [README](https://github.com/tirtho/open-ai/blob/main/README.md) and perform setup before running the notebooks

#### Reference :
- [Azure Open AI](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/overview)
- [Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search)
- [Azure Cognitive Search as vector store](https://github.com/hwchase17/langchain/pull/5146/files/ef78d38fd12a6edcf6b04ab06493305d0d601ac3..f9b67d653854ef08e3dc56563964bb86deba9d8e)

#### Load the API key and relevant Python libaries.

#### Install the python libraries
- > pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken 

- > pip install --index-url=https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/ azure-search-documents==11.4.0a20230509004



#### Load the API keys

In [1]:
import openai
import sys

from azure_openai_setup import set_openai_config, get_openai_global_config_parameters

set_openai_config()

theOpenAIParams, modelName, modelDeploymentName = get_openai_global_config_parameters()

Got Azure OpenAI Credentials from Azure Key Vault with Azure CLI Auth


#### Get the Azure Cognitive Search keys from Azure Key Vault
Note: You need the Search Admin Key

In [2]:
from azure_cognitive_search_setup import set_cognitive_search_config, create_cognitive_search_index

azureSearchAdminKey, azureSearchEndpoint, azureSearchIndexName = set_cognitive_search_config()

Getting Azure Cognitive Search Credentials from Azure Key Vault with Azure CLI Auth


#### Create the Search Index in Azure Cognitive Search
<font color=red>Note: This will delete your existing index</font>

In [3]:
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  

cog_search_key_credential = AzureKeyCredential(azureSearchAdminKey)
cog_search_index_client = SearchIndexClient(
                                endpoint=azureSearchEndpoint,
                                credential=cog_search_key_credential  
                            )

search_index = create_cognitive_search_index(
                    index_name=azureSearchIndexName,
                    search_index_client=cog_search_index_client
               )
print(f' {search_index.name} created')    

retrievable is not a known attribute of class <class 'azure.search.documents.indexes.models._index.SearchField'> and will be ignored


 tr-demo-billsum-index created


### Load the BillSum Dataset
BillSum is a dataset of United States Congressional and California state bills. For illustration purposes, we'll look only at the US bills. The corpus consists of bills from the 103rd-115th (1993-2018) sessions of Congress. The data was split into 18,949 train bills and 3,269 test bills. The BillSum corpus focuses on mid-length legislation from 5,000 to 20,000 characters in length. More information on the project and the original academic paper where this dataset is derived from can be found on the BillSum project's GitHub repository.

We saved it in ../data/bill_sum_data.csv

#### Load data and create vector embeddings and store in a json file

In [4]:
from num2words import num2words
import os
import pandas as pd
import numpy as np
import json
import random, string, uuid

In [5]:
from azure_openai_setup import get_embeddings_from_text

# Read from the csv file
df=pd.read_csv(os.path.join(os.getcwd(),'./data/bill_sum_data_curated.csv')) # This assumes that you have placed the bill_sum_data.csv in the same directory you are running Jupyter Notebooks
# Add the vector embeddings for each row using AzureOpenAI Embeddings
df['content_vector'] = df['summary'].apply(get_embeddings_from_text)

# Add an Id column to the dataframe with random string. This is to match the id column in
# the index searchFields in the helper functions for Azure Cognitive Search
df['id'] = [f'{uuid.uuid4()}' for _ in range(len(df.index))]

def form_metadata(row):
    return '{"lengthOfSummary":' + str(row) + '}'  
# Create metadata from sum_len
# Just to create some data for the metadata column
df['metadata'] = df['sum_len'].apply(form_metadata)


# Selecting the relevant columns
df_bills = df[['id', 'bill_id', 'title', 'summary', 'metadata', 'content_vector']]
# Renaming the 'summary' column to 'content', to match what is in the 
# Index schema in Azure Cognitive Search as defined in the search fields
# in the helper function
df_bills = df_bills.rename(columns={'summary': 'content'})

print(df_bills.loc[0:2, ['id', 'bill_id', 'title', 'content', 'metadata', 'content_vector']])
# Save the data with the vectors in a file
# You can take these and store in any other persistent store
df_bills.to_json(
    path_or_buf='./data/bill_sum_data_curated_with_vectors.json', 
    orient='records'
)

                                     id     bill_id  \
0  2551fd47-a404-47c0-bba5-ee70d31398fd    110_hr37   
1  c68518d1-f397-4d21-aca2-36999b015979  112_hr2873   
2  5925d206-bb52-43c3-bab4-a84446eebe2e   109_s2408   

                                               title  \
0  To amend the Internal Revenue Code of 1986 to ...   
1  To amend the Internal Revenue Code of 1986 to ...   
2  A bill to require the Director of National Int...   

                                             content  \
0  National Science Education Tax Incentive for B...   
1  Small Business Expansion and Hiring Act of 201...   
2  Requires the Director of National Intelligence...   

                   metadata                                     content_vector  
0   {"lengthOfSummary":321}  [-0.023282863199710846, -0.009497825056314468,...  
1  {"lengthOfSummary":1424}  [-0.02702074497938156, -0.01974799670279026, -...  
2   {"lengthOfSummary":463}  [-0.036936987191438675, -0.00575405266135931, ...  


#### Load data with vector embeddings from the json files to Azure Cognitive Search Index

In [6]:
with open('./data/bill_sum_data_curated_with_vectors.json', 'r') as file:
    documents = json.load(file)
search_client = SearchClient(
                    endpoint=azureSearchEndpoint,
                    index_name=azureSearchIndexName,
                    credential=cog_search_key_credential
                )

results = search_client.upload_documents(documents)
print("Stored %s documents with embeddings in Azure Cogntive Search" %(len(documents)))

Stored 20 documents with embeddings in Azure Cogntive Search


## Different Search functions

Reference:\
[Azure Cognitive Search Vector Search with Azure OpenAI](https://github.com/Azure/cognitive-search-vector-pr/blob/main/demo-python/code/azure-search-vector-python-sample.ipynb)

In [7]:
testQuery = "federal agency green energy bill"

#### Vector Similarity Search

In [8]:
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential  
from azure_openai_setup import get_embeddings_from_text

cog_search_key_credential = AzureKeyCredential(azureSearchAdminKey) # TODO: use user key
search_client = SearchClient(
                    endpoint=azureSearchEndpoint, 
                    index_name=azureSearchIndexName, 
                    credential=cog_search_key_credential
                )
results = search_client.search(
                            search_text=None,
                            vector=get_embeddings_from_text(testQuery),
                            top_k=1,
                            vector_fields="content_vector",
                            select=["title", "content", "bill_id"]
                        )

for result in results:
    #print(f"Bill Id: {result['bill_id']}")
    print(f"Score: {result['@search.score']}")
    #print(f"Title: {result['title']}")
    #print(f"Summary Length: {result['sum_len']}")
    #print(f"Content: {result['content']}")
    #result

SerializationError: (', DeserializationError: (", AttributeError: \'float\' object has no attribute \'lower\'", \'Unable to deserialize to object: type\', AttributeError("\'float\' object has no attribute \'lower\'"))', 'Unable to build a model: (", AttributeError: \'float\' object has no attribute \'lower\'", \'Unable to deserialize to object: type\', AttributeError("\'float\' object has no attribute \'lower\'"))', DeserializationError(", AttributeError: 'float' object has no attribute 'lower'", 'Unable to deserialize to object: type', AttributeError("'float' object has no attribute 'lower'")))

In [17]:
# Pure Vector Search multi-lingual (e.g 'tools for software development' in Dutch)  
testQuery = "tools voor softwareontwikkeling"  
 
results = search_client.search(  
            search_text=None,  
            vector=get_embeddings_from_text(testQuery),
            top_k=1
            vector_fields="content_vector",
            select=["title", "content", "bill_id"]
        )

for result in results:
    #print(f"Bill Id: {result['bill_id']}")
    print(f"Score: {result['@search.score']}")
    #print(f"Title: {result['title']}")
    #print(f"Summary Length: {result['sum_len']}")
    #print(f"Content: {result['content']}")
    print(f"Category: {result['bill_id']}\n")  
    

SerializationError: (', DeserializationError: (", AttributeError: \'float\' object has no attribute \'lower\'", \'Unable to deserialize to object: type\', AttributeError("\'float\' object has no attribute \'lower\'"))', 'Unable to build a model: (", AttributeError: \'float\' object has no attribute \'lower\'", \'Unable to deserialize to object: type\', AttributeError("\'float\' object has no attribute \'lower\'"))', DeserializationError(", AttributeError: 'float' object has no attribute 'lower'", 'Unable to deserialize to object: type', AttributeError("'float' object has no attribute 'lower'")))

## Perform a Cross-Field Vector Search

In [None]:
# Cross-Field Vector Search
testQuery = "tools for software development"  

results = search_client.search(
                            search_text=None,
                            vector=get_embeddings_from_text(testQuery),
                            top_k=1,
                            vector_fields="title_vector, content_vector",
                            select=["title", "content", "bill_id"]
            )  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['bill_id']}\n")  


## Perform a Pure Vector Search with a filter

In [None]:
# Pure Vector Search with Filter
testQuery = "tools for software development"  
  
results = search_client.search(
                            search_text=None,
                            vector=get_embeddings_from_text(testQuery),
                            top_k=1,
                            vector_fields="content_vector",
                            filter="bill_id eq '110_hr37'",
                            select=["title", "content", "bill_id"]
            )  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['bill_id']}\n")  


## Perform a Hybrid Search

In [None]:
# Hybrid Search
testQuery = "scalable storage solution"
  
results = search_client.search(
                            search_text=testQuery,
                            vector=get_embeddings_from_text(testQuery),
                            top_k=1,
                            vector_fields="content_vector",
                            filter="bill_id eq '110_hr37'",
                            select=["title", "content", "bill_id"],
                            top=3
            )  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['bill_id']}\n")  


## Perform a Semantic Hybrid Search

In [None]:
testQuery = "scalable storage solution"

from azure_cognitive_search_setup import semantic_search_config_name_from_index_name

semantic_config_name = semantic_search_config_name_from_index_name(azureSearchIndexName)

results = search_client.search(
                            search_text=testQuery,
                            vector=get_embeddings_from_text(testQuery),
                            top_k=1,
                            vector_fields="content_vector",
                            filter="bill_id eq '110_hr37'",
                            select=["title", "content", "bill_id"],
                            query_type="semantic", 
                            query_language="en-us", 
                            semantic_configuration_name=semantic_config_name, 
                            query_caption="extractive", 
                            query_answer="extractive",
                            top=3
            )  
  
semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Title: {result['title']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['bill_id']}")

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")


## TODO
Get the searched text from Azure Cognitive Search and then use it in the prompt for Azure OpenAI

In [None]:
from azure.search.documents.indexes.models import GetIndexStatisticsResult

