# Azure Cognitive Search Vector Search Using LangChain

Use Azure Cognitive Search to retrieve relevant content to build effective prompt for Azure Open AI. The example below uses LangChain modules to perform the task.

## Setup
#### Follow [README](https://github.com/tirtho/open-ai/blob/main/README.md) and perform setup before running the notebooks

#### Reference :
- [Azure Open AI](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/overview)
- [LangChain home page](https://python.langchain.com/docs/get_started/introduction.html)
- [Azure Cognitive Search](https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search)
- [Azure Cognitive Search as vector store](https://github.com/hwchase17/langchain/pull/5146/files/ef78d38fd12a6edcf6b04ab06493305d0d601ac3..f9b67d653854ef08e3dc56563964bb86deba9d8e)
- [LangChain Data connection Vector store integration with Azure Cognitive Search](https://python.langchain.com/docs/modules/data_connection/vectorstores/integrations/azuresearch)

#### Load the API key and relevant Python libaries.

#### Install the python libraries
- > pip install openai num2words matplotlib plotly scipy scikit-learn pandas tiktoken 

- > pip install --index-url=https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/ azure-search-documents==11.4.0a20230509004



#### Load the API keys

In [1]:
import openai
import sys

from azure_openai_setup import set_openai_config, get_openai_global_config_parameters

set_openai_config()

theOpenAIParams, modelName, modelDeploymentName = get_openai_global_config_parameters()

Got Azure OpenAI Credentials from Azure Key Vault with Azure CLI Auth


#### Get the Azure Cognitive Search keys from Azure Key Vault
Note: You need the Search Admin Key

In [2]:
from azure_cognitive_search_setup import set_cognitive_search_config, create_cognitive_search_index

azureSearchAdminKey, azureSearchEndpoint, azureSearchIndexName = set_cognitive_search_config()

Getting Azure Cognitive Search Credentials from Azure Key Vault with Azure CLI Auth


#### Create the Search Index in Azure Cognitive Search
<font color=red>Note: This will delete your existing index</font>

In [3]:
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  

cog_search_key_credential = AzureKeyCredential(azureSearchAdminKey)
cog_search_index_client = SearchIndexClient(
                                endpoint=azureSearchEndpoint,
                                credential=cog_search_key_credential  
                            )

search_index = create_cognitive_search_index(
                    index_name=azureSearchIndexName,
                    search_index_client=cog_search_index_client
               )
print(f' {search_index.name} created')    

 tr-demo-billsum-index created


#### Other modules needed

In [4]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.schema import BaseRetriever
from langchain.vectorstores.azuresearch import AzureSearch

#### Create the Azure Open AI Embeddings and AzureSearch classes:

In [5]:
from azure_openai_setup import get_azure_openai_embeddings 

embeddings = get_azure_openai_embeddings()

In [6]:
vector_store: AzureSearch = AzureSearch(
                                azure_search_endpoint=azureSearchEndpoint,
                                azure_search_key=azureSearchAdminKey,
                                index_name=azureSearchIndexName,
                                embedding_function=embeddings.embed_query,
                            )

### Load the BillSum Dataset
BillSum is a dataset of United States Congressional and California state bills. For illustration purposes, we'll look only at the US bills. The corpus consists of bills from the 103rd-115th (1993-2018) sessions of Congress. The data was split into 18,949 train bills and 3,269 test bills. The BillSum corpus focuses on mid-length legislation from 5,000 to 20,000 characters in length. More information on the project and the original academic paper where this dataset is derived from can be found on the BillSum project's GitHub repository.

We saved it in ../data/bill_sum_data.csv

#### Load, cleanup, select text, summary and title columns and select rows with less than 8192 tokens 

In [7]:
from num2words import num2words
import os
import pandas as pd
import numpy as np

In [8]:
df=pd.read_csv(os.path.join(os.getcwd(),'./data/bill_sum_data_curated.csv')) # This assumes that you have placed the bill_sum_data.csv in the same directory you are running Jupyter Notebooks
df_bills = df[['bill_id', 'title', 'summary', 'sum_len']]
from langchain.document_loaders import DataFrameLoader

loader = DataFrameLoader(df_bills, page_content_column="summary")
docs = loader.load()

In [9]:
results = vector_store.add_documents(documents = docs)
print("Stored %s documents with embeddings in Azure Cogntive Search" %(len(results)))

Stored 20 documents with embeddings in Azure Cogntive Search


## Different Search functions

[LangChain API Reference Docs](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.azuresearch.AzureSearch.html#langchain.vectorstores.azuresearch.AzureSearch.semantic_hybrid_search)

In [10]:
testQuery = "federal agency green energy bill"

In [11]:
# Return docs most similar to query using the LangChain API
search_result_docs = vector_store.similarity_search(
                        query=testQuery,
                        k=2, # get the most nearest neighbor
                        search_type="similarity" # do not pass this argument to try a hybrid search
                     )

for doc in search_result_docs:
    print("Doc: %s\n" %doc)

Doc: page_content="Directs the President, in coordination with designated Secretaries, to establish: (1) a demonstration program for fuel cell proton exchange membrane technology for commercial, residential, and transportation applications within the Secretaries' respective areas. And (2) a comprehensive proton exchange membrane fuel cell bus demonstration program to address hydrogen production, storage, and use in transit bus applications. Mandates that each Federal agency that maintains a motor vehicle fleet develop a plan for fleet transition to vehicles powered by fuel cell technology. Directs the Secretary of Energy to establish a fuel cell technology grant program for State or local government to meet their energy requirements, including such technology as a motor vehicle power source. Authorizes appropriations." metadata={'bill_id': '106_hr5585', 'title': 'Energy Independence Act of 2000', 'sum_len': 810}

Doc: page_content='Full-Service Schools Act - Establishes the Federal Int

## TODO
Get the searched text from Azure Cognitive Search and then use it in the prompt for Azure OpenAI

In [None]:
from azure.search.documents.indexes.models import GetIndexStatisticsResult

