# Vector search in Python (Azure AI Search)

This code demonstrates how to use Azure AI Search by using the push API to insert vectors into your search index:

+ Create an index schema
+ Load the sample data from a local folder
+ Embed the documents in-memory using Azure OpenAI's text-embedding-ada-002 model
+ Index the vector and nonvector fields on Azure AI Search
+ Run a series of vector and hybrid queries, including metadata filtering and hybrid (text + vectors) search. 

The code uses Azure OpenAI to generate embeddings for title and content fields. You'll need access to Azure OpenAI to run this demo.

The code reads the pdf documents int the data directory, which contains the input files for which embeddings need to be generated.

The output is a combination of human-readable text and embeddings that can be pushed into a search index.

## Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access). You must have the Azure OpenAI service name and an API key.

+ A deployment of the text-embedding-ada-002 embedding model.

+ Azure AI Search, any tier, but choose a service that has sufficient capacity for your vector index. We recommend Basic or higher. [Enable semantic ranking](https://learn.microsoft.com/azure/search/semantic-how-to-enable-disable) if you want to run the hybrid query with semantic ranking.

We used Python 3.11, [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/python/python-tutorial), and the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) to test this example.

### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [1]:
! pip install -r requirements.txt --quiet

## Import required libraries and environment variables

In [4]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os

load_dotenv(override=True) # take environment variables from .env.

# The following variables from your .env file are used in this notebook
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"]) if len(os.environ["AZURE_SEARCH_ADMIN_KEY"]) > 0 else DefaultAzureCredential()
index_name = os.environ["AZURE_SEARCH_INDEX"]
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"] if len(os.environ["AZURE_OPENAI_KEY"]) > 0 else None
azure_openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]
embedding_model_name = os.environ["AZURE_OPENAI_EMBEDDING_MODEL_NAME"]
azure_openai_api_version = os.environ["AZURE_OPENAI_API_VERSION"]

blob_container = os.environ.get("AZURE_BLOB_CONTAINER")
blob_connection_string = os.environ.get("AZURE_BLOB_CONNECTION_STRING")

In [5]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
import json

openai_credential = DefaultAzureCredential()
token_provider = get_bearer_token_provider(openai_credential, "https://cognitiveservices.azure.com/.default")

openai_client = AzureOpenAI(
    azure_deployment=azure_openai_embedding_deployment,
    api_version=azure_openai_api_version,
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_key,
    azure_ad_token_provider=token_provider if not azure_openai_key else None
)

## Create your search index

Create your search index schema and vector search configuration. If you get an error, check the search service for available quota and check the .env file to make sure you're using a unique search index name.

In [7]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex,
    AzureOpenAIVectorizer,
    AzureOpenAIParameters
)


# Create a search index
index_client = SearchIndexClient(
    endpoint=endpoint, credential=credential)
fields=[
    SearchField(name="chunk_id",type=SearchFieldDataType.String,key=True,filterable=True,sortable=True,searchable=True,analyzer_name="keyword"),
    SearchField(name="parent_id",type=SearchFieldDataType.String,filterable=True,sortable=True,searchable=True),
    SearchField(name="chunk",type=SearchFieldDataType.String,searchable=True),
    SearchField(name="title",type=SearchFieldDataType.String,searchable=True),
    SearchField(name="vector",
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True,vector_search_dimensions=1536,vector_search_profile_name="profile"
                )
]

vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="hnsw-algorithm"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="profile",
            algorithm_configuration_name="hnsw-algorithm",
            # vectorizer="azure-openai-vectorizer"
        )
    ],
    # vectorizers=[
    #     AzureOpenAIVectorizer(
    #             name="azure-openai-vectorizer",
    #             azure_open_ai_parameters=AzureOpenAIParameters(
    #                 resource_uri=azure_openai_endpoint,
    #                 deployment_id=azure_openai_embedding_deployment,
    #                 api_key=azure_openai_key # Optional if using RBAC authentication
    #             )
    #         )
    # ]
)


semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="chunk")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search,
                    semantic_search=semantic_search
                    )
result = index_client.create_or_update_index(index)
print(f' {result.name} index created')


 sample-docs index created


## Create embeddings
Read your data, generate OpenAI embeddings and export to a format to insert your Azure AI Search index:

In [8]:

from langchain_community.document_loaders import PyPDFLoader
from azure.search.documents import SearchIndexingBufferedSender
import os
import tiktoken
import hashlib

docs_folder = "../data/llms/"
formatted_chunks = []

def hash_string(input_string):
    input_bytes = input_string.encode("utf-8")
    return hashlib.sha256(input_bytes).hexdigest()

for file in os.listdir(docs_folder):
    print(f"Generating embeddings for {file}")

    loader = PyPDFLoader(os.path.join(docs_folder, file))
    pages = loader.load()

    from langchain.text_splitter import RecursiveCharacterTextSplitter

    # from_tiktoken_encoder enables use to split on tokens rather than characters
    recursive_text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        encoding_name=tiktoken.encoding_for_model("gpt-3.5-turbo").name,
        chunk_size=600, 
        chunk_overlap=125
    )

    recursive_text_splitter_chunks = recursive_text_splitter.split_documents(pages)
    # Removing full path from filenames
    for chunk in recursive_text_splitter_chunks:
        chunk.metadata["file_name"] = os.path.basename(chunk.metadata['source'])

    chunk_content = [chunk.page_content for chunk in recursive_text_splitter_chunks]

    recursive_text_splitter_embeddings = openai_client.embeddings.create(input=chunk_content, model=embedding_model_name)
    recursive_text_splitter_embeddings = [result.embedding for result in recursive_text_splitter_embeddings.data]

    
    formatted_chunk = [
        {
            "chunk_id": f"{hash_string(chunk.metadata['file_name'])}_{chunk.metadata['page']}_{i}",
            "parent_id": hash_string(chunk.metadata['file_name']),
            "chunk": chunk.page_content,
            "title": chunk.metadata["file_name"],
            "vector": recursive_text_splitter_embeddings[i]
        }
        for i, chunk in enumerate(recursive_text_splitter_chunks)
    ]

    formatted_chunks.append(formatted_chunk)

Generating embeddings for Precise_Zero-Shot_Dense_Retrieval_without_Relevance_Labels.pdf
Generating embeddings for Self-Consistency_Improves_Chain-of-Thought_Reasonsing_in_LLMs.pdf
Generating embeddings for LLMs_are_Human-Level_Prompt_Engineers.pdf
Generating embeddings for Chain-of-Thought_Prompting_Elicits_Reasoning_in_LLMs.pdf
Generating embeddings for Prefix-Tuning_Optimizing_Continuous_Prompts_for_Generation.pdf
Generating embeddings for AutoPrompt_Eliciting_Knowledge_From_LanguageModels.pdf
Generating embeddings for Generated_Knowledge_Prompting_for_Commonsense_Reasoning.pdf
Generating embeddings for Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.pdf


## Insert text and embeddings into vector store
Add texts and metadata from the JSON data to the vector store:

In [9]:
from azure.search.documents import SearchClient
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=credential)

for docs in formatted_chunks:    
    result = search_client.upload_documents(docs)
    print(f"Uploaded chunks for file {docs[0]['title']}")

Uploaded chunks for file Precise_Zero-Shot_Dense_Retrieval_without_Relevance_Labels.pdf
Uploaded chunks for file Self-Consistency_Improves_Chain-of-Thought_Reasonsing_in_LLMs.pdf
Uploaded chunks for file LLMs_are_Human-Level_Prompt_Engineers.pdf
Uploaded chunks for file Chain-of-Thought_Prompting_Elicits_Reasoning_in_LLMs.pdf
Uploaded chunks for file Prefix-Tuning_Optimizing_Continuous_Prompts_for_Generation.pdf
Uploaded chunks for file AutoPrompt_Eliciting_Knowledge_From_LanguageModels.pdf
Uploaded chunks for file Generated_Knowledge_Prompting_for_Commonsense_Reasoning.pdf
Uploaded chunks for file Power_of_Scale_for_Parameter-Efficient_Prompt_Tuning.pdf


### Upload document using Azure AI search indexers (Alternative Approach)
Documents are expected to be available in Azure ADLS storage

In [10]:
from azure.search.documents.indexes.models import (
    SearchIndexerSkillset,
    SearchIndexer,
    SearchIndexerIndexProjections,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    SearchIndexerDataSourceConnection,
    SearchIndexerDataContainer
)

from azure.search.documents.indexes import SearchIndexerClient

from azure.search.documents.indexes.models import (
    AzureOpenAIEmbeddingSkill,
    SplitSkill
)

def create_search_skillset(
        skillset_name,
        index_name,
        azure_openai_endpoint,
        azure_openai_embedding_deployment_id,
        azure_openai_key=None,
        text_split_mode='pages',
        maximum_page_length=2000,
        page_overlap_length=500):
    return SearchIndexerSkillset(
        name=skillset_name,
        skills=[
            SplitSkill(
                name="Text Splitter",
                default_language_code="en",
                text_split_mode=text_split_mode,
                maximum_page_length=maximum_page_length,
                page_overlap_length=page_overlap_length,
                context="/document",
                inputs=[
                    InputFieldMappingEntry(
                        name="text",
                        source="/document/content"
                    )
                ],
                outputs=[
                    OutputFieldMappingEntry(
                        name="textItems",
                        target_name="pages"
                    )
                ]
            ),
            AzureOpenAIEmbeddingSkill(
                name="Embeddings",
                resource_uri=azure_openai_endpoint,
                deployment_id=azure_openai_embedding_deployment_id,
                api_key=azure_openai_key, # Optional if using RBAC authentication
                context="/document/pages/*",
                inputs=[
                    InputFieldMappingEntry(
                        name="text",
                        source="/document/pages/*"
                    )
                ],
                outputs=[
                    OutputFieldMappingEntry(
                        name="embedding",
                        target_name="vector"
                    )
                ]
            )
        ],
        index_projections=SearchIndexerIndexProjections(
            selectors=[
                SearchIndexerIndexProjectionSelector(
                    target_index_name=index_name,
                    parent_key_field_name="parent_id",
                    source_context="/document/pages/*",
                    mappings=[
                        InputFieldMappingEntry(
                            name="chunk",
                            source="/document/pages/*"
                        ),
                        InputFieldMappingEntry(
                            name="vector",
                            source="/document/pages/*/vector"
                        ),
                        InputFieldMappingEntry(
                            name="title",
                            source="/document/metadata_storage_name"
                        )
                    ]
                )
            ],
            parameters=SearchIndexerIndexProjectionsParameters(projection_mode="skipIndexingParentDocuments")
        )
    )

search_indexer_client = SearchIndexerClient(endpoint=endpoint, credential=credential)


data_source = SearchIndexerDataSourceConnection(
        name="blob-source",
        type="azureblob",
        connection_string=blob_connection_string,
        container=SearchIndexerDataContainer(
            name=blob_container
        )
    )
search_indexer_client.create_or_update_data_source_connection(data_source)

skillset = create_search_skillset(
    "document-processor",
    index_name,
    azure_openai_endpoint,
    azure_openai_embedding_deployment,
    azure_openai_key,
    text_split_mode='pages',
    maximum_page_length=2000,
    page_overlap_length=500
)

search_indexer_client.create_or_update_skillset(skillset)

indexer = SearchIndexer(
        name="document-indexer",
        data_source_name=data_source.name,
        target_index_name=index_name,
        skillset_name=skillset.name
    )

search_indexer_client.create_or_update_indexer(indexer)
search_indexer_client.run_indexer(indexer.name)

## Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

In [11]:
from azure.search.documents.models import VectorizedQuery

# Pure Vector Search
query = "prompt engineering approaches"  
  
embedding = openai_client.embeddings.create(input=query, model=embedding_model_name).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    # filter="category eq 'Developer Tools'",
    select=["title", "chunk"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Chunk: {result['chunk']}")  


Title: LLMs_are_Human-Level_Prompt_Engineers.pdf
Score: 0.8611069
Chunk: A P ROMPT ENGINEERING IN THE WILD
Large models with natural language interfaces, including models for text generation and image
synthesis, have seen an increasing amount of public usage in recent years. As ﬁnding the right prompt
can be difﬁcult for humans, a number of guides on prompt engineering as well as tools to aid in
prompt discovery have been developed. Among others, see, for example:
•https://blog.andrewcantino.com/blog/2021/04/21/prompt-engineering-tips-and-tricks/
•https://techcrunch.com/2022/07/29/a-startup-is-charging-1-99-for-strings-of-text-to-feed-to-dall-e-2/
•https://news.ycombinator.com/item?id=32943224
•https://promptomania.com/stable-diffusion-prompt-builder/
•https://huggingface.co/spaces/Gustavosta/MagicPrompt-Stable-Diffusion
In this paper we apply APE to generate effective instructions for steering LLMs, but the general
framework Algorithm 1 could be applied to steer other models with natu

This example shows a pure vector search to demonstrate OpenAI's text-embedding-ada-002 multilingual capabilities.

In [12]:
# Pure Vector Search multi-lingual (e.g 'tools for software development' in Dutch)  
query = "tools voor softwareontwikkeling"  
  
embedding = openai_client.embeddings.create(input=query, model=embedding_model_name).data[0].embedding
vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="vector")

results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["title", "chunk"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Chunk: {result['chunk']}")  


Title: LLMs_are_Human-Level_Prompt_Engineers.pdf
Score: 0.79798967
Chunk: nett (eds.), Advances in Neural Information Processing Systems , volume 31. Curran Asso-
ciates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/
7aa685b3b1dc1d6780bf36f7340078c9-Paper.pdf .
Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc
Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive
program synthesis with wake-sleep library learning. In Proceedings of the 42nd acm sigplan
international conference on programming language design and implementation , pp. 835–850,
2021.
Tianyu Gao. Prompting: Better ways of using language models for nlp tasks. The Gradient , 2021.
11
Title: Prefix-Tuning_Optimizing_Continuous_Prompts_for_Generation.pdf
Score: 0.7968784
Chunk: Quentin Lhoest, and Alexander M. Rush. 2020.
Transformers: State-of-the-art natural language pro-
cessing. In Proceedings of the 2020 Conference on
Empirical