Azure AI Search Integrated Vectorization 

Integrated vectorization takes a dependency on indexers and skillsets, using the Text Split skill for data chunking, and the AzureOpenAIEmbedding skill and your Azure OpenAI resource for embedding.

This example uses PDFs from the data/documents folder for chunking, embedding, indexing and queries.

Prerequisites
    1. An Azure subscription, with access to Azure OpenAI.
    2. Azure AI Search, any tier, but we recommend Basic or higher for this workload. Enable semantic ranker if you want to run a hybrid query with semantic ranking.
    3. A deployment of the text-embedding-ada-002 model on Azure OpenAI.
    4. Azure Blob Storage. 

    Python interpreter with 3.10 or later

Install packages

In [2]:
! pip install -r requirements.txt --quiet

You should consider upgrading via the '/Users/sithukaungset/Azure-AI-Search-prompthon/venv/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

Load .env file

In [30]:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
import os

load_dotenv(override=True) # take environment variables from .env file

# Variables not used here do not need to be updated in your .env file
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"]) if len(os.environ["AZURE_SEARCH_ADMIN_KEY"]) > 0 else DefaultAzureCredential()
index_name = os.environ["AZURE_SEARCH_INDEX"]
blob_connection_string = os.environ["BLOB_CONNECTION_STRING"]
blob_container_name = os.environ["BLOB_CONTAINER_NAME"]
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"] if len(os.environ["AZURE_OPENAI_KEY"]) > 0 else None
azure_openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]

Connect to the Blob Storage and load documents

Retrieve documentes from Blob Storage. We can see the sample documents in data/documents folder.

In [31]:
from azure.storage.blob import BlobServiceClient  
import os

# Connect to Blob Storage
blob_service_client = BlobServiceClient.from_connection_string(blob_connection_string)
container_client = blob_service_client.get_container_client(blob_container_name)
if not container_client.exists():
    container_client.create_container()

documents_directory = os.path.join("data", "documents")
for file in os.listdir(documents_directory):
    with open(os.path.join(documents_directory, file), "rb") as data:
        name = os.path.basename(file)
        if not container_client.get_blob_client(name).exists():
            container_client.upload_blob(name=name, data=data)

print(f"Setup sample data in {blob_container_name}")

Setup sample data in hdaoai


Create a blob data source connector on Azure AI Search

In [32]:
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection
)
from azure.search.documents.indexes._generated.models import NativeBlobSoftDeleteDeletionDetectionPolicy

# Create a data source 
indexer_client = SearchIndexerClient(endpoint, credential)
container = SearchIndexerDataContainer(name=blob_container_name)
data_source_connection = SearchIndexerDataSourceConnection(
    name=f"{index_name}-blob",
    type="azureblob",
    connection_string=blob_connection_string,
    container=container,
    data_deletion_detection_policy=NativeBlobSoftDeleteDeletionDetectionPolicy()
)
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

Data source 'hdaoai-blob' created or updated


Create a search index
Vector and nonvector content is stored in a search index.

In [33]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    HnswParameters,
    VectorSearchAlgorithmMetric,
    ExhaustiveKnnAlgorithmConfiguration,
    ExhaustiveKnnParameters,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIParameters,
    SemanticConfiguration,
    SemanticSearch,
    SemanticPrioritizedFields,
    SemanticField,
    SearchIndex
)

# Create a search index  
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)  
fields = [  
    SearchField(name="parent_id", type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=True),  
    SearchField(name="title", type=SearchFieldDataType.String),  
    SearchField(name="chunk_id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),  
    SearchField(name="chunk", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False),  
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single), vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"),  
]  
  
# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(  
            name="myHnsw",  
            parameters=HnswParameters(  
                m=4,  
                ef_construction=400,  
                ef_search=500,  
                metric=VectorSearchAlgorithmMetric.COSINE,  
            ),  
        ),  
        ExhaustiveKnnAlgorithmConfiguration(  
            name="myExhaustiveKnn",  
            parameters=ExhaustiveKnnParameters(  
                metric=VectorSearchAlgorithmMetric.COSINE,  
            ),  
        ),  
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",  
            vectorizer="myOpenAI",  
        ),  
        VectorSearchProfile(  
            name="myExhaustiveKnnProfile",  
            algorithm_configuration_name="myExhaustiveKnn",  
            vectorizer="myOpenAI",  
        ),  
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            name="myOpenAI",  
            kind="azureOpenAI",  
            azure_open_ai_parameters=AzureOpenAIParameters(  
                resource_uri=azure_openai_endpoint,  
                deployment_id=azure_openai_embedding_deployment,  
                api_key=azure_openai_key,  
            ),  
        ),  
    ],  
)  
  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(  
        content_fields=[SemanticField(field_name="chunk")]  
    ),  
)  
  
# Create the semantic search with the configuration  
semantic_search = SemanticSearch(configurations=[semantic_config])  
  
# Create the search index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created") 

hdaoai created


Create a skillset

Skills drive integrated vectorization. Text Split provides data chunking. Azure OpenAI Embedding handles calls to the Azure OpenAI, using the connection in env file. An indexer projection specifies secondary indexes used for chunked data.

In [34]:
from azure.search.documents.indexes.models import (
    SplitSkill,
    InputFieldMappingEntry,
    OutputFieldMappingEntry,
    AzureOpenAIEmbeddingSkill,
    SearchIndexerIndexProjections,
    SearchIndexerIndexProjectionSelector,
    SearchIndexerIndexProjectionsParameters,
    IndexProjectionMode,
    SearchIndexerSkillset
)

# Create a skillset  
skillset_name = f"{index_name}-skillset"  
  
split_skill = SplitSkill(  
    description="Split skill to chunk documents",  
    text_split_mode="pages",  
    context="/document",  
    maximum_page_length=2000,  
    page_overlap_length=500,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],  
)  
  
embedding_skill = AzureOpenAIEmbeddingSkill(  
    description="Skill to generate embeddings via Azure OpenAI",  
    context="/document/pages/*",  
    resource_uri=azure_openai_endpoint,  
    deployment_id=azure_openai_embedding_deployment,  
    api_key=azure_openai_key,  
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/pages/*"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="embedding", target_name="vector")  
    ],  
)  
  
index_projections = SearchIndexerIndexProjections(  
    selectors=[  
        SearchIndexerIndexProjectionSelector(  
            target_index_name=index_name,  
            parent_key_field_name="parent_id",  
            source_context="/document/pages/*",  
            mappings=[  
                InputFieldMappingEntry(name="chunk", source="/document/pages/*"),  
                InputFieldMappingEntry(name="vector", source="/document/pages/*/vector"),  
                InputFieldMappingEntry(name="title", source="/document/metadata_storage_name"),  
            ],  
        ),  
    ],  
    parameters=SearchIndexerIndexProjectionsParameters(  
        projection_mode=IndexProjectionMode.SKIP_INDEXING_PARENT_DOCUMENTS  
    ),  
)  
  
skillset = SearchIndexerSkillset(  
    name=skillset_name,  
    description="Skillset to chunk documents and generating embeddings",  
    skills=[split_skill, embedding_skill],  
    index_projections=index_projections,  
)  
  
client = SearchIndexerClient(endpoint, credential)  
client.create_or_update_skillset(skillset)  
print(f"{skillset.name} created")  

hdaoai-skillset created


Create an indexer

In [35]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping
)

# Create an indexer  
indexer_name = f"{index_name}-indexer"  
  
indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",  
    skillset_name=skillset_name,  
    target_index_name=index_name,  
    data_source_name=data_source.name,  
    # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results  
    field_mappings=[FieldMapping(source_field_name="metadata_storage_name", target_field_name="title")]  
)  
  
indexer_client = SearchIndexerClient(endpoint, credential)  
indexer_result = indexer_client.create_or_update_indexer(indexer)  
  
# Run the indexer  
indexer_client.run_indexer(indexer_name)  
print(f' {indexer_name} is created and running. If queries return no results, please wait a bit and try again.') 

 hdaoai-indexer is created and running. If queries return no results, please wait a bit and try again.


Perform a vector similarity search

In [41]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

# Pure Vector Search
query = "지능화된 연계기술"  
  
search_client = SearchClient(endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
# Use the below query to pass in the raw vector query instead of the query vectorization
# vector_query = RawVectorQuery(vector=generate_embeddings(query), k_nearest_neighbors=3, fields="vector")
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")   

parent_id: aHR0cHM6Ly9hb2FpdGVhbWJsb2JzdG9yYWdlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9oZGFvYWkvUFBUJUVDJTgzJTk4JUVEJTk0JThDKCVFRCU5NSU5QyVFQSVCOCU4MClfJUVDJTg0JUI4JUVCJUFGJUI4JUVCJTgyJTk4JUVCJUIwJTlDJUVEJTkxJTlDLnBwdHg1
chunk_id: 742299af0127_aHR0cHM6Ly9hb2FpdGVhbWJsb2JzdG9yYWdlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9oZGFvYWkvUFBUJUVDJTgzJTk4JUVEJTk0JThDKCVFRCU5NSU5QyVFQSVCOCU4MClfJUVDJTg0JUI4JUVCJUFGJUI4JUVCJTgyJTk4JUVCJUIwJTlDJUVEJTkxJTlDLnBwdHg1_pages_0
Score: 0.8358228
Content: 01. Demo Session 1 – AOAI 기반 RDB연동 Chatbot (회의실 예약 시스템)
TO-BE) AOAI의 지능화된 연계기술(Function calling)을 통해 Legacy 시스템과 연계하고,
대화를 통한 추천 방식으로 전환하여 단계별로 진행되던 업무과정을 혁신적으로 간소화 할 수 있습니다.

회의실 예약업무 문제 해결 과정
RDB 연동 LLM Chatbot 구성현황

예약신청
예약취소
예약변경
회의실 전체현황
조회
회의실 조건 검색
대안 고민
(회의실 없는 경우)
반복적 고민
(인원, 빈회의실,회의환경)


기존
문제
개선
시간 소요
(매번 동일한
조건으로 입력)
단계적인 수행없이 One-Stop 업무처리 가능한 방식으로 개선 됨
변화) 검색방식  대화, 추천 방식
효과) 2분  30초 (75% 단축)

IT(DB)와 연동된 Chatbot 동작 구조

“내일 회의실 좀 예약해줘”
지난주에 하신 “TF 주간회의“
시면 6번 Room으로 오전 9시반 예약해 드릴까요?
“오늘 회의 취소 됬어”
“회의 어디서 하지?”

Perform a hybrid search

In [42]:
# Hybrid Search
query = "지능화된 연계기술"  
  
search_client = SearchClient(endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)
  
results = search_client.search(  
    search_text=query,  
    vector_queries= [vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    top=1
)  
  
for result in results:  
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['chunk']}")  

parent_id: aHR0cHM6Ly9hb2FpdGVhbWJsb2JzdG9yYWdlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9oZGFvYWkvUFBUJUVDJTgzJTk4JUVEJTk0JThDKCVFRCU5NSU5QyVFQSVCOCU4MClfJUVDJTg0JUI4JUVCJUFGJUI4JUVCJTgyJTk4JUVCJUIwJTlDJUVEJTkxJTlDLnBwdHg1
chunk_id: 742299af0127_aHR0cHM6Ly9hb2FpdGVhbWJsb2JzdG9yYWdlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9oZGFvYWkvUFBUJUVDJTgzJTk4JUVEJTk0JThDKCVFRCU5NSU5QyVFQSVCOCU4MClfJUVDJTg0JUI4JUVCJUFGJUI4JUVCJTgyJTk4JUVCJUIwJTlDJUVEJTkxJTlDLnBwdHg1_pages_0
Score: 0.03333333507180214
Content: 01. Demo Session 1 – AOAI 기반 RDB연동 Chatbot (회의실 예약 시스템)
TO-BE) AOAI의 지능화된 연계기술(Function calling)을 통해 Legacy 시스템과 연계하고,
대화를 통한 추천 방식으로 전환하여 단계별로 진행되던 업무과정을 혁신적으로 간소화 할 수 있습니다.

회의실 예약업무 문제 해결 과정
RDB 연동 LLM Chatbot 구성현황

예약신청
예약취소
예약변경
회의실 전체현황
조회
회의실 조건 검색
대안 고민
(회의실 없는 경우)
반복적 고민
(인원, 빈회의실,회의환경)


기존
문제
개선
시간 소요
(매번 동일한
조건으로 입력)
단계적인 수행없이 One-Stop 업무처리 가능한 방식으로 개선 됨
변화) 검색방식  대화, 추천 방식
효과) 2분  30초 (75% 단축)

IT(DB)와 연동된 Chatbot 동작 구조

“내일 회의실 좀 예약해줘”
지난주에 하신 “TF 주간회의“
시면 6번 Room으로 오전 9시반 예약해 드릴까요?
“오늘 회의 취소 됬어”
“회

Perform a hybrid search + semantic reranking

In [43]:
from azure.search.documents.models import (
    QueryType,
    QueryCaptionType,
    QueryAnswerType
)
# Semantic Hybrid Search
query = "지능화된 연계기술?"

search_client = SearchClient(endpoint, index_name, credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=1, fields="vector", exhaustive=True)

results = search_client.search(  
    search_text=query,
    vector_queries=[vector_query],
    select=["parent_id", "chunk_id", "chunk"],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='my-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=1
)

semantic_answers = results.get_answers()
if semantic_answers:
    for answer in semantic_answers:
        if answer.highlights:
            print(f"Semantic Answer: {answer.highlights}")
        else:
            print(f"Semantic Answer: {answer.text}")
        print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"parent_id: {result['parent_id']}")  
    print(f"chunk_id: {result['chunk_id']}")  
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['chunk']}")  

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")

parent_id: aHR0cHM6Ly9hb2FpdGVhbWJsb2JzdG9yYWdlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9oZGFvYWkvUFBUJUVDJTgzJTk4JUVEJTk0JThDKCVFRCU5NSU5QyVFQSVCOCU4MClfJUVDJTg0JUI4JUVCJUFGJUI4JUVCJTgyJTk4JUVCJUIwJTlDJUVEJTkxJTlDLnBwdHg1
chunk_id: 742299af0127_aHR0cHM6Ly9hb2FpdGVhbWJsb2JzdG9yYWdlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9oZGFvYWkvUFBUJUVDJTgzJTk4JUVEJTk0JThDKCVFRCU5NSU5QyVFQSVCOCU4MClfJUVDJTg0JUI4JUVCJUFGJUI4JUVCJTgyJTk4JUVCJUIwJTlDJUVEJTkxJTlDLnBwdHg1_pages_0
Reranker Score: 3.3512561321258545
Content: 01. Demo Session 1 – AOAI 기반 RDB연동 Chatbot (회의실 예약 시스템)
TO-BE) AOAI의 지능화된 연계기술(Function calling)을 통해 Legacy 시스템과 연계하고,
대화를 통한 추천 방식으로 전환하여 단계별로 진행되던 업무과정을 혁신적으로 간소화 할 수 있습니다.

회의실 예약업무 문제 해결 과정
RDB 연동 LLM Chatbot 구성현황

예약신청
예약취소
예약변경
회의실 전체현황
조회
회의실 조건 검색
대안 고민
(회의실 없는 경우)
반복적 고민
(인원, 빈회의실,회의환경)


기존
문제
개선
시간 소요
(매번 동일한
조건으로 입력)
단계적인 수행없이 One-Stop 업무처리 가능한 방식으로 개선 됨
변화) 검색방식  대화, 추천 방식
효과) 2분  30초 (75% 단축)

IT(DB)와 연동된 Chatbot 동작 구조

“내일 회의실 좀 예약해줘”
지난주에 하신 “TF 주간회의“
시면 6번 Room으로 오전 9시반 예약해 드릴까요?
“오늘 회의 취