#  AI Search - Index (Push) documents for RAG

### Docs

- https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search

Python SDK
https://learn.microsoft.com/en-us/python/api/overview/azure/search-documents-readme?view=azure-python
- Key concepts: https://learn.microsoft.com/en-us/python/api/overview/azure/search-documents-readme?view=azure-python#key-concepts

Basic appoaches push and pull
- https://learn.microsoft.com/en-us/azure/search/search-what-is-data-import
- Note: If AI enrichment (https://learn.microsoft.com/en-us/azure/search/cognitive-search-concept-intro) is a solution requirement, you must use the pull model (indexers) to load an index. Skillsets are attached to an indexer and don't run independently.

### Inspirational sources
- https://github.com/Azure/azure-search-vector-samples/blob/main/demo-python/code/integrated-vectorization/azure-search-integrated-vectorization-sample.ipynb

### Dependencies
- https://learn.microsoft.com/en-us/azure/search/search-api-versions

In [1]:
#! pip install -r requirements.txt

### Gobal flags (e.g. for debug and development)

### Load .env file (Copy .env-sample to .env and update accordingly)

In [2]:
import os
from dotenv import load_dotenv

load_dotenv(override=True) # take environment variables from .env.

from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential

endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"]) if len(os.environ["AZURE_SEARCH_ADMIN_KEY"]) > 0 else DefaultAzureCredential()
index_name = os.environ["AZURE_SEARCH_INDEX"]

blob_connection_string = os.environ["BLOB_CONNECTION_STRING"]
blob_container_name = os.environ["BLOB_CONTAINER_NAME"]

azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"]
azure_openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"]

In [3]:
from azure.search.documents import SearchClient
search_client = SearchClient(endpoint, index_name, credential)

### Add document actions to batch

In [4]:
unique_term_in_time_and_space = "ZYX1"

In [5]:
# https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexdocumentsbatch?view=azure-python
from azure.search.documents import IndexDocumentsBatch

batch = IndexDocumentsBatch()
batch.add_upload_actions([{ "title": "push.txt", "url": "push.txt", "id": "1", "chunk_id": "1", "content": f"This is a push test document: {unique_term_in_time_and_space}", "contentVector": []}])

[<azure.search.documents._generated.models._models_py3.IndexAction at 0x25548c0ffa0>]

In [6]:
import time

def wait_search_for_unique_term(_unique_term_in_time_and_space):
    while True:
        results = search_client.search(  
                search_text=_unique_term_in_time_and_space,  
                select=["content"],
                top=1
            )  

        _rnr = 0
        for _r in results:
            _rnr = _rnr + 1
        
        if _rnr > 0:
            print("Found total of results: {}".format(_rnr))
            break
        else:
            time.sleep(1)

### Execute batch

In [7]:
search_client.index_documents(batch)

[<azure.search.documents._generated.models._models_py3.IndexingResult at 0x25548c0f2e0>]

### Helper function - Wait for index to contain pushed entries

In [8]:
import time

def wait_search_for_unique_term(_unique_term_in_time_and_space):
    while True:
        results = search_client.search(  
                search_text=_unique_term_in_time_and_space,  
                select=["content"],
                top=1
            )  

        _rnr = 0
        for _r in results:
            _rnr = _rnr + 1
        
        if _rnr > 0:
            print("Found total of results: {}".format(_rnr))
            break
        else:
            time.sleep(1)

In [9]:
# Wait for index updated
import timeit
timeit.timeit(lambda: wait_search_for_unique_term(unique_term_in_time_and_space), number=1)

Found total of results: 1


1.1305372000000005

In [13]:
results = search_client.search(  
                search_text=unique_term_in_time_and_space,  
                select=["content"],
                top=1
            )  

for _r in results:
    print(_r)

{'content': 'This is a push test document: ZYX1', '@search.score': 1.1403778, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}
