# Part 5: Blob Knowledge Source

In Parts 1-4, you worked with pre-indexed data, SharePoint, and web sources. In Part 5, you'll upload documents from Azure Blob Storage and create knowledge sources that index them automatically. You'll also compare two indexing modes: **minimal** (basic content extraction) and **standard** (advanced content understanding with Azure AI Services).

## Step 1: Load Environment Variables

Run below cell to load the configuration for your Azure resources, choose the **.venv(3.11.9)** environment that is created for you.

Notice the additional variables for blob storage, AI services, and embedding models, which are needed for document ingestion and vectorization. All these Azure resources are pre-configured in `.env` for you.

> **‚ö†Ô∏è Troubleshooting**
>
> If code cells get stuck and keep spinning, select **Restart** from the notebook toolbar at the top. If the issue persists after a couple of tries, close VS Code completely and reopen it.

In [73]:
import os

from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv

load_dotenv(override=True) # take environment variables from .env.

# Azure AI Search configuration
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"])

 # Knowledge base name
knowledge_base_name = "upload-blob-knowledge-base-minimal"
standard_knowledge_base_name = "upload-blob-knowledge-base-standard"

# Azure OpenAI configuration
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"]
azure_openai_chatgpt_deployment = os.getenv("AZURE_OPENAI_CHATGPT_DEPLOYMENT", "gpt-4.1")
azure_openai_chatgpt_model_name = os.getenv("AZURE_OPENAI_CHATGPT_MODEL_NAME", "gpt-4.1")
azure_openai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
azure_openai_embedding_model_name = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL_NAME", "text-embedding-3-large")

# Blob configuration
blob_connection_string = os.environ.get("BLOB_CONNECTION_STRING")
blob_resource_id = os.environ.get("BLOB_RESOURCE_ID")
blob_container_name = os.environ["BLOB_CONTAINER_NAME"]
ai_services_endpoint = os.environ["AI_SERVICES_ENDPOINT"]
ai_services_key = os.environ["AI_SERVICES_KEY"]

blob_path = "../data/ai-search-data/blobdata/MSFT_cloud_architecture_zava.pdf"

print("Environment variables loaded")

Environment variables loaded


In [79]:
# -------------------------------------------------------------------------
# OPTIONAL: CLEANUP CELL
# Run this cell if you want to restart Part 5 from scratch.
# It deletes the Knowledge Bases and Knowledge Sources created in this lab.
# -------------------------------------------------------------------------
from azure.core.exceptions import ResourceNotFoundError
from azure.search.documents.indexes import SearchIndexClient

# Ensure client is ready (uses variables from Step 1)
if "endpoint" in globals() and "credential" in globals():
    cleanup_client = SearchIndexClient(endpoint=endpoint, credential=credential)
    
    items_to_delete = [
        ("Knowledge Base", cleanup_client.delete_knowledge_base, "upload-blob-knowledge-base-minimal"),
        ("Knowledge Base", cleanup_client.delete_knowledge_base, "upload-blob-knowledge-base-standard"),
        ("Knowledge Source", cleanup_client.delete_knowledge_source, "upload-blob-knowledge-source-minimal"),
        ("Knowledge Source", cleanup_client.delete_knowledge_source, "upload-blob-knowledge-source-standard"),
    ]

    print("üßπ Starting cleanup of Part 5 resources...")
    for label, delete_func, name in items_to_delete:
        try:
            delete_func(name)
            print(f"   ‚úÖ Deleted {label}: {name}")
        except ResourceNotFoundError:
            print(f"   ‚ö†Ô∏è {label} already deleted: {name}")
        except Exception as e:
            print(f"   ‚ùå Error deleting {name}: {e}")
            
    print("‚ú® Cleanup complete. You can now continue to Step 2.")
else:
    print("‚ùå Error: Please run Step 1 first to load environment variables.")

üßπ Starting cleanup of Part 5 resources...
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-minimal
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-minimal
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-minimal
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-minimal
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
‚ú® Cleanup complete. You can now continue to Step 2.
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
‚ú® Cleanup complete. You can now continue to Step 2.


## Step 2: Upload Document to Blob Storage

Before creating a knowledge source, you need to upload a document to your blob storage. The code below uploads a PDF called `MSFT_cloud_architecture_zava.pdf` which contains information about Zava's cloud architecture and how they classify data by sensitivity level.

Once you create the blob knowledge source in the next step, it will automatically find this PDF in the storage and index it for querying.

In [2]:
import os
from azure.core.exceptions import ClientAuthenticationError, HttpResponseError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# Require an account URL for Azure AD auth (no keys). Prefer BLOB_ACCOUNT_URL in .env.
account_url = os.environ.get("BLOB_ACCOUNT_URL")

# Fallback: derive account URL from the blob connection string without using the key
if not account_url:
    conn = os.environ.get("BLOB_CONNECTION_STRING") or globals().get("blob_connection_string")
    if conn:
        account_name = None
        for part in conn.split(";"):
            if part.lower().startswith("accountname="):
                account_name = part.split("=", 1)[1]
                break
        if account_name:
            account_url = f"https://{account_name}.blob.core.windows.net"

if not account_url:
    raise ValueError("Missing BLOB_ACCOUNT_URL. Set it in .env or rerun setup-environment.sh to populate it.")

# Use Azure AD (managed identity/VS Code signed-in user/service principal) instead of account keys
credential = DefaultAzureCredential(exclude_shared_token_cache_credential=True)
blob_service_client = BlobServiceClient(account_url=account_url, credential=credential)
container_client = blob_service_client.get_container_client(blob_container_name)

# Ensure container exists (idempotent)
try:
    container_client.create_container()
except HttpResponseError as e:
    if e.status_code != 409:
        raise
except ClientAuthenticationError as e:
    raise RuntimeError("Authentication failed. Ensure your identity has 'Storage Blob Data Contributor' on the storage account.") from e

blob_name = os.path.basename(blob_path)
blob_client = container_client.get_blob_client(blob_name)

# Upload directly; avoid exists() to reduce permission needs
try:
    with open(blob_path, "rb") as data:
        blob_client.upload_blob(data, overwrite=True)
except ClientAuthenticationError as e:
    raise RuntimeError("Upload failed. Confirm your identity has 'Storage Blob Data Contributor' on the storage account.") from e
except HttpResponseError as e:
    if getattr(e, "error_code", "").lower() == "authorizationpermissionmismatch":
        raise RuntimeError("Authorization failed (AuthorizationPermissionMismatch). Ensure your identity has 'Storage Blob Data Contributor' on the storage account.") from e
    raise

print(f"Setup sample data in {blob_container_name} using Azure AD auth")

Setup sample data in documents using Azure AD auth


## Step 3: Create Blob Knowledge Source with Minimal Extraction

An **AzureBlobKnowledgeSource** automatically indexes documents from blob storage. Unlike the sources you've used before, this one ingests and processes the documents for you.

The code below creates a knowledge source with a `content_extraction_mode` of **minimal**. This mode chunks documents quickly without deep semantic understanding. An embedding model (`text-embedding-3-large`) is used to vectorize the chunks for vector search, but the chunking strategy itself is basic and fast.

>Minimal indexing is ideal when you need speed and have straightforward documents.

In [3]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    AzureBlobKnowledgeSource,
    AzureBlobKnowledgeSourceParameters,
    AzureOpenAIVectorizerParameters,
    KnowledgeSourceAzureOpenAIVectorizer,
    KnowledgeSourceContentExtractionMode,
    KnowledgeSourceIngestionParameters,
    SearchIndexerDataNoneIdentity
)

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

embedding_model = KnowledgeSourceAzureOpenAIVectorizer(
    azure_open_ai_parameters=AzureOpenAIVectorizerParameters(
        resource_url=azure_openai_endpoint,
        api_key=azure_openai_key,
        deployment_name=azure_openai_embedding_deployment,
        model_name=azure_openai_embedding_model_name
    )
)

if blob_resource_id:
    blob_connection = f"ResourceId={blob_resource_id}"
else:
    blob_connection = blob_connection_string

if not blob_connection:
    raise ValueError("Missing blob connection info. Set BLOB_RESOURCE_ID or BLOB_CONNECTION_STRING via setup-environment.sh.")

ingestion_identity = SearchIndexerDataNoneIdentity()  # system-assigned identity for ingestion

knowledge_source = AzureBlobKnowledgeSource(
    name="upload-blob-knowledge-source-minimal",
    azure_blob_parameters=AzureBlobKnowledgeSourceParameters(
        connection_string=blob_connection,
        container_name=blob_container_name,
        ingestion_parameters=KnowledgeSourceIngestionParameters(
            identity=ingestion_identity,
            embedding_model=embedding_model,
            content_extraction_mode=KnowledgeSourceContentExtractionMode.MINIMAL
        )
    )
)

index_client.create_or_update_knowledge_source(knowledge_source=knowledge_source)
print(f"Knowledge source '{knowledge_source.name}' created or updated successfully.")

Knowledge source 'upload-blob-knowledge-source-minimal' created or updated successfully.


## Step 4: Check Knowledge Source Status

After creating a blob knowledge source, it needs time to process the documents. The code below checks whether indexing is complete, in progress, or failed.

Once you see that `itemsUpdatesProcessed` is 1, that means the single document has been indexed successfully. Once indexing is complete, you can move to the next step.

In [4]:
import json

status = index_client.get_knowledge_source_status(knowledge_source.name)

print(json.dumps(status.serialize(), indent=2))

{
  "synchronizationStatus": "active",
  "synchronizationInterval": "1d",
  "lastSynchronizationState": {
    "startTime": "2025-12-07T15:43:23.549Z",
    "endTime": "2025-12-07T15:43:29.335Z",
    "itemsUpdatesProcessed": 1,
    "itemsUpdatesFailed": 0,
    "itemsSkipped": 0
  },
  "statistics": {
    "totalSynchronization": 1,
    "averageSynchronizationDuration": "PT5.7864929S",
    "averageItemsProcessedPerSynchronization": 1
  }
}


## Step 5: Create Knowledge Base

Now that the blob knowledge source has indexed the document, you can create a knowledge base to query it. The code below creates a knowledge base that uses the blob knowledge source you created earlier.

Notice that this knowledge base also set `retrieval_reasoning_effort` to "low". Currently, the lowest possible effort is "minimal" and highest possible is "medium". The "low" effort will still perform query decomposition, but it will not do iterative retrieval.

In [5]:
from azure.search.documents.indexes.models import AzureOpenAIVectorizerParameters, KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalLowReasoningEffort, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

aoai_params = AzureOpenAIVectorizerParameters(
    resource_url=azure_openai_endpoint,
    api_key=azure_openai_key,
    deployment_name=azure_openai_chatgpt_deployment,
    model_name=azure_openai_chatgpt_model_name,
)

knowledge_base = KnowledgeBase(
    name=knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=knowledge_source.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS,
    retrieval_reasoning_effort=KnowledgeRetrievalLowReasoningEffort
)

index_client.create_or_update_knowledge_base(knowledge_base)
print(f"Knowledge base '{knowledge_base_name}' created or updated successfully.")

Knowledge base 'upload-blob-knowledge-base-minimal' created or updated successfully.


## Step 6: Use agentic retrieval to fetch results from Blob Knowledge Source

The code below queries the PDF document about Zava's data sensitivity classification levels. This demonstrates how agentic retrieval works with blob knowledge sources.

When you run this query, the knowledge base analyzes your question, decomposes it into focused subqueries, searches the blob-indexed content concurrently, uses semantic ranking to filter results, and synthesizes a grounded answer with citations pointing back to the PDF document.

In [6]:
import os
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import AzureBlobKnowledgeSourceParams, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeBaseRetrievalRequest
from IPython.display import display, Markdown

if "endpoint" not in globals() or "knowledge_base_name" not in globals():
    raise RuntimeError("Missing notebook state. Rerun Steps 1-5 to reload endpoint, credential, and knowledge_base_name.")

# Prefer admin key if present; otherwise fall back to AAD (managed identity/service principal) for retrieval
admin_key = os.getenv("AZURE_SEARCH_ADMIN_KEY")
search_credential = AzureKeyCredential(admin_key) if admin_key else DefaultAzureCredential(exclude_shared_token_cache_credential=True)

# If the knowledge source object is not in scope (e.g., after a kernel restart), refetch it by name
if "knowledge_source" not in globals():
    knowledge_source = index_client.get_knowledge_source("upload-blob-knowledge-source-minimal")

knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=knowledge_base_name, credential=search_credential)

blob_ks_params = AzureBlobKnowledgeSourceParams(
    knowledge_source_name=knowledge_source.name,
    include_references=True,
    include_reference_source_data=True
)
req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="What are the levels of Zava data sensitivity classification?")])
    ],
    knowledge_source_params=[
        blob_ks_params
    ],
    include_activity=True
)

result = knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

Zava's data sensitivity classification consists of three levels:

- Level 1: Low business value. Examples include normal business communications (such as email) and files for administrative, sales, and support workers.
- Level 2: Medium business value. Examples include financial and legal information, as well as research and development data for new products.
- Level 3: High business value. Examples include customer and partner personally identifiable information, product engineering specifications, and proprietary manufacturing techniques [ref_id:0][ref_id:2].

## Step 7: Review Response, References, and Activity

The two cells below show the citations and activity log from the blob knowledge source query.

The references reveal which chunks from the PDF were used to answer your question. 

The activity log shows how the knowledge base processed your query and retrieved information from the blob-indexed content.

In [88]:
import json

references = json.dumps([ref.as_dict() for ref in result.references], indent=2)
print(references)

[
  {
    "type": "azureBlob",
    "id": "0",
    "activity_source": 1,
    "source_data": {
      "uid": "fd6a3be7e8bd_aHR0cHM6Ly9sYWI1MTFzdGxna3h4Z2k0dGtnY20uYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50cy9pZWN0ZXN0LmRvY3g1_text_sections_569",
      "blob_url": "https://lab511stlgkxxgi4tkgcm.blob.core.windows.net/documents/iectest.docx",
      "snippet": "\u05e8\u05db\u05df \u05d1\u05e2\u05d3 \u05d4\u05d0\u05e0\u05e8\u05d2\u05d9\u05d4 \u05d4\u05de\u05d5\u05d6\u05e8\u05de\u05ea \u05dc\u05e8\u05e9\u05ea \u05d0\u05ea \u05d4\u05ea\u05e2\u05e8\u05d9\u05e3 \u05d4\u05e7\u05d1\u05d5\u05e2 \u05d1\u05dc\u05d5\u05d7 \u05ea\u05e2\u05e8\u05d9\u05e4\u05d9\u05dd 15-6.7. |\n|  |  |  |  | \u05de\u05e9\u05da \u05d4\u05d4\u05ea\u05d7\u05e9\u05d1\u05e0\u05d5\u05ea \u05d9\u05d4\u05d9\u05d4 \u05dc- 23 \u05e9\u05e0\u05d9\u05dd \u05de\u05d9\u05d5\u05dd \u05e9\u05d9\u05dc\u05d5\u05d1 \u05d4\u05de\u05d9\u05ea\u05e7\u05df \u05d1\u05d9\u05d3\u05d9 \u05d4\u05de\u05d7\u05dc\u05e7. |\n|  |  |  | \u05d4\u05ea\u05d7\u05e

In [89]:
import pandas as pd

activity_types = [{"type": a.type} for a in result.activity]

df = pd.DataFrame(activity_types)

print("Activity Log Steps")
df

Activity Log Steps


Unnamed: 0,type
0,modelQueryPlanning
1,azureBlob
2,azureBlob
3,azureBlob
4,agenticReasoning
5,modelAnswerSynthesis


In [90]:
activity_content = json.dumps([a.as_dict() for a in result.activity], indent=2)
print("Activity Details")
print(activity_content)

Activity Details
[
  {
    "id": 0,
    "type": "modelQueryPlanning",
    "elapsed_ms": 1370,
    "input_tokens": 1461,
    "output_tokens": 106
  },
  {
    "id": 1,
    "type": "azureBlob",
    "elapsed_ms": 704,
    "knowledge_source_name": "upload-blob-knowledge-source-standard",
    "query_time": "2025-12-08T07:05:09.062Z",
    "count": 48,
    "azure_blob_arguments": {
      "search": "\u05ea\u05d5\u05da \u05db\u05de\u05d4 \u05d9\u05de\u05d9\u05dd \u05e0\u05d9\u05ea\u05df \u05dc\u05e0\u05ea\u05e7 \u05e6\u05e8\u05db\u05df \u05de\u05d0\u05d9 \u05ea\u05e9\u05dc\u05d5\u05dd \u05d4\u05d7\u05e9\u05d1\u05d5\u05df"
    }
  },
  {
    "id": 2,
    "type": "azureBlob",
    "elapsed_ms": 391,
    "knowledge_source_name": "upload-blob-knowledge-source-standard",
    "query_time": "2025-12-08T07:05:09.454Z",
    "count": 36,
    "azure_blob_arguments": {
      "search": "\u05d7\u05d5\u05e7\u05d9 \u05e0\u05d9\u05ea\u05d5\u05e7 \u05e6\u05e8\u05db\u05df \u05d1\u05e9\u05dc \u05d0\u05d9 \u05ea\u05

## Step 8: Use Standard extraction mode with Content Understanding

In the previous steps, you created a blob knowledge source with minimal extraction mode. Now, you'll create another blob knowledge source using the **standard** extraction mode, which leverages Azure AI Services for deeper content understanding. This mode provides advanced chunking strategies, semantic extraction, and better handling of complex documents.

The code below adds `content_extraction_mode=STANDARD` and connects Azure AI Services for enhanced processing. 

>Standard extraction takes longer but produces higher-quality chunks that preserve document structure and relationships.

In [80]:
from azure.search.documents.indexes.models import AIServices, KnowledgeSourceContentExtractionMode
from azure.core.exceptions import ResourceNotFoundError, HttpResponseError
import time

# CRITICAL: Azure Knowledge Sources store credentials PERMANENTLY
# We MUST delete the old one completely before creating a new one
ks_name = "upload-blob-knowledge-source-standard"
kb_name = "upload-blob-knowledge-base-standard"

print("üîÑ FORCE DELETE - Removing resources with wrong credentials...")

# Step 1: Delete Knowledge Base first (dependency)
try:
    index_client.delete_knowledge_base(kb_name)
    print(f"   ‚úÖ Deleted Knowledge Base: {kb_name}")
except ResourceNotFoundError:
    print(f"   ‚ÑπÔ∏è  Knowledge Base doesn't exist: {kb_name}")
except Exception as e:
    print(f"   ‚ö†Ô∏è  Error deleting KB: {e}")

time.sleep(3)  # Wait for KB deletion to propagate

# Step 2: Force delete Knowledge Source
try:
    # First try normal delete
    index_client.delete_knowledge_source(ks_name)
    print(f"   ‚úÖ Deleted Knowledge Source: {ks_name}")
    print("   ‚è≥ Waiting 10 seconds for Azure to fully remove it...")
    time.sleep(10)  # Longer wait to ensure complete deletion
except ResourceNotFoundError:
    print(f"   ‚ÑπÔ∏è  Knowledge Source doesn't exist: {ks_name}")
except Exception as e:
    print(f"   ‚ö†Ô∏è  Error deleting KS: {e}")

# Step 3: VERIFY it's really gone (retry up to 5 times)
print("\nüîç Verifying deletion completed...")
ks_exists = True
for attempt in range(5):
    try:
        index_client.get_knowledge_source(ks_name)
        print(f"   ‚ö†Ô∏è  Still exists (attempt {attempt+1}/5)... waiting 5 more seconds")
        time.sleep(5)
        ks_exists = True
    except ResourceNotFoundError:
        print(f"   ‚úÖ CONFIRMED: Knowledge Source is fully deleted!")
        ks_exists = False
        break

if ks_exists:
    print("\n‚ùå ERROR: Knowledge Source still exists after 30 seconds!")
    print("   Please wait 1 minute and run this cell again.")
    raise RuntimeError("Knowledge Source deletion did not complete")

# Step 4: Create completely NEW Knowledge Source
print("\n‚ñ∂Ô∏è  Creating BRAND NEW Knowledge Source with correct credentials...")
print(f"   Endpoint: {ai_services_endpoint}")
print(f"   Key: {ai_services_key[:5]}...{ai_services_key[-5:]} (expected: 1f37e...9ab3e)")

standard_knowledge_source = AzureBlobKnowledgeSource(
    name=ks_name,
    azure_blob_parameters=AzureBlobKnowledgeSourceParameters(
        connection_string=blob_connection,
        container_name=blob_container_name,
        ingestion_parameters=KnowledgeSourceIngestionParameters(
            identity=ingestion_identity,
            embedding_model=embedding_model,
            ai_services=AIServices(uri=ai_services_endpoint, api_key=ai_services_key),
            content_extraction_mode=KnowledgeSourceContentExtractionMode.STANDARD
        )
    )
)

# Use CREATE, not update
try:
    index_client.create_or_update_knowledge_source(knowledge_source=standard_knowledge_source)
    print(f"\n‚úÖ NEW Knowledge Source created!")
    print(f"   üîë Credential check: {ai_services_key[:5]}...{ai_services_key[-5:]}")
    print(f"\n‚è≥ Waiting 10 seconds for indexing to start...")
    time.sleep(10)
except HttpResponseError as e:
    if "already exists" in str(e).lower():
        print("\n‚ùå ERROR: Resource still exists! Wait 60 seconds and try again.")
        raise
    raise

üîÑ FORCE DELETE - Removing resources with wrong credentials...
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
   ‚è≥ Waiting 10 seconds for Azure to fully remove it...
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
   ‚è≥ Waiting 10 seconds for Azure to fully remove it...

üîç Verifying deletion completed...
   ‚úÖ CONFIRMED: Knowledge Source is fully deleted!

‚ñ∂Ô∏è  Creating BRAND NEW Knowledge Source with correct credentials...
   Endpoint: https://lab511-ai-services-lgkxxgi4tkgcm.cognitiveservices.azure.com/
   Key: 47bbe...1cf22 (expected: 1f37e...9ab3e)

üîç Verifying deletion completed...
   ‚úÖ CONFIRMED: Knowledge Source is fully deleted!

‚ñ∂Ô∏è  Creating BRAND NEW Knowledge Source with correct credentials...
   Endpoint: https://lab511-ai-services-lgkxxgi4tkgcm.cognitiveservices.azure.com/
 

## Step 9: Check Standard Extraction Status

Run below cell to monitor the standard extraction progress. This mode uses Azure AI Services to analyze document structure, recognize tables, and perform intelligent chunking, which takes more time than the minimal extraction mode we used earlier.

Once you see that `itemsUpdatesProcessed` is 1, that means the single document has been indexed successfully. Once indexing is complete, you can move to the next step.

In [82]:
import json

status = index_client.get_knowledge_source_status(standard_knowledge_source.name)

print(json.dumps(status.serialize(), indent=2))

{
  "synchronizationStatus": "active",
  "synchronizationInterval": "1d",
  "lastSynchronizationState": {
    "startTime": "2025-12-08T06:28:25.149Z",
    "endTime": "2025-12-08T06:38:09.552Z",
    "itemsUpdatesProcessed": 1,
    "itemsUpdatesFailed": 0,
    "itemsSkipped": 0
  },
  "statistics": {
    "totalSynchronization": 1,
    "averageSynchronizationDuration": "PT9M44.4030721S",
    "averageItemsProcessedPerSynchronization": 1
  }
}


## Step 10: Create Knowledge Base for Standard Extraction

You'll now create a knowledge base that uses the standard extraction blob knowledge source. This knowledge base will benefit from the enhanced document processing and improved chunk quality.

Run below cell to create the knowledge base with the standard extraction source.

In [83]:
from azure.search.documents.indexes.models import KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

standard_knowledge_base = KnowledgeBase(
    name=standard_knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=standard_knowledge_source.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS
)

index_client.create_or_update_knowledge_base(standard_knowledge_base)
print(f"Knowledge base '{standard_knowledge_base_name}' created or updated successfully.")

Knowledge base 'upload-blob-knowledge-base-standard' created or updated successfully.


## Step 11: Query Hebrew Document with Medium Reasoning

This query demonstrates **medium reasoning effort** on a complex Hebrew document (800 pages about Israel Electric Company regulations). The query asks three interconnected questions:

1. **Who**: Which consumers are protected from disconnection?
2. **When**: Under what circumstances is disconnection prohibited?
3. **List**: Provide a comprehensive list of all such consumers

Medium reasoning effort is ideal for this because:
- **Multi-step decomposition**: Breaks down into focused sub-queries for each question
- **Iterative retrieval**: Searches across 800 pages to find all relevant sections
- **Cross-page aggregation**: Combines information scattered throughout the document
- **Hebrew support**: Native RTL language handling in both BM25 and vector search

The activity log in Step 12 will show multiple search iterations as the system comprehensively answers all three parts.

In [148]:
import os
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import AzureBlobKnowledgeSourceParams, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeBaseRetrievalRequest, KnowledgeRetrievalMediumReasoningEffort
from IPython.display import display, Markdown

# FIX: Ensure we use the correct Search Endpoint (it might have been overwritten by Step 7)
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]

if "standard_knowledge_base_name" not in globals():
    raise RuntimeError("Missing notebook state. Rerun Steps 1-10 to reload standard knowledge base name.")

# Prefer admin key if present; otherwise fall back to AAD (managed identity/service principal) for retrieval
admin_key = os.getenv("AZURE_SEARCH_ADMIN_KEY")
search_credential = AzureKeyCredential(admin_key) if admin_key else DefaultAzureCredential(exclude_shared_token_cache_credential=True)

# If the optimized knowledge source object is not in scope (e.g., after a kernel restart), refetch it by name
if "optimized_knowledge_source" not in globals():
    optimized_knowledge_source = index_client.get_knowledge_source("upload-blob-knowledge-source-standard-optimized")

# Use the optimized knowledge base (with 1500-char chunks)
optimized_knowledge_base_name = "upload-blob-knowledge-base-standard-optimized"
standard_knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=optimized_knowledge_base_name, credential=search_credential)

blob_ks_params = AzureBlobKnowledgeSourceParams(
    knowledge_source_name=optimized_knowledge_source.name,
    include_references=True,
    include_reference_source_data=True
)

# Original query - keeping Medium reasoning effort (already optimal for this query)
# Note: Azure AI Search supports only 3 levels: Minimal (1), Low (~2-3), Medium (~5)
# Medium already used 3/5 iterations and found the answer, so it's the right choice

req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="""
"◊û◊î◊ü ◊©◊¢◊ï◊™ ◊î◊§◊™◊ô◊ó◊î ◊©◊ú ◊û◊ï◊ß◊ì◊ô ◊î◊©◊ô◊®◊ï◊™ ◊©◊ú ◊ó◊ë◊®◊™ ◊î◊ó◊©◊û◊ú"           
        """)])
    ],
    knowledge_source_params=[
        blob_ks_params
    ],
    include_activity=True,
    retrieval_reasoning_effort=KnowledgeRetrievalMediumReasoningEffort  # Optimal: up to 5 iterations
)

result = standard_knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

◊©◊¢◊ï◊™ ◊î◊§◊™◊ô◊ó◊î ◊©◊ú ◊û◊ï◊ß◊ì◊ô ◊î◊©◊ô◊®◊ï◊™ ◊©◊ú ◊ó◊ë◊®◊™ ◊î◊ó◊©◊û◊ú ◊î◊ü:

- ◊ë◊†◊ï◊©◊ê◊ô ◊¶◊®◊õ◊†◊ï◊™ (◊õ◊ï◊ú◊ú ◊ë◊ô◊®◊ï◊® ◊ó◊©◊ë◊ï◊ü ◊ï◊°◊ô◊ï◊ù ◊î◊™◊ß◊©◊®◊ï◊™): ◊û◊¢◊†◊î ◊ê◊†◊ï◊©◊ô ◊û◊ß◊¶◊ï◊¢◊ô ◊†◊ô◊™◊ü ◊ë◊ô◊û◊ô◊ù ◊ê' ◊¢◊ì ◊î', ◊û◊î◊©◊¢◊î 8:00 ◊¢◊ì 19:00, ◊ú◊û◊¢◊ò ◊ë◊ó◊í◊ô◊ù [ref_id:0].
- ◊ë◊†◊ï◊©◊ê◊ô ◊ê◊°◊§◊ß◊™ ◊ó◊©◊û◊ú, ◊™◊ß◊ú◊ï◊™ ◊ï◊û◊§◊í◊¢◊ô◊ù: ◊û◊¢◊†◊î ◊ê◊†◊ï◊©◊ô ◊û◊ß◊¶◊ï◊¢◊ô ◊†◊ô◊™◊ü ◊ë◊õ◊ú ◊©◊¢◊ï◊™ ◊î◊ô◊û◊û◊î ◊ï◊ë◊õ◊ú ◊ô◊û◊ï◊™ ◊î◊©◊†◊î [ref_id:0].
- ◊ë◊†◊ï◊©◊ê◊ô ◊¢◊ë◊ï◊ì◊ï◊™ ◊î◊ß◊©◊ï◊®◊ï◊™ ◊ë◊û◊ì◊ï◊® ◊î◊™◊õ◊†◊ï◊ü, ◊û◊ì◊ï◊® ◊ë◊ï◊ì◊ß◊ô◊ù ◊ï◊û◊ì◊ï◊® ◊û◊™◊õ◊†◊†◊ô◊ù: ◊û◊¢◊†◊î ◊ê◊†◊ï◊©◊ô ◊ï◊û◊ß◊¶◊ï◊¢◊ô ◊†◊ô◊™◊ü ◊ë◊ô◊û◊ô◊ù ◊ê' ◊¢◊ì ◊î' ◊û◊î◊©◊¢◊î 8:00 ◊¢◊ì 15:00, ◊ú◊û◊¢◊ò ◊ë◊ó◊í◊ô◊ù [ref_id:0].
- ◊û◊ï◊ß◊ì ◊î◊™◊ß◊©◊ï◊®◊™ ◊î◊õ◊™◊ï◊ë◊î (◊ê◊ù ◊û◊ï◊§◊¢◊ú): ◊û◊¢◊†◊î ◊ë◊ô◊û◊ô◊ù ◊ê' ◊¢◊ì ◊î', ◊û◊©◊¢◊î 8:00 ◊¢◊ì 19:00, ◊ú◊û◊¢◊ò ◊ë◊ó◊í◊ô◊ù [ref_id:0].
- ◊ß◊ë◊ú◊™ ◊ß◊î◊ú ◊ë◊û◊©◊®◊ì◊ô◊ù: ◊ú◊ê ◊§◊ó◊ï◊™ ◊û-20 ◊©◊¢◊ï◊™ ◊©◊ë◊ï◊¢◊ô◊ï◊™ ◊ë◊û◊û◊ï◊¶◊¢ ◊ê◊®◊¶◊ô, ◊ï◊î◊ï◊ì◊¢◊î ◊¢◊ú ◊©◊¢◊ï◊™ ◊î◊§◊™◊ô◊ó◊î ◊û◊™◊§◊®◊°◊û◊™ ◊ú◊¶◊ô◊ë◊ï◊® [ref_id:1].

In [141]:
# Debug: Check available reasoning effort classes
from azure.search.documents.knowledgebases import models
import inspect

print("Available reasoning effort classes:\n")
for name, obj in inspect.getmembers(models):
    if 'reasoning' in name.lower() or 'effort' in name.lower():
        print(f"  ‚Ä¢ {name}: {type(obj)}")

print("\nAll KnowledgeRetrieval classes:")
for name, obj in inspect.getmembers(models):
    if name.startswith('KnowledgeRetrieval'):
        print(f"  ‚Ä¢ {name}")

Available reasoning effort classes:

  ‚Ä¢ KnowledgeBaseAgenticReasoningActivityRecord: <class 'type'>
  ‚Ä¢ KnowledgeRetrievalLowReasoningEffort: <class 'type'>
  ‚Ä¢ KnowledgeRetrievalMediumReasoningEffort: <class 'type'>
  ‚Ä¢ KnowledgeRetrievalMinimalReasoningEffort: <class 'type'>
  ‚Ä¢ KnowledgeRetrievalReasoningEffort: <class 'type'>
  ‚Ä¢ KnowledgeRetrievalReasoningEffortKind: <class 'azure.core._enum_meta.CaseInsensitiveEnumMeta'>

All KnowledgeRetrieval classes:
  ‚Ä¢ KnowledgeRetrievalIntent
  ‚Ä¢ KnowledgeRetrievalIntentType
  ‚Ä¢ KnowledgeRetrievalLowReasoningEffort
  ‚Ä¢ KnowledgeRetrievalMediumReasoningEffort
  ‚Ä¢ KnowledgeRetrievalMinimalReasoningEffort
  ‚Ä¢ KnowledgeRetrievalOutputMode
  ‚Ä¢ KnowledgeRetrievalReasoningEffort
  ‚Ä¢ KnowledgeRetrievalReasoningEffortKind
  ‚Ä¢ KnowledgeRetrievalSemanticIntent


## Understanding Reasoning Effort Levels

Azure AI Search Knowledge Bases support **three reasoning effort levels** that control **maximum iterations**, but the system decides when to stop based on confidence:

| Level | Class Name | Max Iterations | Use Case |
|-------|------------|---------------|----------|
| **Minimal** | `KnowledgeRetrievalMinimalReasoningEffort` | 1 | Simple factual queries, single-document answers |
| **Low** | `KnowledgeRetrievalLowReasoningEffort` | ~2-3 | Basic multi-aspect questions |
| **Medium** | `KnowledgeRetrievalMediumReasoningEffort` | ~5 | Complex questions requiring decomposition |

**Important**: The API only supports these three levels (minimal, low, medium). There is no "high" or "maximum" level.

**Key Point**: You **cannot force** the exact number of iterations. The agentic reasoning loop stops when the L3 classifier determines it has sufficient information to answer confidently.

Your query used **3/5 iterations with Medium effort** because the system found enough information about protected consumers after 3 searches.

## Analyze Retrieval Activity - Optimized Chunks

Let's inspect the activity log to see:
1. **How many search iterations** were performed (medium reasoning effort allows up to 5)
2. **Which chunks were retrieved** from the 849 optimized chunks
3. **Search queries generated** by the LLM's query decomposition
4. **Token usage** and timing metrics

In [149]:
import json
import pandas as pd

print("="*80)
print("üîç RETRIEVAL ACTIVITY ANALYSIS - Optimized 1500-char Chunks")
print("="*80)
print()

# Extract activity records
if hasattr(result, 'activity') and result.activity:
    activity = result.activity
    
    # 1. Count search iterations - filter by type name
    searches = [a for a in activity if type(a).__name__ == 'KnowledgeBaseAzureBlobActivityRecord']
    print(f"üìä Search Statistics:")
    print(f"   ‚Ä¢ Total search iterations: {len(searches)}")
    print(f"   ‚Ä¢ Medium reasoning effort allows up to 5 iterations")
    print()
    
    # 2. Analyze each search iteration
    total_chunks = 0
    all_queries = []
    
    for i, search in enumerate(searches, 1):
        print(f"üîé Search Iteration {i}:")
        
        # Convert to dict to inspect structure
        search_dict = search.as_dict() if hasattr(search, 'as_dict') else {}
        
        # Get search query from azure_blob_arguments
        if 'azure_blob_arguments' in search_dict and 'search' in search_dict['azure_blob_arguments']:
            query = search_dict['azure_blob_arguments']['search']
            all_queries.append(query)
            print(f"   Query: \"{query}\"")
        
        # Get count of results
        if 'count' in search_dict:
            chunk_count = search_dict['count']
            total_chunks += chunk_count
            print(f"   ‚Ä¢ Retrieved {chunk_count} chunks")
            print(f"   ‚Ä¢ Elapsed: {search_dict.get('elapsed_ms', 0)} ms")
        print()
    
    print(f"üìà Total Chunks Retrieved Across All Iterations: {total_chunks}")
    print()
    
    # 3. Answer synthesis info
    answer_synthesis = [a for a in activity if type(a).__name__ == 'KnowledgeBaseModelAnswerSynthesisActivityRecord']
    if answer_synthesis:
        a = answer_synthesis[0]
        a_dict = a.as_dict() if hasattr(a, 'as_dict') else {}
        
        print(f"ü§ñ Answer Synthesis:")
        
        # Get chunks sent to LLM from references
        if hasattr(result, 'references') and result.references:
            print(f"   ‚Ä¢ Chunks sent to LLM: {len(result.references)}")
        
        # Get timing and token usage
        if 'elapsed_ms' in a_dict:
            print(f"   ‚Ä¢ Synthesis time: {a_dict['elapsed_ms']} ms")
        
        if 'model_output' in a_dict:
            model_output = a_dict['model_output']
            if 'usage' in model_output:
                usage = model_output['usage']
                print(f"   ‚Ä¢ Input tokens: {usage.get('prompt_tokens', 0):,}")
                print(f"   ‚Ä¢ Output tokens: {usage.get('completion_tokens', 0):,}")
                print(f"   ‚Ä¢ Total tokens: {usage.get('total_tokens', 0):,}")
        print()
    
    # 4. Create DataFrame of all search queries
    if all_queries:
        print(f"üìù LLM-Generated Search Queries (Query Decomposition):")
        print("="*80)
        for i, q in enumerate(all_queries, 1):
            print(f"{i}. {q}")
        print()
    
    print("="*80)
    print("üí° Key Insights:")
    print("="*80)
    print("‚Ä¢ Smaller chunks (1500 chars) allow more focused retrieval")
    print("‚Ä¢ Each chunk contains ~5-10 table rows with complete context")
    print("‚Ä¢ 200-char overlap ensures table headers repeat across chunks")
    print("‚Ä¢ Medium reasoning effort performs iterative retrieval until satisfied")
    print("‚Ä¢ Answer synthesis combines multiple chunks into coherent response")
    print("="*80)
    
else:
    print("‚ö†Ô∏è  No activity data available in result")
    print("Make sure include_activity=True was set in the request")

üîç RETRIEVAL ACTIVITY ANALYSIS - Optimized 1500-char Chunks

üìä Search Statistics:
   ‚Ä¢ Total search iterations: 2
   ‚Ä¢ Medium reasoning effort allows up to 5 iterations

üîé Search Iteration 1:
   Query: "◊©◊¢◊ï◊™ ◊§◊™◊ô◊ó◊î ◊û◊ï◊ß◊ì◊ô ◊©◊ô◊®◊ï◊™ ◊ó◊ë◊®◊™ ◊î◊ó◊©◊û◊ú"
   ‚Ä¢ Retrieved 7 chunks
   ‚Ä¢ Elapsed: 549 ms

üîé Search Iteration 2:
   Query: "◊û◊ï◊ß◊ì◊ô ◊©◊ô◊®◊ï◊™ ◊ó◊ë◊®◊™ ◊î◊ó◊©◊û◊ú ◊ò◊ú◊§◊ï◊ü ◊©◊¢◊ï◊™ ◊§◊¢◊ô◊ú◊ï◊™"
   ‚Ä¢ Retrieved 10 chunks
   ‚Ä¢ Elapsed: 398 ms

üìà Total Chunks Retrieved Across All Iterations: 17

ü§ñ Answer Synthesis:
   ‚Ä¢ Chunks sent to LLM: 7
   ‚Ä¢ Synthesis time: 3367 ms

üìù LLM-Generated Search Queries (Query Decomposition):
1. ◊©◊¢◊ï◊™ ◊§◊™◊ô◊ó◊î ◊û◊ï◊ß◊ì◊ô ◊©◊ô◊®◊ï◊™ ◊ó◊ë◊®◊™ ◊î◊ó◊©◊û◊ú
2. ◊û◊ï◊ß◊ì◊ô ◊©◊ô◊®◊ï◊™ ◊ó◊ë◊®◊™ ◊î◊ó◊©◊û◊ú ◊ò◊ú◊§◊ï◊ü ◊©◊¢◊ï◊™ ◊§◊¢◊ô◊ú◊ï◊™

üí° Key Insights:
‚Ä¢ Smaller chunks (1500 chars) allow more focused retrieval
‚Ä¢ Each chunk contains ~5-10 table rows with complete context
‚Ä¢ 200-char overlap ensur

## Inspect Retrieved Chunk Content

Let's look at the actual content of the chunks that were retrieved to see if they contain the protected consumer table (Standard 7◊í) we were looking for.

In [150]:
print("="*80)
print("üìÑ CHUNK CONTENT INSPECTION")
print("="*80)
print()

if hasattr(result, 'references') and result.references:
    # The actual chunks are in result.references, not in activity
    # Activity only shows metadata (count, query, timing)
    
    all_chunks = {}
    for ref in result.references:
        ref_dict = ref.as_dict() if hasattr(ref, 'as_dict') else {}
        
        # Extract chunk_id and content from reference
        chunk_id = ref_dict.get('chunk_id', 'unknown')
        content = ref_dict.get('content', '')
        
        if content:
            all_chunks[chunk_id] = content
    
    print(f"Found {len(all_chunks)} unique chunks in result.references\n")
    
    # Look for chunks containing key terms
    target_terms = ['◊°◊¢◊ô◊£ 7◊í', '◊ê◊û◊™ ◊û◊ô◊ì◊î', '◊¶◊®◊õ◊†◊ô◊ù ◊û◊ï◊í◊†◊ô◊ù', '◊†◊õ◊ô ◊î◊®◊ì◊ô◊§◊ï◊™', '◊™◊¢◊®◊ô◊£ ◊ë◊ô◊™◊ô']
    
    for i, (chunk_id, content) in enumerate(list(all_chunks.items())[:5], 1):  # Show first 5 chunks
        print("="*80)
        print(f"Chunk {i}: {chunk_id}")
        print("="*80)
        
        # Check if chunk contains target terms
        found_terms = [term for term in target_terms if term in content]
        if found_terms:
            print(f"‚úÖ Contains key terms: {', '.join(found_terms)}")
        else:
            print("‚ÑπÔ∏è  General context chunk")
        
        # Show preview (first 500 chars)
        preview = content[:500]
        print(f"\n{preview}...")
        print(f"\nüìè Chunk size: {len(content)} characters")
        print()
    
    if len(all_chunks) > 5:
        print(f"\n... and {len(all_chunks) - 5} more chunks (not shown)")
    
    print("\n" + "="*80)
    print("üí° Chunk Size Analysis:")
    print("="*80)
    print(f"‚Ä¢ Configured: 1500 characters maximum")
    print(f"‚Ä¢ Overlap: 200 characters between consecutive chunks")
    if all_chunks:
        print(f"‚Ä¢ Actual sizes: {min(len(c) for c in all_chunks.values())} - {max(len(c) for c in all_chunks.values())} chars")
    print(f"‚Ä¢ Total chunks in index: 849")
    print(f"‚Ä¢ Chunks retrieved: {len(all_chunks)}")
    print("="*80)
    
else:
    print("‚ö†Ô∏è  No activity data available")

üìÑ CHUNK CONTENT INSPECTION

Found 0 unique chunks in result.references


üí° Chunk Size Analysis:
‚Ä¢ Configured: 1500 characters maximum
‚Ä¢ Overlap: 200 characters between consecutive chunks
‚Ä¢ Total chunks in index: 849
‚Ä¢ Chunks retrieved: 0


## Step 13: Compare Extraction Results

The cell below shows citations from the standard extraction query.

Compare these references with those from Step 7 to see how different extraction modes affect chunk creation and information retrieval from the same PDF document.

In [103]:
import json

# Get the actual index that was created by the knowledge source
# Knowledge sources create indexes with a specific naming pattern
indexes = index_client.list_indexes()

print("="*60)
print("Search Indexes Created by Knowledge Sources:")
print("="*60)
print()

for idx in indexes:
    if "upload-blob-knowledge-source" in idx.name:
        print(f"üìä Index Name: {idx.name}")
        print(f"   Created: {idx}")
        print()
        
        # Get full index details
        full_index = index_client.get_index(idx.name)
        
        # Check for vector search configuration
        has_vector_search = hasattr(full_index, 'vector_search') and full_index.vector_search is not None
        
        # Check for both text and vector fields
        text_fields = [f for f in full_index.fields if f.type == "Edm.String" and f.searchable]
        vector_fields = [f for f in full_index.fields if f.type == "Collection(Edm.Single)"]
        
        print(f"   üîç Hybrid Search Configuration:")
        print(f"      ‚Ä¢ Vector Search Enabled: {has_vector_search}")
        print(f"      ‚Ä¢ Text/Keyword Fields: {len(text_fields)} (for BM25)")
        print(f"      ‚Ä¢ Vector Fields: {len(vector_fields)} (for semantic search)")
        print()
        
        if has_vector_search:
            print(f"   ‚úÖ HYBRID SEARCH CONFIRMED!")
            print(f"      This index supports both:")
            print(f"      ‚Ä¢ BM25 keyword search (traditional full-text)")
            print(f"      ‚Ä¢ Vector semantic search (embedding-based)")
            print()
            
            # Show sample field names
            if text_fields:
                print(f"   üìù Sample Keyword Search Fields:")
                for f in text_fields[:3]:
                    print(f"      ‚Ä¢ {f.name} (searchable, filterable)")
            
            if vector_fields:
                print(f"   üßÆ Sample Vector Fields:")
                for f in vector_fields[:3]:
                    dims = f.vector_search_dimensions if hasattr(f, 'vector_search_dimensions') else 'N/A'
                    print(f"      ‚Ä¢ {f.name} (dimensions: {dims})")
        else:
            print(f"   ‚ö†Ô∏è  Keyword-only search (no vector fields found)")
        
        print()
        print("="*60)
        print()

print("""
üí° What This Means:

When you query with medium reasoning effort:
1. LLM decomposes your question into sub-queries
2. Each sub-query runs HYBRID search:
   ‚Ä¢ BM25 scores keyword matches (exact terms)
   ‚Ä¢ Vector search finds semantic matches (meaning)
   ‚Ä¢ Results are merged using Reciprocal Rank Fusion (RRF)
3. Semantic ranker (L3) re-ranks merged results
4. Top results go to answer synthesis

Your Hebrew query benefits from BOTH:
‚Ä¢ BM25 catches exact Hebrew word matches
‚Ä¢ Vectors capture meaning/intent even with different phrasing
""")

Search Indexes Created by Knowledge Sources:



üìä Index Name: upload-blob-knowledge-source-standard-index
   Created: {'additional_properties': {}, 'name': 'upload-blob-knowledge-source-standard-index', 'fields': [<azure.search.documents.indexes.models._index.SearchField object at 0x168123850>, <azure.search.documents.indexes.models._index.SearchField object at 0x168123950>, <azure.search.documents.indexes.models._index.SearchField object at 0x168a38150>, <azure.search.documents.indexes.models._index.SearchField object at 0x168a38050>, <azure.search.documents.indexes.models._index.SearchField object at 0x168a38750>], 'description': "Search index for knowledge source 'upload-blob-knowledge-source-standard'", 'scoring_profiles': [], 'default_scoring_profile': None, 'cors_options': None, 'suggesters': [], 'analyzers': None, 'tokenizers': None, 'token_filters': [], 'char_filters': [], 'normalizers': [], 'encryption_key': None, 'similarity': <azure.search.documents.indexes._generated.models._models_py3.BM25SimilarityAlgorithm object a

## Step 12: Verify Hybrid Search Configuration

Run the cell below to inspect the underlying search index and confirm that hybrid search (BM25 + vector) is configured and being used.

## Summary

You've now experienced blob knowledge sources and compared different content extraction modes for document processing.

**Key concepts to remember:**
- `AzureBlobKnowledgeSource` automatically indexes documents from Azure Blob Storage
- **Minimal extraction** (Steps 3-7): Fast, basic text extraction suitable for simple documents
- **Standard extraction** (Steps 8-13): Uses Azure AI Services for advanced document understanding and better chunk quality
- **Reasoning effort levels** (Step 11-12): Medium effort enables iterative retrieval with more comprehensive results
- Standard extraction is beneficial for complex documents with tables, images, or intricate layouts
- Both modes create searchable, vectorized chunks from your blob documents

### What's Next?

‚û°Ô∏è Continue to [Part 6: Combined Knowledge Sources](part6-combined-knowledge-source.ipynb) to learn how to query search indexes, web URLs, SharePoint, and blob storage simultaneously in a single knowledge base.

## üîß Advanced: Manually Configure Chunk Size and Overlap

**Important Discovery:** The Knowledge Source API doesn't expose `chunkingProperties`, but you can **directly modify the skillset** using the Azure AI Search REST API!

According to the [Microsoft documentation](https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-content-understanding#skill-parameters), you can configure:

- `maximumLength`: 300-50,000 characters (default varies)
- `overlapLength`: Must be less than half of maximumLength

Let's modify the existing skillset to add these parameters:

In [117]:
# Solution: Modify skillset to add chunkingProperties
# This directly patches the ContentUnderstandingSkill with custom chunk size/overlap

import requests
import json

# Get the skillset name created by the knowledge source
skillset_name = "upload-blob-knowledge-source-standard-optimized-skillset"

# Construct REST API URL
# IMPORTANT: ContentUnderstandingSkill requires 2025-11-01-Preview or later
api_version = "2025-11-01-Preview"
skillset_url = f"{endpoint}/skillsets/{skillset_name}?api-version={api_version}"

print(f"üîç Fetching current skillset configuration...\n")

# Get current skillset
headers = {
    "api-key": admin_key,
    "Content-Type": "application/json"
}

response = requests.get(skillset_url, headers=headers)
if response.status_code != 200:
    print(f"‚ùå Error fetching skillset: {response.status_code}")
    print(response.text)
else:
    skillset = response.json()
    
    print(f"‚úÖ Retrieved skillset: {skillset['name']}\n")
    
    # Find the ContentUnderstandingSkill
    content_skill = None
    skill_index = None
    for idx, skill in enumerate(skillset['skills']):
        if skill['@odata.type'] == '#Microsoft.Skills.Util.ContentUnderstandingSkill':
            content_skill = skill
            skill_index = idx
            break
    
    if not content_skill:
        print("‚ùå ContentUnderstandingSkill not found in skillset!")
    else:
        print("‚úÖ Found ContentUnderstandingSkill\n")
        print("Current configuration:")
        print(f"   ‚Ä¢ Has chunkingProperties: {'chunkingProperties' in content_skill}")
        
        if 'chunkingProperties' in content_skill:
            print(f"   ‚Ä¢ Current maximumLength: {content_skill['chunkingProperties'].get('maximumLength', 'Not set')}")
            print(f"   ‚Ä¢ Current overlapLength: {content_skill['chunkingProperties'].get('overlapLength', 'Not set')}")
        
        print("\n" + "="*80)
        print("üìù MODIFYING SKILLSET WITH CUSTOM CHUNKING")
        print("="*80)
        
        # Add/modify chunkingProperties for better table handling
        # Smaller chunks = more granular, better for tables
        # Overlap = preserves context across chunk boundaries
        skillset['skills'][skill_index]['chunkingProperties'] = {
            "unit": "characters",
            "maximumLength": 1500,  # Smaller than default to keep table rows together
            "overlapLength": 200     # 13% overlap to preserve table structure
        }
        
        print(f"\n‚úèÔ∏è  New configuration:")
        print(f"   ‚Ä¢ maximumLength: 1500 characters")
        print(f"   ‚Ä¢ overlapLength:  200 characters")
        print(f"   ‚Ä¢ unit:           characters\n")
        
        print("üí° Why these values?")
        print("   ‚Ä¢ 1500 chars: Small enough to keep 5-10 table rows together")
        print("   ‚Ä¢ 200 overlap: Ensures table headers repeat in next chunk")
        print("   ‚Ä¢ This helps preserve table structure for RAG retrieval\n")
        
        # Update the skillset
        print("üì§ Sending updated skillset to Azure...\n")
        
        update_response = requests.put(
            skillset_url,
            headers=headers,
            json=skillset
        )
        
        if update_response.status_code in [200, 201, 204]:
            print("‚úÖ Skillset updated successfully!")
            print("\n‚ö†Ô∏è  IMPORTANT: You must now RE-RUN the indexer to apply these changes:")
            print("   1. The indexer will re-process all documents")
            print("   2. New chunks will be created with the updated settings")
            print("   3. This may take several minutes depending on document size")
        else:
            print(f"‚ùå Error updating skillset: {update_response.status_code}")
            print(update_response.text)

üîç Fetching current skillset configuration...

‚úÖ Retrieved skillset: upload-blob-knowledge-source-standard-optimized-skillset

‚úÖ Found ContentUnderstandingSkill

Current configuration:
   ‚Ä¢ Has chunkingProperties: True
   ‚Ä¢ Current maximumLength: 2000
   ‚Ä¢ Current overlapLength: 200

üìù MODIFYING SKILLSET WITH CUSTOM CHUNKING

‚úèÔ∏è  New configuration:
   ‚Ä¢ maximumLength: 1500 characters
   ‚Ä¢ overlapLength:  200 characters
   ‚Ä¢ unit:           characters

üí° Why these values?
   ‚Ä¢ 1500 chars: Small enough to keep 5-10 table rows together
   ‚Ä¢ 200 overlap: Ensures table headers repeat in next chunk
   ‚Ä¢ This helps preserve table structure for RAG retrieval

üì§ Sending updated skillset to Azure...

‚úÖ Retrieved skillset: upload-blob-knowledge-source-standard-optimized-skillset

‚úÖ Found ContentUnderstandingSkill

Current configuration:
   ‚Ä¢ Has chunkingProperties: True
   ‚Ä¢ Current maximumLength: 2000
   ‚Ä¢ Current overlapLength: 200

üìù MODIFYING 

### Step 2: Re-run the Indexer to Apply Changes

After modifying the skillset, you need to re-run the indexer to re-process documents with the new chunking settings:

In [None]:
# Step 2: Re-run the indexer to apply new chunking settings
from azure.search.documents.indexes import SearchIndexerClient

indexer_name = "upload-blob-knowledge-source-standard-optimized-indexer"

indexer_client = SearchIndexerClient(endpoint=endpoint, credential=credential)

print(f"üîÑ Resetting and re-running indexer: {indexer_name}\n")

# Reset the indexer to re-process all documents
try:
    indexer_client.reset_indexer(indexer_name)
    print("‚úÖ Indexer reset successful")
except Exception as e:
    print(f"‚ö†Ô∏è  Reset warning: {e}")

# Run the indexer
print("‚ñ∂Ô∏è  Starting indexer run...\n")
try:
    indexer_client.run_indexer(indexer_name)
    print("‚úÖ Indexer started!")
    print("\n‚è≥ The indexer is now re-processing documents with new chunk settings.")
    print("   This may take 5-10 minutes depending on document size.")
    print("\nüìä Check status with:")
    print(f"   indexer_client.get_indexer_status('{indexer_name}')")
except Exception as e:
    print(f"‚ùå Error running indexer: {e}")

### Step 3: Monitor Indexer Progress

In [None]:
# Step 3: Check indexer status
import time

print("üìä Indexer Status\n")
print("="*80)

status = indexer_client.get_indexer_status(indexer_name)

# Get the last execution result
if status.last_result:
    last_run = status.last_result
    print(f"Status:                {last_run.status}")
    print(f"Items processed:       {last_run.items_processed}")
    print(f"Items failed:          {last_run.items_failed}")
    print(f"Start time:            {last_run.start_time}")
    print(f"End time:              {last_run.end_time}")
    
    if last_run.status == "inProgress":
        print("\n‚è≥ Indexing is still in progress...")
        print("   Re-run this cell in 1-2 minutes to check again")
    elif last_run.status == "success":
        print("\n‚úÖ Indexing completed successfully!")
        print("   You can now query with the new chunk sizes")
    else:
        print(f"\n‚ö†Ô∏è  Status: {last_run.status}")
        if last_run.errors:
            print(f"\nErrors:")
            for error in last_run.errors[:3]:  # Show first 3 errors
                print(f"   ‚Ä¢ {error}")
else:
    print("No execution results available yet.")
    
print("\n" + "="*80)

### Step 4: Verify New Chunk Sizes

After indexing completes, let's verify that chunks are now the correct size:

In [None]:
# Step 4: Verify chunk sizes after re-indexing
print("üîç Analyzing chunk sizes in re-indexed data\n")
print("="*80)

# Get the index name
index_name = "upload-blob-knowledge-source-standard-optimized-index"

# Create search client
search_client = SearchClient(
    endpoint=endpoint,
    index_name=index_name,
    credential=AzureKeyCredential(admin_key)
)

# Sample some chunks to check sizes
results = search_client.search(
    search_text="*",  # Get any chunks
    top=20,
    include_total_count=True
)

chunk_sizes = []
for result in results:
    content = result.get('snippet', '')
    chunk_sizes.append(len(content))

if chunk_sizes:
    print(f"Total chunks in index: {results.get_count()}")
    print(f"\nüìä Chunk Size Analysis (sample of 20 chunks):")
    print(f"   ‚Ä¢ Minimum:  {min(chunk_sizes):,} characters")
    print(f"   ‚Ä¢ Maximum:  {max(chunk_sizes):,} characters")
    print(f"   ‚Ä¢ Average:  {sum(chunk_sizes) // len(chunk_sizes):,} characters")
    print(f"   ‚Ä¢ Target:   1,500 characters (with 200 overlap)")
    
    # Show distribution
    small = sum(1 for s in chunk_sizes if s < 1000)
    medium = sum(1 for s in chunk_sizes if 1000 <= s <= 1500)
    large = sum(1 for s in chunk_sizes if s > 1500)
    
    print(f"\nüìà Distribution:")
    print(f"   ‚Ä¢ < 1000 chars:     {small} chunks ({small*100//len(chunk_sizes)}%)")
    print(f"   ‚Ä¢ 1000-1500 chars:  {medium} chunks ({medium*100//len(chunk_sizes)}%)")
    print(f"   ‚Ä¢ > 1500 chars:     {large} chunks ({large*100//len(chunk_sizes)}%)")
    
    if max(chunk_sizes) <= 1500:
        print(f"\n‚úÖ SUCCESS! All chunks are within the 1,500 character limit.")
        print(f"   This should improve table preservation during retrieval.")
    else:
        print(f"\n‚ö†Ô∏è  Some chunks exceed 1,500 characters.")
        print(f"   This is normal for complex tables or formatting.")
else:
    print("‚ùå No chunks found. Wait for indexing to complete.")

print("\n" + "="*80)

---

## üìù Summary: How to Configure Chunk Size & Overlap

**The Problem:** The Knowledge Source Python API doesn't expose `chunkingProperties` parameters.

**The Solution:** Use the Azure AI Search REST API to directly modify the skillset.

### Key Parameters:

| Parameter | Range | Recommended for Tables | Why |
|-----------|-------|----------------------|-----|
| `maximumLength` | 300-50,000 chars | 1,000-2,000 | Small enough to keep 5-10 table rows together |
| `overlapLength` | < maximumLength/2 | 10-20% of max | Ensures table headers repeat across chunks |
| `unit` | characters | characters | Only supported unit |

### When to Use Custom Chunking:

‚úÖ **Use smaller chunks (1,000-1,500) when:**
- Documents have many tables
- Tables are wide (many columns)
- You need precise retrieval of specific table rows

‚úÖ **Use larger chunks (3,000-5,000) when:**
- Documents are mostly narrative text
- You want broader context per chunk
- Tables are simple (2-3 columns)

‚úÖ **Use overlap (200-500 chars) when:**
- Tables span multiple pages
- You need context across chunk boundaries
- Table headers should repeat

### Workflow:

1. **Modify skillset** ‚Üí Add `chunkingProperties` via REST API
2. **Reset indexer** ‚Üí Clear existing chunks
3. **Run indexer** ‚Üí Re-process documents with new settings
4. **Verify** ‚Üí Check chunk sizes are within expected range
5. **Query** ‚Üí Test if retrieval improved for table-heavy queries

In [121]:
# VERIFICATION: Fetch the skillset again to confirm chunkingProperties were saved
print("üîç VERIFICATION: Re-fetching skillset to confirm changes...\n")
print("="*80)

verify_response = requests.get(skillset_url, headers=headers)
if verify_response.status_code == 200:
    verified_skillset = verify_response.json()
    
    # Find ContentUnderstandingSkill again
    for skill in verified_skillset['skills']:
        if skill['@odata.type'] == '#Microsoft.Skills.Util.ContentUnderstandingSkill':
            print("‚úÖ ContentUnderstandingSkill found\n")
            
            if 'chunkingProperties' in skill:
                props = skill['chunkingProperties']
                print("‚úÖ chunkingProperties ARE saved in Azure!")
                print(f"\nüìã Confirmed settings:")
                print(f"   ‚Ä¢ unit:          {props.get('unit', 'Not set')}")
                print(f"   ‚Ä¢ maximumLength: {props.get('maximumLength', 'Not set')}")
                print(f"   ‚Ä¢ overlapLength: {props.get('overlapLength', 'Not set')}")
                
                print(f"\nüí° Explanation:")
                print(f"   The Azure Portal UI may not display chunkingProperties")
                print(f"   in the JSON viewer, but they ARE stored and WILL be used")
                print(f"   by the indexer when you re-run it.")
                
                print(f"\n‚úÖ CONFIRMED: Your chunk settings are active!")
                
            else:
                print("‚ùå chunkingProperties NOT found after update!")
                print("   This shouldn't happen. The API update may have failed silently.")
            
            # Show the full ContentUnderstandingSkill JSON
            print("\n" + "="*80)
            print("üìÑ Full ContentUnderstandingSkill JSON (as stored in Azure):")
            print("="*80)
            print(json.dumps(skill, indent=2))
            
            break
else:
    print(f"‚ùå Error verifying: {verify_response.status_code}")
    print(verify_response.text)

print("\n" + "="*80)

üîç VERIFICATION: Re-fetching skillset to confirm changes...

‚úÖ ContentUnderstandingSkill found

‚úÖ chunkingProperties ARE saved in Azure!

üìã Confirmed settings:
   ‚Ä¢ unit:          characters
   ‚Ä¢ maximumLength: 1500
   ‚Ä¢ overlapLength: 200

üí° Explanation:
   The Azure Portal UI may not display chunkingProperties
   in the JSON viewer, but they ARE stored and WILL be used
   by the indexer when you re-run it.

‚úÖ CONFIRMED: Your chunk settings are active!

üìÑ Full ContentUnderstandingSkill JSON (as stored in Azure):
{
  "@odata.type": "#Microsoft.Skills.Util.ContentUnderstandingSkill",
  "name": "contentUnderstandingSkill",
  "description": null,
  "context": "/document",
  "extractionOptions": [
    "images",
    "locationMetadata"
  ],
  "inputs": [
    {
      "name": "file_data",
      "source": "/document/file_data",
      "sourceContext": null,
      "inputs": []
    }
  ],
  "outputs": [
    {
      "name": "text_sections",
      "targetName": "text_sections"

## Create Optimized Knowledge Base

Now that we have the optimized knowledge source with custom chunking (1500 chars, 200 overlap), we need to create a **knowledge base** that references it. The knowledge base is what you query - it's the API layer over the knowledge source and its index.

In [124]:
from azure.search.documents.indexes.models import KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

# Create knowledge base that references the optimized knowledge source
optimized_knowledge_base = KnowledgeBase(
    name=optimized_knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=optimized_knowledge_source.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS
)

print(f"Creating knowledge base '{optimized_knowledge_base_name}'...")
print(f"   References knowledge source: '{optimized_knowledge_source.name}'")
print(f"   Uses index: '{optimized_knowledge_source.name}-index'")

index_client.create_or_update_knowledge_base(optimized_knowledge_base)
print(f"\n‚úÖ Knowledge base '{optimized_knowledge_base_name}' created successfully!")
print(f"\nüí° This knowledge base queries the 849 chunks created with 1500-char chunking.")

Creating knowledge base 'upload-blob-knowledge-base-standard-optimized'...
   References knowledge source: 'upload-blob-knowledge-source-standard-optimized'
   Uses index: 'upload-blob-knowledge-source-standard-optimized-index'

‚úÖ Knowledge base 'upload-blob-knowledge-base-standard-optimized' created successfully!

üí° This knowledge base queries the 849 chunks created with 1500-char chunking.
