# Part 5: Blob Knowledge Source

In Parts 1-4, you worked with pre-indexed data, SharePoint, and web sources. In Part 5, you'll upload documents from Azure Blob Storage and create knowledge sources that index them automatically. You'll also compare two indexing modes: **minimal** (basic content extraction) and **standard** (advanced content understanding with Azure AI Services).

## Step 1: Load Environment Variables

Run below cell to load the configuration for your Azure resources, choose the **.venv(3.11.9)** environment that is created for you.

Notice the additional variables for blob storage, AI services, and embedding models, which are needed for document ingestion and vectorization. All these Azure resources are pre-configured in `.env` for you.

> **‚ö†Ô∏è Troubleshooting**
>
> If code cells get stuck and keep spinning, select **Restart** from the notebook toolbar at the top. If the issue persists after a couple of tries, close VS Code completely and reopen it.

In [73]:
import os

from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv

load_dotenv(override=True) # take environment variables from .env.

# Azure AI Search configuration
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"])

 # Knowledge base name
knowledge_base_name = "upload-blob-knowledge-base-minimal"
standard_knowledge_base_name = "upload-blob-knowledge-base-standard"

# Azure OpenAI configuration
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"]
azure_openai_chatgpt_deployment = os.getenv("AZURE_OPENAI_CHATGPT_DEPLOYMENT", "gpt-4.1")
azure_openai_chatgpt_model_name = os.getenv("AZURE_OPENAI_CHATGPT_MODEL_NAME", "gpt-4.1")
azure_openai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
azure_openai_embedding_model_name = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL_NAME", "text-embedding-3-large")

# Blob configuration
blob_connection_string = os.environ.get("BLOB_CONNECTION_STRING")
blob_resource_id = os.environ.get("BLOB_RESOURCE_ID")
blob_container_name = os.environ["BLOB_CONTAINER_NAME"]
ai_services_endpoint = os.environ["AI_SERVICES_ENDPOINT"]
ai_services_key = os.environ["AI_SERVICES_KEY"]

blob_path = "../data/ai-search-data/blobdata/MSFT_cloud_architecture_zava.pdf"

print("Environment variables loaded")

Environment variables loaded


In [52]:
# -------------------------------------------------------------------------
# OPTIONAL: CLEANUP CELL
# Run this cell if you want to restart Part 5 from scratch.
# It deletes the Knowledge Bases and Knowledge Sources created in this lab.
# -------------------------------------------------------------------------
from azure.core.exceptions import ResourceNotFoundError
from azure.search.documents.indexes import SearchIndexClient

# Ensure client is ready (uses variables from Step 1)
if "endpoint" in globals() and "credential" in globals():
    cleanup_client = SearchIndexClient(endpoint=endpoint, credential=credential)
    
    items_to_delete = [
        ("Knowledge Base", cleanup_client.delete_knowledge_base, "upload-blob-knowledge-base-minimal"),
        ("Knowledge Base", cleanup_client.delete_knowledge_base, "upload-blob-knowledge-base-standard"),
        ("Knowledge Source", cleanup_client.delete_knowledge_source, "upload-blob-knowledge-source-minimal"),
        ("Knowledge Source", cleanup_client.delete_knowledge_source, "upload-blob-knowledge-source-standard"),
    ]

    print("üßπ Starting cleanup of Part 5 resources...")
    for label, delete_func, name in items_to_delete:
        try:
            delete_func(name)
            print(f"   ‚úÖ Deleted {label}: {name}")
        except ResourceNotFoundError:
            print(f"   ‚ö†Ô∏è {label} already deleted: {name}")
        except Exception as e:
            print(f"   ‚ùå Error deleting {name}: {e}")
            
    print("‚ú® Cleanup complete. You can now continue to Step 2.")
else:
    print("‚ùå Error: Please run Step 1 first to load environment variables.")

üßπ Starting cleanup of Part 5 resources...
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-minimal
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-minimal
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-minimal
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-minimal
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
‚ú® Cleanup complete. You can now continue to Step 2.
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
‚ú® Cleanup complete. You can now continue to Step 2.


## Step 2: Upload Document to Blob Storage

Before creating a knowledge source, you need to upload a document to your blob storage. The code below uploads a PDF called `MSFT_cloud_architecture_zava.pdf` which contains information about Zava's cloud architecture and how they classify data by sensitivity level.

Once you create the blob knowledge source in the next step, it will automatically find this PDF in the storage and index it for querying.

In [2]:
import os
from azure.core.exceptions import ClientAuthenticationError, HttpResponseError
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient

# Require an account URL for Azure AD auth (no keys). Prefer BLOB_ACCOUNT_URL in .env.
account_url = os.environ.get("BLOB_ACCOUNT_URL")

# Fallback: derive account URL from the blob connection string without using the key
if not account_url:
    conn = os.environ.get("BLOB_CONNECTION_STRING") or globals().get("blob_connection_string")
    if conn:
        account_name = None
        for part in conn.split(";"):
            if part.lower().startswith("accountname="):
                account_name = part.split("=", 1)[1]
                break
        if account_name:
            account_url = f"https://{account_name}.blob.core.windows.net"

if not account_url:
    raise ValueError("Missing BLOB_ACCOUNT_URL. Set it in .env or rerun setup-environment.sh to populate it.")

# Use Azure AD (managed identity/VS Code signed-in user/service principal) instead of account keys
credential = DefaultAzureCredential(exclude_shared_token_cache_credential=True)
blob_service_client = BlobServiceClient(account_url=account_url, credential=credential)
container_client = blob_service_client.get_container_client(blob_container_name)

# Ensure container exists (idempotent)
try:
    container_client.create_container()
except HttpResponseError as e:
    if e.status_code != 409:
        raise
except ClientAuthenticationError as e:
    raise RuntimeError("Authentication failed. Ensure your identity has 'Storage Blob Data Contributor' on the storage account.") from e

blob_name = os.path.basename(blob_path)
blob_client = container_client.get_blob_client(blob_name)

# Upload directly; avoid exists() to reduce permission needs
try:
    with open(blob_path, "rb") as data:
        blob_client.upload_blob(data, overwrite=True)
except ClientAuthenticationError as e:
    raise RuntimeError("Upload failed. Confirm your identity has 'Storage Blob Data Contributor' on the storage account.") from e
except HttpResponseError as e:
    if getattr(e, "error_code", "").lower() == "authorizationpermissionmismatch":
        raise RuntimeError("Authorization failed (AuthorizationPermissionMismatch). Ensure your identity has 'Storage Blob Data Contributor' on the storage account.") from e
    raise

print(f"Setup sample data in {blob_container_name} using Azure AD auth")

Setup sample data in documents using Azure AD auth


## Step 3: Create Blob Knowledge Source with Minimal Extraction

An **AzureBlobKnowledgeSource** automatically indexes documents from blob storage. Unlike the sources you've used before, this one ingests and processes the documents for you.

The code below creates a knowledge source with a `content_extraction_mode` of **minimal**. This mode chunks documents quickly without deep semantic understanding. An embedding model (`text-embedding-3-large`) is used to vectorize the chunks for vector search, but the chunking strategy itself is basic and fast.

>Minimal indexing is ideal when you need speed and have straightforward documents.

In [3]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    AzureBlobKnowledgeSource,
    AzureBlobKnowledgeSourceParameters,
    AzureOpenAIVectorizerParameters,
    KnowledgeSourceAzureOpenAIVectorizer,
    KnowledgeSourceContentExtractionMode,
    KnowledgeSourceIngestionParameters,
    SearchIndexerDataNoneIdentity
)

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

embedding_model = KnowledgeSourceAzureOpenAIVectorizer(
    azure_open_ai_parameters=AzureOpenAIVectorizerParameters(
        resource_url=azure_openai_endpoint,
        api_key=azure_openai_key,
        deployment_name=azure_openai_embedding_deployment,
        model_name=azure_openai_embedding_model_name
    )
)

if blob_resource_id:
    blob_connection = f"ResourceId={blob_resource_id}"
else:
    blob_connection = blob_connection_string

if not blob_connection:
    raise ValueError("Missing blob connection info. Set BLOB_RESOURCE_ID or BLOB_CONNECTION_STRING via setup-environment.sh.")

ingestion_identity = SearchIndexerDataNoneIdentity()  # system-assigned identity for ingestion

knowledge_source = AzureBlobKnowledgeSource(
    name="upload-blob-knowledge-source-minimal",
    azure_blob_parameters=AzureBlobKnowledgeSourceParameters(
        connection_string=blob_connection,
        container_name=blob_container_name,
        ingestion_parameters=KnowledgeSourceIngestionParameters(
            identity=ingestion_identity,
            embedding_model=embedding_model,
            content_extraction_mode=KnowledgeSourceContentExtractionMode.MINIMAL
        )
    )
)

index_client.create_or_update_knowledge_source(knowledge_source=knowledge_source)
print(f"Knowledge source '{knowledge_source.name}' created or updated successfully.")

Knowledge source 'upload-blob-knowledge-source-minimal' created or updated successfully.


## Step 4: Check Knowledge Source Status

After creating a blob knowledge source, it needs time to process the documents. The code below checks whether indexing is complete, in progress, or failed.

Once you see that `itemsUpdatesProcessed` is 1, that means the single document has been indexed successfully. Once indexing is complete, you can move to the next step.

In [4]:
import json

status = index_client.get_knowledge_source_status(knowledge_source.name)

print(json.dumps(status.serialize(), indent=2))

{
  "synchronizationStatus": "active",
  "synchronizationInterval": "1d",
  "lastSynchronizationState": {
    "startTime": "2025-12-07T15:43:23.549Z",
    "endTime": "2025-12-07T15:43:29.335Z",
    "itemsUpdatesProcessed": 1,
    "itemsUpdatesFailed": 0,
    "itemsSkipped": 0
  },
  "statistics": {
    "totalSynchronization": 1,
    "averageSynchronizationDuration": "PT5.7864929S",
    "averageItemsProcessedPerSynchronization": 1
  }
}


## Step 5: Create Knowledge Base

Now that the blob knowledge source has indexed the document, you can create a knowledge base to query it. The code below creates a knowledge base that uses the blob knowledge source you created earlier.

Notice that this knowledge base also set `retrieval_reasoning_effort` to "low". Currently, the lowest possible effort is "minimal" and highest possible is "medium". The "low" effort will still perform query decomposition, but it will not do iterative retrieval.

In [5]:
from azure.search.documents.indexes.models import AzureOpenAIVectorizerParameters, KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalLowReasoningEffort, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

aoai_params = AzureOpenAIVectorizerParameters(
    resource_url=azure_openai_endpoint,
    api_key=azure_openai_key,
    deployment_name=azure_openai_chatgpt_deployment,
    model_name=azure_openai_chatgpt_model_name,
)

knowledge_base = KnowledgeBase(
    name=knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=knowledge_source.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS,
    retrieval_reasoning_effort=KnowledgeRetrievalLowReasoningEffort
)

index_client.create_or_update_knowledge_base(knowledge_base)
print(f"Knowledge base '{knowledge_base_name}' created or updated successfully.")

Knowledge base 'upload-blob-knowledge-base-minimal' created or updated successfully.


## Step 6: Use agentic retrieval to fetch results from Blob Knowledge Source

The code below queries the PDF document about Zava's data sensitivity classification levels. This demonstrates how agentic retrieval works with blob knowledge sources.

When you run this query, the knowledge base analyzes your question, decomposes it into focused subqueries, searches the blob-indexed content concurrently, uses semantic ranking to filter results, and synthesizes a grounded answer with citations pointing back to the PDF document.

In [6]:
import os
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import AzureBlobKnowledgeSourceParams, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeBaseRetrievalRequest
from IPython.display import display, Markdown

if "endpoint" not in globals() or "knowledge_base_name" not in globals():
    raise RuntimeError("Missing notebook state. Rerun Steps 1-5 to reload endpoint, credential, and knowledge_base_name.")

# Prefer admin key if present; otherwise fall back to AAD (managed identity/service principal) for retrieval
admin_key = os.getenv("AZURE_SEARCH_ADMIN_KEY")
search_credential = AzureKeyCredential(admin_key) if admin_key else DefaultAzureCredential(exclude_shared_token_cache_credential=True)

# If the knowledge source object is not in scope (e.g., after a kernel restart), refetch it by name
if "knowledge_source" not in globals():
    knowledge_source = index_client.get_knowledge_source("upload-blob-knowledge-source-minimal")

knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=knowledge_base_name, credential=search_credential)

blob_ks_params = AzureBlobKnowledgeSourceParams(
    knowledge_source_name=knowledge_source.name,
    include_references=True,
    include_reference_source_data=True
)
req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="What are the levels of Zava data sensitivity classification?")])
    ],
    knowledge_source_params=[
        blob_ks_params
    ],
    include_activity=True
)

result = knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

Zava's data sensitivity classification consists of three levels:

- Level 1: Low business value. Examples include normal business communications (such as email) and files for administrative, sales, and support workers.
- Level 2: Medium business value. Examples include financial and legal information, as well as research and development data for new products.
- Level 3: High business value. Examples include customer and partner personally identifiable information, product engineering specifications, and proprietary manufacturing techniques [ref_id:0][ref_id:2].

## Step 7: Review Response, References, and Activity

The two cells below show the citations and activity log from the blob knowledge source query.

The references reveal which chunks from the PDF were used to answer your question. 

The activity log shows how the knowledge base processed your query and retrieved information from the blob-indexed content.

In [7]:
import json

references = json.dumps([ref.as_dict() for ref in result.references], indent=2)
print(references)

[
  {
    "type": "azureBlob",
    "id": "0",
    "activity_source": 1,
    "source_data": {
      "uid": "f397da4b4eb3_aHR0cHM6Ly9sYWI1MTFzdGxna3h4Z2k0dGtnY20uYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50cy9NU0ZUX2Nsb3VkX2FyY2hpdGVjdHVyZV96YXZhLnBkZg2_pages_17",
      "blob_url": "https://lab511stlgkxxgi4tkgcm.blob.core.windows.net/documents/MSFT_cloud_architecture_zava.pdf",
      "snippet": "No data sent across the Internet is in plain text form. Always use HTTPS connections, IPsec, or other end -to-end data \n\nencryption methods. \n\nEncryption for data at rest in \n\nthe cloud \n\n \n\nAll data stored on disks or elsewhere in the cloud must be in an encrypted form. \n\nACLs for least privilege \n\naccess \n\n \n\nAccount permissions to access resources in the cloud and what they are allowed to do must follow least-privilege guidelines. \n\n \n\nZava s data sensitivity classification \nUsing the information in Microsoft s Data Classification Toolkit, Zava performed an analysis of their

In [8]:
import pandas as pd

activity_types = [{"type": a.type} for a in result.activity]

df = pd.DataFrame(activity_types)

print("Activity Log Steps")
df

Activity Log Steps


Unnamed: 0,type
0,modelQueryPlanning
1,azureBlob
2,agenticReasoning
3,modelAnswerSynthesis


In [9]:
activity_content = json.dumps([a.as_dict() for a in result.activity], indent=2)
print("Activity Details")
print(activity_content)

Activity Details
[
  {
    "id": 0,
    "type": "modelQueryPlanning",
    "elapsed_ms": 1208,
    "input_tokens": 1456,
    "output_tokens": 51
  },
  {
    "id": 1,
    "type": "azureBlob",
    "elapsed_ms": 601,
    "knowledge_source_name": "upload-blob-knowledge-source-minimal",
    "query_time": "2025-12-07T18:23:00.459Z",
    "count": 5,
    "azure_blob_arguments": {
      "search": "Zava data sensitivity classification levels"
    }
  },
  {
    "id": 2,
    "type": "agenticReasoning",
    "reasoning_tokens": 11871,
    "retrieval_reasoning_effort": {
      "kind": "low"
    }
  },
  {
    "id": 3,
    "type": "modelAnswerSynthesis",
    "elapsed_ms": 1409,
    "input_tokens": 5398,
    "output_tokens": 120
  }
]


## Step 8: Use Standard extraction mode with Content Understanding

In the previous steps, you created a blob knowledge source with minimal extraction mode. Now, you'll create another blob knowledge source using the **standard** extraction mode, which leverages Azure AI Services for deeper content understanding. This mode provides advanced chunking strategies, semantic extraction, and better handling of complex documents.

The code below adds `content_extraction_mode=STANDARD` and connects Azure AI Services for enhanced processing. 

>Standard extraction takes longer but produces higher-quality chunks that preserve document structure and relationships.

In [74]:
from azure.search.documents.indexes.models import AIServices, KnowledgeSourceContentExtractionMode
from azure.core.exceptions import ResourceNotFoundError, HttpResponseError
import time

# CRITICAL: Azure Knowledge Sources store credentials PERMANENTLY
# We MUST delete the old one completely before creating a new one
ks_name = "upload-blob-knowledge-source-standard"
kb_name = "upload-blob-knowledge-base-standard"

print("üîÑ FORCE DELETE - Removing resources with wrong credentials...")

# Step 1: Delete Knowledge Base first (dependency)
try:
    index_client.delete_knowledge_base(kb_name)
    print(f"   ‚úÖ Deleted Knowledge Base: {kb_name}")
except ResourceNotFoundError:
    print(f"   ‚ÑπÔ∏è  Knowledge Base doesn't exist: {kb_name}")
except Exception as e:
    print(f"   ‚ö†Ô∏è  Error deleting KB: {e}")

time.sleep(3)  # Wait for KB deletion to propagate

# Step 2: Force delete Knowledge Source
try:
    # First try normal delete
    index_client.delete_knowledge_source(ks_name)
    print(f"   ‚úÖ Deleted Knowledge Source: {ks_name}")
    print("   ‚è≥ Waiting 10 seconds for Azure to fully remove it...")
    time.sleep(10)  # Longer wait to ensure complete deletion
except ResourceNotFoundError:
    print(f"   ‚ÑπÔ∏è  Knowledge Source doesn't exist: {ks_name}")
except Exception as e:
    print(f"   ‚ö†Ô∏è  Error deleting KS: {e}")

# Step 3: VERIFY it's really gone (retry up to 5 times)
print("\nüîç Verifying deletion completed...")
ks_exists = True
for attempt in range(5):
    try:
        index_client.get_knowledge_source(ks_name)
        print(f"   ‚ö†Ô∏è  Still exists (attempt {attempt+1}/5)... waiting 5 more seconds")
        time.sleep(5)
        ks_exists = True
    except ResourceNotFoundError:
        print(f"   ‚úÖ CONFIRMED: Knowledge Source is fully deleted!")
        ks_exists = False
        break

if ks_exists:
    print("\n‚ùå ERROR: Knowledge Source still exists after 30 seconds!")
    print("   Please wait 1 minute and run this cell again.")
    raise RuntimeError("Knowledge Source deletion did not complete")

# Step 4: Create completely NEW Knowledge Source
print("\n‚ñ∂Ô∏è  Creating BRAND NEW Knowledge Source with correct credentials...")
print(f"   Endpoint: {ai_services_endpoint}")
print(f"   Key: {ai_services_key[:5]}...{ai_services_key[-5:]} (expected: 1f37e...9ab3e)")

standard_knowledge_source = AzureBlobKnowledgeSource(
    name=ks_name,
    azure_blob_parameters=AzureBlobKnowledgeSourceParameters(
        connection_string=blob_connection,
        container_name=blob_container_name,
        ingestion_parameters=KnowledgeSourceIngestionParameters(
            identity=ingestion_identity,
            embedding_model=embedding_model,
            ai_services=AIServices(uri=ai_services_endpoint, api_key=ai_services_key),
            content_extraction_mode=KnowledgeSourceContentExtractionMode.STANDARD
        )
    )
)

# Use CREATE, not update
try:
    index_client.create_or_update_knowledge_source(knowledge_source=standard_knowledge_source)
    print(f"\n‚úÖ NEW Knowledge Source created!")
    print(f"   üîë Credential check: {ai_services_key[:5]}...{ai_services_key[-5:]}")
    print(f"\n‚è≥ Waiting 10 seconds for indexing to start...")
    time.sleep(10)
except HttpResponseError as e:
    if "already exists" in str(e).lower():
        print("\n‚ùå ERROR: Resource still exists! Wait 60 seconds and try again.")
        raise
    raise

üîÑ FORCE DELETE - Removing resources with wrong credentials...
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Base: upload-blob-knowledge-base-standard
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
   ‚è≥ Waiting 10 seconds for Azure to fully remove it...
   ‚úÖ Deleted Knowledge Source: upload-blob-knowledge-source-standard
   ‚è≥ Waiting 10 seconds for Azure to fully remove it...

üîç Verifying deletion completed...

üîç Verifying deletion completed...
   ‚úÖ CONFIRMED: Knowledge Source is fully deleted!

‚ñ∂Ô∏è  Creating BRAND NEW Knowledge Source with correct credentials...
   Endpoint: https://lab511-ai-services-lgkxxgi4tkgcm.cognitiveservices.azure.com/
   Key: 47bbe...1cf22 (expected: 1f37e...9ab3e)
   ‚úÖ CONFIRMED: Knowledge Source is fully deleted!

‚ñ∂Ô∏è  Creating BRAND NEW Knowledge Source with correct credentials...
   Endpoint: https://lab511-ai-services-lgkxxgi4tkgcm.cognitiveservices.azure.com/
 

## Step 9: Check Standard Extraction Status

Run below cell to monitor the standard extraction progress. This mode uses Azure AI Services to analyze document structure, recognize tables, and perform intelligent chunking, which takes more time than the minimal extraction mode we used earlier.

Once you see that `itemsUpdatesProcessed` is 1, that means the single document has been indexed successfully. Once indexing is complete, you can move to the next step.

In [75]:
import json

status = index_client.get_knowledge_source_status(standard_knowledge_source.name)

print(json.dumps(status.serialize(), indent=2))

{
  "synchronizationStatus": "active",
  "synchronizationInterval": "1d",
  "lastSynchronizationState": {
    "startTime": "2025-12-07T20:06:02.128Z",
    "endTime": "2025-12-07T20:06:57.715Z",
    "itemsUpdatesProcessed": 1,
    "itemsUpdatesFailed": 0,
    "itemsSkipped": 0
  },
  "statistics": {
    "totalSynchronization": 1,
    "averageSynchronizationDuration": "PT55.5871275S",
    "averageItemsProcessedPerSynchronization": 1
  }
}


## Step 10: Create Knowledge Base for Standard Extraction

You'll now create a knowledge base that uses the standard extraction blob knowledge source. This knowledge base will benefit from the enhanced document processing and improved chunk quality.

Run below cell to create the knowledge base with the standard extraction source.

In [76]:
from azure.search.documents.indexes.models import KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

standard_knowledge_base = KnowledgeBase(
    name=standard_knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=standard_knowledge_source.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS
)

index_client.create_or_update_knowledge_base(standard_knowledge_base)
print(f"Knowledge base '{standard_knowledge_base_name}' created or updated successfully.")

Knowledge base 'upload-blob-knowledge-base-standard' created or updated successfully.


## Step 11: Query Standard Extraction Knowledge Base

Run the same query about Zava's data sensitivity classification levels, but this time against the standard extraction knowledge base. 

Compare this response with the one from Step 6. You may notice differences in answer quality, completeness, or organization due to the improved document processing.

In [77]:
import os
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import AzureBlobKnowledgeSourceParams, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeBaseRetrievalRequest
from IPython.display import display, Markdown

# FIX: Ensure we use the correct Search Endpoint (it might have been overwritten by Step 7)
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]

if "standard_knowledge_base_name" not in globals():
    raise RuntimeError("Missing notebook state. Rerun Steps 1-10 to reload standard knowledge base name.")

# Prefer admin key if present; otherwise fall back to AAD (managed identity/service principal) for retrieval
admin_key = os.getenv("AZURE_SEARCH_ADMIN_KEY")
search_credential = AzureKeyCredential(admin_key) if admin_key else DefaultAzureCredential(exclude_shared_token_cache_credential=True)

# If the standard knowledge source object is not in scope (e.g., after a kernel restart), refetch it by name
if "standard_knowledge_source" not in globals():
    standard_knowledge_source = index_client.get_knowledge_source("upload-blob-knowledge-source-standard")

standard_knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=standard_knowledge_base_name, credential=search_credential)

blob_ks_params = AzureBlobKnowledgeSourceParams(
    knowledge_source_name=standard_knowledge_source.name,
    include_references=True,
    include_reference_source_data=True
)
req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="What are the levels of Zava data sensitivity classification?")])
    ],
    knowledge_source_params=[
        blob_ks_params
    ],
    include_activity=True
)

result = standard_knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

Zava's data sensitivity classification consists of three levels:

- Level 1: Low business value. This includes data that is encrypted and available only to authenticated users, such as normal business communications (email) and files for administrative, sales, and support workers. Data is encrypted both at rest and in transit, and is provided for all data stored on premises and in cloud-based storage and workloads [ref_id:0].

- Level 2: Medium business value. This level adds strong authentication (such as multi-factor authentication with SMS validation) and data loss prevention to Level 1 protections. It covers financial and legal information and research and development data for new products. Data loss prevention ensures sensitive information does not leave the on-premises network [ref_id:0].

- Level 3: High business value. This level includes the highest levels of encryption, authentication (multi-factor authentication with smart cards), and auditing, compliant with regional regulations. It applies to customer and partner personally identifiable information, product engineering specifications, and proprietary manufacturing techniques [ref_id:0].

## Step 12: Compare Extraction Results

The cell below shows citations from the standard extraction query.

Compare these references with those from Step 7 to see how different extraction modes affect chunk creation and information retrieval from the same PDF document.

In [78]:
import json

references = json.dumps([ref.as_dict() for ref in result.references], indent=2)
print(references)

[
  {
    "type": "azureBlob",
    "id": "0",
    "activity_source": 1,
    "source_data": {
      "uid": "9cf272ed2627_aHR0cHM6Ly9sYWI1MTFzdGxna3h4Z2k0dGtnY20uYmxvYi5jb3JlLndpbmRvd3MubmV0L2RvY3VtZW50cy9NU0ZUX2Nsb3VkX2FyY2hpdGVjdHVyZV96YXZhLnBkZg2_text_sections_15",
      "blob_url": "https://lab511stlgkxxgi4tkgcm.blob.core.windows.net/documents/MSFT_cloud_architecture_zava.pdf",
      "snippet": "<table>\n<tr>\n<th>Level 1: Low business value</th>\n<th>Level 2: Medium business value</th>\n<th>Level 3: High business value</th>\n</tr>\n<tr>\n<td>Data is encrypted and available only to authenticated users</td>\n<td>Level 1 plus strong authentication and data loss protection</td>\n<td>Level 2 plus the highest levels of encryption, authentication, and auditing</td>\n</tr>\n<tr>\n<td>Provided for all data stored on premises and in cloud- based storage and workloads, such as Office 365. Data is encrypted while it resides in the service and in transit between the service and client devices.</

## Summary

You've now experienced blob knowledge sources and compared different content extraction modes for document processing.

**Key concepts to remember:**
- `AzureBlobKnowledgeSource` automatically indexes documents from Azure Blob Storage
- **Minimal extraction** (Steps 3-7): Fast, basic text extraction suitable for simple documents
- **Standard extraction** (Steps 8-12): Uses Azure AI Services for advanced document understanding and better chunk quality
- Standard extraction is beneficial for complex documents with tables, images, or intricate layouts
- Both modes create searchable, vectorized chunks from your blob documents

### What's Next?

‚û°Ô∏è Continue to [Part 6: Combined Knowledge Sources](part6-combined-knowledge-source.ipynb) to learn how to query search indexes, web URLs, SharePoint, and blob storage simultaneously in a single knowledge base.