# Part 5: Blob Knowledge Source

In Parts 1-4, you worked with pre-indexed data, SharePoint, and web sources. In Part 5, you'll upload documents from Azure Blob Storage and create knowledge sources that index them automatically. You'll also compare two indexing modes: **minimal** (basic content extraction) and **standard** (advanced content understanding with Azure AI Services).

## Step 1: Load Environment Variables

Run below cell to load the configuration for your Azure resources, choose the **.venv(3.11.9)** environment that is created for you.

Notice the additional variables for blob storage, AI services, and embedding models, which are needed for document ingestion and vectorization. All these Azure resources are pre-configured in `.env` for you.

> **⚠️ Troubleshooting**
>
> If code cells get stuck and keep spinning, select **Restart** from the notebook toolbar at the top. If the issue persists after a couple of tries, close VS Code completely and reopen it.

In [11]:
import os

from azure.core.credentials import AzureKeyCredential
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv

load_dotenv(override=True) # take environment variables from .env.

# Azure AI Search configuration
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"])

# Knowledge base name
knowledge_base_name = "upload-blob-knowledge-base-minimal"
standard_knowledge_base_name = "upload-blob-knowledge-base-standard"

# Azure OpenAI configuration (identity-based auth, no API key needed)
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_chatgpt_deployment = os.getenv("AZURE_OPENAI_CHATGPT_DEPLOYMENT", "gpt-4.1")
azure_openai_chatgpt_model_name = os.getenv("AZURE_OPENAI_CHATGPT_MODEL_NAME", "gpt-4.1")
azure_openai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
azure_openai_embedding_model_name = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL_NAME", "text-embedding-3-large")

# Blob configuration (identity-based auth - no connection string needed)
blob_storage_account_url = os.environ["BLOB_STORAGE_ACCOUNT_URL"]
blob_storage_resource_id = os.environ["BLOB_STORAGE_RESOURCE_ID"]
blob_container_name = os.environ["BLOB_CONTAINER_NAME"]
ai_services_endpoint = os.environ["AI_SERVICES_ENDPOINT"]

# Azure credential for blob operations
azure_credential = DefaultAzureCredential()

blob_path = "../data/ai-search-data/blobdata/MSFT_cloud_architecture_zava.pdf"

print("Environment variables loaded")

Environment variables loaded


## Step 2: Upload Document to Blob Storage

Before creating a knowledge source, you need to upload a document to your blob storage. The code below uploads a PDF called `MSFT_cloud_architecture_zava.pdf` which contains information about Zava's cloud architecture and how they classify data by sensitivity level.

Once you create the blob knowledge source in the next step, it will automatically find this PDF in the storage and index it for querying.

In [2]:
from azure.storage.blob import BlobServiceClient

# Using DefaultAzureCredential for identity-based blob access (no connection string needed)
blob_service_client = BlobServiceClient(account_url=blob_storage_account_url, credential=azure_credential)
container_client = blob_service_client.get_container_client(blob_container_name)
blob_name = os.path.basename(blob_path)
blob_client = container_client.get_blob_client(blob_name)
if not blob_client.exists():
    with open(blob_path, "rb") as data:
        blob_client.upload_blob(data, overwrite=True)

print(f"Setup sample data in {blob_container_name}")

Setup sample data in documents


## Step 3: Create Blob Knowledge Source with Minimal Extraction

An **AzureBlobKnowledgeSource** automatically indexes documents from blob storage. Unlike the sources you've used before, this one ingests and processes the documents for you.

The code below creates a knowledge source with a `content_extraction_mode` of **minimal**. This mode chunks documents quickly without deep semantic understanding. An embedding model (`text-embedding-3-large`) is used to vectorize the chunks for vector search, but the chunking strategy itself is basic and fast.

>Minimal indexing is ideal when you need speed and have straightforward documents.

In [3]:
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import AzureBlobKnowledgeSource, AzureBlobKnowledgeSourceParameters, AzureOpenAIVectorizerParameters, KnowledgeSourceAzureOpenAIVectorizer, KnowledgeSourceContentExtractionMode, KnowledgeSourceIngestionParameters

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

# Using identity-based auth for embedding model (no api_key needed)
embedding_model = KnowledgeSourceAzureOpenAIVectorizer(
    azure_open_ai_parameters=AzureOpenAIVectorizerParameters(
        resource_url=azure_openai_endpoint,
        deployment_name=azure_openai_embedding_deployment,
        model_name=azure_openai_embedding_model_name
    )
)

# Using resource ID so the search service uses its managed identity to access blob storage
knowledge_source = AzureBlobKnowledgeSource(
    name="upload-blob-knowledge-source-minimal",
    azure_blob_parameters=AzureBlobKnowledgeSourceParameters(
        connection_string=blob_storage_resource_id,
        container_name=blob_container_name,
        ingestion_parameters=KnowledgeSourceIngestionParameters(
            embedding_model=embedding_model,
            content_extraction_mode=KnowledgeSourceContentExtractionMode.MINIMAL
        )
    )
)

index_client.create_or_update_knowledge_source(knowledge_source=knowledge_source)
print(f"Knowledge source '{knowledge_source.name}' created or updated successfully.")

Knowledge source 'upload-blob-knowledge-source-minimal' created or updated successfully.


## Step 4: Check Knowledge Source Status

After creating a blob knowledge source, it needs time to process the documents. The code below checks whether indexing is complete, in progress, or failed.

Once you see that `itemsUpdatesProcessed` is 1, that means the single document has been indexed successfully. Once indexing is complete, you can move to the next step.

In [4]:
import json

status = index_client.get_knowledge_source_status(knowledge_source.name)

print(json.dumps(status.serialize(), indent=2))

{
  "synchronizationStatus": "active",
  "synchronizationInterval": "1d",
  "lastSynchronizationState": {
    "startTime": "2026-02-20T04:16:30.177Z",
    "endTime": "2026-02-20T04:16:35.569Z",
    "itemsUpdatesProcessed": 1,
    "itemsUpdatesFailed": 0,
    "itemsSkipped": 0
  },
  "statistics": {
    "totalSynchronization": 1,
    "averageSynchronizationDuration": "PT5.3922605S",
    "averageItemsProcessedPerSynchronization": 1
  }
}


## Step 5: Create Knowledge Base

Now that the blob knowledge source has indexed the document, you can create a knowledge base to query it. The code below creates a knowledge base that uses the blob knowledge source you created earlier.

Notice that this knowledge base also set `retrieval_reasoning_effort` to "low". Currently, the lowest possible effort is "minimal" and highest possible is "medium". The "low" effort will still perform query decomposition, but it will not do iterative retrieval.

In [5]:
from azure.search.documents.indexes.models import AzureOpenAIVectorizerParameters, KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalLowReasoningEffort, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

# Using identity-based auth: the search service's system-assigned managed identity
# authenticates to Azure OpenAI (no api_key needed)
aoai_params = AzureOpenAIVectorizerParameters(
    resource_url=azure_openai_endpoint,
    deployment_name=azure_openai_chatgpt_deployment,
    model_name=azure_openai_chatgpt_model_name,
)

knowledge_base = KnowledgeBase(
    name=knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=knowledge_source.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS,
    retrieval_reasoning_effort=KnowledgeRetrievalLowReasoningEffort
)

index_client.create_or_update_knowledge_base(knowledge_base)
print(f"Knowledge base '{knowledge_base_name}' created or updated successfully.")

Knowledge base 'upload-blob-knowledge-base-minimal' created or updated successfully.


## Step 6: Use agentic retrieval to fetch results from Blob Knowledge Source

The code below queries the PDF document about Zava's data sensitivity classification levels. This demonstrates how agentic retrieval works with blob knowledge sources.

When you run this query, the knowledge base analyzes your question, decomposes it into focused subqueries, searches the blob-indexed content concurrently, uses semantic ranking to filter results, and synthesizes a grounded answer with citations pointing back to the PDF document.

In [6]:
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import AzureBlobKnowledgeSourceParams, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeBaseRetrievalRequest
from IPython.display import display, Markdown

knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=knowledge_base_name, credential=credential)

blob_ks_params = AzureBlobKnowledgeSourceParams(
    knowledge_source_name=knowledge_source.name,
    include_references=True,
    include_reference_source_data=True
)
req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="What are the levels of Zava data sensitivity classification?")])
    ],
    knowledge_source_params=[
        blob_ks_params
    ],
    include_activity=True
)


result = knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

Zava's data sensitivity classification consists of three levels:

- Level 1: Low business value. Examples include normal business communications (email) and files for administrative, sales, and support workers.
- Level 2: Medium business value. Examples include financial and legal information, as well as research and development data for new products.
- Level 3: High business value. Examples include customer and partner personally identifiable information, product engineering specifications, and proprietary manufacturing techniques [ref_id:0][ref_id:1].

## Step 7: Review Response, References, and Activity

The two cells below show the citations and activity log from the blob knowledge source query.

The references reveal which chunks from the PDF were used to answer your question. 

The activity log shows how the knowledge base processed your query and retrieved information from the blob-indexed content.

In [7]:
import json

references = json.dumps([ref.as_dict() for ref in result.references], indent=2)
print(references)

[
  {
    "type": "azureBlob",
    "id": "0",
    "activity_source": 1,
    "source_data": {
      "uid": "9ae203ceaecc_aHR0cHM6Ly9zdXJlcHN0b3JlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9kb2N1bWVudHMvTVNGVF9jbG91ZF9hcmNoaXRlY3R1cmVfemF2YS5wZGY1_pages_17",
      "blob_url": "https://surepstore.blob.core.windows.net/documents/MSFT_cloud_architecture_zava.pdf",
      "snippet": "No data sent across the Internet is in plain text form. Always use HTTPS connections, IPsec, or other end -to-end data \n\nencryption methods. \n\nEncryption for data at rest in \n\nthe cloud \n\n \n\nAll data stored on disks or elsewhere in the cloud must be in an encrypted form. \n\nACLs for least privilege \n\naccess \n\n \n\nAccount permissions to access resources in the cloud and what they are allowed to do must follow least-privilege guidelines. \n\n \n\nZava s data sensitivity classification \nUsing the information in Microsoft s Data Classification Toolkit, Zava performed an analysis of their data and determined the f

In [8]:
import pandas as pd

activity_types = [{"type": a.type} for a in result.activity]

df = pd.DataFrame(activity_types)

print("Activity Log Steps")
df

Activity Log Steps


Unnamed: 0,type
0,modelQueryPlanning
1,azureBlob
2,azureBlob
3,azureBlob
4,agenticReasoning
5,modelAnswerSynthesis


In [9]:
activity_content = json.dumps([a.as_dict() for a in result.activity], indent=2)
print("Activity Details")
print(activity_content)

Activity Details
[
  {
    "id": 0,
    "type": "modelQueryPlanning",
    "elapsed_ms": 2586,
    "input_tokens": 1456,
    "output_tokens": 63
  },
  {
    "id": 1,
    "type": "azureBlob",
    "elapsed_ms": 310,
    "knowledge_source_name": "upload-blob-knowledge-source-minimal",
    "query_time": "2026-02-20T04:16:48.974Z",
    "count": 5,
    "azure_blob_arguments": {
      "search": "Zava data sensitivity classification levels"
    }
  },
  {
    "id": 2,
    "type": "azureBlob",
    "elapsed_ms": 165,
    "knowledge_source_name": "upload-blob-knowledge-source-minimal",
    "query_time": "2026-02-20T04:16:49.140Z",
    "count": 7,
    "azure_blob_arguments": {
      "search": "Zava data classification scheme"
    }
  },
  {
    "id": 3,
    "type": "azureBlob",
    "elapsed_ms": 156,
    "knowledge_source_name": "upload-blob-knowledge-source-minimal",
    "query_time": "2026-02-20T04:16:49.297Z",
    "count": 6,
    "azure_blob_arguments": {
      "search": "Zava information sensi

## Step 8: Use Standard extraction mode with Content Understanding

In the previous steps, you created a blob knowledge source with minimal extraction mode. Now, you'll create another blob knowledge source using the **standard** extraction mode, which leverages Azure AI Services for deeper content understanding. This mode provides advanced chunking strategies, semantic extraction, and better handling of complex documents.

The code below adds `content_extraction_mode=STANDARD` and connects Azure AI Services for enhanced processing. 

>Standard extraction takes longer but produces higher-quality chunks that preserve document structure and relationships.

In [12]:

from azure.search.documents.indexes.models import AIServices, KnowledgeSourceContentExtractionMode

# Using resource ID for identity-based blob access, same as minimal extraction
# ai_services uses identity-based auth (api_key=None) - search service MI has "Cognitive Services User" role
standard_knowledge_source = AzureBlobKnowledgeSource(
    name="upload-blob-knowledge-source-standard",
    azure_blob_parameters=AzureBlobKnowledgeSourceParameters(
        connection_string=blob_storage_resource_id,
        container_name=blob_container_name,
        ingestion_parameters=KnowledgeSourceIngestionParameters(
            embedding_model=embedding_model,
            ai_services=AIServices(uri=ai_services_endpoint),
            content_extraction_mode=KnowledgeSourceContentExtractionMode.STANDARD
        )
    )
)

index_client.create_or_update_knowledge_source(knowledge_source=standard_knowledge_source)
print(f"Knowledge source '{standard_knowledge_source.name}' created or updated successfully.")

Knowledge source 'upload-blob-knowledge-source-standard' created or updated successfully.


## Step 9: Check Standard Extraction Status

Run below cell to monitor the standard extraction progress. This mode uses Azure AI Services to analyze document structure, recognize tables, and perform intelligent chunking, which takes more time than the minimal extraction mode we used earlier.

Once you see that `itemsUpdatesProcessed` is 1, that means the single document has been indexed successfully. Once indexing is complete, you can move to the next step.

In [13]:
import json

status = index_client.get_knowledge_source_status(standard_knowledge_source.name)

print(json.dumps(status.serialize(), indent=2))

{
  "synchronizationStatus": "creating"
}


## Step 10: Create Knowledge Base for Standard Extraction

You'll now create a knowledge base that uses the standard extraction blob knowledge source. This knowledge base will benefit from the enhanced document processing and improved chunk quality.

Run below cell to create the knowledge base with the standard extraction source.

In [14]:
from azure.search.documents.indexes.models import KnowledgeBase, KnowledgeBaseAzureOpenAIModel, KnowledgeRetrievalOutputMode, KnowledgeSourceReference

standard_knowledge_base = KnowledgeBase(
    name=standard_knowledge_base_name,
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=standard_knowledge_source.name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS
)

index_client.create_or_update_knowledge_base(standard_knowledge_base)
print(f"Knowledge base '{standard_knowledge_base_name}' created or updated successfully.")

Knowledge base 'upload-blob-knowledge-base-standard' created or updated successfully.


## Step 11: Query Standard Extraction Knowledge Base

Run the same query about Zava's data sensitivity classification levels, but this time against the standard extraction knowledge base. 

Compare this response with the one from Step 6. You may notice differences in answer quality, completeness, or organization due to the improved document processing.

In [16]:
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import AzureBlobKnowledgeSourceParams, KnowledgeBaseMessage, KnowledgeBaseMessageTextContent, KnowledgeBaseRetrievalRequest

standard_knowledge_base_client = KnowledgeBaseRetrievalClient(endpoint=endpoint, knowledge_base_name=standard_knowledge_base_name, credential=credential)

blob_ks_params = AzureBlobKnowledgeSourceParams(
    knowledge_source_name=standard_knowledge_source.name,
    include_references=True,
    include_reference_source_data=True
)
req = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(role="user", content=[KnowledgeBaseMessageTextContent(text="What are the levels of data sensitivity classification for Zava?")])
    ],
    knowledge_source_params=[
        blob_ks_params
    ],
    include_activity=True
)


result = standard_knowledge_base_client.retrieve(retrieval_request=req)
display(Markdown(result.response[0].content[0].text))

Zava classifies data sensitivity into three levels:

- Level 1: Low business value. This includes data that is encrypted and available only to authenticated users, such as normal business communications (email) and files for administrative, sales, and support workers.
- Level 2: Medium business value. This level adds strong authentication (such as multi-factor authentication with SMS) and data loss protection. Examples include financial and legal information, as well as research and development data for new products.
- Level 3: High business value. This level includes the highest levels of encryption, authentication (multi-factor with smart cards), and auditing. Examples are customer and partner personally identifiable information, product engineering specifications, and proprietary manufacturing techniques [ref_id:0][ref_id:3].

## Step 12: Compare Extraction Results

The cell below shows citations from the standard extraction query.

Compare these references with those from Step 7 to see how different extraction modes affect chunk creation and information retrieval from the same PDF document.

In [17]:
import json

references = json.dumps([ref.as_dict() for ref in result.references], indent=2)
print(references)

[
  {
    "type": "azureBlob",
    "id": "0",
    "activity_source": 1,
    "source_data": {
      "uid": "c72f9f27d47d_aHR0cHM6Ly9zdXJlcHN0b3JlLmJsb2IuY29yZS53aW5kb3dzLm5ldC9kb2N1bWVudHMvTVNGVF9jbG91ZF9hcmNoaXRlY3R1cmVfemF2YS5wZGY1_text_sections_15",
      "blob_url": "https://surepstore.blob.core.windows.net/documents/MSFT_cloud_architecture_zava.pdf",
      "snippet": "<table>\n<tr>\n<th>Level 1: Low business value</th>\n<th>Level 2: Medium business value</th>\n<th>Level 3: High business value</th>\n</tr>\n<tr>\n<td>Data is encrypted and available only to authenticated users<br>Provided for all data stored on premises and in cloud- based storage and workloads, such as Office 365. Data is encrypted while it resides in the service and in transit between the service and client devices.<br>Examples of Level 1 data are normal business communications (email) and files for administrative, sales, and support workers.</td>\n<td>Level 1 plus strong authentication and data loss protection<br>S

## Summary

You've now experienced blob knowledge sources and compared different content extraction modes for document processing.

**Key concepts to remember:**
- `AzureBlobKnowledgeSource` automatically indexes documents from Azure Blob Storage
- **Minimal extraction**: Fast, basic text extraction suitable for simple documents
- **Standard extraction**: Uses Azure AI Services for advanced document understanding and better chunk quality
- Standard extraction is beneficial for complex documents with tables, images, or intricate layouts
- Both modes create searchable, vectorized chunks from your blob documents

### What's Next?

➡️ Continue to [Part 6: Combined Knowledge Sources](part6-combined-knowledge-source.ipynb) to learn how to query search indexes, web URLs, SharePoint, and blob storage simultaneously in a single knowledge base.