# Indexed SharePoint Knowledge Source

Use **Indexed SharePoint Knowledge Source** to automatically create a complete indexer pipeline.

## üìã Table of Contents

| Step | Description | Jump |
|------|-------------|------|
| 0Ô∏è‚É£ Environment Config | Configure Azure AI Search, SharePoint, Azure OpenAI | [View](#env-config) |
| 1Ô∏è‚É£ Create Knowledge Source | Auto-create Data Source + Skillset + Index + Indexer | [View](#create-ks) |
| 2Ô∏è‚É£ View KS Details | View created resources | [View](#ks-details) |
| 3Ô∏è‚É£ Check Indexer Status | Monitor indexing progress | [View](#indexer-status) |
| 4Ô∏è‚É£ View Index Content | Check indexed Chunks | [View](#index-content) |
| 5Ô∏è‚É£ Create Knowledge Base | Create Knowledge Base | [View](#create-kb) |
| 6Ô∏è‚É£ Query Knowledge Base | Agentic Retrieval query | [View](#query-kb) |
| üßπ Delete Resources | Cleanup resources (optional) | [View](#cleanup) |

---

## Difference from Manual Indexer

| Method | Notebook | Description |
|--------|----------|-------------|
| **Manual Indexer** | `03e_sharepoint_indexer.ipynb` | Manually create Data Source + Index + Indexer, full control |
| **Indexed SP KS** | `03f_indexed_sharepoint_ks.ipynb` (this file) | One-click auto-create entire pipeline |

## Feature Support

| Feature | Configuration |
|---------|---------------|
| üìÑ Text Extraction | ‚úÖ Enabled by default |
| üî¢ Embedding | ‚úÖ Configure `embedding_model` |
| üñºÔ∏è Image Semanticization | ‚úÖ Configure `chat_completion_model` + `disable_image_verbalization=False` |

## Permission Requirements

- App Registration + `Sites.Read.All` (Application)
- Global Admin grants Admin Consent

---

<a id="env-config"></a>
## 0Ô∏è‚É£ Environment Configuration

In [None]:
%load_ext dotenv
%dotenv

import os
import requests
import json

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    IndexedSharePointKnowledgeSource,
    IndexedSharePointKnowledgeSourceParameters,
    KnowledgeSourceIngestionParameters,
    KnowledgeSourceContentExtractionMode,
    KnowledgeSourceAzureOpenAIVectorizer,
    KnowledgeBaseAzureOpenAIModel,
    AzureOpenAIVectorizerParameters
)

# Azure AI Search Configuration
search_endpoint = os.environ.get("AZURE_SEARCH_ENDPOINT")
search_api_key = os.environ.get("AZURE_SEARCH_API_KEY")

# SharePoint App Registration
sp_app_id = os.environ.get("SP_APP_ID")
sp_app_secret = os.environ.get("SP_APP_SECRET")
sp_tenant_id = os.environ.get("SP_TENANT_ID")

# Azure OpenAI Configuration
aoai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT", "https://your-openai-resource.openai.azure.com/")
aoai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
embedding_model = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL", "text-embedding-ada-002")
gpt_model = os.getenv("AZURE_OPENAI_DEPLOYMENT", "gpt-4o")

# Create SearchIndexClient
index_client = SearchIndexClient(
    endpoint=search_endpoint,
    credential=AzureKeyCredential(search_api_key)
)

# REST API Headers
headers = {
    "Content-Type": "application/json",
    "api-key": search_api_key
}

print(f"‚úÖ Azure AI Search: {search_endpoint}")
print(f"‚úÖ App ID: {sp_app_id}")
print("\nüîß Azure OpenAI:")
print(f"   Endpoint: {aoai_endpoint}")
print(f"   Embedding: {embedding_model}")
print(f"   GPT Model: {gpt_model}")
print(f"   API Key: {'‚úÖ Configured' if aoai_api_key else '‚ùå Not configured'}")

<a id="create-ks"></a>
## 1Ô∏è‚É£ Create Indexed SharePoint Knowledge Source

This will automatically create Data Source + Skillset + Index + Indexer

In [None]:
# ‚ö†Ô∏è Delete old Knowledge Source (if exists)
ks_name = "sharepoint-index-demoed-ks"

try:
    index_client.delete_knowledge_source(ks_name)
    print(f"‚úÖ Deleted old Knowledge Source '{ks_name}'")
except Exception as e:
    print(f"‚ÑπÔ∏è Knowledge Source does not exist or already deleted: {e}")

In [None]:
# Knowledge Source name
ks_name = "sharepoint-indexed-ks-demo"

# SharePoint site
sharepoint_site = "https://your-tenant.sharepoint.com/sites/your-site"

# SharePoint Connection String
sharepoint_connection_string = (
    f"SharePointOnlineEndpoint={sharepoint_site};"
    f"ApplicationId={sp_app_id};"
    f"ApplicationSecret={sp_app_secret};"
    f"TenantId={sp_tenant_id}"
)

# Create Indexed SharePoint Knowledge Source
indexed_sp_ks = IndexedSharePointKnowledgeSource(
    name=ks_name,
    description="SharePoint - Indexed mode (with Embedding)",
    indexed_share_point_parameters=IndexedSharePointKnowledgeSourceParameters(
        # SharePoint connection string
        connection_string=sharepoint_connection_string,
        
        # Library to index: defaultSiteLibrary or allSiteLibraries
        container_name="defaultSiteLibrary",
        
        # Data ingestion parameters
        ingestion_parameters=KnowledgeSourceIngestionParameters(
            # Content extraction mode
            content_extraction_mode=KnowledgeSourceContentExtractionMode.MINIMAL,
            
            # ‚ùå Disable image verbalization (avoid GPT-4o rate limit)
            disable_image_verbalization=True,
            
            # No GPT-4o needed (image processing disabled)
            chat_completion_model=None,
            
            # ‚úÖ Embedding model for vector search
            embedding_model=KnowledgeSourceAzureOpenAIVectorizer(
                azure_open_ai_parameters=AzureOpenAIVectorizerParameters(
                    resource_url=aoai_endpoint,
                    deployment_name=embedding_model,
                    api_key=aoai_api_key,
                    model_name="text-embedding-ada-002"
                )
            ) if aoai_api_key else None,
        )
    )
)

try:
    result = index_client.create_or_update_knowledge_source(knowledge_source=indexed_sp_ks)
    print(f"‚úÖ Knowledge Source '{ks_name}' created successfully!")
    print("   Type: indexedSharePoint")
    
    print("\nüîß Enabled features:")
    print(f"   - Image verbalization: ‚ùå Disabled")
    print(f"   - Embedding: {'‚úÖ' if aoai_api_key else '‚ùå'}")
    
    print("\nüìã Auto-created resources:")
    if hasattr(result, 'indexed_share_point_parameters') and result.indexed_share_point_parameters:
        params = result.indexed_share_point_parameters
        if hasattr(params, 'created_resources') and params.created_resources:
            cr = params.created_resources
            print(f"   - Data Source: {cr.datasource}")
            print(f"   - Index: {cr.index}")
            print(f"   - Skillset: {cr.skillset}")
            print(f"   - Indexer: {cr.indexer}")
except Exception as e:
    print(f"‚ùå Creation failed: {e}")

<a id="ks-details"></a>
## 2Ô∏è‚É£ View Knowledge Source Details

In [None]:
# View Knowledge Source details
ks_url = f"{search_endpoint}/knowledgesources/{ks_name}?api-version=2025-11-01-preview"

response = requests.get(ks_url, headers=headers)
if response.status_code == 200:
    ks_details = response.json()
    print("üìã Knowledge Source details:")
    print(json.dumps(ks_details, indent=2, ensure_ascii=False))
else:
    print(f"‚ùå Failed to get: {response.status_code}")
    print(response.text)

<a id="indexer-status"></a>
## 3Ô∏è‚É£ Check Indexer Status

In [None]:
# Indexer name = Knowledge Source name + "-indexer"
indexer_name = f"{ks_name}-indexer"
indexer_url = f"{search_endpoint}/indexers/{indexer_name}/status?api-version=2024-07-01"

response = requests.get(indexer_url, headers=headers)
if response.status_code == 200:
    status = response.json()
    print(f"üîÑ Indexer '{indexer_name}' status:")
    print(f"   - Overall status: {status.get('status', 'N/A')}")
    
    if status.get('lastResult'):
        last = status['lastResult']
        print(f"   - Last run: {last.get('status', 'N/A')}")
        print(f"   - Start time: {last.get('startTime', 'N/A')}")
        print(f"   - End time: {last.get('endTime', 'N/A')}")
        print(f"   - Documents processed: {last.get('itemsProcessed', 0)}")
        print(f"   - Documents failed: {last.get('itemsFailed', 0)}")
        
        if last.get('errors'):
            print("\n‚ö†Ô∏è Errors:")
            for err in last['errors'][:3]:
                print(f"   - {err.get('message', 'Unknown')[:100]}")
    else:
        print("   - Not yet run")
else:
    print(f"‚ùå Failed to get: {response.status_code}")

### Configure Indexer Schedule (Optional)

Set up Indexer to run automatically on a schedule to detect changes in SharePoint.

In [None]:
# Configure Indexer scheduled run
indexer_name = f"{ks_name}-indexer"
indexer_url = f"{search_endpoint}/indexers/{indexer_name}?api-version=2024-07-01"

# Get current Indexer configuration
response = requests.get(indexer_url, headers=headers)
if response.status_code != 200:
    print(f"‚ùå Failed to get Indexer: {response.status_code}")
else:
    indexer_config = response.json()
    
    # Remove @odata fields (not needed for update)
    indexer_config.pop("@odata.context", None)
    indexer_config.pop("@odata.etag", None)
    
    # Add schedule configuration
    # PT5M=5 minutes, PT30M=30 minutes, PT1H=1 hour, P1D=1 day
    indexer_config["schedule"] = {
        "interval": "PT1H",  # Run every 60 minutes
        "startTime": None     # Start immediately
    }
    
    # Update Indexer
    update_response = requests.put(indexer_url, headers=headers, json=indexer_config)
    
    if update_response.status_code in [200, 201, 204]:
        print("‚úÖ Indexer schedule configured!")
        print("   - Run interval: every 1 hour")
        print("   - Indexer: {indexer_name}")
        print("\nüí° Common interval settings:")
        print("   - PT5M   = 5 minutes (minimum)")
        print("   - PT30M  = 30 minutes")
        print("   - PT1H   = 1 hour")
        print("   - PT6H   = 6 hours")
        print("   - P1D    = 1 day")
    else:
        print("‚ùå Configuration failed: {update_response.status_code}")
        print(update_response.text)

<a id="index-content"></a>
## 4Ô∏è‚É£ View Index Content

In [None]:
# Index name = Knowledge Source name + "-index"
index_name = f"{ks_name}-index"

search_query = {
    "search": "*",
    "top": 20,
    "select": "uid,snippet_parent_id,doc_url,snippet"
}

response = requests.post(
    f"{search_endpoint}/indexes/{index_name}/docs/search?api-version=2024-11-01-preview",
    headers=headers,
    json=search_query
)

if response.status_code == 200:
    results = response.json()
    docs = results.get("value", [])
    print(f"üìÑ Chunks in index: {len(docs)}")
    print("-" * 60)
    for i, doc in enumerate(docs[:5], 1):
        print(f"\n[{i}] {doc.get('doc_url', 'N/A')}")
        snippet = doc.get('snippet', '')[:200]
        print(f"    {snippet}...")
else:
    print(f"‚ùå Search failed: {response.status_code}")
    print(response.text)

### View Index Schema (All Available Fields)

In [None]:
# View all fields in the index (Schema)
index_name = f"{ks_name}-index"
index_url = f"{search_endpoint}/indexes/{index_name}?api-version=2024-11-01-preview"

response = requests.get(index_url, headers=headers)
if response.status_code == 200:
    index_schema = response.json()
    print(f"üìã Fields in index '{index_name}':")
    print("-" * 60)
    for field in index_schema.get("fields", []):
        field_type = field.get("type", "N/A")
        searchable = "üîç" if field.get("searchable") else ""
        filterable = "üîß" if field.get("filterable") else ""
        print(f"  {field['name']:30} {field_type:20} {searchable}{filterable}")
else:
    print(f"‚ùå Failed to get: {response.status_code}")

### View Full Document Metadata (From Index)

In [None]:
# View full document metadata (including all fields)
index_name = f"{ks_name}-index"

search_query = {
    "search": "*",
    "top": 5,
    "select": "*"  # Select all fields
}

response = requests.post(
    f"{search_endpoint}/indexes/{index_name}/docs/search?api-version=2024-11-01-preview",
    headers=headers,
    json=search_query
)

if response.status_code == 200:
    results = response.json()
    docs = results.get("value", [])
    print(f"üìÑ Full document metadata (first {len(docs)} docs):")
    print("=" * 80)
    
    for i, doc in enumerate(docs, 1):
        print(f"\n[{i}] Document details:")
        print("-" * 40)
        for key, value in doc.items():
            if key.startswith("@"):
                continue
            # Truncate long fields
            if isinstance(value, str) and len(value) > 100:
                value = value[:100] + "..."
            elif isinstance(value, list) and len(value) > 3:
                value = str(value[:3]) + f"... ({len(value)} items)"
            print(f"  {key}: {value}")
else:
    print(f"‚ùå Search failed: {response.status_code}")
    print(response.text)

### üìä Document Processing Status Monitoring

View the processing status of each document by the AI Search Indexer.

In [None]:
# View Indexer execution history (includes processing status of each document)
indexer_name = f"{ks_name}-indexer"
status_url = f"{search_endpoint}/indexers/{indexer_name}/status?api-version=2024-07-01"

response = requests.get(status_url, headers=headers)
if response.status_code == 200:
    status = response.json()
    
    print("üìä Indexer execution history:")
    print("=" * 80)
    
    # Current status
    print(f"\nüîÑ Current status: {status.get('status', 'N/A')}")
    
    # Last execution
    if status.get('lastResult'):
        last = status['lastResult']
        print(f"\nüìã Last execution:")
        print(f"   Status: {last.get('status', 'N/A')}")
        print(f"   Start: {last.get('startTime', 'N/A')}")
        print(f"   End: {last.get('endTime', 'N/A')}")
        print(f"   ‚úÖ Successfully processed: {last.get('itemsProcessed', 0)} documents")
        print(f"   ‚ùå Failed: {last.get('itemsFailed', 0)} documents")
        
        # Show warnings
        if last.get('warnings'):
            print(f"\n‚ö†Ô∏è Warnings ({len(last['warnings'])} total):")
            for warn in last['warnings'][:5]:
                doc_key = warn.get('key', 'Unknown')
                message = warn.get('message', '')[:150]
                print(f"   [{doc_key}] {message}")
        
        # Show errors
        if last.get('errors'):
            print(f"\n‚ùå Errors ({len(last['errors'])} total):")
            for err in last['errors'][:5]:
                doc_key = err.get('key', 'Unknown')
                message = err.get('message', '')[:150]
                print(f"   [{doc_key}] {message}")
    
    # Execution history
    if status.get('executionHistory'):
        print(f"\nüìú Execution history (last {len(status['executionHistory'])} runs):")
        print("-" * 60)
        for i, exec_info in enumerate(status['executionHistory'][:5], 1):
            exec_status = exec_info.get('status', 'N/A')
            start_time = exec_info.get('startTime', 'N/A')[:19] if exec_info.get('startTime') else 'N/A'
            items_processed = exec_info.get('itemsProcessed', 0)
            items_failed = exec_info.get('itemsFailed', 0)
            
            status_icon = "‚úÖ" if exec_status == "success" else "‚ùå" if exec_status == "transientFailure" else "üîÑ"
            print(f"   {i}. [{status_icon} {exec_status}] {start_time} - Processed:{items_processed}, Failed:{items_failed}")
else:
    print(f"‚ùå Failed to get: {response.status_code}")

### üîÑ Manually Trigger Indexer Run

If there are unindexed documents, you can manually trigger the Indexer to run again.

In [None]:
# Manually trigger Indexer run
indexer_name = f"{ks_name}-indexer"
run_url = f"{search_endpoint}/indexers/{indexer_name}/run?api-version=2024-07-01"

# ‚ö†Ô∏è Uncomment to trigger run
# response = requests.post(run_url, headers=headers)
# if response.status_code == 202:
#     print(f"‚úÖ Indexer '{indexer_name}' triggered!")
#     print("   Running...please check status later")
# else:
#     print(f"‚ùå Trigger failed: {response.status_code}")
#     print(response.text)

print("üí° Uncomment the code above to manually trigger Indexer run")
print(f"   Indexer: {indexer_name}")

<a id="create-kb"></a>
## 5Ô∏è‚É£ Create Knowledge Base

In [None]:
from azure.search.documents.indexes.models import (
    KnowledgeBase,
    KnowledgeSourceReference,
    KnowledgeBaseAzureOpenAIModel,
    AzureOpenAIVectorizerParameters,
    KnowledgeRetrievalOutputMode,
    KnowledgeRetrievalLowReasoningEffort
)

kb_name = "sharepoint-index-demoed-kb"

# Azure OpenAI parameters
aoai_params = AzureOpenAIVectorizerParameters(
    resource_url=aoai_endpoint,
    deployment_name=gpt_model,
    api_key=aoai_api_key,
    model_name="gpt-4o"
)

# Create Knowledge Base
kb = KnowledgeBase(
    name=kb_name,
    description="SharePoint Knowledge Base - Indexed SharePoint Knowledge Source",
    
    knowledge_sources=[
        KnowledgeSourceReference(name=ks_name)
    ],
    
    retrieval_instructions="Use this knowledge source to answer questions about SharePoint documents.",
    answer_instructions="Provide accurate answers based on retrieved document content and cite sources.",
    
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS,
    
    models=[
        KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)
    ],
    
    retrieval_reasoning_effort=KnowledgeRetrievalLowReasoningEffort()
)

try:
    index_client.create_or_update_knowledge_base(knowledge_base=kb)
    print(f"‚úÖ Knowledge Base '{kb_name}' created successfully!")
except Exception as e:
    print(f"‚ùå Creation failed: {e}")

<a id="query-kb"></a>
## 6Ô∏è‚É£ Query Knowledge Base

In [None]:
from azure.identity import DefaultAzureCredential
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import (
    KnowledgeBaseRetrievalRequest,
    KnowledgeBaseMessage,
    KnowledgeBaseMessageTextContent
)

# Create client
credential = DefaultAzureCredential()
kb_client = KnowledgeBaseRetrievalClient(
    endpoint=search_endpoint,
    knowledge_base_name=kb_name,
    credential=credential
)

# Query question
question = "What does this document talk about?"

request = KnowledgeBaseRetrievalRequest(
    include_activity=True,
    messages=[
        KnowledgeBaseMessage(
            role="user",
            content=[KnowledgeBaseMessageTextContent(text=question)]
        )
    ]
)

print(f"üîç Query: {question}")
print("=" * 60)

result = kb_client.retrieve(retrieval_request=request)

print("\nüìù Answer:")
print("-" * 40)
for resp in result.response:
    for content in resp.content:
        print(content.text)

if result.references:
    print("\nüìö References:")
    for i, ref in enumerate(result.references, 1):
        ref_dict = ref.as_dict()
        print(f"  [{i}] {ref_dict.get('doc_url', 'N/A')}")

<a id="cleanup"></a>
## üßπ Delete Resources (Optional)

In [None]:
# ‚ö†Ô∏è Deleting Knowledge Source will also delete all auto-created resources!

# Uncomment to execute deletion
# index_client.delete_knowledge_base(kb_name)
# print(f"‚úÖ Knowledge Base '{kb_name}' deleted")

# index_client.delete_knowledge_source(ks_name)
# print(f"‚úÖ Knowledge Source '{ks_name}' deleted")
# print("   Auto-created Data Source, Index, Skillset, Indexer also deleted")

print("üí° To delete resources, uncomment the code above and run")