## Step 0: Load Environment Variables

First, load your Azure resource settings from `.env`. If you ran the SharePoint setup script, your app registration, permissions, and connection string are already populated here.

> Use the `.venv(3.11.9)` kernel when prompted.

In [10]:
import os
from azure.core.credentials import AzureKeyCredential
from dotenv import load_dotenv

load_dotenv(override=True)

# Azure AI Search configuration
endpoint = os.environ["AZURE_SEARCH_SERVICE_ENDPOINT"]
credential = AzureKeyCredential(os.environ["AZURE_SEARCH_ADMIN_KEY"])

# Azure OpenAI configuration
azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_KEY"]
azure_openai_embedding_deployment = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT", "text-embedding-3-large")
azure_openai_embedding_model_name = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL_NAME", "text-embedding-3-large")
azure_openai_chatgpt_deployment = os.getenv("AZURE_OPENAI_CHATGPT_DEPLOYMENT", "gpt-4.1")
azure_openai_chatgpt_model_name = os.getenv("AZURE_OPENAI_CHATGPT_MODEL_NAME", "gpt-4.1")

# Optional SharePoint config pre-loaded by setup scripts
sharepoint_connection_string = os.getenv("SHAREPOINT_CONNECTION_STRING")
sharepoint_site_url = os.getenv("SHAREPOINT_SITE_URL")
application_id = None
tenant_id = os.getenv("AZURE_TENANT_ID") or os.getenv("TENANT_ID")
managed_identity_object_id = os.getenv("AZURE_SEARCH_MANAGED_IDENTITY_OBJECT_ID")
application_secret = None
use_federated_creds = False
use_managed_identity = bool(managed_identity_object_id)

def _parse_connection_string(cs: str) -> dict:
    parts = cs.split(";") if cs else []
    parsed = {}
    for part in parts:
        if "=" in part:
            key, value = part.split("=", 1)
            parsed[key] = value
    return parsed

if sharepoint_connection_string:
    parsed_cs = _parse_connection_string(sharepoint_connection_string)
    application_id = parsed_cs.get("ApplicationId")
    tenant_id = tenant_id or parsed_cs.get("TenantId")
    managed_identity_object_id = parsed_cs.get("FederatedCredentialObjectId") or managed_identity_object_id
    application_secret = parsed_cs.get("ApplicationSecret")
    sharepoint_site_url = sharepoint_site_url or parsed_cs.get("SharePointOnlineEndpoint")
    use_federated_creds = bool(parsed_cs.get("FederatedCredentialObjectId"))
    print("‚úì Loaded SharePoint connection string from environment (.env)")
    print("  You can skip manual app registration (Step 2) unless you need to change it.")

print("‚úì Environment variables loaded")
print(f"  Azure AI Search: {endpoint}")
print(f"  Azure OpenAI: {azure_openai_endpoint}")

‚úì Loaded SharePoint connection string from environment (.env)
  You can skip manual app registration (Step 2) unless you need to change it.
‚úì Environment variables loaded
  Azure AI Search: https://lab511-search-lgkxxgi4tkgcm.search.windows.net
  Azure OpenAI: https://lab511-openai-lgkxxgi4tkgcm.openai.azure.com/


## Step 1: Verify Azure AI Search Configuration

IndexedSharePointKnowledgeSource requires:
- **Semantic ranker** enabled (required for agentic retrieval)
- **System-assigned managed identity** (optional but recommended for tenant detection)

### 1.1 Check if Semantic Ranker is Enabled

Run this cell to verify semantic ranker status:

In [11]:
from azure.search.documents.indexes import SearchIndexClient

index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

# Get service statistics to check semantic ranker
# Note: Semantic ranker configuration is typically done through Azure portal
print("üìã Checking Azure AI Search service configuration...")
print()
print("Please verify in Azure Portal:")
print("1. Navigate to your Azure AI Search service")
print("2. Go to 'Settings' > 'Semantic ranker'")
print("3. Ensure it's set to 'Free' or 'Standard'")
print()
print("If not enabled, enable it now before continuing.")

üìã Checking Azure AI Search service configuration...

Please verify in Azure Portal:
1. Navigate to your Azure AI Search service
2. Go to 'Settings' > 'Semantic ranker'
3. Ensure it's set to 'Free' or 'Standard'

If not enabled, enable it now before continuing.


### 1.2 Enable System-Assigned Managed Identity (Optional)

If your SharePoint site is in the **same tenant** as Azure AI Search, enabling managed identity allows automatic tenant detection.

**To enable managed identity:**

1. Navigate to your Azure AI Search service in Azure Portal
2. Select **Settings** > **Identity**
3. Under **System assigned**, toggle **Status** to **On**
4. Click **Save**
5. Copy the **Object (principal) ID** that appears

‚ö†Ô∏è **Skip this if:**
- SharePoint is in a different tenant (you'll use `TenantId` in connection string instead)
- You prefer to explicitly specify tenant ID

Run this cell to document whether you enabled it:

In [None]:
# Document your managed identity configuration
if managed_identity_object_id:
    use_managed_identity = True
    print(f"Found managed identity object ID from environment: {managed_identity_object_id}")
else:
    use_managed_identity = input("Did you enable system-assigned managed identity? (yes/no): ").lower() == 'yes'
    if use_managed_identity:
        managed_identity_object_id = input("Enter the Object (principal) ID: ").strip()
        print(f"‚úì Using managed identity: {managed_identity_object_id}")
    else:
        managed_identity_object_id = None
        print("‚úì Will use explicit TenantId in connection string")

‚úì Using managed identity: 


## Step 2: Create Microsoft Entra App Registration

This is the most complex step. You'll create an app registration that allows Azure AI Search to access SharePoint.

**Skip this step if you already ran the SharePoint setup script** (`infra/deploy-yourself/setup-environment.sh` or `setup-sharepoint.sh`). In that case the app, permissions, federated credential, and connection string are already in `.env`.

### 2.1 Register the Application

1. Navigate to [Azure Portal](https://portal.azure.com)
2. Go to **Microsoft Entra ID** (formerly Azure Active Directory)
3. Select **App registrations** > **+ New registration**
4. Enter:
   - **Name**: `AzureAISearch-SharePoint-Indexer`
   - **Supported account types**: **Single tenant**
   - **Redirect URI**: Leave blank
5. Click **Register**
6. **Copy** the following values from the Overview page:
   - **Application (client) ID**
   - **Directory (tenant) ID**

Run this cell to store these values:

In [None]:
# Store your app registration details
if application_id and tenant_id:
    print("Found app registration in environment (.env / connection string):")
    print(f"  Application ID: {application_id}")
    print(f"  Tenant ID: {tenant_id}")
    reuse_app = input("Use these values? (yes/no): ").lower() == "yes"
else:
    reuse_app = False

if not reuse_app:
    print("Enter your Microsoft Entra App Registration details:")
    print()
    application_id = input("Application (client) ID: ").strip()
    tenant_id = input("Directory (tenant) ID: ").strip()
    print()
    print("‚úì App registration details captured")
    print(f"  Application ID: {application_id}")
    print(f"  Tenant ID: {tenant_id}")
else:
    print("‚úì Using app registration from environment")

### 2.2 Configure API Permissions

The indexer needs permissions to read SharePoint content. We recommend **Application permissions** (not delegated).

**In the Azure Portal (same app registration):**

1. Select **API permissions** > **+ Add a permission**
2. Select **Microsoft Graph**
3. Select **Application permissions** (not Delegated)
4. Add these permissions:
   - **Files.Read.All** (required)
   - **Sites.Read.All** (standard indexing)
   
   OR for ACL support:
   - **Files.Read.All** (required)
   - **Sites.FullControl.All** (for ACL sync)
   
   OR for specific sites only:
   - **Files.Read.All** (required)
   - **Sites.Selected** (then grant full control to specific sites)

5. Click **Add permissions**
6. ‚ö†Ô∏è **CRITICAL**: Click **Grant admin consent for [Your Tenant]**
   - A tenant admin must approve this
   - Without this, indexing will fail

Run this cell to document your permission choice:

In [None]:
print("Which permissions did you configure?")
print("1. Files.Read.All + Sites.Read.All (standard)")
print("2. Files.Read.All + Sites.FullControl.All (with ACL sync)")
print("3. Files.Read.All + Sites.Selected (specific sites)")

permission_choice = input("Enter 1, 2, or 3: ").strip()

use_acl_sync = permission_choice == '2'
use_specific_sites = permission_choice == '3'

admin_consent_granted = input("Did a tenant admin grant consent? (yes/no): ").lower() == 'yes'

if not admin_consent_granted:
    print("\n‚ö†Ô∏è  WARNING: Admin consent is REQUIRED!")
    print("   Indexing will fail without it. Get admin approval before continuing.")
else:
    print("\n‚úì Permissions configured and admin consent granted")

### 2.3 Choose Authentication Method

You have two options:

**Option A: Client Secret** (simpler, but secrets expire)
- In app registration, go to **Certificates & secrets**
- Click **+ New client secret**
- Add description and expiration period
- **Copy the secret VALUE** (not the ID) - you can't see it again!

**Option B: Federated Credentials** (more secure, uses managed identity)
- Requires system-assigned managed identity enabled (Step 1.2)
- In app registration, go to **Certificates & secrets**
- Under **Federated credentials**, click **+ Add credential**
- Select **Managed Identity**
- Choose your search service's managed identity
- Add name and save

Run this cell to configure authentication:

In [None]:
# Choose authentication method
if use_federated_creds or application_secret:
    # Already determined from environment/connection string
    auth_method = "Federated Credentials" if use_federated_creds else "Client Secret"
    print(f"Using authentication from environment: {auth_method}")
else:
    print("Choose authentication method:")
    print("1. Client Secret (simpler)")
    print("2. Federated Credentials (more secure, requires managed identity)")
    
    auth_choice = input("Enter 1 or 2: ").strip()
    
    if auth_choice == '1':
        application_secret = input("Paste your client secret VALUE: ").strip()
        use_federated_creds = False
        print("‚úì Using client secret authentication")
    elif auth_choice == '2':
        if not use_managed_identity:
            print("‚ö†Ô∏è  ERROR: Federated credentials require managed identity!")
            print("   Go back to Step 1.2 or choose client secret instead.")
            raise ValueError("Managed identity required for federated credentials")
        application_secret = None
        use_federated_creds = True
        print("‚úì Using federated credentials authentication")
    else:
        raise ValueError("Invalid choice. Enter 1 or 2.")

### 2.4 Configure Public Client Flows

**In the Azure Portal (same app registration):**

1. Select **Authentication**
2. Scroll to **Advanced settings** > **Allow public client flows**
3. Set to **Yes**
4. Click **Save**
5. Click **+ Add a platform**
6. Select **Mobile and desktop applications**
7. Check the box for: `https://login.microsoftonline.com/common/oauth2/nativeclient`
8. Click **Configure**

Run this cell to confirm:

In [None]:
configured_auth_settings = input("Did you configure authentication settings? (yes/no): ").lower() == 'yes'

if configured_auth_settings:
    print("‚úì Authentication settings configured")
else:
    print("‚ö†Ô∏è  Complete authentication configuration before continuing")

## Step 3: Configure SharePoint Connection String

Now let's build the connection string based on your configuration.

### 3.1 Get SharePoint Site URL

You need the full URL to your SharePoint site's document library.

**To get the URL:**
1. Navigate to your SharePoint site in a browser
2. Open the document library you want to index
3. Copy the URL from the address bar
4. Example: `https://mycompany.sharepoint.com/sites/MyTeamSite`

Run this cell to enter your SharePoint details:

In [13]:
print("Enter your SharePoint site details:")
print()
if sharepoint_site_url:
    print(f"Using SharePoint site URL from environment: {sharepoint_site_url}")
else:
    sharepoint_site_url = input("SharePoint site URL: ").strip()
    # Validate URL format
    if not sharepoint_site_url.startswith("https://") or "sharepoint.com" not in sharepoint_site_url:
        print("‚ö†Ô∏è  Warning: URL should be in format: https://[tenant].sharepoint.com/sites/[site]")

print()
print(f"‚úì SharePoint site: {sharepoint_site_url}")

Enter your SharePoint site details:

Using SharePoint site URL from environment: https://mngenvmcap338326.sharepoint.com/sites/lab511-demo

‚úì SharePoint site: https://mngenvmcap338326.sharepoint.com/sites/lab511-demo


### 3.2 Build Connection String

Based on your authentication choice, we'll build the appropriate connection string format:

In [14]:
# Build connection string based on authentication method
reuse_connection_string = False

if sharepoint_connection_string:
    print("Using SharePoint connection string from environment (.env)")
    reuse_connection_string = True
    parsed_cs = _parse_connection_string(sharepoint_connection_string)
    sharepoint_site_url = parsed_cs.get("SharePointOnlineEndpoint", sharepoint_site_url)
    application_id = parsed_cs.get("ApplicationId", application_id)
    tenant_id = parsed_cs.get("TenantId", tenant_id)
    managed_identity_object_id = parsed_cs.get("FederatedCredentialObjectId", managed_identity_object_id)
    application_secret = parsed_cs.get("ApplicationSecret", application_secret)
    use_federated_creds = bool(parsed_cs.get("FederatedCredentialObjectId"))

if not reuse_connection_string:
    if use_federated_creds:
        # Federated credentials (secretless) format
        sharepoint_connection_string = (
            f"SharePointOnlineEndpoint={sharepoint_site_url};"
            f"ApplicationId={application_id};"
            f"FederatedCredentialObjectId={managed_identity_object_id};"
            f"TenantId={tenant_id}"
        )
        auth_method = "Federated Credentials (Secretless)"
    else:
        # Client secret format
        sharepoint_connection_string = (
            f"SharePointOnlineEndpoint={sharepoint_site_url};"
            f"ApplicationId={application_id};"
            f"ApplicationSecret={application_secret};"
            f"TenantId={tenant_id}"
        )
        auth_method = "Client Secret"
else:
    auth_method = "Federated Credentials (Secretless)" if use_federated_creds else "Client Secret"

print("="*60)
print("SharePoint Connection String Configuration")
print("="*60)
print(f"Authentication Method: {auth_method}")
print(f"SharePoint Site: {sharepoint_site_url}")
print(f"Application ID: {application_id}")
print(f"Tenant ID: {tenant_id}")
print()
print("‚úì Connection string ready (contains sensitive data - not displayed)")
print("="*60)

Using SharePoint connection string from environment (.env)
SharePoint Connection String Configuration
Authentication Method: Federated Credentials (Secretless)
SharePoint Site: https://mngenvmcap338326.sharepoint.com/sites/lab511-demo
Application ID: 1b7e0a7c-aad7-4588-aecd-b395f09b6305
Tenant ID: 9dce4dc6-16c7-48c4-9f57-52897cc5a893

‚úì Connection string ready (contains sensitive data - not displayed)


### 3.3 Add to Environment (Optional)

You can optionally save this to your `.env` file for reuse:

In [15]:
save_to_env = input("Save SharePoint connection string to .env? (yes/no): ").lower() == 'yes'

if save_to_env:
    from pathlib import Path
    
    # Get repository root (2 levels up from notebooks folder)
    repo_root = Path.cwd().parent if Path.cwd().name == 'notebooks' else Path.cwd()
    env_path = repo_root / '.env'
    
    # Append to .env file
    with open(env_path, 'a') as f:
        f.write(f"\n# SharePoint Configuration (Part 9)\n")
        f.write(f"SHAREPOINT_CONNECTION_STRING={sharepoint_connection_string}\n")
        f.write(f"SHAREPOINT_SITE_URL={sharepoint_site_url}\n")
    
    print(f"‚úì Saved to {env_path}")
    print("  ‚ö†Ô∏è  SECURITY: Never commit this file to source control!")
else:
    print("‚úì Connection string stored in memory only (this session)")

‚úì Connection string stored in memory only (this session)


## Step 4: Create Indexed SharePoint Knowledge Source

Now we'll create the knowledge source that will:
1. Connect to SharePoint
2. Create a data source
3. Create an indexer
4. Create a skillset (for chunking and embedding)
5. Create an index
6. Start the indexing process

### 4.1 Configure Knowledge Source Parameters

In [16]:
from azure.search.documents.indexes.models import (
    IndexedSharePointKnowledgeSource,
    IndexedSharePointKnowledgeSourceParameters,
    KnowledgeSourceIngestionParameters,
    KnowledgeSourceAzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    KnowledgeSourceContentExtractionMode
)

# Knowledge source name
knowledge_source_name = "sharepoint-indexed-ks"

# Configure embedding model for vectorization
embedding_params = AzureOpenAIVectorizerParameters(
    resource_url=azure_openai_endpoint,
    deployment_name=azure_openai_embedding_deployment,
    model_name=azure_openai_embedding_model_name,
    api_key=azure_openai_key
)

embedding_model = KnowledgeSourceAzureOpenAIVectorizer(
    azure_open_ai_parameters=embedding_params
)

# Configure ingestion parameters
ingestion_params = KnowledgeSourceIngestionParameters(
    embedding_model=embedding_model,
    content_extraction_mode=KnowledgeSourceContentExtractionMode.MINIMAL,  # Use STANDARD for better chunking
    disable_image_verbalization=False,  # Set to True if you don't need image descriptions
    # ingestion_schedule=None,  # Add schedule for automatic refresh
    # ingestion_permission_options=[]  # Add for ACL sync (see Part 3 discussion)
)

print("‚úì Ingestion parameters configured")
print(f"  Embedding model: {azure_openai_embedding_deployment}")
print(f"  Content extraction: MINIMAL (standard text/image extraction)")
print(f"  Image verbalization: Enabled")

‚úì Ingestion parameters configured
  Embedding model: text-embedding-3-large
  Content extraction: MINIMAL (standard text/image extraction)
  Image verbalization: Enabled


### 4.2 Choose SharePoint Container

Specify which document library to index:
- `defaultSiteLibrary`: Index the site's default "Shared Documents" library
- `allSiteLibraries`: Index ALL document libraries in the site
- `useQuery`: Use specific query (advanced - leave blank for now)

In [17]:
print("Which SharePoint library do you want to index?")
print("1. Default library only (Shared Documents)")
print("2. All libraries in the site")

library_choice = input("Enter 1 or 2: ").strip()

container_name = "defaultSiteLibrary" if library_choice == '1' else "allSiteLibraries"

print(f"‚úì Will index: {container_name}")

Which SharePoint library do you want to index?
1. Default library only (Shared Documents)
2. All libraries in the site
‚úì Will index: allSiteLibraries
‚úì Will index: allSiteLibraries


### 4.3 Create the Knowledge Source

This will trigger the creation of all Azure AI Search objects (data source, indexer, skillset, index):

In [18]:
# Create IndexedSharePointKnowledgeSourceParameters
sharepoint_params = IndexedSharePointKnowledgeSourceParameters(
    connection_string=sharepoint_connection_string,
    container_name=container_name,
    query=None,  # Advanced filtering - leave as None for now
    ingestion_parameters=ingestion_params
)

# Create the knowledge source
knowledge_source = IndexedSharePointKnowledgeSource(
    name=knowledge_source_name,
    description="Indexed SharePoint knowledge source with embeddings and full search capabilities",
    indexed_share_point_parameters=sharepoint_params
)

print("Creating knowledge source...")
print("This will:")
print("  1. Create a SharePoint data source")
print("  2. Create a skillset for chunking and vectorization")
print("  3. Create an index with vector fields")
print("  4. Create an indexer to process documents")
print("  5. Start indexing your SharePoint content")
print()
print("‚è≥ This may take a minute...")
print()

try:
    result = index_client.create_or_update_knowledge_source(knowledge_source)
    print("="*60)
    print("‚úì Knowledge source created successfully!")
    print("="*60)
    print()
    
    # Display created resources
    if hasattr(result, 'indexed_share_point_parameters') and result.indexed_share_point_parameters:
        created = result.indexed_share_point_parameters.created_resources
        if created:
            print("Created Azure AI Search objects:")
            print(f"  - Data Source: {created.datasource}")
            print(f"  - Indexer: {created.indexer}")
            print(f"  - Skillset: {created.skillset}")
            print(f"  - Index: {created.index}")
            print()
            
            # Store for later use
            indexer_name = created.indexer
            index_name = created.index
    
    print("üéâ Indexing has started!")
    print("   Proceed to Step 5 to monitor progress.")
    
except Exception as e:
    print("="*60)
    print("‚ùå Error creating knowledge source")
    print("="*60)
    print(f"Error: {str(e)}")
    print()
    print("Common issues:")
    print("  1. Admin consent not granted for app permissions")
    print("  2. Invalid connection string format")
    print("  3. Client secret expired or incorrect")
    print("  4. SharePoint URL not accessible")
    print("  5. Tenant ID mismatch")
    print()
    print("Review Steps 2-3 and try again.")
    raise

Creating knowledge source...
This will:
  1. Create a SharePoint data source
  2. Create a skillset for chunking and vectorization
  3. Create an index with vector fields
  4. Create an indexer to process documents
  5. Start indexing your SharePoint content

‚è≥ This may take a minute...

‚úì Knowledge source created successfully!

üéâ Indexing has started!
   Proceed to Step 5 to monitor progress.
‚úì Knowledge source created successfully!

üéâ Indexing has started!
   Proceed to Step 5 to monitor progress.


## Step 5: Monitor Indexing Progress

The indexer is now processing your SharePoint documents. Let's check its status.

### 5.1 Check Knowledge Source Status

In [19]:
import json
import time

print("Checking knowledge source ingestion status...")
print()

try:
    status_response = index_client.get_knowledge_source_status(knowledge_source_name)
    status = status_response.as_dict() if hasattr(status_response, 'as_dict') else status_response
    
    print(json.dumps(status, indent=2, default=str))
    print()
    
    # Interpret status
    sync_status = status.get('synchronization_status', 'unknown')
    
    if sync_status == 'creating':
        print("üìä Status: Creating indexer pipeline...")
    elif sync_status == 'active':
        current_state = status.get('current_synchronization_state', {})
        items_processed = current_state.get('item_updates_processed', 0)
        items_failed = current_state.get('items_updates_failed', 0)
        
        print(f"üìä Status: Actively indexing")
        print(f"   Items processed: {items_processed}")
        print(f"   Items failed: {items_failed}")
    elif sync_status == 'deleting':
        print("üìä Status: Being deleted")
    else:
        print(f"üìä Status: {sync_status}")
    
except Exception as e:
    print(f"‚ùå Error checking status: {e}")
    print("   The indexer may still be initializing. Try again in a moment.")

Checking knowledge source ingestion status...

{
  "synchronization_status": "active",
  "synchronization_interval": "1d",
  "last_synchronization_state": {
    "start_time": "2025-12-07T14:19:01.567Z",
    "end_time": "2025-12-07T14:19:18.562Z",
    "items_updates_processed": 3,
    "items_updates_failed": 0,
    "items_skipped": 0
  },
  "statistics": {
    "total_synchronization": 1,
    "average_synchronization_duration": "PT16.9953046S",
    "average_items_processed_per_synchronization": 3
  }
}

üìä Status: Actively indexing
   Items processed: 0
   Items failed: 0
{
  "synchronization_status": "active",
  "synchronization_interval": "1d",
  "last_synchronization_state": {
    "start_time": "2025-12-07T14:19:01.567Z",
    "end_time": "2025-12-07T14:19:18.562Z",
    "items_updates_processed": 3,
    "items_updates_failed": 0,
    "items_skipped": 0
  },
  "statistics": {
    "total_synchronization": 1,
    "average_synchronization_duration": "PT16.9953046S",
    "average_items_pr

### 5.2 Check Indexer Status (Detailed)

For more detailed information, check the indexer directly:

In [20]:
from azure.search.documents.indexes import SearchIndexerClient

indexer_client = SearchIndexerClient(endpoint=endpoint, credential=credential)

try:
    # Get the indexer name from earlier (it should be stored)
    if 'indexer_name' not in locals():
        indexer_name = f"{knowledge_source_name}-indexer"
    
    print(f"Checking indexer: {indexer_name}")
    print()
    
    indexer_status = indexer_client.get_indexer_status(indexer_name)
    
    # Show execution history
    print("="*60)
    print("Indexer Execution History")
    print("="*60)
    
    if indexer_status.last_result:
        result = indexer_status.last_result
        print(f"Status: {result.status}")
        print(f"Start time: {result.start_time}")
        print(f"End time: {result.end_time}")
        print(f"Items processed: {result.items_processed}")
        print(f"Items failed: {result.items_failed}")
        
        if result.errors:
            print()
            print("Errors:")
            for error in result.errors[:5]:  # Show first 5 errors
                print(f"  - {error.error_message}")
        
        if result.warnings:
            print()
            print(f"Warnings: {len(result.warnings)}")
    
    print()
    print("="*60)
    
    # Provide guidance
    if indexer_status.last_result and indexer_status.last_result.status == 'success':
        print("‚úì Indexing completed successfully!")
        print("  Proceed to Step 6 to query your SharePoint content.")
    elif indexer_status.last_result and indexer_status.last_result.status == 'inProgress':
        print("‚è≥ Indexing is still in progress...")
        print("   Wait a few minutes and run this cell again.")
    elif indexer_status.last_result and indexer_status.last_result.status.startswith('transient'):
        print("‚ö†Ô∏è  Indexer encountered transient errors")
        print("   Review errors above and retry if needed.")
    else:
        print("‚ÑπÔ∏è  Check the status above for details")
    
except Exception as e:
    print(f"‚ùå Error checking indexer: {e}")
    print("   The indexer may not be created yet. Wait and try again.")

Checking indexer: sharepoint-indexed-ks-indexer

Indexer Execution History
Status: success
Start time: 2025-12-07 14:19:01.604000+00:00
End time: 2025-12-07 14:19:18.562000+00:00
‚ùå Error checking indexer: 'IndexerExecutionResult' object has no attribute 'items_processed'
   The indexer may not be created yet. Wait and try again.
Indexer Execution History
Status: success
Start time: 2025-12-07 14:19:01.604000+00:00
End time: 2025-12-07 14:19:18.562000+00:00
‚ùå Error checking indexer: 'IndexerExecutionResult' object has no attribute 'items_processed'
   The indexer may not be created yet. Wait and try again.


### 5.3 View Created Objects in Azure Portal

While indexing runs, you can inspect the created objects:

1. **Navigate to Azure Portal** > Your Azure AI Search service
2. **Check Indexer**:
   - Go to **Indexers**
   - Find your indexer (e.g., `sharepoint-indexed-ks-indexer`)
   - View execution history and errors
3. **Check Index**:
   - Go to **Indexes**
   - Find your index (e.g., `sharepoint-indexed-ks-index`)
   - Use **Search Explorer** to test queries
4. **Check Skillset**:
   - Go to **Skillsets**
   - View how documents are chunked and vectorized
5. **Check Data Source**:
   - Go to **Data sources**
   - Verify connection to SharePoint

‚è≥ **Indexing typically takes 2-10 minutes** depending on:
- Number of documents
- Document sizes
- Content extraction mode (minimal vs standard)
- Image verbalization (if enabled)

## Step 6: Create Knowledge Base and Query

Once indexing completes, create a knowledge base to query your SharePoint content.

### 6.1 Create Knowledge Base

In [21]:
from azure.search.documents.indexes.models import (
    KnowledgeBase,
    KnowledgeBaseAzureOpenAIModel,
    AzureOpenAIVectorizerParameters,
    KnowledgeSourceReference,
    KnowledgeRetrievalOutputMode
)

knowledge_base_name = "sharepoint-indexed-kb"

# Configure Azure OpenAI model for reasoning
aoai_params = AzureOpenAIVectorizerParameters(
    resource_url=azure_openai_endpoint,
    deployment_name=azure_openai_chatgpt_deployment,
    model_name=azure_openai_chatgpt_model_name,
    api_key=azure_openai_key
)

# Create knowledge base
knowledge_base = KnowledgeBase(
    name=knowledge_base_name,
    description="Knowledge base for indexed SharePoint content with full agentic retrieval capabilities",
    models=[KnowledgeBaseAzureOpenAIModel(azure_open_ai_parameters=aoai_params)],
    knowledge_sources=[
        KnowledgeSourceReference(name=knowledge_source_name)
    ],
    output_mode=KnowledgeRetrievalOutputMode.ANSWER_SYNTHESIS
)

print("Creating knowledge base...")
index_client.create_or_update_knowledge_base(knowledge_base)
print(f"‚úì Knowledge base '{knowledge_base_name}' created successfully")

Creating knowledge base...
‚úì Knowledge base 'sharepoint-indexed-kb' created successfully
‚úì Knowledge base 'sharepoint-indexed-kb' created successfully


### 6.2 Query Your SharePoint Content

Now you can query your SharePoint documents with agentic retrieval!

In [24]:
from azure.search.documents.knowledgebases import KnowledgeBaseRetrievalClient
from azure.search.documents.knowledgebases.models import (
    KnowledgeBaseRetrievalRequest,
    KnowledgeBaseMessage,
    KnowledgeBaseMessageTextContent,
    IndexedSharePointKnowledgeSourceParams
)
from IPython.display import display, Markdown

# Create retrieval client
knowledge_base_client = KnowledgeBaseRetrievalClient(
    endpoint=endpoint,
    knowledge_base_name=knowledge_base_name,
    credential=credential
)

# Configure knowledge source parameters
# IMPORTANT: Set include_reference_source_data=True for citations with SharePoint URLs
sharepoint_params = IndexedSharePointKnowledgeSourceParams(
    knowledge_source_name=knowledge_source_name,
    include_references=True,
    include_reference_source_data=True  # Critical for SharePoint URL citations
)

# Example query - modify based on your SharePoint content
user_question = "do you info related to monitoring"

print(f"Query: {user_question}")
print()

# Create retrieval request
request = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(
            role="user",
            content=[KnowledgeBaseMessageTextContent(text=user_question)]
        )
    ],
    knowledge_source_params=[sharepoint_params],
    include_activity=True
)

# Execute query
result = knowledge_base_client.retrieve(retrieval_request=request)

# Display answer
print("="*60)
print("Answer:")
print("="*60)
display(Markdown(result.response[0].content[0].text))

Query: do you info related to monitoring

Answer:
Answer:


Information was found about TeraSky Observability 360‚Ñ¢, a unified observability platform designed for modern cloud environments. It offers unified observability across logs, metrics, traces, and Kubernetes health in a single view, enabling faster troubleshooting and reducing mean time to resolution (MTTR) from hours to minutes. The platform provides real-time intelligence with sub-second queries across billions of events and includes AI-powered insights for proactive anomaly detection (upcoming). Key features include zero-touch deployment with automated Terraform, vendor-independent instrumentation using OpenTelemetry, scalable analytics with Azure Data Explorer, and unified dashboards. The solution aims to cut downtime, reduce operational costs, consolidate multiple monitoring tools, and improve system reliability and performance. TeraSky also provides end-to-end onboarding, managed services, continuous updates, curated dashboards, and on-demand customization [ref_id:0].

### 6.3 View Citations

See which SharePoint documents were used to generate the answer:

In [33]:
import json
from pprint import pprint
from urllib.parse import urljoin, unquote

print("="*60)
print("Citations:")
print("="*60)
print()

def _friendly_name_from_path(path: str) -> str:
    # Return the last path segment, url-decoded, for display
    return unquote(path.split("/")[-1]) if path else path

def _absolute_doc_url(doc_url: str) -> str:
    # Some refs return a site-relative path like /drives/.../root:/File.pdf
    if doc_url.startswith("http://") or doc_url.startswith("https://"):
        return doc_url
    if sharepoint_site_url:
        # Ensure trailing slash on base so urljoin works as expected
        base = sharepoint_site_url if sharepoint_site_url.endswith("/") else sharepoint_site_url + "/"
        return urljoin(base, doc_url.lstrip("/"))
    return doc_url

if result.references:
    for i, ref in enumerate(result.references, 1):
        print(f"Citation {i}:")
        ref_dict = ref.as_dict() if hasattr(ref, "as_dict") else ref if isinstance(ref, dict) else None
        raw_type = type(ref).__name__
        print(f"  Type: {raw_type}")

        if ref_dict:
            keys = ", ".join(ref_dict.keys())
            print(f"  Keys: {keys}")

            source = ref_dict.get("source") or ref_dict.get("source_uri") or ref_dict.get("uri") or ref_dict.get("url")
            if not source and isinstance(ref_dict.get("source_data"), dict):
                source = ref_dict["source_data"].get("doc_url")

            abs_url = _absolute_doc_url(source) if source else None
            if abs_url:
                print(f"  Source: {abs_url}")
                print(f"  File: {_friendly_name_from_path(source)}")

            text = ref_dict.get("text") or ref_dict.get("content")
            if not text and isinstance(ref_dict.get("source_data"), dict):
                text = ref_dict["source_data"].get("snippet")

            if isinstance(text, list):
                text = " ".join([t if isinstance(t, str) else json.dumps(t) for t in text])
            if isinstance(text, str):
                excerpt = text[:200] + "..." if len(text) > 200 else text
                print(f"  Excerpt: {excerpt}")

            if not source and not text:
                print("  No source/text fields returned; full payload:")
                print(json.dumps(ref_dict, indent=2))
        else:
            print("  Unable to decode reference payload; raw object:")
            pprint(ref)
        print()
else:
    print("No citations found.")
    print("This might mean:")
    print("  1. No relevant documents matched your query")
    print("  2. Indexing hasn't completed yet")
    print("  3. include_reference_source_data wasn't set to True")

Citations:

Citation 1:
  Type: KnowledgeBaseIndexedSharePointReference
  Keys: type, id, activity_source, source_data, reranker_score
  Source: https://mngenvmcap338326.sharepoint.com/sites/lab511-demo/drives/b!9jcYDW5AbEmk9nijXaZofLzCWG7nkD5LphVt0iwKomaJiUPs83P-TaAFjI_uSX8v/root:/TeraSky Observability 360‚Ñ¢ Brochure.pdf
  File: TeraSky Observability 360‚Ñ¢ Brochure.pdf
  Excerpt: TeraSky 
Observability 360‚Ñ¢
Unified Observability for Modern 
Cloud Environments

What You Get
‚Ä¢ Unified Observability: One platform, no silos
‚Ä¢ Faster Troubleshooting: Reduce MTTR from 

hours to min...



### 6.4 View Activity Log

See how agentic retrieval processed your query:

In [35]:
import json
import pandas as pd

print("="*60)
print("Activity Log:")
print("="*60)
print()

if hasattr(result, "activity") and result.activity:
    details = []
    for idx, a in enumerate(result.activity):
        a_dict = a.as_dict() if hasattr(a, "as_dict") else a if isinstance(a, dict) else {}
        row = {
            "id": idx,
            "type": a_dict.get("type") or getattr(a, "type", None),
            "elapsed_ms": a_dict.get("elapsed_ms") or a_dict.get("elapsedMs") or a_dict.get("durationMs"),
            "input_tokens": a_dict.get("input_tokens") or a_dict.get("inputTokens"),
            "output_tokens": a_dict.get("output_tokens") or a_dict.get("outputTokens"),
            "knowledge_source_name": a_dict.get("knowledge_source_name") or a_dict.get("knowledgeSourceName"),
            "query_time": a_dict.get("query_time") or a_dict.get("queryTime"),
            "count": a_dict.get("count"),
        }
        details.append(row)

    # Compact table view
    df = pd.DataFrame(details)
    display(df)

    print()
    print("Full activity details (JSON):")
    print(json.dumps(details, indent=2, default=str))

else:
    print("No activity log available")

Activity Log:



Unnamed: 0,id,type,elapsed_ms,input_tokens,output_tokens,knowledge_source_name,query_time,count
0,0,modelQueryPlanning,884.0,1451.0,67.0,,,
1,1,indexedSharePoint,281.0,,,sharepoint-indexed-ks,2025-12-07T14:25:03.761Z,0.0
2,2,indexedSharePoint,297.0,,,sharepoint-indexed-ks,2025-12-07T14:25:04.059Z,1.0
3,3,indexedSharePoint,241.0,,,sharepoint-indexed-ks,2025-12-07T14:25:04.300Z,1.0
4,4,agenticReasoning,,,,,,
5,5,modelAnswerSynthesis,2029.0,3240.0,179.0,,,



Full activity details (JSON):
[
  {
    "id": 0,
    "type": "modelQueryPlanning",
    "elapsed_ms": 884,
    "input_tokens": 1451,
    "output_tokens": 67,
    "knowledge_source_name": null,
    "query_time": null,
    "count": null
  },
  {
    "id": 1,
    "type": "indexedSharePoint",
    "elapsed_ms": 281,
    "input_tokens": null,
    "output_tokens": null,
    "knowledge_source_name": "sharepoint-indexed-ks",
    "query_time": "2025-12-07T14:25:03.761Z",
    "count": 0
  },
  {
    "id": 2,
    "type": "indexedSharePoint",
    "elapsed_ms": 297,
    "input_tokens": null,
    "output_tokens": null,
    "knowledge_source_name": "sharepoint-indexed-ks",
    "query_time": "2025-12-07T14:25:04.059Z",
    "count": 1
  },
  {
    "id": 3,
    "type": "indexedSharePoint",
    "elapsed_ms": 241,
    "input_tokens": null,
    "output_tokens": null,
    "knowledge_source_name": "sharepoint-indexed-ks",
    "query_time": "2025-12-07T14:25:04.300Z",
    "count": 1
  },
  {
    "id": 4,
    "

### 6.5 Try Your Own Queries

Modify the cell below to ask questions about your SharePoint content:

In [None]:
# Try different queries based on your SharePoint content
queries = [
    "What are the main topics covered in these documents?",
    "Summarize the key points from the documents",
    "Find information about [your topic here]"
]

# Pick a query or write your own
my_question = queries[0]  # Change index or replace with your own question

request = KnowledgeBaseRetrievalRequest(
    messages=[
        KnowledgeBaseMessage(
            role="user",
            content=[KnowledgeBaseMessageTextContent(text=my_question)]
        )
    ],
    knowledge_source_params=[sharepoint_params],
    include_activity=True
)

result = knowledge_base_client.retrieve(retrieval_request=request)

print(f"Question: {my_question}")
print()
display(Markdown(result.response[0].content[0].text))

## Step 7: Managing Your Indexed Knowledge Source

### 7.1 Update Permissions (Resync ACLs)

If SharePoint permissions change, you need to manually trigger a resync:

In [None]:
# This cell is for reference - only run if you enabled ACL sync
# and SharePoint permissions have changed

print("To resync permissions after SharePoint changes:")
print()
print("Option 1: Resync all ACLs (no content update)")
print("  Use Azure Portal > Indexers > [your indexer] > Run with 'resync' option")
print()
print("Option 2: Reset specific documents (content + ACLs)")
print("  Use REST API: POST /indexers/[name]/resetdocs with document keys")
print()
print("Option 3: Full reindex (everything)")
print("  Use Azure Portal > Indexers > [your indexer] > Reset and Run")
print()
print("‚ö†Ô∏è  Remember: Without resync, index will have stale ACL data!")

### 7.2 Schedule Automatic Refresh

To keep your index updated with new/changed documents, configure a schedule:

In [None]:
print("To schedule automatic indexer runs:")
print()
print("1. In Azure Portal:")
print("   - Go to Indexers > [your indexer]")
print("   - Click 'Edit'")
print("   - Add Schedule (e.g., daily, hourly)")
print("   - Save")
print()
print("2. Via SDK (update ingestion_schedule in ingestion_parameters)")
print()
print("3. Via REST API:")
print("   - PATCH /indexers/[name]")
print("   - Add 'schedule' object with 'interval'")
print()
print("Recommended: Daily schedule for most SharePoint scenarios")

### 7.3 Clean Up Resources (Optional)

When you're done testing, you can delete the knowledge source and all created objects:

In [None]:
# WARNING: This will delete the knowledge source, knowledge base,
# and all created Azure AI Search objects (indexer, index, skillset, data source)

confirm_delete = input("Delete knowledge source and all created objects? (yes/no): ").lower()

if confirm_delete == 'yes':
    try:
        # Must delete knowledge base first
        print(f"Deleting knowledge base: {knowledge_base_name}")
        index_client.delete_knowledge_base(knowledge_base_name)
        print("‚úì Knowledge base deleted")
        
        # Then delete knowledge source (this deletes indexer, index, skillset, data source)
        print(f"Deleting knowledge source: {knowledge_source_name}")
        index_client.delete_knowledge_source(knowledge_source_name)
        print("‚úì Knowledge source and all created objects deleted")
        
        print()
        print("üóëÔ∏è  Cleanup complete!")
        
    except Exception as e:
        print(f"‚ùå Error during cleanup: {e}")
else:
    print("‚úì Keeping resources")

## Summary

Congratulations! üéâ You've successfully:

‚úÖ **Configured SharePoint integration** with Microsoft Entra app registration

‚úÖ **Created an IndexedSharePointKnowledgeSource** with full indexing pipeline

‚úÖ **Indexed SharePoint documents** with embeddings and semantic understanding

‚úÖ **Queried SharePoint content** with agentic retrieval and citations

### Key Takeaways

1. **IndexedSharePoint vs RemoteSharePoint**:
   - Indexed: Full search features, immediate availability, higher cost
   - Remote: Simple, lower cost, 24-48hr+ delay, live permissions

2. **Setup Complexity**:
   - Indexed requires significant Azure/SharePoint configuration
   - Worth it for production scenarios with complex queries

3. **Permission Management**:
   - Indexed: Permissions captured at index time (manual sync required)
   - Remote: Live SharePoint permissions (automatic)

4. **Automatic Refresh**:
   - Document changes: Automatic via incremental indexing
   - Permission changes: Manual resync required
   - Use scheduled indexer runs for regular updates

### What's Next?

- **Combine with other sources**: Use Part 6's pattern to query SharePoint + indexes + web URLs
- **Optimize performance**: Try different `content_extraction_mode` settings
- **Add scheduling**: Configure automatic indexer runs
- **Production deployment**: Switch to managed identities and Azure RBAC

### Resources

- [Indexed SharePoint Documentation](https://learn.microsoft.com/azure/search/agentic-knowledge-source-how-to-sharepoint-indexed)
- [SharePoint Indexer Prerequisites](https://learn.microsoft.com/azure/search/search-how-to-index-sharepoint-online)
- [SharePoint ACL Sync](https://learn.microsoft.com/azure/search/search-indexer-sharepoint-access-control-lists)
- [Knowledge Bases Overview](https://learn.microsoft.com/azure/search/agentic-retrieval-how-to-create-knowledge-base)

Great work! üöÄ