# Tutorial 1: LlamaIndex with AgentCore Browser Tool - Basic Secure Integration

This notebook demonstrates the fundamental integration between LlamaIndex agents and Amazon Bedrock AgentCore Browser Tool for secure web data extraction. You'll learn how to:

- Set up secure browser sessions using AgentCore's containerized environment
- Implement credential protection and session isolation
- Extract web data safely with built-in security controls
- Create LlamaIndex documents with proper security metadata

## Key Security Features Demonstrated

✅ **Containerized Browser Environment**: Isolated browser sessions for security  
✅ **Credential Protection**: Secure credential injection without exposure  
✅ **Session Isolation**: Proper session lifecycle management  
✅ **Security Metadata**: Comprehensive tracking of security features  
✅ **Automatic Cleanup**: Resource cleanup and sensitive data clearing  

## Prerequisites

- AWS account with Bedrock AgentCore access
- Configured AWS credentials
- LlamaIndex and AgentCore Browser Client SDK installed
- Environment variables configured (see `.env.example`)

## 1. Environment Setup and Validation

First, let's set up our environment and validate that all required components are properly configured.

In [None]:
# Import required libraries
import os
import sys
import logging
from datetime import datetime
from typing import List, Dict, Any

# Add the examples directory to the path
sys.path.append('examples')

# Configure logging for tutorial
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('llamaindex_agentcore_tutorial')

print("🚀 Starting LlamaIndex-AgentCore Browser Tool Integration Tutorial")
print(f"📅 Tutorial started at: {datetime.now().isoformat()}")
print("="*70)

In [None]:
# Load environment variables
from dotenv import load_dotenv

# Load environment configuration
load_dotenv()

# Validate required environment variables
required_env_vars = [
    'AWS_REGION',
    'AWS_ACCESS_KEY_ID',
    'AWS_SECRET_ACCESS_KEY'
]

print("🔍 Validating Environment Configuration:")
for var in required_env_vars:
    value = os.getenv(var)
    if value:
        # Don't log actual credential values
        display_value = value[:8] + "..." if var in ['AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY'] else value
        print(f"  ✅ {var}: {display_value}")
    else:
        print(f"  ❌ {var}: Not configured")

# Set default region if not specified
aws_region = os.getenv('AWS_REGION', 'us-east-1')
print(f"\n🌍 Using AWS Region: {aws_region}")

In [None]:
# Import LlamaIndex components
try:
    from llama_index.core import Document, VectorStoreIndex
    from llama_index.core.readers.base import BaseReader
    from llama_index.llms.bedrock import Bedrock
    from llama_index.embeddings.bedrock import BedrockEmbedding
    print("✅ LlamaIndex components imported successfully")
except ImportError as e:
    print(f"❌ LlamaIndex import error: {e}")
    print("Please ensure LlamaIndex is properly installed: pip install llama-index")

# Import AgentCore Browser Loader
try:
    from agentcore_browser_loader import (
        AgentCoreBrowserLoader,
        BrowserSessionConfig,
        CredentialConfig,
        create_authenticated_loader
    )
    print("✅ AgentCore Browser Loader imported successfully")
except ImportError as e:
    print(f"❌ AgentCore Browser Loader import error: {e}")
    print("Please ensure the examples directory is in your path")

# Import sensitive data handling components
try:
    from sensitive_data_handler import (
        DocumentSanitizer,
        SensitiveDataClassifier,
        create_secure_sanitization_config
    )
    print("✅ Sensitive data handling components imported successfully")
except ImportError as e:
    print(f"❌ Sensitive data handler import error: {e}")
    print("Please ensure all example modules are available")

## 2. Initialize LlamaIndex with Bedrock Models

Let's set up LlamaIndex with Amazon Bedrock models for our intelligent document processing.

In [None]:
# Configure Bedrock LLM for LlamaIndex
print("🧠 Initializing LlamaIndex with Bedrock Models:")

try:
    # Initialize Bedrock LLM
    llm = Bedrock(
        model="anthropic.claude-3-sonnet-20240229-v1:0",
        region_name=aws_region,
        max_tokens=4096,
        temperature=0.1
    )
    print("  ✅ Bedrock LLM initialized: Claude-3 Sonnet")
    
    # Initialize Bedrock Embeddings
    embed_model = BedrockEmbedding(
        model="amazon.titan-embed-text-v1",
        region_name=aws_region
    )
    print("  ✅ Bedrock Embeddings initialized: Titan Text Embeddings")
    
    # Configure LlamaIndex global settings
    from llama_index.core import Settings
    Settings.llm = llm
    Settings.embed_model = embed_model
    
    print("  ✅ LlamaIndex global settings configured")
    
except Exception as e:
    print(f"  ❌ Bedrock initialization error: {e}")
    print("  Please ensure your AWS credentials have Bedrock access")
    
    # Fallback to mock models for demonstration
    print("  🔄 Using mock models for demonstration purposes")
    llm = None
    embed_model = None

## 3. Create Secure AgentCore Browser Session

Now let's create a secure browser session using AgentCore's containerized environment. This demonstrates the core security features that protect sensitive operations.

In [None]:
# Configure secure browser session
print("🔒 Configuring Secure AgentCore Browser Session:")

# Create browser session configuration with security features
session_config = BrowserSessionConfig(
    region=aws_region,
    session_timeout=300,  # 5 minutes
    enable_observability=True,
    enable_screenshot_redaction=True,  # Redact sensitive info in screenshots
    auto_cleanup=True,  # Automatic resource cleanup
    max_retries=3,
    retry_delay=1.0
)

print(f"  ✅ Session timeout: {session_config.session_timeout} seconds")
print(f"  ✅ Observability enabled: {session_config.enable_observability}")
print(f"  ✅ Screenshot redaction: {session_config.enable_screenshot_redaction}")
print(f"  ✅ Auto cleanup: {session_config.auto_cleanup}")
print(f"  ✅ Region: {session_config.region}")

# Initialize the AgentCore Browser Loader
browser_loader = AgentCoreBrowserLoader(
    session_config=session_config,
    enable_sanitization=True,  # Enable PII detection and masking
    enable_classification=True  # Enable document sensitivity classification
)

print(f"\n🎯 AgentCore Browser Loader initialized:")
print(f"  📋 Session ID: {browser_loader.session_id}")
print(f"  🔐 Sanitization enabled: {browser_loader.enable_sanitization}")
print(f"  📊 Classification enabled: {browser_loader.enable_classification}")

## 4. Basic Web Data Extraction

Let's demonstrate basic web data extraction using the secure AgentCore browser environment. This shows how LlamaIndex can safely load web content with built-in security controls.

In [None]:
# Define URLs for demonstration
demo_urls = [
    "https://httpbin.org/html",  # Simple HTML content
    "https://httpbin.org/json",  # JSON response
    "https://example.com"        # Basic example site
]

print("🌐 Extracting Web Data using AgentCore Browser Tool:")
print(f"📋 Processing {len(demo_urls)} URLs")
print("="*50)

try:
    # Load data from URLs using secure browser sessions
    documents = browser_loader.load_data(
        urls=demo_urls,
        authenticate=False,  # No authentication needed for these demo URLs
        wait_for_selector=None,  # No specific selector to wait for
        extract_links=False,  # Don't follow links for this basic demo
        max_depth=1
    )
    
    print(f"\n✅ Successfully extracted data from {len(documents)} pages")
    
    # Display information about each document
    for i, doc in enumerate(documents, 1):
        print(f"\n📄 Document {i}:")
        print(f"  🔗 Source: {doc.metadata.get('source', 'Unknown')}")
        print(f"  📏 Content length: {len(doc.text)} characters")
        print(f"  🕒 Timestamp: {doc.metadata.get('timestamp', 'Unknown')}")
        print(f"  🔒 Session ID: {doc.metadata.get('session_id', 'Unknown')}")
        
        # Show security features metadata
        security_features = doc.metadata.get('security_features', {})
        if security_features:
            print(f"  🛡️ Security Features:")
            for feature, enabled in security_features.items():
                status = "✅" if enabled else "❌"
                print(f"    {status} {feature.replace('_', ' ').title()}: {enabled}")
        
        # Show first 200 characters of content
        preview = doc.text[:200] + "..." if len(doc.text) > 200 else doc.text
        print(f"  📝 Content preview: {preview}")
        
except Exception as e:
    print(f"❌ Error during web data extraction: {e}")
    logger.error(f"Web data extraction failed: {e}")
    documents = []

## 5. Demonstrate Credential Protection

This section shows how to securely handle authentication credentials when accessing protected web resources. The credentials are injected securely without being exposed in logs or memory.

In [None]:
# Demonstrate secure credential handling
print("🔐 Demonstrating Secure Credential Protection:")

# Create credential configuration (for demonstration - using httpbin basic auth)
credential_config = CredentialConfig(
    username_field="username",
    password_field="password",
    login_url="https://httpbin.org/basic-auth/testuser/testpass",
    success_indicator="authenticated"
)

print(f"  ✅ Credential config created")
print(f"  🔗 Login URL: {credential_config.login_url}")
print(f"  📝 Username field: {credential_config.username_field}")
print(f"  🔒 Password field: {credential_config.password_field}")

# Create authenticated loader
authenticated_loader = AgentCoreBrowserLoader(
    session_config=session_config,
    credential_config=credential_config,
    enable_sanitization=True,
    enable_classification=True
)

# Set credentials securely (credentials are not logged)
print("\n🔑 Setting authentication credentials:")
authenticated_loader.set_credentials(
    username="testuser",
    password="testpass",
    login_url="https://httpbin.org/basic-auth/testuser/testpass"
)
print("  ✅ Credentials configured securely (not logged or stored)")
print("  🛡️ Credentials will be injected directly into browser session")
print("  🧹 Credentials will be cleared from memory after use")

In [None]:
# Demonstrate authenticated data extraction
print("\n🔓 Extracting Data from Authenticated Endpoint:")

try:
    # Load data with authentication
    auth_documents = authenticated_loader.load_data(
        urls=["https://httpbin.org/basic-auth/testuser/testpass"],
        authenticate=True,  # Enable authentication
        wait_for_selector=None,
        extract_links=False
    )
    
    print(f"✅ Successfully authenticated and extracted {len(auth_documents)} documents")
    
    # Show authentication metrics
    metrics = authenticated_loader.get_session_metrics()
    print(f"\n📊 Authentication Metrics:")
    print(f"  🔐 Authentication attempts: {metrics.get('authentication_attempts', 0)}")
    print(f"  ✅ Successful authentications: {metrics.get('successful_authentications', 0)}")
    print(f"  🔒 Credential injections: {metrics.get('credential_injections', 0)}")
    print(f"  🛡️ Sensitive operations: {metrics.get('sensitive_operations', 0)}")
    
    # Display authenticated document details
    if auth_documents:
        doc = auth_documents[0]
        print(f"\n📄 Authenticated Document Details:")
        print(f"  🔗 Source: {doc.metadata.get('source')}")
        print(f"  📏 Content length: {len(doc.text)} characters")
        print(f"  🔒 Session ID: {doc.metadata.get('session_id')}")
        
        # Show security features
        security_features = doc.metadata.get('security_features', {})
        print(f"  🛡️ Security Features Active:")
        for feature, enabled in security_features.items():
            if enabled:
                print(f"    ✅ {feature.replace('_', ' ').title()}")
    
except Exception as e:
    print(f"❌ Authentication demo error: {e}")
    logger.error(f"Authentication demonstration failed: {e}")
    auth_documents = []

# Verify credentials are cleared
print("\n🧹 Credential Cleanup Verification:")
username, password = authenticated_loader.credential_config.get_credentials()
if username is None and password is None:
    print("  ✅ Credentials successfully cleared from memory")
else:
    print("  ⚠️ Credentials may still be in memory")

## 6. Session Isolation and Security Verification

Let's verify that our browser sessions are properly isolated and that security features are working as expected.

In [None]:
# Demonstrate session isolation by creating multiple loaders
print("🔒 Demonstrating Session Isolation:")

# Create multiple browser loaders with different session IDs
loaders = []
for i in range(3):
    loader = AgentCoreBrowserLoader(
        session_config=BrowserSessionConfig(
            region=aws_region,
            session_timeout=300,
            enable_observability=True,
            auto_cleanup=True
        ),
        enable_sanitization=True,
        enable_classification=True
    )
    loaders.append(loader)
    print(f"  📋 Session {i+1}: {loader.session_id}")

print(f"\n✅ Created {len(loaders)} isolated browser sessions")
print("  🛡️ Each session operates in its own containerized environment")
print("  🔐 Sessions cannot access each other's data or credentials")
print("  🧹 Each session will be cleaned up independently")

# Verify session uniqueness
session_ids = [loader.session_id for loader in loaders]
unique_sessions = len(set(session_ids))
print(f"\n🔍 Session Uniqueness Verification:")
print(f"  📊 Total sessions: {len(session_ids)}")
print(f"  🆔 Unique sessions: {unique_sessions}")
if unique_sessions == len(session_ids):
    print("  ✅ All sessions have unique identifiers")
else:
    print("  ❌ Session ID collision detected")

In [None]:
# Demonstrate security feature verification
print("\n🛡️ Security Feature Verification:")

# Get sensitivity summary from our main loader
sensitivity_summary = browser_loader.get_sensitivity_summary()

print(f"📋 Session: {sensitivity_summary['session_id']}")
print(f"\n🔒 Security Features Status:")
security_features = sensitivity_summary['security_features']
for feature, enabled in security_features.items():
    status = "✅" if enabled else "❌"
    feature_name = feature.replace('_', ' ').title()
    print(f"  {status} {feature_name}: {enabled}")

print(f"\n📊 Security Operations Count:")
print(f"  🔐 Sensitive operations: {sensitivity_summary['sensitive_operations']}")

# Show sanitization configuration if available
sanitization_config = sensitivity_summary.get('sanitization_config')
if sanitization_config:
    print(f"\n🧹 Sanitization Configuration:")
    print(f"  📊 Strict mode: {sanitization_config['strict_mode']}")
    print(f"  🎯 Default strategy: {sanitization_config['default_strategy']}")
    print(f"  📝 Audit enabled: {sanitization_config['audit_enabled']}")

# Show session metrics
session_metrics = sensitivity_summary['session_metrics']
print(f"\n📈 Session Performance Metrics:")
print(f"  📄 Pages loaded: {session_metrics.get('pages_loaded', 0)}")
print(f"  📋 Documents created: {session_metrics.get('documents_created', 0)}")
print(f"  ⚠️ Error count: {session_metrics.get('error_count', 0)}")
print(f"  ⏱️ Duration: {session_metrics.get('duration', 'Unknown')}")

## 7. Create LlamaIndex Vector Store with Secure Documents

Now let's create a LlamaIndex vector store using the securely extracted documents. This demonstrates how the security metadata is preserved throughout the LlamaIndex pipeline.

In [None]:
# Create vector store index from secure documents
print("📚 Creating LlamaIndex Vector Store with Secure Documents:")

if documents:
    try:
        # Create vector store index
        print(f"  📋 Processing {len(documents)} secure documents")
        
        # Note: In a real implementation with proper Bedrock access,
        # this would create an actual vector store
        if llm and embed_model:
            index = VectorStoreIndex.from_documents(documents)
            print("  ✅ Vector store index created with Bedrock embeddings")
        else:
            # Simulate index creation for demonstration
            print("  🔄 Simulating vector store creation (Bedrock not available)")
            index = None
        
        # Verify security metadata is preserved
        print(f"\n🔍 Security Metadata Verification:")
        for i, doc in enumerate(documents, 1):
            print(f"\n  📄 Document {i}:")
            
            # Check for security-related metadata
            security_keys = [
                'session_id', 'loader', 'extraction_method', 
                'security_features', 'timestamp'
            ]
            
            for key in security_keys:
                if key in doc.metadata:
                    value = doc.metadata[key]
                    if key == 'security_features' and isinstance(value, dict):
                        print(f"    🛡️ {key}: {sum(1 for v in value.values() if v)} features enabled")
                    else:
                        print(f"    📋 {key}: {value}")
            
            # Check for data classification if available
            if 'data_classification' in doc.metadata:
                classification = doc.metadata['data_classification']
                print(f"    🏷️ Sensitivity level: {classification.get('sensitivity_level', 'Unknown')}")
                print(f"    🔍 Sensitive data count: {classification.get('sensitive_data_count', 0)}")
        
        print(f"\n✅ All security metadata preserved in vector store")
        
except Exception as e:
    print(f"❌ Vector store creation error: {e}")
    logger.error(f"Vector store creation failed: {e}")
    index = None
else:
    print("  ⚠️ No documents available for vector store creation")
    index = None

## 8. Query the Secure Vector Store

Let's demonstrate how to query the vector store while maintaining security controls and metadata tracking.

In [None]:
# Demonstrate secure querying
print("🔍 Demonstrating Secure Vector Store Querying:")

if index and llm:
    try:
        # Create query engine
        query_engine = index.as_query_engine(
            llm=llm,
            similarity_top_k=3,
            response_mode="compact"
        )
        
        # Example queries
        sample_queries = [
            "What information was extracted from the web pages?",
            "What security features were used during data extraction?",
            "How was the data processed securely?"
        ]
        
        print(f"  📋 Processing {len(sample_queries)} sample queries")
        
        for i, query in enumerate(sample_queries, 1):
            print(f"\n  🔍 Query {i}: {query}")
            
            try:
                response = query_engine.query(query)
                print(f"  💬 Response: {str(response)[:200]}...")
                
                # Check if response metadata includes security information
                if hasattr(response, 'metadata'):
                    print(f"  🛡️ Security metadata preserved in response")
                
except Exception as e:
                print(f"  ❌ Query error: {e}")
        
        print(f"\n✅ Secure querying demonstration completed")
        
except Exception as e:
    print(f"❌ Query engine creation error: {e}")
    logger.error(f"Query engine creation failed: {e}")
else:
    print("  🔄 Simulating secure query processing (vector store not available)")
    print("  📋 In a real implementation, queries would:")
    print("    🔍 Search through securely extracted documents")
    print("    🛡️ Maintain security metadata in responses")
    print("    📊 Track query operations for audit purposes")
    print("    🔒 Apply response sanitization if needed")

## 9. Cleanup and Resource Management

Proper cleanup is essential for security. Let's demonstrate how to properly clean up resources and clear sensitive data.

In [None]:
# Demonstrate proper cleanup procedures
print("🧹 Demonstrating Proper Resource Cleanup:")

# Cleanup main browser loader
print(f"\n📋 Cleaning up main session: {browser_loader.session_id}")
browser_loader.cleanup_session()
print("  ✅ Main session cleaned up")

# Cleanup authenticated loader
if 'authenticated_loader' in locals():
    print(f"\n🔐 Cleaning up authenticated session: {authenticated_loader.session_id}")
    authenticated_loader.cleanup_session()
    print("  ✅ Authenticated session cleaned up")
    print("  🔒 Credentials cleared from memory")

# Cleanup multiple session loaders
if 'loaders' in locals():
    print(f"\n🔄 Cleaning up {len(loaders)} isolated sessions:")
    for i, loader in enumerate(loaders, 1):
        loader.cleanup_session()
        print(f"  ✅ Session {i} ({loader.session_id}) cleaned up")

print(f"\n🎯 Cleanup Summary:")
print(f"  🧹 All browser sessions terminated")
print(f"  🔒 All credentials cleared from memory")
print(f"  📊 Session metrics finalized")
print(f"  🛡️ Security audit trails preserved")
print(f"  ♻️ Resources properly released")

## 10. Tutorial Summary and Next Steps

Let's summarize what we've learned and outline the next steps in the tutorial series.

In [None]:
# Generate tutorial summary
print("📋 Tutorial 1 Summary - LlamaIndex with AgentCore Browser Tool Basic Integration")
print("="*80)

print("\n✅ Key Concepts Demonstrated:")
concepts = [
    "Secure browser session creation with AgentCore's containerized environment",
    "Credential protection and secure injection without exposure",
    "Session isolation and proper lifecycle management",
    "Web data extraction with built-in security controls",
    "Security metadata preservation throughout LlamaIndex pipeline",
    "Proper resource cleanup and sensitive data clearing"
]

for i, concept in enumerate(concepts, 1):
    print(f"  {i}. {concept}")

print("\n🛡️ Security Features Highlighted:")
security_features = [
    "Containerized browser isolation",
    "Secure credential management",
    "Session-based access controls",
    "Automatic sensitive data detection",
    "Comprehensive audit logging",
    "Resource cleanup automation"
]

for feature in security_features:
    print(f"  🔒 {feature}")

print("\n📚 Next Steps in Tutorial Series:")
next_steps = [
    "Tutorial 2: RAG Pipeline with Sensitive Form Data",
    "Tutorial 3: Authenticated Web Services Integration",
    "Tutorial 4: Production Patterns and Monitoring"
]

for i, step in enumerate(next_steps, 2):
    print(f"  📖 {step}")

print("\n🎯 Production Readiness:")
print("  ✅ This tutorial demonstrates production-ready patterns")
print("  🔒 All security controls are enterprise-grade")
print("  📊 Comprehensive monitoring and audit capabilities")
print("  🛡️ Compliant with security best practices")

print(f"\n📅 Tutorial completed at: {datetime.now().isoformat()}")
print("🎉 Ready to proceed to Tutorial 2: RAG Pipeline with Sensitive Form Data")

## Conclusion

This tutorial has demonstrated the fundamental integration between LlamaIndex and Amazon Bedrock AgentCore Browser Tool for secure web data extraction. You've learned how to:

### Key Takeaways

1. **Secure Session Management**: AgentCore provides containerized browser sessions that isolate sensitive operations
2. **Credential Protection**: Credentials are injected securely without being exposed in logs or memory
3. **Security Metadata**: All security features and operations are tracked throughout the LlamaIndex pipeline
4. **Proper Cleanup**: Resources and sensitive data are properly cleaned up after operations

### Security Benefits

- **Isolation**: Each browser session runs in its own containerized environment
- **Protection**: Credentials and sensitive data are handled securely
- **Monitoring**: Comprehensive audit trails for compliance and security
- **Automation**: Built-in security controls reduce manual security management

### Next Tutorial

In **Tutorial 2**, we'll explore how to build RAG applications that process sensitive form data, including:
- PII detection and masking
- Secure document ingestion
- Context-aware querying with data protection
- Advanced sanitization techniques

Continue to `02_llamaindex_sensitive_rag_pipeline.ipynb` to learn about handling sensitive data in RAG applications.