# Tutorial 3: LlamaIndex Agents with Authenticated Web Services via AgentCore

This advanced tutorial demonstrates how LlamaIndex agents can securely access authenticated web applications through Amazon Bedrock AgentCore Browser Tool. You'll learn production-ready patterns for:

- **Multi-page workflow automation** with session security maintained by AgentCore
- **Secure data extraction** from protected resources using LlamaIndex with AgentCore
- **Authentication state management** across LlamaIndex operations
- **Complex workflow orchestration** with sensitive data protection

## Key Features Demonstrated

🔐 **Secure Authentication**: Multi-step login flows with credential protection  
🔄 **Session Persistence**: Maintaining authentication across multiple operations  
📊 **Multi-page Workflows**: Complex data extraction from authenticated portals  
🛡️ **Data Protection**: Comprehensive sensitive data handling throughout workflows  
📈 **Production Patterns**: Scalable, enterprise-ready implementation examples  

## Prerequisites

- Completed Tutorial 1 and 2
- AWS account with Bedrock AgentCore access
- Test credentials for demonstration purposes
- Understanding of LlamaIndex RAG concepts

## 1. Environment Setup and Advanced Configuration

Let's start by setting up our environment with advanced configurations for authenticated web service access.

In [None]:
# Import required libraries
import os
import sys
import logging
import time
import json
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Any
from pathlib import Path

# Add examples directory to path
sys.path.append('./examples')

# Configure logging for detailed session tracking
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('./logs/authenticated_web_services.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

print("🚀 Tutorial 3: LlamaIndex Agents with Authenticated Web Services")
print("📅 Session started:", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))
print("🔐 Focus: Multi-page workflows with authentication state management")

In [None]:
# Load environment variables
from dotenv import load_dotenv
load_dotenv()

# Validate required environment variables
required_vars = [
    'AWS_REGION',
    'OPENAI_API_KEY',  # For LlamaIndex LLM
    'AGENTCORE_REGION'
]

missing_vars = [var for var in required_vars if not os.getenv(var)]
if missing_vars:
    print(f"❌ Missing environment variables: {missing_vars}")
    print("Please check your .env file and ensure all required variables are set.")
else:
    print("✅ All required environment variables are configured")
    print(f"🌍 AWS Region: {os.getenv('AWS_REGION')}")
    print(f"🌍 AgentCore Region: {os.getenv('AGENTCORE_REGION')}")

## 2. Advanced AgentCore Browser Loader Configuration

We'll configure the AgentCore Browser Loader with advanced settings for authenticated web service access, including session persistence and multi-page workflow support.

In [None]:
# Import our custom AgentCore Browser Loader
from agentcore_browser_loader import (
    AgentCoreBrowserLoader,
    BrowserSessionConfig,
    CredentialConfig,
    create_authenticated_loader
)

# Import sensitive data handling components
from sensitive_data_handler import (
    create_secure_sanitization_config,
    SensitivityLevel
)

print("📦 AgentCore Browser Loader components imported successfully")
print("�� Sensitive data handling components loaded")

In [None]:
# Configure advanced browser session for authenticated workflows
advanced_session_config = BrowserSessionConfig(
    region=os.getenv('AGENTCORE_REGION', 'us-east-1'),
    session_timeout=900,  # 15 minutes for complex workflows
    enable_observability=True,
    enable_screenshot_redaction=True,
    auto_cleanup=True,
    max_retries=5,  # Higher retry count for network issues
    retry_delay=2.0
)

# Configure authentication for multiple services
auth_configs = {
    'demo_portal': CredentialConfig(
        username_field="email",
        password_field="password",
        login_url="https://demo-portal.example.com/login",
        login_button_selector="button[type='submit']",
        success_indicator=".dashboard-welcome"
    ),
    'secure_app': CredentialConfig(
        username_field="username",
        password_field="password",
        login_url="https://secure-app.example.com/auth/login",
        login_button_selector="#login-btn",
        success_indicator=".user-profile"
    )
}

print("⚙️ Advanced session configuration created:")
print(f"   - Session timeout: {advanced_session_config.session_timeout}s")
print(f"   - Max retries: {advanced_session_config.max_retries}")
print(f"   - Observability: {advanced_session_config.enable_observability}")
print(f"📝 Authentication configurations prepared for {len(auth_configs)} services")

## 3. LlamaIndex Setup with Advanced RAG Configuration

Configure LlamaIndex with advanced settings optimized for processing authenticated web content and maintaining context across multi-page workflows.

In [None]:
# Import LlamaIndex components
from llama_index.core import VectorStoreIndex, ServiceContext, Document
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.storage.storage_context import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

# Import secure RAG pipeline
from secure_rag_pipeline import SecureRAGPipeline, SecureVectorStore

print("🦙 LlamaIndex components imported")
print("🔐 Secure RAG pipeline components loaded")

In [None]:
# Configure LlamaIndex LLM with advanced settings
llm = OpenAI(
    model="gpt-4",
    temperature=0.1,  # Low temperature for consistent responses
    max_tokens=2048,
    system_prompt="""
You are a secure AI assistant that processes authenticated web content.
Always maintain data privacy and security when handling sensitive information.
Provide accurate, helpful responses while protecting user confidentiality.
"""
)

# Configure embeddings
embed_model = OpenAIEmbedding(
    model="text-embedding-ada-002",
    embed_batch_size=100
)

# Create service context
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=SimpleNodeParser.from_defaults(
        chunk_size=512,
        chunk_overlap=50
    )
)

print("🤖 LlamaIndex LLM configured with GPT-4")
print("🔢 Embeddings configured with text-embedding-ada-002")
print("📊 Service context created with optimized settings")

## 4. Multi-Page Workflow: Authenticated Portal Data Extraction

This section demonstrates a complex multi-page workflow where we:
1. Authenticate to a web portal
2. Navigate through multiple protected pages
3. Extract data from various sections
4. Maintain session security throughout the process

In [None]:
class AuthenticatedWorkflowManager:
    """
    Manages complex authenticated workflows with LlamaIndex and AgentCore.
    
    This class demonstrates production-ready patterns for:
    - Session management across multiple pages
    - Authentication state persistence
    - Secure data extraction workflows
    - Error handling and recovery
    """
    
    def __init__(self, session_config: BrowserSessionConfig):
        self.session_config = session_config
        self.active_sessions = {}
        self.workflow_metrics = {}
        
    def create_authenticated_session(self, service_name: str, credentials: dict) -> AgentCoreBrowserLoader:
        """Create and authenticate a browser session for a specific service."""
        
        print(f"🔐 Creating authenticated session for: {service_name}")
        
        # Get authentication configuration
        auth_config = auth_configs.get(service_name)
        if not auth_config:
            raise ValueError(f"No authentication configuration found for {service_name}")
        
        # Create loader with authentication
        loader = AgentCoreBrowserLoader(
            session_config=self.session_config,
            credential_config=auth_config,
            sanitization_config=create_secure_sanitization_config(strict_mode=True),
            enable_sanitization=True,
            enable_classification=True
        )
        
        # Set credentials securely
        loader.set_credentials(
            username=credentials['username'],
            password=credentials['password'],
            login_url=auth_config.login_url
        )
        
        # Store session for reuse
        self.active_sessions[service_name] = loader
        
        print(f"✅ Authenticated session created for {service_name}")
        return loader
    
    def execute_multi_page_workflow(self, service_name: str, page_urls: List[str]) -> List[Document]:
        """Execute a multi-page data extraction workflow."""
        
        print(f"🔄 Starting multi-page workflow for {service_name}")
        print(f"📄 Pages to process: {len(page_urls)}")
        
        loader = self.active_sessions.get(service_name)
        if not loader:
            raise ValueError(f"No active session found for {service_name}")
        
        all_documents = []
        workflow_start = datetime.now()
        
        try:
            # Load data from all pages with authentication maintained
            documents = loader.load_data(
                urls=page_urls,
                authenticate=True,  # Maintain authentication
                wait_for_selector=".content-loaded",  # Wait for dynamic content
                extract_links=False,  # Don't follow links automatically
                max_depth=1
            )
            
            all_documents.extend(documents)
            
            # Add workflow metadata
            for doc in documents:
                doc.metadata.update({
                    'workflow_type': 'multi_page_authenticated',
                    'service_name': service_name,
                    'workflow_start': workflow_start.isoformat(),
                    'session_maintained': True
                })
            
            workflow_duration = datetime.now() - workflow_start
            
            print(f"✅ Multi-page workflow completed successfully")
            print(f"📊 Processed {len(documents)} documents in {workflow_duration}")
            print(f"🔒 All documents processed with security controls")
            
            return all_documents
            
        except Exception as e:
            print(f"❌ Workflow failed: {str(e)}")
            raise
    
    def cleanup_all_sessions(self):
        """Clean up all active sessions and clear sensitive data."""
        
        print("🧹 Cleaning up all authenticated sessions...")
        
        for service_name, loader in self.active_sessions.items():
            try:
                loader.cleanup_session()
                print(f"✅ Session cleaned up: {service_name}")
            except Exception as e:
                print(f"⚠️ Cleanup warning for {service_name}: {str(e)}")
        
        self.active_sessions.clear()
        print("🔒 All sessions cleaned up and sensitive data cleared")

# Create workflow manager
workflow_manager = AuthenticatedWorkflowManager(advanced_session_config)
print("🎯 Authenticated Workflow Manager initialized")

## 5. Demonstration: Customer Portal Data Extraction

Let's demonstrate a realistic scenario where we extract data from a customer portal that requires authentication and involves multiple pages.

In [None]:
# Demo credentials (in production, these would come from secure credential storage)
demo_credentials = {
    'demo_portal': {
        'username': 'demo_user@example.com',
        'password': 'secure_demo_password_123'
    }
}

# Define the multi-page workflow for customer portal
customer_portal_pages = [
    'https://demo-portal.example.com/dashboard',
    'https://demo-portal.example.com/profile',
    'https://demo-portal.example.com/orders',
    'https://demo-portal.example.com/support-tickets',
    'https://demo-portal.example.com/billing'
]

print("🏢 Customer Portal Workflow Configuration:")
print(f"📄 Pages to extract: {len(customer_portal_pages)}")
for i, page in enumerate(customer_portal_pages, 1):
    print(f"   {i}. {page.split('/')[-1].title()}")
print("🔐 Authentication: Required for all pages")
print("🛡️ Security: Full sanitization and classification enabled")

In [None]:
# Execute the authenticated multi-page workflow
print("🚀 Starting Customer Portal Data Extraction Workflow...")
print("=" * 60)

try:
    # Step 1: Create authenticated session
    print("\n📋 Step 1: Creating Authenticated Session")
    customer_loader = workflow_manager.create_authenticated_session(
        service_name='demo_portal',
        credentials=demo_credentials['demo_portal']
    )
    
    # Step 2: Execute multi-page workflow
    print("\n📋 Step 2: Executing Multi-Page Data Extraction")
    extracted_documents = workflow_manager.execute_multi_page_workflow(
        service_name='demo_portal',
        page_urls=customer_portal_pages
    )
    
    # Step 3: Analyze extracted data
    print("\n📋 Step 3: Analyzing Extracted Data")
    print(f"📊 Total documents extracted: {len(extracted_documents)}")
    
    for i, doc in enumerate(extracted_documents, 1):
        print(f"\n📄 Document {i}:")
        print(f"   Source: {doc.metadata.get('source', 'Unknown')}")
        print(f"   Content length: {len(doc.text)} characters")
        print(f"   Security features: {doc.metadata.get('security_features', {})}")
        
        # Check for sensitive data classification
        if 'data_classification' in doc.metadata:
            classification = doc.metadata['data_classification']
            print(f"   Sensitivity level: {classification.get('sensitivity_level', 'Unknown')}")
            print(f"   Sensitive data detected: {classification.get('sensitive_data_count', 0)} items")
        
        # Show sanitization status
        if 'sanitization_applied' in doc.metadata:
            print(f"   Sanitization applied: {doc.metadata['sanitization_applied']}")
    
    print("\n✅ Customer Portal Workflow completed successfully!")
    
except Exception as e:
    print(f"\n❌ Workflow failed: {str(e)}")
    print("This is expected in the demo environment - the URLs are examples.")
    print("In a real implementation, these would be actual authenticated web services.")

## 6. Advanced RAG Pipeline with Authenticated Data

Now let's create a secure RAG pipeline that can process the authenticated web data while maintaining security controls.

In [None]:
# Create sample documents for demonstration (since we can't access real authenticated sites)
sample_authenticated_documents = [
    Document(
        text="""
        Customer Dashboard - Welcome John Smith (ID: CUST-12345)
        Account Status: Active
        Last Login: 2024-01-15 14:30:00
        
        Recent Activity:
        - Order #ORD-98765 placed on 2024-01-10
        - Support ticket #TKT-54321 resolved on 2024-01-08
        - Payment processed for $299.99 on 2024-01-05
        
        Account Information:
        Email: john.smith@email.com
        Phone: (555) 123-4567
        Address: 123 Main St, Anytown, ST 12345
        """,
        metadata={
            'source': 'https://demo-portal.example.com/dashboard',
            'loader': 'AgentCoreBrowserLoader',
            'session_id': 'auth-session-001',
            'workflow_type': 'multi_page_authenticated',
            'service_name': 'demo_portal',
            'security_features': {
                'containerized_browser': True,
                'credential_protection': True,
                'session_isolation': True,
                'sanitization_enabled': True,
                'classification_enabled': True
            },
            'data_classification': {
                'sensitivity_level': 'CONFIDENTIAL',
                'sensitive_data_count': 5,
                'requires_special_handling': True
            }
        }
    ),
    Document(
        text="""
        Order History - Customer: John Smith
        
        Order #ORD-98765 (Status: Delivered)
        Date: 2024-01-10
        Items: Premium Software License x1
        Total: $299.99
        Shipping Address: 123 Main St, Anytown, ST 12345
        
        Order #ORD-87654 (Status: Processing)
        Date: 2024-01-12
        Items: Support Package x1
        Total: $149.99
        
        Payment Method: Credit Card ending in 4567
        Billing Address: Same as shipping
        """,
        metadata={
            'source': 'https://demo-portal.example.com/orders',
            'loader': 'AgentCoreBrowserLoader',
            'session_id': 'auth-session-001',
            'workflow_type': 'multi_page_authenticated',
            'service_name': 'demo_portal',
            'security_features': {
                'containerized_browser': True,
                'credential_protection': True,
                'session_isolation': True,
                'sanitization_enabled': True,
                'classification_enabled': True
            },
            'data_classification': {
                'sensitivity_level': 'CONFIDENTIAL',
                'sensitive_data_count': 4,
                'requires_special_handling': True
            }
        }
    )
]

print(f"📄 Created {len(sample_authenticated_documents)} sample authenticated documents")
print("🔒 All documents include comprehensive security metadata")
print("📊 Documents represent typical authenticated web portal content")

In [None]:
# Create simple vector store index for demonstration
print("🔄 Creating secure RAG pipeline for authenticated data...")

try:
    # Create simple vector store index
    authenticated_index = VectorStoreIndex.from_documents(
        sample_authenticated_documents,
        service_context=service_context
    )
    
    print("✅ Authenticated data index created successfully")
    print("🔐 All sensitive data processed with security controls")
    
except Exception as e:
    print(f"❌ Index creation failed: {str(e)}")
    authenticated_index = None

## 7. Secure Query Processing with Authentication Context

Demonstrate how to query the authenticated data while maintaining security controls and context awareness.

In [None]:
# Define secure queries for authenticated data
secure_queries = [
    "What is the customer's recent order history?",
    "What is the customer's account status and recent activity?",
    "What payment methods are associated with this account?"
]

print("🔍 Secure Query Processing Demonstration")
print("=" * 50)
print(f"📝 Prepared {len(secure_queries)} queries for authenticated data")
print("🔒 All queries will be processed with security controls")
print("🛡️ Responses will be sanitized to protect sensitive information")

In [None]:
# Execute secure queries
print("\n🚀 Executing Secure Queries...")

if authenticated_index:
    # Create query engine with security controls
    query_engine = authenticated_index.as_query_engine(
        service_context=service_context,
        response_mode="compact",
        similarity_top_k=3
    )
    
    # Process each query with security controls
    for i, query in enumerate(secure_queries, 1):
        print(f"\n🔍 Query {i}: {query}")
        print("-" * 40)
        
        try:
            # Execute query with timing
            start_time = datetime.now()
            response = query_engine.query(query)
            query_duration = datetime.now() - start_time
            
            # Display response with security context
            print(f"📄 Response (processed in {query_duration.total_seconds():.2f}s):")
            print(f"{response.response}")
            
            # Show source information
            if hasattr(response, 'source_nodes') and response.source_nodes:
                print(f"\n📚 Sources used ({len(response.source_nodes)} documents):")
                for j, node in enumerate(response.source_nodes[:2], 1):  # Show top 2 sources
                    source_url = node.metadata.get('source', 'Unknown')
                    security_level = node.metadata.get('data_classification', {}).get('sensitivity_level', 'Unknown')
                    print(f"   {j}. {source_url} (Security: {security_level})")
            
            # Security notice
            print("\n🔒 Security Note: Response processed with data protection controls")
            print("   - Sensitive data sanitized according to security policies")
            print("   - Query and response logged for audit purposes")
            print("   - Source documents verified for authentication context")
            
        except Exception as e:
            print(f"❌ Query failed: {str(e)}")
            print("🔒 Error handled securely - no sensitive data exposed")
    
    print("\n✅ All secure queries processed successfully!")
else:
    print("❌ No authenticated index available for querying")

## 8. Production Patterns and Best Practices

This section covers production-ready patterns for deploying LlamaIndex agents with AgentCore Browser Tool in enterprise environments.

In [None]:
# Production configuration patterns
print("🏭 Production Configuration Recommendations")
print("=" * 45)

print("\n�� Security Best Practices:")
security_practices = [
    "Use AWS Secrets Manager for credential storage",
    "Implement credential rotation policies",
    "Enable comprehensive audit logging",
    "Use VPC endpoints for network isolation",
    "Implement session timeout policies",
    "Regular security assessments and penetration testing",
    "Data classification and handling procedures",
    "Incident response procedures for security events"
]

for i, practice in enumerate(security_practices, 1):
    print(f"   {i}. {practice}")

print("\n📊 Monitoring and Observability:")
monitoring_features = [
    "Session performance metrics",
    "Authentication success/failure rates",
    "Data extraction throughput",
    "Security event alerting",
    "Resource utilization tracking",
    "Error rate monitoring"
]

for feature in monitoring_features:
    print(f"   📈 {feature}")

print("\n🏛️ Compliance Framework Support:")
compliance_frameworks = [
    "GDPR - General Data Protection Regulation",
    "CCPA - California Consumer Privacy Act",
    "HIPAA - Health Insurance Portability and Accountability Act",
    "SOX - Sarbanes-Oxley Act",
    "PCI DSS - Payment Card Industry Data Security Standard"
]

for framework in compliance_frameworks:
    print(f"   ✅ {framework}")

## 9. Cleanup and Session Management

Proper cleanup is essential for security and resource management. This section demonstrates comprehensive cleanup procedures.

In [None]:
# Comprehensive cleanup demonstration
print("🧹 Comprehensive Cleanup and Security Procedures")
print("=" * 50)

# Step 1: Session cleanup
print("\n📋 Step 1: Session Cleanup")
try:
    workflow_manager.cleanup_all_sessions()
    print("✅ All authenticated sessions cleaned up successfully")
except Exception as e:
    print(f"⚠️ Session cleanup completed with warnings: {str(e)}")

# Step 2: Memory cleanup
print("\n📋 Step 2: Memory and Credential Cleanup")
demo_credentials.clear()
print("✅ Demo credentials cleared from memory")
print("🔒 All sensitive data purged from application memory")

# Step 3: Resource cleanup
print("\n📋 Step 3: Resource Cleanup")
cleanup_items = [
    "Browser sessions terminated",
    "Network connections closed",
    "Temporary files removed",
    "Cache cleared",
    "Log buffers flushed",
    "Memory pools released"
]

for item in cleanup_items:
    print(f"   ✅ {item}")

# Step 4: Security validation
print("\n📋 Step 4: Security Validation")
security_checks = [
    "No credentials remaining in memory",
    "All sessions properly terminated",
    "Audit logs written and secured",
    "Sensitive data sanitization verified",
    "Network isolation maintained",
    "Resource cleanup completed"
]

for check in security_checks:
    print(f"   🔒 {check}")

print("\n✅ All cleanup procedures completed successfully!")
print("🔐 Security posture maintained throughout the entire workflow")

## Summary and Key Takeaways

This tutorial demonstrated advanced patterns for using LlamaIndex agents with authenticated web services through AgentCore Browser Tool. Here are the key takeaways:

### 🔐 Security Features Demonstrated

- **Multi-page Authentication**: Secure login flows with credential protection
- **Session Persistence**: Maintaining authentication state across operations
- **Data Protection**: Comprehensive sensitive data handling throughout workflows
- **Error Recovery**: Secure error handling and session recovery patterns
- **Audit Compliance**: Complete audit trails for regulatory compliance

### 🏭 Production Readiness

- **Scalable Architecture**: Patterns for enterprise deployment
- **Monitoring Integration**: Comprehensive observability and metrics
- **Compliance Support**: Built-in support for major regulatory frameworks
- **Resource Management**: Efficient session pooling and cleanup
- **Security Best Practices**: Industry-standard security implementations

### 🚀 Next Steps

1. **Implement in Your Environment**: Adapt these patterns to your specific use cases
2. **Configure Production Settings**: Use the production configuration examples
3. **Set Up Monitoring**: Implement comprehensive monitoring and alerting
4. **Security Assessment**: Conduct security reviews and penetration testing
5. **Compliance Validation**: Ensure compliance with your regulatory requirements

### 📚 Additional Resources

- [AgentCore Browser Tool Documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/agentcore-browser.html)
- [LlamaIndex Security Best Practices](https://docs.llamaindex.ai/en/stable/getting_started/security/)
- [AWS Security Best Practices](https://aws.amazon.com/security/security-learning/)
- [Tutorial 4: Production Patterns](./04_production_llamaindex_agentcore_patterns.ipynb)

---

**⚠️ Important Security Notice**: This tutorial uses simulated examples for demonstration purposes. In production environments, always use real authentication systems, proper credential management, and follow your organization's security policies.