# Real AI Agent Integration with XML Analysis Framework

This notebook demonstrates **actual** AI agent integration using LangChain and the XML Analysis Framework. Everything shown here uses real API calls and generates actual AI responses.

**Key Distinctions:**
- 🤖 **AI-Generated**: Content created by actual LLM API calls (clearly marked)
- 📊 **Framework-Prepared**: Data structured by our framework for AI consumption (not AI-generated)

**Requirements:**
- OpenAI API key (configured in .env file)
- Internet connection for API calls
- All packages installed from the setup cell

## Setup and Installation

In [1]:
# Install required packages
%pip install xml-analysis-framework==1.2.12 --upgrade -q --force-reinstall --no-cache-dir
%pip install langchain --upgrade -q --force-reinstall --no-cache-dir
%pip install langchain-openai --upgrade -q --force-reinstall --no-cache-dir
%pip install langchain-community --upgrade -q --force-reinstall --no-cache-dir
%pip install python-dotenv --upgrade -q --force-reinstall --no-cache-dir

import xml_analysis_framework as xaf
import json
from pathlib import Path
from datetime import datetime
import os

print(f"XML Analysis Framework version: {xaf.__version__}")

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python-headless 4.12.0.88 requires numpy<2.3.0,>=2; python_version >= "3.9", but you have numpy 2.3.2 which is incompatible.[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
XML Analysis Framework version: 1.2.12


## Real AI Setup (LangChain + OpenAI)

This section sets up actual AI components that will make real API calls.

In [2]:
# Configure model selection and setup
# Load environment variables
from dotenv import load_dotenv
load_dotenv()  # This loads variables from .env file

from langchain.agents import initialize_agent, Tool
from langchain_openai import OpenAI, ChatOpenAI  # For OpenAI models
from langchain.memory import ConversationBufferMemory
from langchain.schema import SystemMessage

# ========================================
# MODEL SELECTION - CHANGE THIS SECTION
# ========================================

# Option 1: OpenAI Models (current)
MODEL_PROVIDER = "openai"  # Options: "openai", "anthropic", "local"

# OpenAI Configuration
OPENAI_CONFIG = {
    "model_name": "gpt-4o",  # Options: "gpt-4", "gpt-4-turbo-preview", "gpt-3.5-turbo", "gpt-3.5-turbo-16k"
    "temperature": 0.1,
    "max_tokens": 4096,
    "use_chat_model": True,  # Use ChatOpenAI for better chat-based interactions
}

# Model Context Windows (for reference)
CONTEXT_WINDOWS = {
    # OpenAI
    "gpt-4": "8k tokens",
    "gpt-4-turbo-preview": "128k tokens",
    "gpt-3.5-turbo": "4k tokens", 
    "gpt-3.5-turbo-16k": "16k tokens",
    # Anthropic
    "claude-3-opus-20240229": "200k tokens",
    "claude-3-sonnet-20240229": "200k tokens",
    "claude-3-haiku-20240307": "200k tokens",
    # Local/Open models
    "llama-2-70b": "4k tokens",
    "mixtral-8x7b": "32k tokens",
    "deepseek-coder": "16k tokens"
}

# ========================================
# Initialize based on provider
# ========================================

llm = None

if MODEL_PROVIDER == "openai":
    openai_api_key = os.getenv("OPENAI_API_KEY")
    
    if openai_api_key:
        print(f"✅ OpenAI API key loaded: {openai_api_key[:8]}...")
        
        if OPENAI_CONFIG["use_chat_model"]:
            # Use ChatOpenAI for better chat interactions
            llm = ChatOpenAI(
                model_name=OPENAI_CONFIG["model_name"],
                temperature=OPENAI_CONFIG["temperature"],
                max_tokens=OPENAI_CONFIG["max_tokens"],
                openai_api_key=openai_api_key
            )
        else:
            # Use standard OpenAI
            llm = OpenAI(
                model_name=OPENAI_CONFIG["model_name"],
                temperature=OPENAI_CONFIG["temperature"],
                max_tokens=OPENAI_CONFIG["max_tokens"],
                openai_api_key=openai_api_key
            )
        
        print(f"✅ OpenAI LLM initialized!")
        print(f"📊 Using model: {OPENAI_CONFIG['model_name']}")
        print(f"📏 Context window: {CONTEXT_WINDOWS.get(OPENAI_CONFIG['model_name'], 'Unknown')}")
        print(f"🌡️ Temperature: {OPENAI_CONFIG['temperature']}")
        print(f"📝 Max output tokens: {OPENAI_CONFIG['max_tokens']}")
    else:
        print("⚠️ OPENAI_API_KEY not found. Please check your .env file.")

elif MODEL_PROVIDER == "anthropic":
    # Example for Anthropic Claude (requires: pip install langchain-anthropic)
    print("ℹ️ To use Anthropic Claude:")
    print("1. Install: pip install langchain-anthropic")
    print("2. Add to .env: ANTHROPIC_API_KEY=your_key")
    print("3. Update imports and initialization:")
    print("""
from langchain_anthropic import ChatAnthropic

anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
llm = ChatAnthropic(
    model="claude-3-opus-20240229",
    temperature=0.1,
    max_tokens=4096,
    anthropic_api_key=anthropic_api_key
)
""")

elif MODEL_PROVIDER == "local":
    # Example for local models (Ollama, llama.cpp, etc.)
    print("ℹ️ To use local models:")
    print("1. Install: pip install langchain-community")
    print("2. Run local model server (e.g., Ollama)")
    print("3. Update initialization:")
    print("""
from langchain_community.llms import Ollama

llm = Ollama(
    model="llama2:70b",  # or "mixtral", "deepseek-coder", etc.
    temperature=0.1
)
""")

# Initialize memory (works with any LLM)
if llm:
    memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
    print("✅ Memory buffer initialized!")
else:
    print("❌ No LLM initialized. Check your configuration above.")

✅ OpenAI API key loaded: sk-proj-...
✅ OpenAI LLM initialized!
📊 Using model: gpt-4o
📏 Context window: Unknown
🌡️ Temperature: 0.1
📝 Max output tokens: 4096
✅ Memory buffer initialized!


  memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)


## Framework-Based Analysis Tool

Create a tool that uses our XML Analysis Framework to prepare data for AI consumption. This tool itself doesn't use AI - it structures XML data for AI processing.

In [3]:
class XMLAnalysisFrameworkTool:
    """📊 Framework-Prepared: Uses XML Analysis Framework to structure data for AI consumption"""
    
    def __init__(self):
        self.framework = xaf
        self.analysis_history = []
    
    def analyze_document_for_ai(self, file_path: str) -> dict:
        """Prepare comprehensive XML analysis for AI agent consumption"""
        try:
            # Use our framework to structure the data
            result = self.framework.analyze(file_path)
            enhanced = self.framework.analyze_enhanced(file_path) 
            schema = self.framework.analyze_schema(file_path)
            chunks = self.framework.chunk(file_path, strategy="auto")
            
            # Structure the data for AI consumption (this is NOT AI-generated)
            structured_analysis = {
                'document_summary': {
                    'file_path': file_path,
                    'document_type': result['document_type'].type_name,
                    'confidence': result['document_type'].confidence,
                    'handler_used': result['handler_used']
                },
                'technical_details': {
                    'total_elements': schema.total_elements,
                    'max_depth': schema.max_depth,
                    'root_element': schema.root_element,
                    'namespace_count': len(schema.namespaces)
                },
                'ai_ready_insights': {
                    'identified_use_cases': enhanced.ai_use_cases,
                    'key_findings': enhanced.key_findings,
                    'quality_score': enhanced.quality_metrics,
                    'extracted_data': enhanced.structured_data
                },
                'processing_results': {
                    'chunk_count': len(chunks),
                    'total_tokens': sum(c.token_estimate for c in chunks),
                    'sample_chunks': [
                        {
                            'id': c.chunk_id,
                            'path': c.element_path,
                            'content_preview': c.content[:200] + "..." if len(c.content) > 200 else c.content,
                            'token_count': c.token_estimate
                        } for c in chunks[:3]  # First 3 chunks only
                    ]
                }
            }
            
            self.analysis_history.append(structured_analysis)
            return structured_analysis
            
        except Exception as e:
            return {'error': str(e), 'file_path': file_path}
    
    def get_analysis_summary_text(self, analysis: dict) -> str:
        """Create human-readable summary of framework analysis (NOT AI-generated)"""
        if 'error' in analysis:
            return f"Error analyzing {analysis['file_path']}: {analysis['error']}"
        
        summary = f"""Document Analysis Summary (Framework-Generated):

File: {analysis['document_summary']['file_path']}
Type: {analysis['document_summary']['document_type']} 
Confidence: {analysis['document_summary']['confidence']:.1%}
Handler: {analysis['document_summary']['handler_used']}

Structure:
- Elements: {analysis['technical_details']['total_elements']}
- Max Depth: {analysis['technical_details']['max_depth']}
- Root: {analysis['technical_details']['root_element']}
- Namespaces: {analysis['technical_details']['namespace_count']}

AI Processing Ready:
- Generated {analysis['processing_results']['chunk_count']} chunks
- Total tokens: {analysis['processing_results']['total_tokens']}
- Identified use cases: {len(analysis['ai_ready_insights']['identified_use_cases'])}
"""
        return summary

# Initialize the framework tool
xml_tool = XMLAnalysisFrameworkTool()
print("✅ XML Analysis Framework Tool initialized (prepares data for AI consumption)")

✅ XML Analysis Framework Tool initialized (prepares data for AI consumption)


## 📊 Framework Analysis (Data Preparation)

Run our framework analysis to prepare structured data for AI consumption.

In [4]:
# Analyze our test KML file using the framework
test_file = "data/mapbox-example.kml"

# Check if file exists first
import os
if not os.path.exists(test_file):
    print(f"⚠️ Test file not found: {test_file}")
    print("Available files in data directory:")
    if os.path.exists("data"):
        for f in os.listdir("data"):
            if f.endswith('.xml') or f.endswith('.kml'):
                print(f"  - {f}")
        # Use the first available file
        available_files = [f for f in os.listdir("data") if f.endswith(('.xml', '.kml'))]
        if available_files:
            test_file = f"data/{available_files[0]}"
            print(f"Using: {test_file}")
    else:
        print("Data directory not found. Creating a simple test...")
        # Create a simple XML for testing
        test_file = "simple_test.xml"
        with open(test_file, 'w') as f:
            f.write('''<?xml version="1.0" encoding="UTF-8"?>
<root>
    <item id="1">
        <name>Test Item 1</name>
        <description>This is a test description</description>
    </item>
    <item id="2">
        <name>Test Item 2</name>
        <description>Another test description</description>
    </item>
</root>''')
        print(f"Created simple test file: {test_file}")

print("📊 Framework-Prepared: Running XML analysis to structure data for AI...")
analysis = xml_tool.analyze_document_for_ai(test_file)

# Display the structured analysis (prepared by framework, not AI)
if 'error' not in analysis:
    print("✅ Framework analysis completed successfully!")
    print("\n📊 Framework-Prepared Data Structure:")
    print(json.dumps(analysis['document_summary'], indent=2))
    print(f"\nGenerated {analysis['processing_results']['chunk_count']} chunks for AI processing")
else:
    print(f"❌ Framework analysis failed: {analysis['error']}")

📊 Framework-Prepared: Running XML analysis to structure data for AI...
File size: 0.0 MB
Using iterative parsing for large file: data/mapbox-example.kml
✅ Framework analysis completed successfully!

📊 Framework-Prepared Data Structure:
{
  "file_path": "data/mapbox-example.kml",
  "document_type": "KML Geographic Data",
  "confidence": 0.95,
  "handler_used": "KMLHandler"
}

Generated 23 chunks for AI processing


## 📊 Framework Summary (Non-AI)

In [5]:
# Generate a framework-based summary (NOT AI-generated)
print("📊 Framework-Prepared Summary (NOT AI-generated):")
print("=" * 50)
summary = xml_tool.get_analysis_summary_text(analysis)
print(summary)

📊 Framework-Prepared Summary (NOT AI-generated):
Document Analysis Summary (Framework-Generated):

File: data/mapbox-example.kml
Type: KML Geographic Data 
Confidence: 95.0%
Handler: KMLHandler

Structure:
- Elements: 24
- Max Depth: 6
- Root: kml
- Namespaces: 1

AI Processing Ready:
- Generated 23 chunks
- Total tokens: 2289
- Identified use cases: 9



## 🤖 Real AI Agent Integration

Now we create a real LangChain agent that can actually analyze our framework-prepared data using AI.

In [6]:
# Create AI tools that work WITH framework analysis, not replace it
from langchain.tools import Tool

def xml_analysis_tool_func(file_path: str) -> str:
    """📊 Framework-Prepared: Analyze XML and structure for AI consumption"""
    analysis = xml_tool.analyze_document_for_ai(file_path)
    if 'error' in analysis:
        return f"Error analyzing {file_path}: {analysis['error']}"
    
    # Return structured framework analysis for AI to work with
    return xml_tool.get_analysis_summary_text(analysis)

# Define the tool for LangChain that provides framework analysis
xml_analysis_tool = Tool(
    name="XML_Framework_Analysis",
    func=xml_analysis_tool_func,
    description="Get structured XML document analysis including document type, metadata, chunks, and suggested AI use cases. Input: file path."
)

# Helper function to extract content from LangChain responses
def extract_llm_content(response):
    """Extract text content from various LangChain response types"""
    if hasattr(response, 'content'):
        # AIMessage object
        return response.content
    elif hasattr(response, 'text'):
        # Some models return .text
        return response.text
    elif isinstance(response, str):
        # Already a string
        return response
    elif isinstance(response, dict) and 'content' in response:
        # Dictionary response
        return response['content']
    else:
        # Fallback
        return str(response)

# Create use-case-specific AI tools optimized for modern 128k+ context models
class UseCase_SpecificAI:
    """🤖 AI-Generated responses for specific use cases - optimized for large context models"""
    
    def __init__(self, llm):
        self.llm = llm
        self.max_context_chars = 100000  # ~25k tokens, well within 128k limit
        
    def geospatial_analysis(self, chunks_content: str, specific_task: str) -> str:
        """🤖 AI for geospatial use cases - uses full document context"""
        # With 128k context, we can analyze much more content at once
        prompt = f"""
You are an expert in geospatial analysis. Based on the comprehensive XML content below, provide detailed insights for: {specific_task}

XML Content (full document sections):
{chunks_content[:self.max_context_chars]}

Please provide a detailed JSON response with this structure:
{{
    "task": "{specific_task}",
    "insights": ["detailed insight 1", "detailed insight 2", "detailed insight 3", "...more as relevant"],
    "patterns_identified": ["pattern 1", "pattern 2", "..."],
    "recommendations": ["specific recommendation 1", "specific recommendation 2", "..."],
    "extracted_features": {{
        "locations": [{{
            "name": "location name",
            "coordinates": {{"lat": 0.0, "lon": 0.0}},
            "type": "feature type",
            "attributes": {{}}
        }}],
        "relationships": ["relationship 1", "relationship 2"]
    }},
    "summary": "comprehensive summary of findings",
    "confidence": 0.95
}}
"""
        try:
            response = self.llm.invoke(prompt)
            return extract_llm_content(response)
        except Exception as e:
            return f'{{"error": "{e}"}}'
    
    def security_compliance_analysis(self, chunks_content: str, specific_task: str) -> str:
        """🤖 AI for security and compliance - comprehensive analysis"""
        prompt = f"""
You are a security compliance expert. Based on the comprehensive XML content below, provide detailed analysis for: {specific_task}

XML Content (full document):
{chunks_content[:self.max_context_chars]}

Please provide a comprehensive JSON response with this structure:
{{
    "task": "{specific_task}",
    "security_findings": [
        {{"finding": "description", "severity": "low|medium|high|critical", "evidence": "specific XML elements"}}
    ],
    "compliance_status": {{
        "overall": "compliant|non-compliant|partially-compliant",
        "by_standard": {{"standard_name": "status"}}
    }},
    "vulnerabilities": [
        {{"type": "vulnerability type", "description": "details", "mitigation": "recommended action"}}
    ],
    "risk_assessment": {{
        "overall_risk": "low|medium|high",
        "risk_factors": ["factor 1", "factor 2"],
        "risk_score": 0.0
    }},
    "recommendations": [
        {{"priority": "high|medium|low", "action": "specific action", "rationale": "why this matters"}}
    ],
    "confidence": 0.95
}}
"""
        try:
            response = self.llm.invoke(prompt)
            return extract_llm_content(response)
        except Exception as e:
            return f'{{"error": "{e}"}}'
    
    def dependency_analysis(self, chunks_content: str, specific_task: str) -> str:
        """🤖 AI for dependency analysis - full graph analysis"""
        prompt = f"""
You are a software dependency expert. Based on the comprehensive XML content below, provide detailed analysis for: {specific_task}

XML Content (full dependency tree):
{chunks_content[:self.max_context_chars]}

Please provide a comprehensive JSON response with this structure:
{{
    "task": "{specific_task}",
    "dependency_graph": {{
        "total_dependencies": 0,
        "direct_dependencies": ["dep1", "dep2"],
        "transitive_dependencies": ["dep3", "dep4"],
        "dependency_tree": {{}}
    }},
    "security_analysis": {{
        "vulnerable_dependencies": [
            {{"name": "dep", "version": "1.0", "vulnerabilities": ["CVE-..."], "severity": "high"}}
        ],
        "outdated_dependencies": [
            {{"name": "dep", "current": "1.0", "latest": "2.0", "type": "major|minor|patch"}}
        ]
    }},
    "compatibility_analysis": {{
        "conflicts": ["conflict description"],
        "version_constraints": {{"dep": "constraint"}},
        "compatibility_score": 0.95
    }},
    "recommendations": [
        {{"type": "upgrade|remove|replace", "dependency": "name", "action": "specific action", "impact": "expected impact"}}
    ],
    "confidence": 0.95
}}
"""
        try:
            response = self.llm.invoke(prompt)
            return extract_llm_content(response)
        except Exception as e:
            return f'{{"error": "{e}"}}'
    
    def comprehensive_document_analysis(self, full_analysis: dict, all_chunks: list, specific_focus: str = None) -> str:
        """🤖 Analyze entire document with all chunks - leverages large context window"""
        # Combine all chunks for comprehensive analysis
        full_content = "\n\n=== CHUNK BOUNDARY ===\n\n".join([
            f"Chunk {i+1} (Path: {chunk.element_path}):\n{chunk.content}" 
            for i, chunk in enumerate(all_chunks)
        ])
        
        # Use full document metadata from framework
        doc_type = full_analysis['document_summary']['document_type']
        use_cases = full_analysis['ai_ready_insights']['identified_use_cases']
        
        prompt = f"""
You are analyzing a complete {doc_type} document. The framework has identified these potential use cases:
{', '.join(use_cases)}

{"Focus on: " + specific_focus if specific_focus else "Provide comprehensive analysis."}

Full Document Content ({len(all_chunks)} chunks):
{full_content[:self.max_context_chars]}

Provide a comprehensive JSON analysis covering all relevant aspects for the identified use cases.
Include specific examples from the document content to support your analysis.
"""
        try:
            response = self.llm.invoke(prompt)
            return extract_llm_content(response)
        except Exception as e:
            return f'{{"error": "{e}", "content_size": {len(full_content)}}}'

# Initialize use-case specific AI if LLM is available
if 'llm' in locals():
    usecase_ai = UseCase_SpecificAI(llm)
    
    # Create the main agent with framework analysis tool
    from langchain.agents import initialize_agent, AgentType
    
    tools = [xml_analysis_tool]
    
    agent = initialize_agent(
        tools=tools,
        llm=llm,
        agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
        verbose=True,
        handle_parsing_errors=True
    )
    
    print("✅ AI system created (optimized for 128k+ context models):")
    print("  📊 Framework analysis tool (structures XML data)")
    print("  🤖 Use-case specific AI tools (large context aware)")
    print("  📄 Full document analysis capability")
    print(f"  💾 Max content size: {usecase_ai.max_context_chars:,} characters")
    print("  🔧 Response handler for various LangChain formats")
else:
    print("❌ LLM not initialized. Check your API key setup.")

✅ AI system created (optimized for 128k+ context models):
  📊 Framework analysis tool (structures XML data)
  🤖 Use-case specific AI tools (large context aware)
  📄 Full document analysis capability
  💾 Max content size: 100,000 characters
  🔧 Response handler for various LangChain formats


  agent = initialize_agent(


## 🤖 Framework-Guided AI Analysis

The framework identifies the document type and suggests AI use cases. Now we use AI to work with that structured data.

In [11]:
# Step 1: Framework analysis (not AI - just data structuring)
print("📊 Framework Analysis Results:")
print("=" * 50)
framework_analysis = xml_tool.analyze_document_for_ai(test_file)
print(f"Document Type: {framework_analysis['document_summary']['document_type']}")
print(f"Suggested AI Use Cases: {framework_analysis['ai_ready_insights']['identified_use_cases']}")
print(f"Available chunks: {framework_analysis['processing_results']['chunk_count']}")

# Step 2: Get framework-prepared content for AI to work with
if 'error' not in framework_analysis:
    # Create chunks using framework
    print("\n📊 Creating chunks using auto strategy...")
    all_chunks = xaf.chunk(test_file, strategy="auto")
    print(f"✅ Created {len(all_chunks)} chunks successfully")
    
    # With 128k context, we can use ALL chunks!
    total_content_size = sum(len(chunk.content) for chunk in all_chunks)
    total_tokens = sum(chunk.token_estimate for chunk in all_chunks)
    print(f"📊 Total content: {total_content_size:,} characters (~{total_tokens:,} tokens)")
    
    # Combine all chunks for comprehensive analysis
    chunks_content = "\n\n=== CHUNK ===\n\n".join([chunk.content for chunk in all_chunks])
    
    print(f"📊 Framework prepared ALL {len(all_chunks)} chunks for AI analysis")
    
    # Step 3: Use AI for comprehensive analysis with large context
    if 'usecase_ai' in locals():
        ai_use_cases = framework_analysis['ai_ready_insights']['identified_use_cases']
        
        print("\n🤖 AI Analysis of Framework-Prepared Data (Full Document):")
        print("=" * 50)
        
        # Use comprehensive document analysis
        print("🤖 Running comprehensive document analysis...")
        comprehensive_result = usecase_ai.comprehensive_document_analysis(
            framework_analysis, 
            all_chunks,
            specific_focus="geospatial patterns and features"
        )
        print("Comprehensive analysis result:")
        print(comprehensive_result[:1000] + "..." if len(str(comprehensive_result)) > 1000 else comprehensive_result)
        
        # Example: Full geospatial analysis with all content
        if any('geospatial' in uc.lower() or 'geographic' in uc.lower() for uc in ai_use_cases):
            print("\n🤖 Running full geospatial analysis (all chunks):")
            geo_analysis = usecase_ai.geospatial_analysis(chunks_content, "Comprehensive geospatial pattern recognition across entire document")
            print(geo_analysis[:10000] + "..." if len(str(geo_analysis)) > 1000 else geo_analysis)
            
    else:
        print("❌ Use-case specific AI not available")
else:
    print("❌ Framework analysis failed")

📊 Framework Analysis Results:
File size: 0.0 MB
Using iterative parsing for large file: data/mapbox-example.kml
Document Type: KML Geographic Data
Suggested AI Use Cases: ['Geospatial pattern recognition', 'Location clustering and classification', 'Route optimization and analysis', 'Geographic feature extraction', 'Spatial relationship discovery', 'Area and distance calculations', 'Terrain and elevation analysis', 'Geographic anomaly detection', 'Location-based recommendations']
Available chunks: 23

📊 Creating chunks using auto strategy...
✅ Created 23 chunks successfully
📊 Total content: 61,332 characters (~2,289 tokens)
📊 Framework prepared ALL 23 chunks for AI analysis

🤖 AI Analysis of Framework-Prepared Data (Full Document):
🤖 Running comprehensive document analysis...
Comprehensive analysis result:
To provide a comprehensive JSON analysis focusing on geospatial patterns and features from the KML document, we need to extract and analyze the relevant data points and structures. Th

## 📊 Vector Database Preparation (Framework-Only)

This section shows how to prepare the framework analysis for vector database ingestion. No AI is used here - just data structuring.

In [8]:
def prepare_for_vector_db(analysis, chunks):
    """📊 Framework-Prepared: Structure data for vector database ingestion (NOT AI-generated)"""
    
    vector_docs = []
    
    # Document-level metadata (structured by framework)
    doc_metadata = {
        'document_type': analysis['document_summary']['document_type'],
        'confidence': analysis['document_summary']['confidence'],
        'handler_used': analysis['document_summary']['handler_used'],
        'total_elements': analysis['technical_details']['total_elements'],
        'max_depth': analysis['technical_details']['max_depth'],
        'ai_use_cases': analysis['ai_ready_insights']['identified_use_cases'],
        'processing_timestamp': datetime.now().isoformat()
    }
    
    # Create vector documents for each chunk (framework-structured)
    for chunk in chunks:
        vector_doc = {
            'id': f"{analysis['document_summary']['file_path']}_{chunk.chunk_id}",
            'content': chunk.content,
            'metadata': {
                **doc_metadata,
                'chunk_id': chunk.chunk_id,
                'element_path': chunk.element_path,
                'start_line': chunk.start_line,
                'end_line': chunk.end_line,
                'elements_included': chunk.elements_included,
                'token_estimate': chunk.token_estimate,
                'chunk_metadata': chunk.metadata
            }
        }
        vector_docs.append(vector_doc)
    
    return vector_docs

# Use the chunks we created with auto strategy (should work now)
if 'all_chunks' in locals() and 'framework_analysis' in locals() and len(all_chunks) > 0:
    vector_docs = prepare_for_vector_db(framework_analysis, all_chunks)
    
    print("📊 Framework-Prepared Vector Database Documents:")
    print(f"Structured {len(vector_docs)} documents for vector database ingestion")
    
    print(f"\nSample document structure (framework-prepared):")
    print(json.dumps(vector_docs[0], indent=2, default=str)[:800] + "...")
    
    print(f"\nFirst chunk content preview:")
    print(f"ID: {vector_docs[0]['id']}")
    print(f"Content: {vector_docs[0]['content'][:200]}...")
    print(f"Token estimate: {vector_docs[0]['metadata']['token_estimate']}")
        
else:
    print("❌ Chunks or framework analysis not available. Run previous cells first.")
    if 'all_chunks' in locals():
        print(f"Available chunks: {len(all_chunks)}")
    else:
        print("all_chunks variable not found")

📊 Framework-Prepared Vector Database Documents:
Structured 23 documents for vector database ingestion

Sample document structure (framework-prepared):
{
  "id": "data/mapbox-example.kml_chunk_0_ac4e63fc",
  "content": "<ns0:kml xmlns:ns0=\"http://www.opengis.net/kml/2.2\">\n  <ns0:Document>\n    <ns0:Placemark>\n      <ns0:name>Portland</ns0:name>\n      <ns0:Point>\n        <ns0:coordinates>-122.681944,45.52,0</ns0:coordinates>\n      </ns0:Point>\n    </ns0:Placemark>\n    <ns0:Placemark>\n      <ns0:name>Rio de Janeiro</ns0:name>\n      <ns0:Point>\n        <ns0:coordinates>-43.196389,-22.908333,0</ns0:coordinates>\n      </ns0:Point>\n    </ns0:Placemark>\n    <ns0:Placemark>\n      <ns0:name>Istanbul</ns0:name>\n      <ns0:Point>\n        <ns0:coordinates>28.976018,41.01224,0</ns0:coordinates>\n      </ns0:Point>\n    </ns0:Placemark>\n    <ns0:Placemark>\n      <ns0:name>Reykjavik</ns0:name>\n      <ns0:Point>\n        <ns0:coordin...

First chunk content preview:
ID: data/mapbox

## 🤖 Structured AI Q&A System

Ask specific questions about the framework-prepared data using structured prompts.

In [12]:
# Create structured Q&A system that works with framework data
if 'llm' in locals() and 'framework_analysis' in locals():
    
    def ask_structured_question(question: str, expected_format: str = "json") -> str:
        """🤖 AI-Generated: Ask specific questions with structured responses"""
        
        # Use framework-prepared metadata and chunks
        doc_type = framework_analysis['document_summary']['document_type']
        use_cases = framework_analysis['ai_ready_insights']['identified_use_cases']
        
        prompt = f"""
Based on the framework analysis below, answer the specific question with a structured response.

FRAMEWORK ANALYSIS:
Document Type: {doc_type}
Suggested AI Use Cases: {use_cases}
Content Chunks: {len(doc_chunks)} chunks analyzed

CONTENT:
{chunks_content[:2000]}...

QUESTION: {question}

Please provide a {expected_format.upper()} response that directly answers the question based on the framework-analyzed content above.
"""
        try:
            response = llm.invoke(prompt)
            return extract_llm_content(response)
        except Exception as e:
            return f"Error: {e}"
    
    # Test structured questions based on framework insights
    questions = [
        {
            "question": "What geographic coordinates or locations can you extract from this data?",
            "format": "json"
        },
        {
            "question": "What would be the best way to index this data for spatial search?",
            "format": "structured text"
        },
        {
            "question": "How would you classify the geographic features found in this document?",
            "format": "json"
        }
    ]
    
    print("🤖 Structured AI Q&A (working with framework data):")
    print("=" * 60)
    
    for i, q in enumerate(questions, 1):
        print(f"\n**Question {i}:** {q['question']}")
        print(f"**Expected Format:** {q['format']}")
        
        # Create structured prompt for this question
        structured_prompt = f"""
You are analyzing XML data that has been processed by our framework. The framework identified this as a {framework_analysis['document_summary']['document_type']} document.

Content to analyze:
{chunks_content[:1500]}...

Question: {q['question']}

Please provide a {q['format']} response."""
        
        try:
            ai_response = llm.invoke(structured_prompt)
            ai_content = extract_llm_content(ai_response)  # Extract content from AIMessage
            print(f"**🤖 AI Response ({q['format']}):**")
            print(ai_content)
        except Exception as e:
            print(f"**Error:** {e}")
            
        print("-" * 40)
        
else:
    print("❌ LLM or framework analysis not available for Q&A.")

🤖 Structured AI Q&A (working with framework data):

**Question 1:** What geographic coordinates or locations can you extract from this data?
**Expected Format:** json
**🤖 AI Response (json):**
Based on the provided KML data, I can extract the geographic coordinates and locations as follows:

```json
{
  "locations": [
    {
      "name": "Portland",
      "coordinates": {
        "longitude": -122.681944,
        "latitude": 45.52,
        "altitude": 0
      }
    },
    {
      "name": "Rio de Janeiro",
      "coordinates": {
        "longitude": -43.196389,
        "latitude": -22.908333,
        "altitude": 0
      }
    },
    {
      "name": "Istanbul",
      "coordinates": {
        "longitude": 28.976018,
        "latitude": 41.01224,
        "altitude": 0
      }
    },
    {
      "name": "Reykjavik",
      "coordinates": {
        "longitude": -21.933333,
        "latitude": 64.133333,
        "altitude": 0
      }
    },
    {
      "name": "Simple Polygon",
      "polygon"

## 🚀 Advanced Use Cases with Large Context Models

Leverage the 128k+ context window for sophisticated analysis patterns.

In [13]:
# Advanced patterns that leverage large context windows
if 'usecase_ai' in locals() and 'all_chunks' in locals():
    
    print("🚀 Advanced Large Context Analysis Patterns:")
    print("=" * 60)
    
    # 1. Cross-Reference Analysis
    print("\n1️⃣ Cross-Reference Analysis Across All Chunks:")
    cross_ref_prompt = f"""
Analyze the entire document and identify cross-references, relationships, and patterns across all chunks:

Document Type: {framework_analysis['document_summary']['document_type']}
Total Chunks: {len(all_chunks)}

Full Content:
{chunks_content[:usecase_ai.max_context_chars]}

Identify:
1. Repeated patterns across different sections
2. Related elements that reference each other
3. Hierarchical relationships in the data
4. Anomalies or inconsistencies across chunks

Provide JSON response with findings.
"""
    
    # 2. Multi-Stage Analysis Pipeline
    print("\n2️⃣ Multi-Stage Analysis Pipeline:")
    
    # Stage 1: Extract all entities
    entity_extraction_prompt = f"""
Extract ALL entities from this {framework_analysis['document_summary']['document_type']} document:

{chunks_content[:50000]}

Return JSON with categorized entities.
"""
    
    # Stage 2: Analyze relationships between entities
    relationship_prompt = """
Based on the entities found, analyze their relationships and create a knowledge graph structure.
"""
    
    # Stage 3: Generate insights
    insight_prompt = """
Based on the entity relationships, generate actionable insights and recommendations.
"""
    
    print("  - Stage 1: Entity extraction across all chunks")
    print("  - Stage 2: Relationship mapping")
    print("  - Stage 3: Insight generation")
    
    # 3. Comparative Analysis
    print("\n3️⃣ Comparative Analysis Pattern:")
    print("  Compare different sections of the document to find:")
    print("  - Consistency in data formats")
    print("  - Evolution of patterns from start to end")
    print("  - Outliers and anomalies")
    
    # 4. Document Summarization at Multiple Levels
    print("\n4️⃣ Multi-Level Summarization:")
    summarization_levels = {
        "executive": "1-paragraph executive summary",
        "technical": "Technical summary with key data points",
        "detailed": "Detailed summary with examples from content"
    }
    
    for level, description in summarization_levels.items():
        print(f"  - {level.capitalize()}: {description}")
    
    # 5. Use-Case-Specific Deep Dives
    print("\n5️⃣ Deep Dive Analysis for Each Use Case:")
    for i, use_case in enumerate(framework_analysis['ai_ready_insights']['identified_use_cases'][:5], 1):
        print(f"  {i}. {use_case}")
        # Each use case could have its own comprehensive analysis
    
    # Example: Run one of these analyses
    try:
        print("\n🤖 Example: Running Cross-Reference Analysis...")
        # Truncate if needed, but we have lots of room with 128k context
        cross_ref_result = llm.invoke(cross_ref_prompt[:50000])
        cross_ref_content = extract_llm_content(cross_ref_result)  # Handle AIMessage
        print("Result preview:", str(cross_ref_content)[:5000] + "...")
    except Exception as e:
        print(f"Error in cross-reference analysis: {e}")
    
else:
    print("❌ Required components not available for advanced analysis")

🚀 Advanced Large Context Analysis Patterns:

1️⃣ Cross-Reference Analysis Across All Chunks:

2️⃣ Multi-Stage Analysis Pipeline:
  - Stage 1: Entity extraction across all chunks
  - Stage 2: Relationship mapping
  - Stage 3: Insight generation

3️⃣ Comparative Analysis Pattern:
  Compare different sections of the document to find:
  - Consistency in data formats
  - Evolution of patterns from start to end
  - Outliers and anomalies

4️⃣ Multi-Level Summarization:
  - Executive: 1-paragraph executive summary
  - Technical: Technical summary with key data points
  - Detailed: Detailed summary with examples from content

5️⃣ Deep Dive Analysis for Each Use Case:
  1. Geospatial pattern recognition
  2. Location clustering and classification
  3. Route optimization and analysis
  4. Geographic feature extraction
  5. Spatial relationship discovery

🤖 Example: Running Cross-Reference Analysis...
Result preview: The document is a KML (Keyhole Markup Language) file, which is used to represent

## Summary

This notebook demonstrates **proper division of labor** between framework and AI:

### 📊 XML Analysis Framework Does:
- **Document Type Detection**: Identifies KML, Maven POM, SCAP, etc.
- **Data Structuring**: Extracts metadata, creates chunks, prepares schemas
- **Use Case Identification**: Suggests specific AI applications per document type
- **Content Preparation**: Optimizes data for AI consumption (chunking, token estimation)

### 🤖 AI Does (working with framework data):
- **Pattern Recognition**: Analyzes geospatial patterns in structured geographic data
- **Feature Extraction**: Identifies specific geographic/business features from chunks
- **Classification**: Categorizes content using framework-prepared metadata
- **Insight Generation**: Provides domain-specific analysis based on use cases
- **Structured Responses**: Returns JSON for downstream processing

### 🔑 Key Benefits:
1. **Efficiency**: Framework handles heavy XML parsing, AI focuses on insights
2. **Scalability**: Smaller AI models can work with pre-structured data
3. **Consistency**: Framework ensures reliable data preparation
4. **Flexibility**: Use-case-specific AI workflows based on document type
5. **Cost-Effective**: Reduces API calls by using framework for data prep

### 🚀 Production Pattern:
```python
# 1. Framework analyzes and structures (fast, deterministic)
analysis = xaf.analyze_enhanced("document.xml")
chunks = xaf.chunk("document.xml", strategy="auto")

# 2. AI works with structured data (focused, efficient)
for use_case in analysis.ai_use_cases:
    ai_result = ai_workflow(use_case, chunks)
    
# 3. Combine framework metadata + AI insights
final_result = {
    'document_metadata': analysis,  # Framework-prepared
    'ai_insights': ai_result        # AI-generated
}
```

This approach allows smaller, faster models to work effectively since the framework handles the complex XML parsing and data structuring! 🎯