# Module 9.1: Query Decomposition & Planning

## Overview

This notebook demonstrates advanced retrieval techniques for handling complex multi-part queries through decomposition, dependency analysis, and parallel execution.

**Problem:** Standard RAG pipelines struggle with complex multi-part queries (15-20% of production traffic). Current simple retrieval approaches yield quality scores of 2.1/5 on complex queries versus 4.2/5 on simple ones.

**Solution:** Query decomposition improves complex query accuracy from 2.1/5 to 4.0/5 by:
- Breaking queries into atomic sub-queries
- Building dependency graphs for optimal execution order
- Parallel execution reducing latency by 60% for independent queries
- Synthesizing coherent answers from multiple results

**Trade-offs:**
- Adds 200-500ms overhead (NOT for simple queries)
- Higher LLM costs ($0.01-0.02 per complex query)
- Complexity in debugging multi-step failures

In [None]:
# Setup and imports
import sys
import os
import json
import asyncio
from pathlib import Path

# Import from our module
from src.l3_m9_query_decomposition.pipeline import (
    QueryDecomposer,
    DependencyGraph,
    ParallelExecutionEngine,
    AnswerSynthesizer,
    QueryDecompositionPipeline,
    DecompositionError,
    DependencyError,
    SynthesisError
)
from src.l3_m9_query_decomposition.config import Config, get_openai_client

# Load example data
with open('../example_data.json', 'r') as f:
    example_data = json.load(f)

print("‚úì Imports successful")
print(f"‚úì Example data loaded: {len(example_data['sample_queries'])} queries")

## Section 1: Query Decomposition with LLM

**Goal:** Break complex queries into atomic sub-queries using GPT-4 Turbo.

**Key Features:**
- Temperature = 0.0 for deterministic outputs
- JSON response format with sub-query IDs and dependencies
- Validation: Maximum 6 sub-queries
- Fallback handling for LLM parsing failures

**When this works:**
- Queries with 2-4 distinct semantic parts
- Clear independent sub-questions

**When this breaks:**
- Too granular decomposition (>6 sub-queries)
- LLM returns invalid JSON
- Query is actually simple (1 sub-query)

In [None]:
# Test Query Decomposition
async def test_decomposition():
    if not Config.OPENAI_API_KEY:
        print("‚ö†Ô∏è Skipping API calls (no OPENAI_API_KEY)")
        return
    
    decomposer = QueryDecomposer(Config.OPENAI_API_KEY)
    
    # Test 1: Complex query (should work)
    complex_query = example_data['sample_queries'][0]['query']
    print(f"Query: {complex_query}\n")
    
    try:
        result = await decomposer.decompose(complex_query)
        print(f"‚úì Decomposed into {len(result.sub_queries)} sub-queries:")
        for sq in result.sub_queries[:3]:  # Show max 3
            print(f"  - {sq.id}: {sq.query[:60]}...")
            if sq.dependencies:
                print(f"    Deps: {sq.dependencies}")
    except DecompositionError as e:
        print(f"‚úó Decomposition failed: {e}")

# Expected: 3 sub-queries for PostgreSQL vs MySQL comparison
await test_decomposition()

## Section 2: Dependency Graph Construction

**Goal:** Build execution plan using NetworkX DiGraph to represent dependencies.

**Key Features:**
- Nodes = sub-query IDs
- Edges = dependency relationships
- Validates for circular dependencies (raises DependencyError)
- Generates execution levels for parallel execution

**When this works:**
- Valid DAG (Directed Acyclic Graph)
- Clear sequential or parallel patterns

**When this breaks:**
- Circular dependencies (q1 depends on q2, q2 depends on q1)
- Missing dependency references
- Invalid graph structure

In [None]:
# Test Dependency Graph
from src.l3_m9_query_decomposition.pipeline import SubQuery

# Example 1: Parallel execution (no dependencies)
parallel_queries = [
    SubQuery(id="q1", query="PostgreSQL performance?", dependencies=[]),
    SubQuery(id="q2", query="MySQL performance?", dependencies=[]),
    SubQuery(id="q3", query="JSON support comparison?", dependencies=[])
]

graph = DependencyGraph(parallel_queries)
levels = graph.get_execution_levels()
print(f"‚úì Parallel pattern: {len(levels)} level(s)")
print(f"  Level 1: {levels[0]}")

# Example 2: Sequential execution (with dependencies)
sequential_queries = [
    SubQuery(id="q1", query="AWS Lambda security?", dependencies=[]),
    SubQuery(id="q2", query="Azure Functions security?", dependencies=[]),
    SubQuery(id="q3", query="HIPAA recommendation?", dependencies=["q1", "q2"])
]

graph2 = DependencyGraph(sequential_queries)
levels2 = graph2.get_execution_levels()
print(f"\n‚úì Sequential pattern: {len(levels2)} level(s)")
for i, level in enumerate(levels2, 1):
    print(f"  Level {i}: {level}")

# Expected: Parallel = 1 level with 3 queries; Sequential = 2 levels

## Section 3: Parallel Execution Engine

**Goal:** Execute sub-queries concurrently using async/await patterns.

**Key Features:**
- Semaphore limiting concurrent retrievals (default: 5)
- Timeout protection per query (default: 30s)
- Level-based execution respecting dependencies
- Error isolation (one failure doesn't stop others)

**When this works:**
- Independent queries in parallel (60% latency reduction)
- Proper resource limits prevent exhaustion

**When this breaks:**
- Too many concurrent retrievals (resource exhaustion)
- Timeout exceeded (>30s per query)
- Vector database rate limiting

In [None]:
# Test Parallel Execution with Mock Retrieval
import time

async def mock_retrieval(query: str) -> str:
    """Mock retrieval simulating 100ms latency."""
    await asyncio.sleep(0.1)
    return f"Mock result for: {query[:40]}..."

# Test parallel execution
async def test_parallel_execution():
    engine = ParallelExecutionEngine(mock_retrieval, max_concurrent=3)
    
    # Use parallel queries from previous section
    start = time.time()
    results = await engine.execute_level(parallel_queries, {})
    elapsed = (time.time() - start) * 1000
    
    print(f"‚úì Executed {len(results)} queries in {elapsed:.0f}ms")
    print(f"  (Sequential would be ~{len(results) * 100}ms)")
    for qid in list(results.keys())[:2]:  # Show first 2
        print(f"  {qid}: {results[qid][:50]}...")

await test_parallel_execution()

# Expected: ~100ms for 3 parallel queries vs ~300ms sequential

## Section 4: Answer Synthesis

**Goal:** Combine multiple sub-query results into a coherent final answer.

**Key Features:**
- LLM-based synthesis with conflict resolution
- Context management (max 4K tokens by default)
- Temperature = 0.3 for balanced creativity/accuracy
- Handles contradictory information

**When this works:**
- Sub-results are complementary
- Total context < 4K tokens
- Clear synthesis strategy

**When this breaks:**
- Context overflow (>4K tokens from multiple retrievals)
- Contradictory sub-answers requiring manual resolution
- Synthesis cost adds $0.005 per query

In [None]:
# Test Answer Synthesis
async def test_synthesis():
    if not Config.OPENAI_API_KEY:
        print("‚ö†Ô∏è Skipping API calls (no OPENAI_API_KEY)")
        return
    
    synthesizer = AnswerSynthesizer(Config.OPENAI_API_KEY)
    
    # Mock sub-results
    original_query = "Compare PostgreSQL and MySQL performance and JSON support"
    sub_results = {
        "q1": "PostgreSQL: Excellent ACID compliance, 10K TPS...",
        "q2": "MySQL: Fast reads, 15K TPS on simple queries...",
        "q3": "PostgreSQL has native JSONB, MySQL added JSON in 5.7..."
    }
    
    try:
        answer = await synthesizer.synthesize(
            original_query,
            sub_results,
            parallel_queries
        )
        print(f"‚úì Synthesized answer ({len(answer)} chars):")
        print(f"  {answer[:150]}...")
    except SynthesisError as e:
        print(f"‚úó Synthesis failed: {e}")

await test_synthesis()

# Expected: Coherent comparison integrating all three sub-results

## Section 5: Full Pipeline Integration

**Goal:** End-to-end pipeline with fallback to simple retrieval.

**Key Features:**
- Automatic complexity detection
- Fallback on decomposition/execution failures
- Latency and cost tracking
- Metadata for debugging

**Decision Logic:**
- Single sub-query ‚Üí Use simple retrieval
- Multiple sub-queries ‚Üí Use decomposition
- Any error + fallback enabled ‚Üí Simple retrieval

**When this works:**
- Complex queries (2-4 parts) with ‚â•700ms budget
- Fallback provides resilience

**When this breaks:**
- Latency budget <700ms (overhead too high)
- Query volume >100K/day on tight budget

In [None]:
# Test Full Pipeline
async def test_full_pipeline():
    if not Config.OPENAI_API_KEY:
        print("‚ö†Ô∏è Skipping API calls (no OPENAI_API_KEY)")
        return
    
    pipeline = QueryDecompositionPipeline(
        Config.OPENAI_API_KEY,
        mock_retrieval,
        enable_fallback=True
    )
    
    # Test complex query
    complex_query = example_data['sample_queries'][0]['query']
    print(f"Query: {complex_query[:70]}...\n")
    
    result = await pipeline.process_query(complex_query)
    
    print(f"‚úì Method: {result['method']}")
    print(f"  Latency: {result['latency_ms']:.0f}ms")
    print(f"  Sub-queries: {result.get('sub_queries', 'N/A')}")
    print(f"  Answer: {result['answer'][:100]}...")

await test_full_pipeline()

# Expected: Decomposition method with ~800ms latency

## Section 6: Common Failure Modes

**Failure scenarios from production:**

1. **Too Granular Decomposition** (10+ sub-queries)
   - Violates MAX_SUB_QUERIES=6 limit
   - Fix: Simplify query or increase limit

2. **Circular Dependencies**
   - q1 depends on q2, q2 depends on q1
   - Fix: Raises DependencyError, need better decomposition

3. **Parallel Execution Timeouts**
   - Resource exhaustion from too many concurrent retrievals
   - Fix: Reduce max_concurrent or increase timeout

4. **Answer Synthesis Conflicts**
   - Contradictory sub-answers
   - Fix: Manual intervention or better conflict resolution

5. **Context Overflow**
   - Multiple retrievals exceeding 4K token limit
   - Fix: Reduce retrieval size or increase limit

In [None]:
# Demonstrate Failure Modes
print("Failure Mode 1: Too Granular Decomposition")
too_complex = example_data['sample_queries'][6]  # Edge case query
print(f"  Query: {too_complex['query'][:60]}...")
print(f"  Expected: {too_complex['expected_sub_queries']} sub-queries (exceeds limit)")

print("\nFailure Mode 2: Circular Dependencies")
circular = [
    SubQuery(id="q1", query="What is X?", dependencies=["q2"]),
    SubQuery(id="q2", query="What is Y?", dependencies=["q1"])
]
try:
    graph_bad = DependencyGraph(circular)
    print("  ‚úó Should have raised DependencyError!")
except DependencyError as e:
    print(f"  ‚úì Caught: {str(e)[:50]}...")

print("\nFailure Mode 3: Context Overflow")
print("  Scenario: Multiple large retrievals > 4K tokens")
print("  Fix: Reduce retrieval size or increase token limit")

print("\nFailure Mode 4: Timeout")
print("  Scenario: Retrieval takes >30s")
print("  Fix: Increase timeout or optimize retrieval")

# Expected: Circular dependency error caught, others documented

## Section 7: Decision Card - When to Use Query Decomposition

### ‚úÖ Use query decomposition when:

- Query has **2-4 distinct semantic parts**
- Sub-queries are **largely independent** or have clear dependencies
- **Latency budget ‚â•700ms** (accounts for 200-500ms overhead)
- **Accuracy improvement worth cost increase** ($0.01-0.02 per query)
- Handling **15-20% complex queries** in production traffic

### ‚ùå When NOT to use:

- **Simple Direct Questions** (80%+ of traffic) - Adds unnecessary latency
- **Real-Time Apps** (<500ms requirement) - Overhead too high
- **Very High Query Volume** (>100K/day) on limited budget - Costs multiply
- **Domain-Specific Queries** - May need custom fine-tuning

### üìä Performance Impact:

| Metric | Simple Query | Complex w/o Decomp | Complex w/ Decomp |
|--------|--------------|-------------------|-------------------|
| Latency | 200ms | 250ms | 800ms |
| Quality | 4.2/5 | 2.1/5 | 4.0/5 |
| Cost | $0.001 | $0.001 | $0.020 |

### üéØ Key Takeaway:

Deploy when **15-20% complex query volume justifies the cost and latency trade-off**. For 80% simple queries, standard retrieval remains optimal.

In [None]:
# Decision Helper Function
def should_use_decomposition(
    query_complexity: str,
    latency_budget_ms: int,
    query_volume_per_day: int,
    cost_sensitive: bool
) -> dict:
    """
    Decision helper based on query characteristics.
    
    Returns recommendation with reasoning.
    """
    reasons = []
    score = 0
    
    # Complexity check
    if query_complexity in ["high", "complex"]:
        score += 2
        reasons.append("‚úì Complex query benefits from decomposition")
    else:
        reasons.append("‚úó Simple query - unnecessary overhead")
    
    # Latency check
    if latency_budget_ms >= 700:
        score += 1
        reasons.append("‚úì Latency budget sufficient")
    else:
        reasons.append("‚úó Latency budget too tight (<700ms)")
    
    # Volume check
    if query_volume_per_day < 100000:
        score += 1
        reasons.append("‚úì Volume manageable for decomposition costs")
    else:
        if cost_sensitive:
            reasons.append("‚úó High volume + cost-sensitive")
    
    recommendation = "USE" if score >= 3 else "SKIP"
    
    return {
        "recommendation": recommendation,
        "score": f"{score}/4",
        "reasons": reasons
    }

# Test decision helper
test_cases = [
    ("high", 1000, 50000, False),
    ("low", 300, 10000, True),
    ("high", 500, 150000, True)
]

for complexity, latency, volume, cost_sens in test_cases:
    result = should_use_decomposition(complexity, latency, volume, cost_sens)
    print(f"{result['recommendation']} ({result['score']}): {' | '.join(result['reasons'][:2])}")

# Expected: USE for case 1, SKIP for cases 2 and 3

## Conclusion

### What We've Learned:

1. **Query Decomposition** breaks complex multi-part queries into atomic sub-queries
2. **Dependency Graphs** enable optimal parallel/sequential execution planning
3. **Parallel Execution** reduces latency by 60% for independent queries
4. **Answer Synthesis** combines results into coherent responses with conflict resolution
5. **Trade-offs are real**: +200-500ms latency, 20√ó cost increase, debugging complexity

### Production Checklist:

- ‚úì Fallback to simple retrieval for failures
- ‚úì Rate limiting on decomposition calls
- ‚úì Circuit breakers for stuck async operations
- ‚úì Logging of failed sub-queries for debugging
- ‚úì Monitoring: success rates, latency, synthesis conflicts

### Alternative Solutions to Consider:

1. **Single-Shot Retrieval with Better Prompting** - Simplest approach
2. **Query Expansion** (not decomposition) - Middle-ground using semantic variations
3. **Managed Query Understanding Service** - Zero implementation, vendor-dependent
4. **Fine-Tuned Decomposition Model** - Advanced for specialized domains

### Next Steps:

- **Module 9.2**: Query Rewriting & Expansion
- **Module 9.3**: Hybrid Search Techniques
- **Module 10**: Multi-Modal RAG

---

**Remember:** Only use decomposition when complexity justifies the cost. For 80% of simple queries, standard retrieval is optimal.