# Multimodal Multi-Hop RAG for Pump Datasheets

This notebook extends the **multimodal RAG capabilities** from `CORTEX_SEARCH_MULTIMODAL_pumps_complete.ipynb` with **multi-hop reasoning** for complex cross-document analysis.

## 🔗 What is Multi-Hop RAG?

**Traditional Single-Hop**: Query → Search → Answer (may miss relevant information)

**Multi-Hop RAG**: Query → Initial Search → Gap Analysis → Follow-up Searches → Comprehensive Answer

## 🎯 Perfect for Complex Pump Queries:

- **Cross-vendor comparisons**: "Compare NPSH requirements between Sulzer BE and Goulds 3196 pumps"
- **Comprehensive analysis**: "Which pumps meet API 610 standards and what are their efficiency ratings?"
- **Multi-specification queries**: "Find pumps suitable for high-temperature applications with materials list"

## Prerequisites:

✅ **Assumes you've completed**: `CORTEX_SEARCH_MULTIMODAL_pumps_complete.ipynb`

- Multimodal search service: `DATASHEET_CORTEX_SEARCH_SERVICE`
- Metadata tables: `DATASHEET_DIRECTORY`, `DATASHEET_PAGE_METADATA` 
- Vector embeddings and text indexes ready

**This notebook adds**: Multi-hop reasoning layer on top of existing multimodal capabilities

In [None]:
# Import required libraries
import json
from typing import List, Dict, Any, Set
from snowflake.core import Root
import snowflake.snowpark.session as session

# Get active session and connect to existing search service
session = get_active_session()
root = Root(session)

# Connect to the existing multimodal search service
search_service = (root
    .databases["DEMODB"]
    .schemas["DATASHEET_RAG"]
    .cortex_search_services["DATASHEET_CORTEX_SEARCH_SERVICE"]
)

print("✅ Connected to existing multimodal search service")
print("🔗 Ready for multi-hop RAG implementation")

## Multi-Hop RAG Implementation

The multi-hop approach performs iterative searches to ensure comprehensive coverage:

1. **HOP 1**: Initial broad search using existing multimodal service
2. **Analysis**: Identify gaps in vendor/product coverage
3. **HOP 2+**: Targeted follow-up searches for missing information
4. **Synthesis**: Combine results for comprehensive analysis

In [None]:
class MultiHopRAG:
    """Multi-hop RAG system built on existing multimodal search service"""
    
    def __init__(self, session, search_service):
        self.session = session
        self.search_service = search_service
        
    def embed_query(self, query_text: str) -> List[float]:
        """Generate embedding using the same model as the search service"""
        sql_output = self.session.sql(
            f"""SELECT SNOWFLAKE.CORTEX.EMBED_TEXT_1024('voyage-multimodal-3', 
                'Represent the query for retrieving supporting documents: {query_text}')"""
        ).collect()
        return list(sql_output[0].asDict().values())[0]
    
    def search_multimodal(self, query_text: str, limit: int = 10) -> List[Dict[str, Any]]:
        """Use existing multimodal search service with both text and vector indexes"""
        query_vector = self.embed_query(query_text)
        
        # Use the existing multi-index search pattern
        resp = self.search_service.search(
            multi_index_query={
                "TEXT": [{"text": query_text}],
                "VECTOR_MAIN": [{"vector": query_vector}]
            },
            columns=["TEXT", "PAGE_NUMBER", "IMAGE_FILEPATH", "VENDOR", "PRODUCT_ID", 
                    "PUMP_MODEL", "DATASHEET_TYPE", "SECTION_TITLE"],
            limit=limit
        )
        
        return resp.to_dict()["results"]
    
    def analyze_coverage_gaps(self, results: List[Dict], original_query: str) -> List[str]:
        """Identify gaps in vendor/product coverage for follow-up searches"""
        found_vendors = set(r.get('VENDOR', 'Unknown') for r in results)
        found_products = set(r.get('PRODUCT_ID', 'Unknown') for r in results)
        
        print(f"📊 Coverage Analysis:")
        print(f"   Found vendors: {', '.join(found_vendors)}")
        print(f"   Found products: {', '.join(found_products)}")
        
        # Generate follow-up queries based on gaps and query type
        follow_up_queries = []
        
        # For comparison queries, ensure we search all major vendors
        if any(word in original_query.lower() for word in ['compare', 'which', 'best', 'highest', 'vs']):
            major_vendors = ['Sulzer', 'Goulds', 'Fristam']
            for vendor in major_vendors:
                if vendor not in found_vendors:
                    follow_up_queries.append(f"{vendor} {original_query}")
        
        # For specification queries, search by technical terms
        if any(term in original_query.lower() for term in ['npsh', 'efficiency', 'flow', 'pressure', 'api']):
            # Extract key technical terms for targeted search
            tech_terms = []
            if 'npsh' in original_query.lower():
                tech_terms.append('NPSH required suction head')
            if 'efficiency' in original_query.lower():
                tech_terms.append('pump efficiency BEP')
            if 'api 610' in original_query.lower():
                tech_terms.append('API 610 standard compliance')
            
            for term in tech_terms[:2]:  # Limit follow-ups
                follow_up_queries.append(f"{term} specifications")
        
        return follow_up_queries[:3]  # Limit to 3 follow-up queries
    
    def multi_hop_search(self, original_query: str, max_hops: int = 4) -> Dict[str, Any]:
        """Perform multi-hop search for comprehensive coverage"""
        print(f"🔍 Multi-Hop Search: {original_query}")
        print("=" * 60)
        
        all_results = []
        search_history = []
        
        # HOP 1: Initial broad search
        print(f"\n📍 HOP 1: Initial multimodal search")
        initial_results = self.search_multimodal(original_query, limit=8)
        all_results.extend(initial_results)
        search_history.append({
            "hop": 1, 
            "query": original_query, 
            "results_count": len(initial_results)
        })
        
        self._display_results(initial_results[:3], "Initial Results")
        
        # Analyze gaps and generate follow-up queries
        follow_up_queries = self.analyze_coverage_gaps(initial_results, original_query)
        
        # HOP 2+: Follow-up searches
        for hop_num, follow_up_query in enumerate(follow_up_queries, 2):
            if hop_num > max_hops:
                break
                
            print(f"\n📍 HOP {hop_num}: Follow-up search")
            print(f"   Query: {follow_up_query}")
            
            hop_results = self.search_multimodal(follow_up_query, limit=5)
            
            # Filter out duplicates based on IMAGE_FILEPATH
            existing_paths = {r.get('IMAGE_FILEPATH') for r in all_results}
            new_results = [r for r in hop_results if r.get('IMAGE_FILEPATH') not in existing_paths]
            
            if new_results:
                all_results.extend(new_results)
                search_history.append({
                    "hop": hop_num,
                    "query": follow_up_query,
                    "results_count": len(new_results)
                })
                self._display_results(new_results[:2], f"Hop {hop_num} New Results")
            else:
                print(f"   No new results found")
        
        return {
            "original_query": original_query,
            "all_results": all_results,
            "search_history": search_history,
            "total_documents": len(all_results)
        }
    
    def _display_results(self, results: List[Dict], title: str):
        """Display search results in formatted way"""
        print(f"\n📋 {title}:")
        for i, result in enumerate(results, 1):
            vendor = result.get('VENDOR', 'N/A')
            product = result.get('PRODUCT_ID', 'N/A')
            page = result.get('PAGE_NUMBER', 'N/A')
            section = result.get('SECTION_TITLE', 'N/A')
            print(f"   {i}. {vendor} {product} - Page {page} ({section})")
    
    def generate_comprehensive_answer(self, search_data: Dict[str, Any]) -> str:
        """Generate comprehensive answer using all multi-hop results"""
        print(f"\n🤖 Generating comprehensive answer from {search_data['total_documents']} documents...")
        
        # Group results by vendor for structured analysis
        results_by_vendor = {}
        for result in search_data["all_results"]:
            vendor = result.get('VENDOR', 'Unknown')
            if vendor not in results_by_vendor:
                results_by_vendor[vendor] = []
            results_by_vendor[vendor].append(result)
        
        # Create structured context for LLM
        context_parts = []
        for vendor, results in results_by_vendor.items():
            context_parts.append(f"\n=== {vendor} Data ===")
            for result in results[:2]:  # Top 2 results per vendor
                product = result.get('PRODUCT_ID', 'Unknown')
                page = result.get('PAGE_NUMBER', 'Unknown')
                section = result.get('SECTION_TITLE', 'General')
                text_content = str(result.get('TEXT', ''))[:400]
                
                context_parts.append(
                    f"Product: {product} (Page {page}, {section})\n"
                    f"Content: {text_content}..."
                )
        
        context = "\n".join(context_parts)
        
        # Generate comprehensive answer
        prompt = f"""
        Question: {search_data['original_query']}
        
        I performed a multi-hop search across pump datasheets and gathered the following information:
        
        {context}
        
        Please provide a comprehensive answer that:
        1. Directly answers the original question
        2. Compares information across different vendors when relevant
        3. Cites specific sources (vendor, product, page) for key facts
        4. Highlights any gaps in available information
        5. Provides actionable insights for pump selection
        
        Focus on technical accuracy and practical engineering insights.
        """
        
        sql = "SELECT SNOWFLAKE.CORTEX.COMPLETE('llama3.1-70b', ?) as answer"
        result = self.session.sql(sql, params=[prompt]).collect()[0]
        
        answer = result["ANSWER"]
        
        # Display search summary
        print(f"\n📊 Multi-Hop Search Summary:")
        for search in search_data['search_history']:
            print(f"   Hop {search['hop']}: {search['results_count']} results")
        
        print(f"\n💡 Comprehensive Answer:")
        print(answer)
        
        return answer

# Initialize multi-hop RAG system using existing search service
multihop_rag = MultiHopRAG(session, search_service)
print("🚀 Multi-hop RAG system ready!")

In [None]:
# Complex comparison query that benefits from multi-hop search
comparison_query = "Cmopare NPSH required at 120% flow for Goulds 3196 vs Sulzer BE. Which is lower and what are the values?"

# Perform multi-hop search
search_results = multihop_rag.multi_hop_search(comparison_query, max_hops=4)

# Generate comprehensive answer
final_answer = multihop_rag.generate_comprehensive_answer(search_results)

## Example 1: Cross-Vendor NPSH Comparison

This demonstrates how multi-hop search ensures comprehensive coverage across all vendors for comparison queries.

In [None]:
# Complex comparison query that benefits from multi-hop search
comparison_query = "Compare NPSH requirements between Sulzer BE and Goulds 3196 pumps at high flow rates"

# Perform multi-hop search
search_results = multihop_rag.multi_hop_search(comparison_query, max_hops=4)

# Generate comprehensive answer
final_answer = multihop_rag.generate_comprehensive_answer(search_results)

## Example 2: API 610 Compliance Analysis

Multi-hop search excels at finding all pumps meeting specific standards across different datasheets.

In [None]:
# Standards compliance query
standards_query = "Which pumps meet API 610 standards and what are their efficiency ratings and material specifications?"

# Perform multi-hop search
search_results = multihop_rag.multi_hop_search(standards_query, max_hops=4)

# Generate comprehensive answer
final_answer = multihop_rag.generate_comprehensive_answer(search_results)

## Example 3: High-Temperature Application Requirements

Complex application queries often require information from multiple sections and datasheets.

In [None]:
# Application-specific query
application_query = "What pumps are suitable for high-temperature corrosive applications and what materials and sealing systems are used?"

# Perform multi-hop search
search_results = multihop_rag.multi_hop_search(application_query, max_hops=3)

# Generate comprehensive answer
final_answer = multihop_rag.generate_comprehensive_answer(search_results)

## Enhanced Analysis: Single-Hop vs Multi-Hop Comparison

Let's compare the effectiveness of single-hop vs multi-hop approaches.

In [None]:
def compare_approaches(query: str):
    """Compare single-hop vs multi-hop search results"""
    print(f"📊 Comparing Single-Hop vs Multi-Hop: {query}")
    print("=" * 70)
    
    # Single-hop search (just initial search)
    print("\n🔍 Single-Hop Results:")
    single_results = multihop_rag.search_multimodal(query, limit=10)
    single_vendors = set(r.get('VENDOR', 'Unknown') for r in single_results)
    single_products = set(r.get('PRODUCT_ID', 'Unknown') for r in single_results)
    
    print(f"   Documents: {len(single_results)}")
    print(f"   Vendors: {len(single_vendors)} ({', '.join(single_vendors)})")
    print(f"   Products: {len(single_products)}")
    
    # Multi-hop search
    print("\n🔗 Multi-Hop Results:")
    multihop_data = multihop_rag.multi_hop_search(query, max_hops=3)
    multihop_results = multihop_data["all_results"]
    multihop_vendors = set(r.get('VENDOR', 'Unknown') for r in multihop_results)
    multihop_products = set(r.get('PRODUCT_ID', 'Unknown') for r in multihop_results)
    
    print(f"   Documents: {len(multihop_results)}")
    print(f"   Vendors: {len(multihop_vendors)} ({', '.join(multihop_vendors)})")
    print(f"   Products: {len(multihop_products)}")
    print(f"   Search Hops: {len(multihop_data['search_history'])}")
    
    # Analysis
    print("\n📈 Improvement Analysis:")
    print(f"   📄 Additional documents: +{len(multihop_results) - len(single_results)}")
    print(f"   🏭 Additional vendors: +{len(multihop_vendors - single_vendors)}")
    print(f"   🔧 Additional products: +{len(multihop_products - single_products)}")
    
    new_vendors = multihop_vendors - single_vendors
    if new_vendors:
        print(f"   ✨ New vendors discovered: {', '.join(new_vendors)}")
    
    return {
        "single_hop": {"results": single_results, "vendors": single_vendors},
        "multi_hop": {"results": multihop_results, "vendors": multihop_vendors}
    }

# Test comparison with a complex query
test_query = "Compare efficiency and flow rate capabilities across pump models"
comparison_results = compare_approaches(test_query)

## Interactive Multi-Hop Query Interface

Test different types of complex queries with the multi-hop system.

In [None]:
# Sample complex queries for testing
sample_queries = [
    "Compare NPSH requirements and efficiency across all pump vendors",
    "Which pumps have the highest pressure ratings and what materials are used?",
    "Find pumps suitable for chemical processing with corrosion resistance details",
    "What are the dimensional requirements and installation considerations for API pumps?",
    "Compare maintenance intervals and service procedures across pump types"
]

print("🚀 Multi-Hop RAG Demo - Sample Complex Queries:")
print("=" * 55)

for i, query in enumerate(sample_queries, 1):
    print(f"{i}. {query}")

# Run a sample query (change index to test different queries)
selected_query = sample_queries[0]
print(f"\n🎯 Running: {selected_query}")
print("=" * 80)

# Execute multi-hop search
demo_results = multihop_rag.multi_hop_search(selected_query, max_hops=4)
demo_answer = multihop_rag.generate_comprehensive_answer(demo_results)

## Summary: Multi-Hop RAG Benefits

### ✅ When Multi-Hop RAG Excels:

- **Cross-vendor comparisons**: Ensures all major vendors are included
- **Standards compliance**: Finds all pumps meeting specific criteria
- **Complex specifications**: Gathers comprehensive technical data
- **Application analysis**: Combines requirements from multiple sources

### ⚡ Performance Considerations:

- **Latency**: 2-4x longer due to multiple search hops
- **Cost**: Additional API calls for embeddings and completions
- **Accuracy**: Better coverage but requires careful result filtering

### 🎯 Best Practices:

1. **Use for complex queries**: Simple factual queries work fine with single-hop
2. **Limit hops**: Start with 3-4 hops, adjust based on results
3. **Filter duplicates**: Remove redundant results between hops
4. **Gap analysis**: Focus follow-ups on missing vendor/product coverage

### 🔧 Integration with Existing System:

This multi-hop layer seamlessly extends your existing multimodal RAG system:
- ✅ Uses same search service and embeddings
- ✅ Leverages existing metadata and attributes
- ✅ Compatible with current Streamlit apps
- ✅ No additional setup or data processing required