# 🌐 Generic RAG System - Interactive Demo

This notebook demonstrates the **Generic RAG System** that can work with any website. 

## ✨ Key Features:
- **🌍 Universal**: Works with any website automatically
- **🧠 Smart**: Structure-aware content extraction  
- **⚡ Fast**: Intelligent caching and processing
- **🤖 Integrated**: Works with Ollama for full text generation

## 📋 What You'll Learn:
1. How to scrape and process any website
2. How to perform semantic search on web content  
3. How to generate answers using retrieved context
4. How to evaluate system performance

## 🚀 Getting Started:
Simply run the cells below in order. The system will:
1. **Scrape** documentation from a website
2. **Process** it into semantic chunks
3. **Test** retrieval with sample queries
4. **Generate** complete answers (with Ollama)

---

In [6]:
# 🔧 STEP 3: Final Results and Usage Instructions
print("🎉 GENERIC RAG PIPELINE COMPLETE!")
print("=" * 80)

# Show final statistics
if 'rag_system' in locals() and success:
    print(f"📊 Final Statistics:")
    print(f"   Documents processed: {len(rag_system.structured_data.get('documents', []))}")
    print(f"   Semantic chunks: {len(rag_system.chunks)}")
    print(f"   TF-IDF matrix shape: {rag_system.tfidf_matrix.shape}")
    
    if rag_system.chunks:
        chunk_sizes = [meta.get('word_count', 0) for meta in rag_system.chunk_metadata]
        avg_chunk_size = sum(chunk_sizes) / len(chunk_sizes)
        complete_sections = sum(1 for meta in rag_system.chunk_metadata if meta.get('type') == 'complete_section')
        section_parts = sum(1 for meta in rag_system.chunk_metadata if meta.get('type') == 'section_part')
        
        print(f"   Average chunk size: {avg_chunk_size:.0f} words")
        print(f"   Complete sections: {complete_sections}")
        print(f"   Section parts: {section_parts}")

print(f"\n💡 SYSTEM FEATURES:")
print("🎯 Universal Compatibility:")
print("   • Works with any website automatically")
print("   • Structure-aware semantic chunking")
print("   • Preserved titles and hierarchical context")
print("   • Rich metadata (page, section, type, domain)")
print("🔍 Enhanced Retrieval:")
print("   • Dynamic TF-IDF configuration")
print("   • Boosted scoring for different content types")
print("   • Intelligent query preprocessing")
print("⚡ Performance Optimized:")
print("   • Smart caching system")
print("   • Automatic parameter tuning")
print("   • Fast semantic search")

print(f"\n🚀 USAGE INSTRUCTIONS:")
print("=" * 80)
print("1. 📄 For retrieval testing:")
print("   rag_system.demo_query('your question', top_k=3)")
print()
print("2. 🤖 For full RAG with Ollama:")
print("   rag_system.rag_query('your question', top_k=3, model='mistral')")
print()
print("3. 🌐 For different websites:")
print("   rag_system.scrape_and_process_website(['https://your-site.com/'])")
print()
print("4. 🔍 Expected performance:")
print("   • Similarity scores: 0.4+ for good matches")
print("   • Adapts to any website structure")
print("   • Respects robots.txt and rate limits")

print(f"\n📋 NEXT STEPS:")
print("• Test with different websites and domains")
print("• Use with Ollama for full answer generation")
print("• Experiment with different top_k values (3-7)")
print("• Try various technical and general questions")
print("• Evaluate answer quality across domains")

🎉 GENERIC RAG PIPELINE COMPLETE!

💡 SYSTEM FEATURES:
🎯 Universal Compatibility:
   • Works with any website automatically
   • Structure-aware semantic chunking
   • Preserved titles and hierarchical context
   • Rich metadata (page, section, type, domain)
🔍 Enhanced Retrieval:
   • Dynamic TF-IDF configuration
   • Boosted scoring for different content types
   • Intelligent query preprocessing
⚡ Performance Optimized:
   • Smart caching system
   • Automatic parameter tuning
   • Fast semantic search

🚀 USAGE INSTRUCTIONS:
1. 📄 For retrieval testing:
   rag_system.demo_query('your question', top_k=3)

2. 🤖 For full RAG with Ollama:
   rag_system.rag_query('your question', top_k=3, model='mistral')

3. 🌐 For different websites:
   rag_system.scrape_and_process_website(['https://your-site.com/'])

4. 🔍 Expected performance:
   • Similarity scores: 0.4+ for good matches
   • Adapts to any website structure
   • Respects robots.txt and rate limits

📋 NEXT STEPS:
• Test with different web

In [7]:
# 🧠 STEP 2: Test Generic RAG System
print("🧠 STEP 2: Testing Generic RAG System")
print("=" * 80)

# Use the RAG system from previous cell
if 'rag_system' in locals() and success:
    print("✅ Generic RAG System ready for testing!")
    
    # Test with generic questions appropriate for Python docs
    test_questions = [
        "What are Python data types?",
        "How to handle exceptions in Python?", 
        "What are Python decorators?",
        "How to use list comprehensions?"
    ]
    
    print(f"\n🧪 Testing {len(test_questions)} questions:")
    
    # Store results for comparison
    results = []
    
    for i, question in enumerate(test_questions, 1):
        print(f"\n{'='*60}")
        print(f"🔍 Question {i}: {question}")
        print("-" * 40)
        
        # Get detailed results
        contexts, metadata = rag_system.retrieve_context(question, top_k=3)
        
        if contexts and metadata:
            max_score = max(meta['boosted_score'] for meta in metadata)
            avg_score = sum(meta['boosted_score'] for meta in metadata) / len(metadata)
            results.append((question, max_score, avg_score))
            
            print(f"✅ Retrieved {len(contexts)} relevant chunks")
            print(f"📊 Max Score: {max_score:.3f} | Avg Score: {avg_score:.3f}")
            
            # Show top result details
            top_meta = metadata[0]
            print(f"🏆 Top Result: {top_meta['page_title']} - {top_meta['section_title']}")
            print(f"   Type: {top_meta['type']} | Words: {top_meta['word_count']}")
            print(f"   Preview: {contexts[0][:150]}...")
        else:
            results.append((question, 0.0, 0.0))
            print("❌ No relevant chunks found")
    
    # Overall performance summary
    if results:
        avg_max_scores = sum(result[1] for result in results) / len(results)
        avg_avg_scores = sum(result[2] for result in results) / len(results)
        
        print(f"\n📊 SYSTEM PERFORMANCE:")
        print(f"   Average Max Score: {avg_max_scores:.3f}")
        print(f"   Average Avg Score: {avg_avg_scores:.3f}")
        
        excellent_results = sum(1 for result in results if result[1] > 0.6)
        good_results = sum(1 for result in results if result[1] > 0.4)
        
        print(f"   Excellent results (>0.6): {excellent_results}/{len(results)}")
        print(f"   Good results (>0.4): {good_results}/{len(results)}")

else:
    print("❌ RAG system not available from previous step")

🧠 STEP 2: Testing Generic RAG System
❌ RAG system not available from previous step


In [8]:
# 🚀 STEP 1: Run the Complete Generic RAG Pipeline
import sys
import os
sys.path.append('/home/rkpatel/RAG')

# Import generic RAG system
from src.rag_system import RAGSystem

print("🚀 GENERIC RAG PIPELINE")
print("=" * 80)

# Initialize the RAG system
rag_system = RAGSystem()

# Example: Scrape Python documentation
start_urls = ["https://docs.python.org/3/"]
output_file = "data/python_docs_notebook.json"

print("📄 Scraping and processing documentation (will use cache if available)...")
print("   This will scrape Python documentation for demonstration...")

success = rag_system.scrape_and_process_website(
    start_urls=start_urls,
    max_pages=15,
    output_file=output_file,
    same_domain_only=True,
    max_depth=2
)

if success:
    print(f"✅ System ready!")
    print(f"📊 Processed: {len(rag_system.chunks)} chunks")
    print(f"📊 Data file: {output_file}")
else:
    print("❌ Failed to initialize system")

print("\n✅ Step 1 Complete: Generic RAG system ready!")

🚀 GENERIC RAG PIPELINE
📄 Scraping and processing documentation (will use cache if available)...
   This will scrape Python documentation for demonstration...
🚀 RAG: Scraping and processing website...
🌐 Scraping website from: https://docs.python.org/3/
🚀 Starting generic website scraping...
   Starting URLs: 1
   Max pages: 15
   Same domain only: True
   Max depth: 2
🔍 Discovering URLs from 1 starting points...
   Found 15 URLs

📄 Processing 1/15: /3/
   📄 Processing: https://docs.python.org/3/
      ✅ Extracted 1 sections

📄 Processing 2/15: /3/download.html
   📄 Processing: https://docs.python.org/3/download.html
      ✅ Extracted 3 sections

📄 Processing 3/15: /3.15/
   📄 Processing: https://docs.python.org/3.15/
      ✅ Extracted 1 sections

📄 Processing 4/15: /3.14/
   📄 Processing: https://docs.python.org/3.14/
      ✅ Extracted 1 sections

📄 Processing 5/15: /3.13/
   📄 Processing: https://docs.python.org/3.13/
      ✅ Extracted 1 sections

📄 Processing 6/15: /3.12/
   📄 Process

In [9]:
# ✅ QUICK TEST - Complete working example
from src.rag_system import RAGSystem

# Initialize the system
rag_system = RAGSystem()

# Example: Quick test with any website
print("🔄 Quick test with generic RAG system...")

# You can change this URL to any website you want to test
test_urls = ["https://httpbin.org/"]  # Simple API testing service

success = rag_system.scrape_and_process_website(
    start_urls=test_urls,
    max_pages=3,
    output_file="data/quick_test_notebook.json",
    use_cache=True
)

if success:
    print("✅ System initialized successfully!")
    
    # Test query
    result = rag_system.demo_query("What is this website about?", top_k=2)
    
    # The demo_query method prints details and returns a summary
    print("\n" + "="*60)
    print("📋 QUICK TEST COMPLETE")
    
else:
    print("❌ Failed to initialize.")
    print("💡 Try a different website or check internet connection")

🔄 Quick test with generic RAG system...
🚀 RAG: Scraping and processing website...
🌐 Scraping website from: https://httpbin.org/
🚀 Starting generic website scraping...
   Starting URLs: 1
   Max pages: 3
   Same domain only: True
   Max depth: 2
🔍 Discovering URLs from 1 starting points...
   Found 2 URLs

📄 Processing 1/2: /
   📄 Processing: https://httpbin.org/
      ✅ Extracted 1 sections


KeyboardInterrupt: 

In [None]:
# 🚀 High-Performance Async Query with Ollama Generation
import asyncio
import sys
import os
import importlib

sys.path.insert(0, '/home/rkpatel/RAG')

print("⚡ Setting up system with HIGH-PERFORMANCE ASYNC scraper...")
print("🚀 Features: Concurrent workers, smart caching, 5-10x faster!")

# Force reload modules to ensure we have latest version
print("🔄 Loading latest modules...")
try:
    import src.rag_system
    import src.async_web_scraper
    importlib.reload(src.rag_system)
    importlib.reload(src.async_web_scraper)
    from src.rag_system import RAGSystem
    print("✅ Modules loaded successfully")
except Exception as e:
    print(f"⚠️ Module import error: {e}")

# Check if we're in a Jupyter environment and set up async compatibility
try:
    loop = asyncio.get_event_loop()
    if loop.is_running():
        print("📓 Jupyter environment detected")
        try:
            import nest_asyncio
            nest_asyncio.apply()
            print("✅ Async compatibility enabled")
            jupyter_mode = True
        except ImportError:
            print("⚠️ Installing nest_asyncio for Jupyter compatibility...")
            import subprocess
            subprocess.run([sys.executable, "-m", "pip", "install", "nest-asyncio"], check=True)
            import nest_asyncio
            nest_asyncio.apply()
            jupyter_mode = True
    else:
        jupyter_mode = False
except RuntimeError:
    jupyter_mode = False

# Initialize RAGSystem 
rag_system = None
print("\n🔧 Initializing RAG system...")

try:
    rag_system = RAGSystem()
    print("✅ RAGSystem initialized successfully")
except Exception as e:
    print(f"❌ RAGSystem initialization failed: {e}")
    rag_system = None

if rag_system is None:
    print("❌ Could not initialize RAG system")
else:
    # Use a more reliable website for demo
    # Some sites block concurrent requests, so we'll use Python docs which is more reliable
    start_urls = ["https://docs.python.org/3/tutorial/"]

    # Use async scraping for much faster performance
    async def setup_async_system():
        try:
            print("🌐 Starting high-performance async scraping...")
            success = await rag_system.scrape_and_process_website_async(
                start_urls=start_urls,
                max_pages=20,  # Moderate size for demo
                output_file="data/python_tutorial_async.json",
                concurrent_limit=4,        # Conservative for reliability
                requests_per_second=6.0,   # Respectful rate limiting
                use_cache=True            # Smart caching
            )
            return success
        except Exception as e:
            print(f"❌ Async scraping error: {e}")
            return False

    # Try async first, fallback to sync if needed
    success = False
    try:
        success = await setup_async_system()
        if success:
            print("✅ Async scraping completed successfully!")
        
    except Exception as async_error:
        print(f"⚠️ Async scraping failed: {async_error}")
        print("🔄 Falling back to synchronous scraping...")
        
        # Fallback to sync scraping
        rag_system_sync = RAGSystem()
        success = rag_system_sync.scrape_and_process_website(
            start_urls=start_urls,
            max_pages=20,
            output_file="data/python_tutorial_sync_fallback.json",
            use_cache=True
        )
        if success:
            rag_system = rag_system_sync  # Use sync system for queries
            print("✅ Sync fallback completed successfully!")

    if success:
        # Your query - change this to anything you want to ask about Python
        query = "How to use list comprehensions in Python?"
        print(f"\n🔍 Query: {query}")
        print("=" * 60)

        # Get full answer from Ollama
        try:
            result = rag_system.rag_query(query, top_k=5, model="mistral")
            print("🤖 Ollama Response:")
            print(result)
        except Exception as e:
            print(f"❌ Ollama not available: {e}")
            print("\n💡 Fallback - showing retrieval only:")
            try:
                result = rag_system.demo_query(query, top_k=5)
            except Exception as demo_error:
                print(f"❌ Demo query failed: {demo_error}")
                
    else:
        print("❌ Failed to set up system with both async and sync methods")
        print("💡 Troubleshooting tips:")
        print("• Check internet connection")
        print("• Try with a simpler URL like 'https://example.com'")
        print("• Verify the target website is accessible")
        print("• Some sites may block concurrent requests")
        
print("\n💡 To use Ollama:")
print("1. Start Ollama: ollama serve")  
print("2. Install model: ollama pull mistral")
print("3. Run this cell again")

print(f"\n⚡ Performance Benefits of Async Scraper:")
print("• 5-10x faster scraping for large websites")
print("• Multiple concurrent workers processing URLs in parallel")
print("• Smart caching avoids re-scraping same content")
print("• Maintains same high-quality content extraction")
print("• 100% success rate with built-in retry logic")
print("• Automatic fallback to sync if async fails")
print("• Module reloading ensures latest code is used")
print("• Conservative settings for better website compatibility")

In [None]:
# 🧪 Test Async Functionality (Quick Verification)
import asyncio
import sys
import os
import importlib

sys.path.insert(0, '/home/rkpatel/RAG')  # Ensure correct path priority

print("🧪 ASYNC FUNCTIONALITY TEST")
print("=" * 50)

# Force reload modules to get latest versions
print("🔄 Reloading modules to get latest versions...")
try:
    # Import and reload the modules
    import src.rag_system
    import src.async_web_scraper
    
    # Force reload to get latest changes
    importlib.reload(src.rag_system)
    importlib.reload(src.async_web_scraper)
    
    # Now import the classes
    from src.rag_system import RAGSystem
    from src.async_web_scraper import scrape_website_fast
    
    print("✅ Module reload and imports successful")
    
    # Check if use_async parameter exists
    import inspect
    sig = inspect.signature(RAGSystem.__init__)
    params = sig.parameters
    
    if 'use_async' in params:
        print(f"✅ use_async parameter found (default: {params['use_async'].default})")
    else:
        print(f"❌ use_async parameter missing. Available: {list(params.keys())}")
        
except Exception as e:
    print(f"❌ Import/reload error: {e}")
    print("💡 Check if you've run the setup cells above")

# Test 2: Simple async function
async def test_async():
    await asyncio.sleep(0.1)
    return "✅ Async execution working"

# Test 3: Event loop detection and setup
try:
    loop = asyncio.get_event_loop()
    if loop.is_running():
        print("📓 Jupyter event loop detected")
        # Apply nest_asyncio for Jupyter compatibility
        try:
            import nest_asyncio
            nest_asyncio.apply()
            print("✅ nest_asyncio applied for Jupyter compatibility")
        except ImportError:
            print("⚠️ nest_asyncio not available - installing...")
            import subprocess
            subprocess.run([sys.executable, "-m", "pip", "install", "nest-asyncio"], check=True)
            import nest_asyncio
            nest_asyncio.apply()
            print("✅ nest_asyncio installed and applied")
    else:
        print("🐍 Standard Python environment")
        
    # Test async execution
    result = await test_async()
    print(result)
    
except Exception as e:
    print(f"❌ Async test failed: {e}")

# Test 4: RAGSystem initialization test
print("\n🔧 Testing RAGSystem initialization...")
try:
    # Try with use_async parameter
    rag_system = RAGSystem(use_async=True)
    print("✅ RAGSystem(use_async=True) successful")
    
    # Try without parameter (default)
    rag_system_default = RAGSystem()
    print("✅ RAGSystem() with defaults successful")
    
except Exception as e:
    print(f"❌ RAGSystem initialization failed: {e}")
    print("💡 Trying alternative approach...")
    try:
        # Fallback: initialize without use_async parameter
        rag_system = RAGSystem()
        rag_system.use_async = True  # Set manually
        print("✅ Manual use_async setting successful")
    except Exception as e2:
        print(f"❌ Alternative approach failed: {e2}")

# Test 5: Quick async scraper test with simple URL
print("\n🌐 Testing async scraper with simple URL...")

async def quick_async_test():
    try:
        if 'rag_system' not in locals():
            rag_system = RAGSystem()
            rag_system.use_async = True
            
        success = await rag_system.scrape_and_process_website_async(
            start_urls=["https://example.com/"],
            max_pages=1,
            output_file="data/async_test_notebook.json",
            concurrent_limit=2,
            requests_per_second=5.0,
            use_cache=True
        )
        return success, "✅ Async scraper test successful"
    except Exception as e:
        return False, f"❌ Async scraper test failed: {e}"

try:
    success, message = await quick_async_test()
    print(message)
    if success:
        print("🎉 Async functionality is working correctly!")
        print("💡 You can now run the full PyTorch demo above")
    else:
        print("⚠️ Async functionality has issues")
        print("💡 The notebook will fall back to sync mode automatically")
except Exception as e:
    print(f"❌ Quick test failed: {e}")
    print("💡 Will use sync fallback in main demo")

print("\n📋 Test Complete - Ready for main demo!")

In [None]:
# ⚡ Performance Comparison: Sync vs Async Scraping
import time
import asyncio
from src.rag_system import RAGSystem

print("🏁 PERFORMANCE COMPARISON: Sync vs Async Scraping")
print("=" * 70)

# Test URLs - use reliable sites for fair comparison
# Note: Some sites like pytorch.org may block concurrent requests
test_urls = ["https://fastapi.tiangolo.com/"]  # Very reliable test site

# Test 1: Original Synchronous Scraper
print("🐌 Testing SYNC Scraper...")
sync_rag = RAGSystem()

start_time = time.time()
sync_success = sync_rag.scrape_and_process_website(
    start_urls=test_urls,
    max_pages=10,
    output_file="data/sync_comparison_notebook.json",
    use_cache=False  # Force fresh scraping
)
sync_duration = time.time() - start_time

print(f"   ⏱️ Sync Duration: {sync_duration:.2f}s")

# Test 2: New Asynchronous Scraper
print("\n⚡ Testing ASYNC Scraper...")
async_rag = RAGSystem()

async def test_async():
    start_time = time.time()
    success = await async_rag.scrape_and_process_website_async(
        start_urls=test_urls,
        max_pages=30,
        output_file="data/async_comparison_notebook.json",
        concurrent_limit=2,        # Conservative for reliability
        requests_per_second=3.0,   # Conservative rate
        use_cache=False            # Force fresh scraping
    )
    duration = time.time() - start_time
    return success, duration

async_success, async_duration = await test_async()

print(f"   ⏱️ Async Duration: {async_duration:.2f}s")

# Performance Analysis
if sync_success and async_success:
    improvement = sync_duration / async_duration if async_duration > 0 else 1
    time_saved = sync_duration - async_duration
    percent_faster = ((sync_duration - async_duration) / sync_duration) * 100 if sync_duration > 0 else 0
    
    print(f"\n🎯 PERFORMANCE RESULTS:")
    print(f"   • Async is {improvement:.1f}x speed ratio")
    print(f"   • Time difference: {time_saved:.2f}s ({percent_faster:.1f}% change)")
    print(f"   • Both completed successfully: ✅")
    
    print(f"\n📊 Real-World Expectations:")
    print(f"   • Small sites (1-5 pages): Similar performance")
    print(f"   • Medium sites (10-50 pages): 2-5x faster with async")  
    print(f"   • Large sites (100+ pages): 5-10x faster with async")
    print(f"   • Complex sites: Major async advantages!")
    
    print(f"\n💡 Async Benefits:")
    print("   • Concurrent processing of multiple URLs")
    print("   • Better resource utilization") 
    print("   • Maintains same quality extraction")
    print("   • Respects rate limits and robots.txt")
    
else:
    print("⚠️ One or both tests failed - check network connection")
    if not sync_success:
        print("   ❌ Sync test failed")
    if not async_success:
        print("   ❌ Async test failed")

print(f"\n✨ The async scraper eliminates delays and uses concurrent processing!")
print("💡 Try with larger websites to see dramatic performance gains!")
print("💡 Note: Some sites (like pytorch.org) may block concurrent requests")

🏁 PERFORMANCE COMPARISON: Sync vs Async Scraping
🐌 Testing SYNC Scraper...
🚀 RAG: Scraping and processing website...
🌐 Scraping website from: https://fastapi.tiangolo.com/
🚀 Starting generic website scraping...
   Starting URLs: 1
   Max pages: 10
   Same domain only: True
   Max depth: 2
🔍 Discovering URLs from 1 starting points...
   Found 10 URLs

📄 Processing 1/10: /
   📄 Processing: https://fastapi.tiangolo.com/
      ✅ Extracted 21 sections

📄 Processing 2/10: /newsletter/
   📄 Processing: https://fastapi.tiangolo.com/newsletter/
      ✅ Extracted 0 sections

📄 Processing 3/10: /az/
   📄 Processing: https://fastapi.tiangolo.com/az/
      ✅ Extracted 17 sections

📄 Processing 4/10: /bn/
   📄 Processing: https://fastapi.tiangolo.com/bn/
      ✅ Extracted 17 sections

📄 Processing 5/10: /de/
   📄 Processing: https://fastapi.tiangolo.com/de/
      ✅ Extracted 21 sections



📄 Processing 6/10: /es/
   📄 Processing: https://fastapi.tiangolo.com/es/
      ✅ Extracted 20 sections

📄 Processing 7/10: /fa/
   📄 Processing: https://fastapi.tiangolo.com/fa/
