# 🌐 Generic RAG System - Interactive Demo

This notebook demonstrates the **Generic RAG System** that can work with any website. 

## ✨ Key Features:
- **🌍 Universal**: Works with any website automatically
- **🧠 Smart**: Structure-aware content extraction  
- **⚡ Fast**: Intelligent caching and processing
- **🤖 Integrated**: Works with Ollama for full text generation

## 📋 What You'll Learn:
1. How to scrape and process any website
2. How to perform semantic search on web content  
3. How to generate answers using retrieved context
4. How to evaluate system performance

## 🚀 Getting Started:
Simply run the cells below in order. The system will:
1. **Scrape** documentation from a website
2. **Process** it into semantic chunks
3. **Test** retrieval with sample queries
4. **Generate** complete answers (with Ollama)

---

In [None]:
# 🔧 STEP 3: Final Results and Usage Instructions
print("🎉 GENERIC RAG PIPELINE COMPLETE!")
print("=" * 80)

# Show final statistics
if 'rag_system' in locals() and success:
    print(f"📊 Final Statistics:")
    print(f"   Documents processed: {len(rag_system.structured_data.get('documents', []))}")
    print(f"   Semantic chunks: {len(rag_system.chunks)}")
    print(f"   TF-IDF matrix shape: {rag_system.tfidf_matrix.shape}")
    
    if rag_system.chunks:
        chunk_sizes = [meta.get('word_count', 0) for meta in rag_system.chunk_metadata]
        avg_chunk_size = sum(chunk_sizes) / len(chunk_sizes)
        complete_sections = sum(1 for meta in rag_system.chunk_metadata if meta.get('type') == 'complete_section')
        section_parts = sum(1 for meta in rag_system.chunk_metadata if meta.get('type') == 'section_part')
        
        print(f"   Average chunk size: {avg_chunk_size:.0f} words")
        print(f"   Complete sections: {complete_sections}")
        print(f"   Section parts: {section_parts}")

print(f"\n💡 SYSTEM FEATURES:")
print("🎯 Universal Compatibility:")
print("   • Works with any website automatically")
print("   • Structure-aware semantic chunking")
print("   • Preserved titles and hierarchical context")
print("   • Rich metadata (page, section, type, domain)")
print("🔍 Enhanced Retrieval:")
print("   • Dynamic TF-IDF configuration")
print("   • Boosted scoring for different content types")
print("   • Intelligent query preprocessing")
print("⚡ Performance Optimized:")
print("   • Smart caching system")
print("   • Automatic parameter tuning")
print("   • Fast semantic search")

print(f"\n🚀 USAGE INSTRUCTIONS:")
print("=" * 80)
print("1. 📄 For retrieval testing:")
print("   rag_system.demo_query('your question', top_k=3)")
print()
print("2. 🤖 For full RAG with Ollama:")
print("   rag_system.rag_query('your question', top_k=3, model='mistral')")
print()
print("3. 🌐 For different websites:")
print("   rag_system.scrape_and_process_website(['https://your-site.com/'])")
print()
print("4. 🔍 Expected performance:")
print("   • Similarity scores: 0.4+ for good matches")
print("   • Adapts to any website structure")
print("   • Respects robots.txt and rate limits")

print(f"\n📋 NEXT STEPS:")
print("• Test with different websites and domains")
print("• Use with Ollama for full answer generation")
print("• Experiment with different top_k values (3-7)")
print("• Try various technical and general questions")
print("• Evaluate answer quality across domains")

In [None]:
# 🧠 STEP 2: Test Generic RAG System
print("🧠 STEP 2: Testing Generic RAG System")
print("=" * 80)

# Use the RAG system from previous cell
if 'rag_system' in locals() and success:
    print("✅ Generic RAG System ready for testing!")
    
    # Test with generic questions appropriate for Python docs
    test_questions = [
        "What are Python data types?",
        "How to handle exceptions in Python?", 
        "What are Python decorators?",
        "How to use list comprehensions?"
    ]
    
    print(f"\n🧪 Testing {len(test_questions)} questions:")
    
    # Store results for comparison
    results = []
    
    for i, question in enumerate(test_questions, 1):
        print(f"\n{'='*60}")
        print(f"🔍 Question {i}: {question}")
        print("-" * 40)
        
        # Get detailed results
        contexts, metadata = rag_system.retrieve_context(question, top_k=3)
        
        if contexts and metadata:
            max_score = max(meta['boosted_score'] for meta in metadata)
            avg_score = sum(meta['boosted_score'] for meta in metadata) / len(metadata)
            results.append((question, max_score, avg_score))
            
            print(f"✅ Retrieved {len(contexts)} relevant chunks")
            print(f"📊 Max Score: {max_score:.3f} | Avg Score: {avg_score:.3f}")
            
            # Show top result details
            top_meta = metadata[0]
            print(f"🏆 Top Result: {top_meta['page_title']} - {top_meta['section_title']}")
            print(f"   Type: {top_meta['type']} | Words: {top_meta['word_count']}")
            print(f"   Preview: {contexts[0][:150]}...")
        else:
            results.append((question, 0.0, 0.0))
            print("❌ No relevant chunks found")
    
    # Overall performance summary
    if results:
        avg_max_scores = sum(result[1] for result in results) / len(results)
        avg_avg_scores = sum(result[2] for result in results) / len(results)
        
        print(f"\n📊 SYSTEM PERFORMANCE:")
        print(f"   Average Max Score: {avg_max_scores:.3f}")
        print(f"   Average Avg Score: {avg_avg_scores:.3f}")
        
        excellent_results = sum(1 for result in results if result[1] > 0.6)
        good_results = sum(1 for result in results if result[1] > 0.4)
        
        print(f"   Excellent results (>0.6): {excellent_results}/{len(results)}")
        print(f"   Good results (>0.4): {good_results}/{len(results)}")

else:
    print("❌ RAG system not available from previous step")

In [None]:
# 🚀 STEP 1: Run the Complete Generic RAG Pipeline
import sys
import os
sys.path.append('/home/rkpatel/RAG')

# Import generic RAG system
from src.rag_system import RAGSystem

print("🚀 GENERIC RAG PIPELINE")
print("=" * 80)

# Initialize the RAG system
rag_system = RAGSystem()

# Example: Scrape Python documentation
start_urls = ["https://docs.python.org/3/"]
output_file = "data/python_docs_notebook.json"

print("📄 Scraping and processing documentation (will use cache if available)...")
print("   This will scrape Python documentation for demonstration...")

success = rag_system.scrape_and_process_website(
    start_urls=start_urls,
    max_pages=15,
    output_file=output_file,
    same_domain_only=True,
    max_depth=2
)

if success:
    print(f"✅ System ready!")
    print(f"📊 Processed: {len(rag_system.chunks)} chunks")
    print(f"📊 Data file: {output_file}")
else:
    print("❌ Failed to initialize system")

print("\n✅ Step 1 Complete: Generic RAG system ready!")

In [None]:
# ✅ QUICK TEST - Complete working example
from src.rag_system import RAGSystem

# Initialize the system
rag_system = RAGSystem()

# Example: Quick test with any website
print("🔄 Quick test with generic RAG system...")

# You can change this URL to any website you want to test
test_urls = ["https://httpbin.org/"]  # Simple API testing service

success = rag_system.scrape_and_process_website(
    start_urls=test_urls,
    max_pages=3,
    output_file="data/quick_test_notebook.json",
    use_cache=True
)

if success:
    print("✅ System initialized successfully!")
    
    # Test query
    result = rag_system.demo_query("What is this website about?", top_k=2)
    
    # The demo_query method prints details and returns a summary
    print("\n" + "="*60)
    print("📋 QUICK TEST COMPLETE")
    
else:
    print("❌ Failed to initialize.")
    print("💡 Try a different website or check internet connection")

In [1]:
# 🤖 Advanced Query with Ollama Generation
from src.rag_system import RAGSystem

# Initialize and load system with any website
rag_system = RAGSystem()

# Example: Use documentation website
start_urls = ["https://docs.python.org/3/tutorial/"]

print("📚 Setting up system for Ollama demo...")
success = rag_system.scrape_and_process_website(
    start_urls=start_urls,
    max_pages=5,
    output_file="data/ollama_demo_notebook.json"
)

if success:
    # Your query - change this to anything you want to ask
    query = "How to define functions in Python?"
    print(f"🔍 Query: {query}")
    print("=" * 60)

    # Get full answer from Ollama
    try:
        result = rag_system.rag_query(query, top_k=3, model="mistral")
        print("🤖 Ollama Response:")
        print(result)
    except Exception as e:
        print(f"❌ Ollama not available: {e}")
        print("\n💡 Fallback - showing retrieval only:")
        result = rag_system.demo_query(query, top_k=3)
        
else:
    print("❌ Failed to set up system")
    
print("\n💡 To use Ollama:")
print("1. Start Ollama: ollama serve")  
print("2. Install model: ollama pull mistral")
print("3. Run this cell again")

📚 Setting up system for Ollama demo...
🚀 RAG: Scraping and processing website...
🌐 Scraping website from: https://docs.python.org/3/tutorial/
🚀 Starting generic website scraping...
   Starting URLs: 1
   Max pages: 5
   Same domain only: True
   Max depth: 2
🔍 Discovering URLs from 1 starting points...
   Found 5 URLs

📄 Processing 1/5: /3/tutorial/
   📄 Processing: https://docs.python.org/3/tutorial/
      ✅ Extracted 1 sections

📄 Processing 2/5: /3/whatsnew/changelog.html
   📄 Processing: https://docs.python.org/3/whatsnew/changelog.html
      ✅ Extracted 952 sections

📄 Processing 3/5: /3/tutorial/appetite.html
   📄 Processing: https://docs.python.org/3/tutorial/appetite.html
      ✅ Extracted 1 sections

📄 Processing 4/5: /3/bugs.html
   📄 Processing: https://docs.python.org/3/bugs.html
      ✅ Extracted 4 sections

📄 Processing 5/5: /3/genindex.html
   📄 Processing: https://docs.python.org/3/genindex.html
      ✅ Extracted 1 sections

🧠 Creating semantic chunks from 5 documents..