In [None]:
!pip install html2text

Collecting html2text
  Downloading html2text-2025.4.15-py3-none-any.whl.metadata (4.1 kB)
Downloading html2text-2025.4.15-py3-none-any.whl (34 kB)
Installing collected packages: html2text
Successfully installed html2text-2025.4.15


In [None]:
!pip install langchain langchain-community langchain-google-genai chromadb google-generativeai

Collecting langchain-community
  Downloading langchain_community-0.3.27-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain-google-genai
  Downloading langchain_google_genai-2.1.7-py3-none-any.whl.metadata (7.0 kB)
Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain-google-genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.18 (from langchain-google-genai)
  Downloading google_ai_genera

In [None]:
import requests
from bs4 import BeautifulSoup
import html2text
import re
import time
from urllib.parse import urljoin, urlparse
import json
from typing import Dict, List, Tuple, Optional, Set, Any
from collections import deque
import os
from langchain.text_splitter import MarkdownTextSplitter
from langchain.docstore.document import Document
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import google.generativeai as genai
from google.colab import userdata
import time

## **Overview**

We are crawling the web and implementing RAG (Retrieval-Augmented Generation)  to provide helpful care tips and informational content to people living with Parkinson's, based on the self-reported severity of their symptoms.

**Care-Tips Based on Symptom Severity**

| Symptom | Type of care-tips based on rating | Rating | Care-tip example based on symptom and severity |
|---------|-----------------------------------|--------|-----------------------------------------------|
| Pain | Educational | 1-2 | I'm glad you've been able to manage your pain. Exercising regularly and making sure to get adequate nutrition can go a long way when regulating pain.<br><br>Meditating can be a good substitution if you are in too much pain to exercise. |
| | Basic Care | 3 | Warm packs may help control your pain. However, avoid electric heating pads as they can cause burns with prolonged use.<br><br>If your pain is due to acute injury, consider using a cold pack instead to reduce pain and swelling. This should typically not be done for > 20 minutes. |
| | Advanced Care | 4-5 | Consider using the journal as a 'pain log' to describe when, where and what kind of pain you are feeling as well as what has or hasn't helped relieve your pain. This can help you understand the causes of your pain.<br><br>Sharing this information with your health providers can help them classify and treat your pain accurately. |
| | Escalation | 5 (User rates 4-5 > 4 times) | I recommend you speak to your doctor or your nurse about the pain you are experiencing.<br><br>They can help you find the underlying cause of your pain. |
| Light-headedness | Educational | 1-2 | You've noticed some occasional light-headedness but it hasn't been too disruptive.<br><br>Let's keep an eye on it and track when it happens. |
| | Basic Care | 3 | You've been feeling light-headed more often, especially when standing.<br><br>Let's monitor how often it happens and make a note to talk to your care team if it continues. |
| | Advanced Care | 4-5 | You're starting to feel light-headed more regularly, and it may be getting in the way of your day.<br><br>Keep tracking when it happens and bring it up with your care team so they're aware. |
| | Escalation | 5 (User rates 4-5 > 4 times) | Light-headedness has been coming up often and might be affecting you when moving around.<br><br>Let's continue tracking when it happens and make sure this gets shared with your care team. |
| Unusual Sweating | Educational | 1-2 | Glad to hear that you've been able to manage your sweating problems.<br><br>Remember that sweating problems occur as a regulatory function of your autonomic nervous system. |
| | Basic Care | 3 | Unusual sweating is a common symptom experienced by many people with PD. Try to identify any foods that can cause excessive sweating and use an antiperspirant to control sweating and odour.<br><br>You can consult this article for more information |
| | Advanced Care | 4-5 | People with PD often experience discomfort and uneasiness due to unusual sweating, but there are ways to manage it. Avoid wearing tight-fitting clothing, especially those made from nylon or silk.<br><br>Consider buying breathable socks and armpit shields to absorb sweat and moisture. |
| | Escalation | 5 (User rates 4-5 > 4 times) | It appears that you have been experiencing constant issues with excessive sweating. It may be the right time to contact your care team or consult our care finder to support your experience better.<br><br>In the meantime, try avoiding hot or humid environments, crowded rooms, and stressful situations. |
| Skin Changes | Educational | 1-2 | You may have noticed occasional skin changes but it hasn't been a regular concern yet.<br><br>Let's keep track of this just in case. |
| | Basic Care | 3 | Skin changes like increased oiliness or irritation are happening more often.<br><br>Let's continue monitoring and sharing them with your care team. |
| | Advanced Care | 4-5 | You may be noticing changes in your skin which isn't unusual in Parkinson's.<br><br>Let's keep tracking each episode and share this with your care team. |
| | Escalation | 5 (User rates 4-5 > 4 times) | You're consistently noticing skin changes, a normal symptom of Parkinson's.<br><br>Track each episode's timing and severity to share with your care team. |

## **Grabbing Reliable Parkinson's Content From the Web**

| Organization Name                                               | Country/Region            | Description                                      | Successfully Crawled |
|------------------------------------------------------------------|---------------------------|--------------------------------------------------|----------------------|
| Parkinson’s Foundation                                           | USA (Global reach)        | [parkinson.org](https://www.parkinson.org/)     | ✅                   |
| Michael J. Fox Foundation for Parkinson’s Research              | USA (Global reach)        | [michaeljfox.org](https://www.michaeljfox.org/) | ✅                   |
| American Parkinson Disease Association (APDA)                   | USA                       | [apdaparkinson.org](https://www.apdaparkinson.org/) | ✅               |
| Parkinson Canada                                                | Canada                    | [parkinson.ca](https://www.parkinson.ca/)       | ✅                   |
| European Parkinson’s Disease Association (EPDA)                 | Europe (Pan-European)     | [parkinsonseurope.org](https://parkinsonseurope.org/) | ❌               |
| Parkinson’s UK                                                  | United Kingdom            | [parkinsons.org.uk](https://www.parkinsons.org.uk/) | ✅               |
| Davis Phinney Foundation                                        | USA                       | [davisphinneyfoundation.org](https://davisphinneyfoundation.org/) | ❌         |
| PMD Alliance                                                    | USA                       | [pmdalliance.org](https://www.pmdalliance.org/) | ✅                   |
| ParkinsonNet                                                    | Netherlands               | [parkinsonnet.com](https://www.parkinsonnet.com/) | ✅               |


**Unable to crawl ❌*Davis Phinney Foundation* & ❌*European Parkinson's Disease Association* due to HTTP 403 Forbidden errors. These websites are most likely blocking the scraper**

Grab all main pages and secondary pages with a limit of 50 pages and 3 levels deep, only crawls pages within the same domain + URLs for videos and podcast ressources

In [None]:
# Using Google Colab Secrets
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    genai.configure(api_key=GOOGLE_API_KEY)
    print("✅ API key loaded from Colab secrets")
except Exception as e:
    print("❌ Could not load API key from secrets. Please set up your API key.")
    print("Go to the left sidebar 🔑 Secrets tab and add 'GOOGLE_API_KEY' as a secret")

✅ API key loaded from Colab secrets


In [None]:
# Using Google Colab Secrets
try:
    GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    genai.configure(api_key=GOOGLE_API_KEY)
    print("✅ API key loaded from Colab secrets")
except Exception as e:
    print("❌ Could not load API key from secrets. Please set up your API key.")
    print("Go to the left sidebar 🔑 Secrets tab and add 'GOOGLE_API_KEY' as a secret")

# =============================================================================
# 1. TEXT PREPROCESSING AND CLEANING
# =============================================================================

def merge_hyphenated_words(text):
    """Merge words that are split by hyphens across lines."""
    return re.sub(r"(\w)-\n(\w)", r"\1\2", text)

def fix_newlines(text):
    """Replace single newlines with spaces, keep double newlines as paragraph breaks."""
    return re.sub(r"(?<!\n)\n(?!\n)", " ", text)

def remove_multiple_newlines(text):
    """Replace multiple consecutive newlines with single newlines."""
    return re.sub(r"\n{3,}", "\n\n", text)

def clean_markdown_artifacts(text):
    """Clean up markdown artifacts that might not be useful for RAG."""
    # Remove excessive markdown links that are just URLs
    text = re.sub(r'\[([^\]]*)\]\([^)]*\)', r'\1', text)
    # Clean up excessive asterisks and underscores
    text = re.sub(r'\*{3,}', '***', text)
    text = re.sub(r'_{3,}', '___', text)
    # Remove excessive whitespace
    text = re.sub(r' {3,}', ' ', text)
    return text

def remove_navigation_elements(text):
    """Remove common navigation and footer elements."""
    # Remove common navigation phrases
    nav_patterns = [
        r'Skip to main content',
        r'Skip to navigation',
        r'Back to top',
        r'Contact Us',
        r'Privacy Policy',
        r'Terms of Service',
        r'Copyright ©.*',
        r'All rights reserved.*'
    ]
    for pattern in nav_patterns:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE)
    return text

def clean_text(text):
    """
    Comprehensive text cleaning pipeline for Parkinson's content.

    Args:
        text (str): Raw text to be cleaned

    Returns:
        str: Cleaned text
    """
    if not text or not isinstance(text, str):
        return ""

    cleaning_functions = [
        merge_hyphenated_words,
        fix_newlines,
        remove_multiple_newlines,
        clean_markdown_artifacts,
        remove_navigation_elements
    ]

    for cleaning_function in cleaning_functions:
        text = cleaning_function(text)

    # Final cleanup
    text = text.strip()
    return text

# =============================================================================
# 2. DOCUMENT CHUNKING AND PROCESSING
# =============================================================================

def text_to_docs(text: str, metadata: Dict[str, Any]) -> List[Document]:
    """
    Convert text to Document chunks with metadata.

    Args:
        text (str): Cleaned text content
        metadata (dict): Metadata for the document

    Returns:
        List[Document]: List of document chunks
    """
    if not text or len(text.strip()) < 50:  # Skip very short content
        return []

    doc_chunks = []
    # Use larger chunks for better context in RAG
    text_splitter = MarkdownTextSplitter(
        chunk_size=1500,  # Slightly smaller for better retrieval
        chunk_overlap=200  # More overlap for better context preservation
    )

    chunks = text_splitter.split_text(text)

    for i, chunk in enumerate(chunks):
        if len(chunk.strip()) < 100:  # Skip very small chunks
            continue

        # Enhanced metadata for better retrieval
        chunk_metadata = metadata.copy()
        chunk_metadata.update({
            'chunk_id': i,
            'chunk_length': len(chunk),
            'total_chunks': len(chunks)
        })

        doc = Document(page_content=chunk, metadata=chunk_metadata)
        doc_chunks.append(doc)

    return doc_chunks

def get_doc_chunks(text: str, metadata: Dict[str, Any]) -> List[Document]:
    """
    Process text and metadata to generate document chunks.

    Args:
        text (str): Raw text content
        metadata (dict): Associated metadata

    Returns:
        List[Document]: List of processed document chunks
    """
    cleaned_text = clean_text(text)
    if not cleaned_text:
        return []

    doc_chunks = text_to_docs(cleaned_text, metadata)
    return doc_chunks

# =============================================================================
# 3. CHROMADB INITIALIZATION AND CONNECTION - MODIFIED FOR PERSISTENCE
# =============================================================================

def get_persistent_directory():
    """Get or create persistent directory for ChromaDB data."""
    # Create a persistent directory in your local Downloads folder
    persist_dir = "/content/drive/MyDrive/ChromaDB_Parkinson_Data"

    # For local development, use a local directory
    if not os.path.exists("/content/drive"):
        persist_dir = os.path.expanduser("~/Downloads/ChromaDB_Parkinson_Data")

    # Create directory if it doesn't exist
    os.makedirs(persist_dir, exist_ok=True)
    return persist_dir

def show_persistence_notification(persist_dir, collection_name, total_docs):
    """Show a notification about data persistence."""
    from IPython.display import display, HTML

    notification = f"""
    <div style="background-color: #d4edda; border: 1px solid #c3e6cb; color: #155724;
                padding: 15px; border-radius: 5px; margin: 10px 0; font-family: Arial;">
        <h3 style="margin-top: 0;">💾 Data Successfully Persisted!</h3>
        <p><strong>📍 Location:</strong> {persist_dir}</p>
        <p><strong>📚 Collection:</strong> {collection_name}</p>
        <p><strong>📊 Documents:</strong> {total_docs} chunks stored</p>
        <p><strong>🔄 Next Steps:</strong> Your data is now saved locally and ready for Google Drive upload!</p>
        <hr style="border: 1px solid #c3e6cb;">
        <p style="margin-bottom: 0; font-size: 0.9em;">
            <strong>💡 Tip:</strong> You can now upload the entire folder to Google Drive for backup and sharing.
        </p>
    </div>
    """
    display(HTML(notification))
    print(f"🎉 SUCCESS: Data persisted to {persist_dir}")

def get_chroma_client(collection_name: str = "parkinsons_knowledge_base"):
    """
    Initialize and return ChromaDB client with Google Embeddings and LOCAL PERSISTENCE.

    Args:
        collection_name (str): Name of the ChromaDB collection

    Returns:
        Chroma: Initialized ChromaDB vector store with persistence
    """
    embedding_function = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

    # Get persistent directory
    persist_dir = get_persistent_directory()

    print(f"📁 Persistent data ready at: {persist_dir}")

# =============================================================================
# 13. EXAMPLE USAGE AND TESTING - UPDATED FOR PERSISTENCE
# =============================================================================

def run_complete_pipeline():
    """
    Run the complete pipeline to create and test the knowledge base with PERSISTENCE
    """
    print("🚀 STARTING COMPLETE PARKINSON'S RAG PIPELINE WITH PERSISTENCE")
    print("="*60)

    # Step 1: Create complete knowledge base with persistence
    print("\n📚 STEP 1: Creating complete knowledge base with LOCAL PERSISTENCE...")
    process_complete_knowledge_base()

    # Step 2: Test the knowledge base
    print("\n🧪 STEP 2: Testing persistent knowledge base...")
    try:
        # Test the complete knowledge base
        inspector = inspect_chromadb("parkinsons_complete_kb")

        # Search for text content
        print("\n📄 Text Content Search:")
        text_results = inspector.search_documents("Parkinson's symptoms treatment", k=3)
        for i, result in enumerate(text_results[:3], 1):
            if "error" not in result:
                print(f"   {i}. {result['organization']} - {result.get('content_type', 'text')}")
                print(f"      {result['content_preview'][:100]}...")

        # Search for media content
        print("\n🎬 Media Content Search:")
        media_results = search_media_content("exercise therapy", media_type="all", k=3)
        for i, result in enumerate(media_results[:3], 1):
            print(f"   {i}. [{result['type'].upper()}] {result['title']}")
            print(f"      Organization: {result['organization']}")
            print(f"      URL: {result['media_url']}")

        # Export sample for inspection
        print("\n💾 Exporting sample data to persistent directory...")
        inspector.export_sample_data("complete_knowledge_base_sample.json", sample_size=100)

        # Show Google Drive upload instructions
        print("\n📤 Preparing for Google Drive upload...")
        prepare_for_google_drive_upload()

    except Exception as e:
        print(f"❌ Testing failed: {str(e)}")

    print("\n🎉 Pipeline complete! Your persistent knowledge base is ready for use and Google Drive upload.")

# =============================================================================
# 14. MOUNT GOOGLE DRIVE HELPER (OPTIONAL)
# =============================================================================

def mount_google_drive_and_setup():
    """
    Mount Google Drive and set up persistent directory there (for Google Colab)
    """
    try:
        from google.colab import drive
        drive.mount('/content/drive')

        # Update the persistent directory function to use Google Drive
        global get_persistent_directory
        def get_persistent_directory():
            persist_dir = "/content/drive/MyDrive/ChromaDB_Parkinson_Data"
            os.makedirs(persist_dir, exist_ok=True)
            return persist_dir

        persist_dir = get_persistent_directory()
        print(f"✅ Google Drive mounted successfully!")
        print(f"📁 Persistent directory set to: {persist_dir}")

        from IPython.display import display, HTML
        notification = """
        <div style="background-color: #d1ecf1; border: 1px solid #bee5eb; color: #0c5460;
                    padding: 15px; border-radius: 5px; margin: 10px 0; font-family: Arial;">
            <h3 style="margin-top: 0;">☁️ Google Drive Integration Active!</h3>
            <p>Your ChromaDB data will now be saved directly to Google Drive and automatically synced!</p>
            <p><strong>📍 Location:</strong> Google Drive → ChromaDB_Parkinson_Data</p>
        </div>
        """
        display(HTML(notification))

        return True
    except ImportError:
        print("ℹ️  Google Drive mount not available (not in Colab environment)")
        return False
    except Exception as e:
        print(f"❌ Error mounting Google Drive: {str(e)}")
        return False

# =============================================================================
# 15. BACKUP AND RESTORE FUNCTIONS
# =============================================================================

def backup_chromadb_to_zip():
    """
    Create a ZIP backup of the ChromaDB persistent data
    """
    import shutil

    persist_dir = get_persistent_directory()
    backup_path = os.path.join(os.path.dirname(persist_dir), "ChromaDB_Backup.zip")

    try:
        shutil.make_archive(backup_path.replace('.zip', ''), 'zip', persist_dir)

        from IPython.display import display, HTML
        notification = f"""
        <div style="background-color: #d4edda; border: 1px solid #c3e6cb; color: #155724;
                    padding: 15px; border-radius: 5px; margin: 10px 0; font-family: Arial;">
            <h3 style="margin-top: 0;">📦 Backup Created Successfully!</h3>
            <p><strong>📍 Backup Location:</strong> {backup_path}</p>
            <p><strong>💡 Tip:</strong> You can now download or share this ZIP file containing your complete knowledge base!</p>
        </div>
        """
        display(HTML(notification))

        print(f"✅ Backup created at: {backup_path}")
        return backup_path
    except Exception as e:
        print(f"❌ Error creating backup: {str(e)}")
        return None

def check_persistence_status():
    """
    Check the status of persistent data
    """
    persist_dir = get_persistent_directory()

    print("🔍 PERSISTENCE STATUS CHECK")
    print("="*40)
    print(f"📁 Persistent Directory: {persist_dir}")
    print(f"📂 Directory Exists: {os.path.exists(persist_dir)}")

    if os.path.exists(persist_dir):
        files = os.listdir(persist_dir)
        print(f"📊 Files in directory: {len(files)}")

        if files:
            print("📋 Contents:")
            for file in files[:10]:  # Show first 10 files
                file_path = os.path.join(persist_dir, file)
                size = os.path.getsize(file_path) if os.path.isfile(file_path) else "DIR"
                print(f"   - {file} ({size} bytes)" if size != "DIR" else f"   - {file}/ (directory)")

        # Check total size
        total_size = 0
        for dirpath, dirnames, filenames in os.walk(persist_dir):
            for filename in filenames:
                filepath = os.path.join(dirpath, filename)
                total_size += os.path.getsize(filepath)

        print(f"💾 Total Size: {total_size / (1024*1024):.2f} MB")
    else:
        print("❌ Persistent directory not found. Run the pipeline first!")


# =============================================================================
# 16. EXECUTION ENTRY POINT - FIXED
# =============================================================================

def get_chroma_client(collection_name: str = "parkinsons_knowledge_base"):
    """
    Initialize and return ChromaDB client with Google Embeddings and LOCAL PERSISTENCE.
    """
    embedding_function = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

    # Get persistent directory
    persist_dir = get_persistent_directory()

    print(f"📁 Using persistent directory: {persist_dir}")

    vector_store = Chroma(
        collection_name=collection_name,
        embedding_function=embedding_function,
        persist_directory=persist_dir  # This makes it persistent!
    )

    return vector_store

# Main execution block
if __name__ == "__main__":
    print("🚀 PERSISTENT CHROMADB RAG SYSTEM")
    print("="*50)
    print("💾 This version saves your data locally for persistence!")
    print("☁️  Optional: Run mount_google_drive_and_setup() for direct Google Drive storage")
    print()

    # Option 1: Mount Google Drive first (recommended for Colab)
    print("1️⃣  Mounting Google Drive (optional but recommended)...")
    try:
        mount_success = mount_google_drive_and_setup()
    except:
        mount_success = False
        print("ℹ️  Google Drive mount not available")

    # Option 2: Check current persistence status
    print("\n2️⃣  Checking persistence status...")
    try:
        check_persistence_status()
    except:
        print("ℹ️  Will create persistence directory when needed")

    print("\n3️⃣  Available commands:")
    print("   - run_complete_pipeline()                 # Full pipeline with persistence")
    print("   - process_complete_knowledge_base()       # Create knowledge base only")
    print("   - inspect_chromadb('collection_name')     # Inspect existing data")
    print("   - backup_chromadb_to_zip()               # Create ZIP backup")
    print("   - prepare_for_google_drive_upload()      # Upload instructions")
    print("   - check_persistence_status()             # Check data status")

    print("\n✨ Ready to process your Parkinson's knowledge base with full persistence!")

    # Uncomment the line below to run the full pipeline automatically
    run_complete_pipeline()

# =============================================================================
# 4. DATA LOADING AND FILE HANDLING - UNCHANGED
# =============================================================================

def upload_and_load_json():
    """
    Upload JSON file using Google Colab file upload widget and load it.

    Returns:
        dict: Loaded data from JSON file
    """
    from google.colab import files  # ✅ Add this import

    print("Please upload your parkinsons_full_crawl.json file:")
    uploaded = files.upload()

    # Get the uploaded file (should be the first and only file)
    filename = list(uploaded.keys())[0]
    print(f"✅ Uploaded file: {filename}")

    # Load and return the JSON data
    with open(filename, 'r', encoding='utf-8') as f:
        data = json.load(f)

    print(f"✅ Successfully loaded data with {len(data)} organizations")
    return data

def load_crawled_data(json_file_path: str = None) -> Dict:
    """
    Load the crawled Parkinson's data from JSON file.
    If no path provided, will prompt for file upload.

    Args:
        json_file_path (str, optional): Path to the JSON file

    Returns:
        dict: Loaded data
    """
    if json_file_path is None:
        # Use file upload widget
        return upload_and_load_json()
    else:
        # Try to load from specified path first
        try:
            with open(json_file_path, 'r', encoding='utf-8') as f:
                data = json.load(f)
            print(f"✅ Successfully loaded data from {json_file_path}")
            return data
        except FileNotFoundError:
            print(f"❌ File not found: {json_file_path}")
            print("Switching to file upload method...")
            return upload_and_load_json()

# =============================================================================
# 5. WEB PAGE CONTENT PROCESSING - UNCHANGED
# =============================================================================

def process_organization_data(org_name: str, org_data: Dict) -> List[Document]:
    """
    Process data from a single organization into document chunks.

    Args:
        org_name (str): Name of the organization
        org_data (dict): Organization's crawled data

    Returns:
        List[Document]: Processed document chunks
    """
    all_docs = []

    if 'pages_content' not in org_data:
        print(f"No pages content found for {org_name}")
        return all_docs

    pages_content = org_data['pages_content']

    for url, page_data in pages_content.items():
        # Extract text and metadata
        text = page_data.get('text', '')
        page_metadata = page_data.get('metadata', {})

        # Enhanced metadata
        metadata = {
            'organization': org_name,
            'source_url': url,
            'title': page_metadata.get('title', ''),
            'description': page_metadata.get('description', ''),
            'keywords': page_metadata.get('keywords', ''),
            'crawl_depth': page_data.get('depth', 0),
            'crawled_at': page_data.get('crawled_at', ''),
            'base_url': org_data.get('base_url', ''),
            'content_type': 'web_page'
        }

        # Process text into chunks
        doc_chunks = get_doc_chunks(text, metadata)
        all_docs.extend(doc_chunks)

        if doc_chunks:
            print(f"Processed {len(doc_chunks)} chunks from {url}")

    return all_docs

# =============================================================================
# 6. MEDIA CONTENT PROCESSING - UNCHANGED
# =============================================================================

def process_media_content(org_name: str, org_data: Dict) -> List[Document]:
    """
    Process video and podcast content from organization data into document chunks.

    Args:
        org_name (str): Name of the organization
        org_data (dict): Organization's crawled data

    Returns:
        List[Document]: Media document chunks
    """
    media_docs = []

    if 'media_content' not in org_data:
        return media_docs

    media_content = org_data['media_content']

    # Process videos
    videos = media_content.get('videos', [])
    for video in videos:
        content_parts = []

        if video.get('title'):
            content_parts.append(f"Video Title: {video['title']}")

        if video.get('description'):
            content_parts.append(f"Description: {video['description']}")

        content_parts.append(f"Video URL: {video.get('url', 'N/A')}")
        content_parts.append(f"Source Page: {video.get('source_page', 'N/A')}")

        content = "\n".join(content_parts)

        metadata = {
            'organization': org_name,
            'content_type': 'video',
            'media_type': 'video',
            'title': video.get('title', 'Untitled Video'),
            'description': video.get('description', ''),
            'media_url': video.get('url', ''),
            'source_page': video.get('source_page', ''),
            'base_url': org_data.get('base_url', ''),
            'source_url': video.get('url', ''),
        }

        if len(content.strip()) > 50:
            doc = Document(page_content=content, metadata=metadata)
            media_docs.append(doc)

    # Process podcasts
    podcasts = media_content.get('podcasts', [])
    for podcast in podcasts:
        content_parts = []

        if podcast.get('title'):
            content_parts.append(f"Podcast Title: {podcast['title']}")

        if podcast.get('description'):
            content_parts.append(f"Description: {podcast['description']}")

        content_parts.append(f"Podcast URL: {podcast.get('url', 'N/A')}")
        content_parts.append(f"Source Page: {podcast.get('source_page', 'N/A')}")

        content = "\n".join(content_parts)

        metadata = {
            'organization': org_name,
            'content_type': 'podcast',
            'media_type': 'audio',
            'title': podcast.get('title', 'Untitled Podcast'),
            'description': podcast.get('description', ''),
            'media_url': podcast.get('url', ''),
            'source_page': podcast.get('source_page', ''),
            'base_url': org_data.get('base_url', ''),
            'source_url': podcast.get('url', ''),
        }

        if len(content.strip()) > 50:
            doc = Document(page_content=content, metadata=metadata)
            media_docs.append(doc)

    return media_docs

# =============================================================================
# 7. BATCH STORAGE WITH RATE LIMITING - MODIFIED WITH PERSISTENCE NOTIFICATION
# =============================================================================

def store_documents_with_rate_limiting(docs: List[Document], vector_store: Chroma, batch_size: int = 25):
    """
    Store documents in batches with rate limiting to avoid API quota issues.
    Shows persistence notifications.

    Args:
        docs (List[Document]): Documents to store
        vector_store (Chroma): ChromaDB vector store
        batch_size (int): Number of documents per batch
    """
    total_docs = len(docs)
    print(f"Storing {total_docs} documents in batches of {batch_size} with rate limiting")

    successful_batches = 0
    failed_batches = 0

    for i in range(0, total_docs, batch_size):
        batch = docs[i:i + batch_size]
        batch_num = i//batch_size + 1
        total_batches = (total_docs + batch_size - 1)//batch_size

        try:
            print(f"Processing batch {batch_num}/{total_batches}...")
            vector_store.add_documents(batch)
            successful_batches += 1
            print(f"✅ Stored batch {batch_num}/{total_batches}")

            # Rate limiting: wait between batches
            if batch_num < total_batches:
                print("⏱️  Waiting 60 seconds to respect API rate limits...")
                time.sleep(60)

        except Exception as e:
            failed_batches += 1
            print(f"❌ Error storing batch {batch_num}: {str(e)}")

            # If it's a rate limit error, wait longer
            if "429" in str(e) or "RATE_LIMIT_EXCEEDED" in str(e):
                print("⏱️  Rate limit detected. Waiting 2 minutes before retrying...")
                time.sleep(120)

                # Retry the failed batch once
                try:
                    print(f"🔄 Retrying batch {batch_num}...")
                    vector_store.add_documents(batch)
                    successful_batches += 1
                    failed_batches -= 1
                    print(f"✅ Successfully stored batch {batch_num} on retry")
                except Exception as retry_error:
                    print(f"❌ Failed again on retry: {str(retry_error)}")
            continue

    # PERSIST THE DATA AND SHOW NOTIFICATION
    try:
        vector_store.persist()
        persist_dir = get_persistent_directory()
        collection_name = vector_store._collection.name

        print(f"\n📊 STORAGE SUMMARY:")
        print(f"✅ Successful batches: {successful_batches}")
        print(f"❌ Failed batches: {failed_batches}")
        print(f"📄 Total documents attempted: {total_docs}")
        print(f"📄 Estimated documents stored: {successful_batches * batch_size}")
        print("💾 All successful documents persisted to ChromaDB")

        # Show persistence notification
        show_persistence_notification(persist_dir, collection_name, successful_batches * batch_size)

    except Exception as e:
        print(f"❌ Error persisting data: {str(e)}")

# =============================================================================
# 8. KNOWLEDGE BASE INSPECTION TOOLS - MODIFIED FOR PERSISTENCE
# =============================================================================

class ChromaDBInspector:
    """
    Class to inspect and interact with your ChromaDB knowledge base
    """

    def __init__(self, collection_name: str = "parkinsons_knowledge_base"):
        self.collection_name = collection_name
        self.embedding_function = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

        # Use persistent directory
        persist_dir = get_persistent_directory()

        self.vector_store = Chroma(
            collection_name=collection_name,
            embedding_function=self.embedding_function,
            persist_directory=persist_dir  # Load from persistent directory
        )

    def get_collection_stats(self) -> Dict:
        """Get basic statistics about your ChromaDB collection"""
        try:
            collection = self.vector_store._collection
            stats = {
                "collection_name": collection.name,
                "total_documents": collection.count(),
                "sample_ids": list(collection.get()["ids"][:5]) if collection.count() > 0 else [],
                "persistent_directory": get_persistent_directory()
            }
            return stats
        except Exception as e:
            return {"error": str(e)}

    def search_documents(self, query: str, k: int = 5) -> List[Dict]:
        """Search for documents in your knowledge base"""
        try:
            results = self.vector_store.similarity_search_with_score(query, k=k)

            formatted_results = []
            for doc, score in results:
                formatted_results.append({
                    "score": score,
                    "organization": doc.metadata.get("organization", "Unknown"),
                    "title": doc.metadata.get("title", "Unknown"),
                    "url": doc.metadata.get("source_url", "Unknown"),
                    "content_preview": doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content,
                    "content_type": doc.metadata.get("content_type", "web_page"),
                    "full_content": doc.page_content,
                    "metadata": doc.metadata
                })

            return formatted_results
        except Exception as e:
            return [{"error": str(e)}]

    def get_all_organizations(self) -> List[str]:
        """Get list of all organizations in the database"""
        try:
            all_docs = self.vector_store.get()
            organizations = set()

            for metadata in all_docs["metadatas"]:
                org = metadata.get("organization")
                if org:
                    organizations.add(org)

            return sorted(list(organizations))
        except Exception as e:
            return [f"Error: {str(e)}"]

    def export_sample_data(self, filename: str = "chromadb_sample.json", sample_size: int = 50):
        """Export a sample of your data to JSON for inspection"""
        try:
            all_docs = self.vector_store.get()
            persist_dir = get_persistent_directory()

            sample_data = {
                "total_documents": len(all_docs["ids"]),
                "sample_size": min(sample_size, len(all_docs["ids"])),
                "organizations": self.get_all_organizations(),
                "persistent_directory": persist_dir,
                "sample_documents": []
            }

            for i in range(min(sample_size, len(all_docs["ids"]))):
                sample_data["sample_documents"].append({
                    "id": all_docs["ids"][i],
                    "content_preview": all_docs["documents"][i][:200] + "...",
                    "metadata": all_docs["metadatas"][i]
                })

            # Save to persistent directory
            export_path = os.path.join(persist_dir, filename)
            with open(export_path, 'w', encoding='utf-8') as f:
                json.dump(sample_data, f, indent=2, ensure_ascii=False)

            print(f"✅ Sample data exported to {export_path}")
            return sample_data
        except Exception as e:
            print(f"❌ Error exporting data: {str(e)}")
            return None

# =============================================================================
# 9. MEDIA SEARCH UTILITIES - UPDATED FOR PERSISTENCE
# =============================================================================

def search_media_content(query: str, collection_name: str = "parkinsons_complete_kb",
                        media_type: str = "all", k: int = 5):
    """
    Search specifically for media content

    Args:
        query (str): Search query
        collection_name (str): ChromaDB collection name
        media_type (str): "all", "video", or "podcast"
        k (int): Number of results to return

    Returns:
        List[Dict]: Formatted search results
    """
    vector_store = get_chroma_client(collection_name)

    # Get results
    results = vector_store.similarity_search_with_score(query, k=k*3)

    # Filter by media type
    filtered_results = []
    for doc, score in results:
        content_type = doc.metadata.get('content_type', '')

        if media_type == "all" and content_type in ['video', 'podcast']:
            filtered_results.append((doc, score))
        elif media_type == content_type:
            filtered_results.append((doc, score))

        if len(filtered_results) >= k:
            break

    # Format results
    formatted_results = []
    for doc, score in filtered_results:
        formatted_results.append({
            "score": score,
            "type": doc.metadata.get('content_type', 'unknown'),
            "organization": doc.metadata.get('organization', 'Unknown'),
            "title": doc.metadata.get('title', 'Unknown'),
            "media_url": doc.metadata.get('media_url', 'Unknown'),
            "description": doc.metadata.get('description', ''),
            "content": doc.page_content
        })

    return formatted_results

# =============================================================================
# 10. MAIN PROCESSING FUNCTIONS - MODIFIED WITH PERSISTENCE NOTIFICATIONS
# =============================================================================

def process_complete_knowledge_base(json_file_path: str = None,
                                  collection_name: str = "parkinsons_complete_kb"):
    """
    Process ALL content including text and media into a comprehensive knowledge base.
    Now with LOCAL PERSISTENCE and notifications!

    Args:
        json_file_path (str, optional): Path to JSON data file
        collection_name (str): Name for ChromaDB collection
    """
    print("🚀 Starting COMPLETE Parkinson's Knowledge Base Creation (Text + Media + PERSISTENCE)")
    print("="*70)

    # Load the data
    print("Loading crawled data...")
    try:
        data = load_crawled_data(json_file_path)
    except Exception as e:
        print(f"❌ Error loading data: {str(e)}")
        return

    # Initialize ChromaDB with persistence
    print("Initializing ChromaDB with LOCAL PERSISTENCE...")
    try:
        vector_store = get_chroma_client(collection_name)
        persist_dir = get_persistent_directory()
        print(f"✅ ChromaDB initialized successfully - Collection: {collection_name}")
        print(f"📁 Persistent directory: {persist_dir}")
    except Exception as e:
        print(f"❌ Error initializing ChromaDB: {str(e)}")
        return

    all_documents = []
    media_documents = []

    print("Processing organizations...")
    print("-" * 40)

    for org_name, org_data in data.items():
        if isinstance(org_data, dict):
            print(f"\n📂 Processing {org_name}...")

            # Process text content
            if 'pages_content' in org_data:
                text_docs = process_organization_data(org_name, org_data)
                all_documents.extend(text_docs)
                print(f"   📄 Generated {len(text_docs)} text document chunks")

            # Process media content
            media_docs = process_media_content(org_name, org_data)
            media_documents.extend(media_docs)
            print(f"   🎬 Generated {len(media_docs)} media document chunks")

        else:
            print(f"   ⚠️  Skipping {org_name} - no valid data")

    # Combine all documents
    all_content = all_documents + media_documents

    print("\n" + "="*70)
    print(f"📊 COMPLETE SUMMARY")
    print("="*70)
    print(f"📄 Text documents: {len(all_documents)}")
    print(f"🎬 Media documents: {len(media_documents)}")
    print(f"📚 Total documents to store: {len(all_content)}")

    if all_content:
        print("\nStoring ALL documents in ChromaDB with rate limiting and PERSISTENCE...")
        try:
            store_documents_with_rate_limiting(all_content, vector_store, batch_size=20)

            print(f"\n🎉 SUCCESS! Your COMPLETE knowledge base is ready!")
            print(f"   📚 Organizations processed: {len([k for k, v in data.items() if isinstance(v, dict)])}")
            print(f"   📄 Text document chunks: {len(all_documents)}")
            print(f"   🎬 Media document chunks: {len(media_documents)}")
            print(f"   📊 Total document chunks: {len(all_content)}")
            print(f"   🗄️  ChromaDB collection: {collection_name}")
            print(f"   💾 Data persisted to: {get_persistent_directory()}")

        except Exception as e:
            print(f"❌ Error storing documents: {str(e)}")
            return
    else:
        print("❌ No documents were generated. Check your data file.")

# =============================================================================
# 11. UTILITY AND INSPECTION FUNCTIONS - UPDATED FOR PERSISTENCE
# =============================================================================

def inspect_chromadb(collection_name: str = "parkinsons_knowledge_base"):
    """
    Inspect your ChromaDB knowledge base

    Args:
        collection_name (str): Name of the collection to inspect

    Returns:
        ChromaDBInspector: Inspector instance for further operations
    """
    print("🔍 CHROMADB INSPECTOR (PERSISTENT VERSION)")
    print("="*50)

    inspector = ChromaDBInspector(collection_name)

    # Get basic stats
    print("📊 Collection Statistics:")
    stats = inspector.get_collection_stats()
    for key, value in stats.items():
        print(f"   {key}: {value}")

    # Get organizations
    print("\n🏢 Organizations in Database:")
    orgs = inspector.get_all_organizations()
    for i, org in enumerate(orgs, 1):
        print(f"   {i}. {org}")

    # Test search
    print("\n🔍 Sample Search Results:")
    test_queries = [
        "What are the symptoms of Parkinson's disease?",
        "How to manage tremor?",
        "exercise therapy"
    ]

    for query in test_queries:
        print(f"\nQuery: '{query}'")
        results = inspector.search_documents(query, k=2)
        for i, result in enumerate(results[:2], 1):
            if "error" not in result:
                print(f"   {i}. {result['organization']} - Score: {result['score']:.3f}")
                print(f"      Type: {result.get('content_type', 'text')}")
                print(f"      {result['content_preview']}")

    return inspector

# =============================================================================
# 12. GOOGLE DRIVE UPLOAD HELPER
# =============================================================================

def prepare_for_google_drive_upload():
    """
    Show instructions for uploading persistent data to Google Drive
    """
    persist_dir = get_persistent_directory()

    from IPython.display import display, HTML

    instructions = f"""
    <div style="background-color: #e7f3ff; border: 1px solid #b8daff; color: #004085;
                padding: 20px; border-radius: 5px; margin: 10px 0; font-family: Arial;">
        <h3 style="margin-top: 0;">📤 Ready for Google Drive Upload!</h3>

        <h4>📁 Your persistent data location:</h4>
        <code style="background-color: #f8f9fa; padding: 5px; border-radius: 3px;">{persist_dir}</code>

        <h4>🚀 Upload Steps:</h4>
        <ol>
            <li><strong>Compress the folder:</strong> Right-click on the ChromaDB_Parkinson_Data folder and create a ZIP file</li>
            <li><strong>Upload to Google Drive:</strong> Upload the ZIP file to your Google Drive</li>
            <li><strong>Share with team:</strong> Share the folder with team members if needed</li>
        </ol>

        <h4>💡 Alternative - Direct Google Drive mount (in Colab):</h4>
        <p>If you're in Google Colab, you can mount Google Drive and save directly there!</p>

        <h4>📋 What's included in your persistent data:</h4>
        <ul>
            <li>🗄️ Complete ChromaDB vector database</li>
            <li>📊 All document embeddings</li>
            <li>🔍 Searchable knowledge base</li>
            <li>📁 Collection metadata</li>
        </ul>
    </div>
    """
    display(HTML(instructions))
    print(f"📁 Persistent data ready at: {persist_dir}")

✅ API key loaded from Colab secrets
🚀 PERSISTENT CHROMADB RAG SYSTEM
💾 This version saves your data locally for persistence!
☁️  Optional: Run mount_google_drive_and_setup() for direct Google Drive storage

1️⃣  Mounting Google Drive (optional but recommended)...
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
✅ Google Drive mounted successfully!
📁 Persistent directory set to: /content/drive/MyDrive/ChromaDB_Parkinson_Data



2️⃣  Checking persistence status...
🔍 PERSISTENCE STATUS CHECK
📁 Persistent Directory: /content/drive/MyDrive/ChromaDB_Parkinson_Data
📂 Directory Exists: True
📊 Files in directory: 2
📋 Contents:
   - chroma.sqlite3 (163840 bytes)
   - complete_knowledge_base_sample.json (171 bytes)
💾 Total Size: 0.16 MB

3️⃣  Available commands:
   - run_complete_pipeline()                 # Full pipeline with persistence
   - process_complete_knowledge_base()       # Create knowledge base only
   - inspect_chromadb('collection_name')     # Inspect existing data
   - backup_chromadb_to_zip()               # Create ZIP backup
   - prepare_for_google_drive_upload()      # Upload instructions
   - check_persistence_status()             # Check data status

✨ Ready to process your Parkinson's knowledge base with full persistence!
🚀 STARTING COMPLETE PARKINSON'S RAG PIPELINE WITH PERSISTENCE

📚 STEP 1: Creating complete knowledge base with LOCAL PERSISTENCE...
🚀 Starting COMPLETE Parkinson's Knowledge Base

Saving parkinsons_full_crawl.json to parkinsons_full_crawl.json
✅ Uploaded file: parkinsons_full_crawl.json
✅ Successfully loaded data with 9 organizations
Initializing ChromaDB with LOCAL PERSISTENCE...
📁 Using persistent directory: /content/drive/MyDrive/ChromaDB_Parkinson_Data
✅ ChromaDB initialized successfully - Collection: parkinsons_complete_kb
📁 Persistent directory: /content/drive/MyDrive/ChromaDB_Parkinson_Data
Processing organizations...
----------------------------------------

📂 Processing Parkinson's Foundation...
Processed 12 chunks from https://www.parkinson.org/
Processed 15 chunks from https://www.parkinson.org/understanding-parkinsons/movement-symptoms
Processed 27 chunks from https://www.parkinson.org/advancing-research/advocate-research
Processed 10 chunks from https://www.parkinson.org/living-with-parkinsons/stories
Processed 15 chunks from https://www.parkinson.org/understanding-parkinsons/10-early-signs
Processed 13 chunks from https://www.parkinson.org/resource

  vector_store.persist()


🎉 SUCCESS: Data persisted to /content/drive/MyDrive/ChromaDB_Parkinson_Data

🎉 SUCCESS! Your COMPLETE knowledge base is ready!
   📚 Organizations processed: 9
   📄 Text document chunks: 6655
   🎬 Media document chunks: 639
   📊 Total document chunks: 7294
   🗄️  ChromaDB collection: parkinsons_complete_kb
   💾 Data persisted to: /content/drive/MyDrive/ChromaDB_Parkinson_Data

🧪 STEP 2: Testing persistent knowledge base...
🔍 CHROMADB INSPECTOR (PERSISTENT VERSION)
📊 Collection Statistics:
   collection_name: parkinsons_complete_kb
   total_documents: 7294
   sample_ids: ['087cf170-cc50-4b41-8b43-99806a9a0bc4', '93880001-2c94-4e9f-9b11-40fd57fc975b', '56121b29-7430-4bff-8bc6-07ef17ef6a0e', 'f4059cc8-e424-4371-afa4-2319300b9366', 'f7ccf8d4-e47e-40a6-9e64-7c65de56638c']
   persistent_directory: /content/drive/MyDrive/ChromaDB_Parkinson_Data

🏢 Organizations in Database:
   1. American Parkinson Disease Association
   2. Michael J. Fox Foundation
   3. PMD Alliance
   4. Parkinson Canada
  

📁 Persistent data ready at: /content/drive/MyDrive/ChromaDB_Parkinson_Data

🎉 Pipeline complete! Your persistent knowledge base is ready for use and Google Drive upload.
