# MFEGSN - PDF RAG System on Google Colab

This notebook allows you to run the MFEGSN PDF RAG System on Google Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yaniber/MFEGSN/blob/main/MFEGSN_Colab.ipynb)

## Features
- üì§ Upload PDFs or import from Google Drive
- üîç Extract content and convert to Markdown
- üß† Index documents with RAG (Retrieval-Augmented Generation)
- üîé Semantic search across your documents
- üíæ Save outputs to GitHub or Google Drive
- ÔøΩÔøΩ **NEW:** Access via public URL with Ngrok
- ‚úÖ **NEW:** Select and process specific PDFs
- üìä **NEW:** Real-time processing progress
- üöÄ **NEW:** One-click GitHub export

## 1. Setup and Installation

Install dependencies and clone the repository.

In [None]:
# Clone the repository
!git clone https://github.com/yaniber/MFEGSN.git
%cd MFEGSN

# Install dependencies with compatibility handling
# Note: Some dependency warnings are expected in Colab and can be safely ignored
!pip install -q -r requirements.txt 2>&1 | grep -v "dependency conflicts" || true

print("‚úÖ Installation complete!")
print("\n‚ö†Ô∏è  Note: Dependency warnings with google-colab packages are normal and can be ignored.")

## 2. Mount Google Drive (Optional)

Mount your Google Drive to import PDFs or save outputs.

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

print("\n‚úÖ Google Drive mounted at /content/drive")
print("\nYou can now access files from:")
print("  - My Drive: /content/drive/MyDrive/")
print("  - Shared with me: /content/drive/Shareddrives/")

## 3. Import PDFs from Google Drive

Copy PDFs from your Google Drive to the working directory.

### Option A: Import from Multiple Folders

Import all PDFs from multiple Google Drive folders at once.

In [None]:
import shutil
from pathlib import Path
import os

# Option 1: Import from multiple folders
# Add all your Google Drive folders containing PDFs
GDRIVE_PDF_FOLDERS = [
    "/content/drive/MyDrive/PDFs",
    "/content/drive/MyDrive/Documents",
    "/content/drive/MyDrive/Research",
    # Add more folders as needed
]

# Create pdfs directory if it doesn't exist
Path("pdfs").mkdir(exist_ok=True)

total_imported = 0
print("üìÅ Searching for PDFs in Google Drive folders...\n")

for folder_path in GDRIVE_PDF_FOLDERS:
    if os.path.exists(folder_path):
        pdf_files = list(Path(folder_path).glob("*.pdf"))
        
        if pdf_files:
            print(f"üìÇ Folder: {folder_path}")
            print(f"   Found {len(pdf_files)} PDF(s)")
            
            for pdf_file in pdf_files:
                dest = Path("pdfs") / pdf_file.name
                
                # Handle duplicate filenames by adding folder name
                if dest.exists():
                    folder_name = Path(folder_path).name
                    dest = Path("pdfs") / f"{folder_name}_{pdf_file.name}"
                
                shutil.copy2(pdf_file, dest)
                print(f"   ‚úì Copied: {pdf_file.name}")
                total_imported += 1
            print()
        else:
            print(f"‚äò Folder: {folder_path} (no PDFs found)\n")
    else:
        print(f"‚ö†Ô∏è  Folder not found: {folder_path}\n")

if total_imported > 0:
    print(f"\n‚úÖ Successfully imported {total_imported} PDF(s) from {len([f for f in GDRIVE_PDF_FOLDERS if os.path.exists(f)])} folder(s)")
else:
    print("\n‚ö†Ô∏è  No PDF files found in any of the specified folders")
    print("\nTip: Update GDRIVE_PDF_FOLDERS list with the correct paths to your PDFs")
    print("Example paths:")
    print("  - /content/drive/MyDrive/FolderName")
    print("  - /content/drive/Shareddrives/SharedFolderName")

### Option B: Interactive File Selection

Browse and select specific PDF files from any Google Drive location.

In [None]:
import shutil
from pathlib import Path
from google.colab import files as colab_files
import ipywidgets as widgets
from IPython.display import display, clear_output
import os

# Alternative: Browse and select files interactively
print("üìÇ Interactive PDF Selection from Google Drive\n")
print("Instructions:")
print("1. Enter the full path to a Google Drive folder")
print("2. Select the PDFs you want to import")
print("3. Repeat for additional folders")
print("4. Type 'done' when finished\n")

# Create pdfs directory if it doesn't exist
Path("pdfs").mkdir(exist_ok=True)

total_imported = 0

while True:
    folder_path = input("\nEnter folder path (or 'done' to finish): ").strip()
    
    if folder_path.lower() == 'done':
        break
    
    if not os.path.exists(folder_path):
        print(f"‚ùå Folder not found: {folder_path}")
        continue
    
    # Find all PDFs in the folder
    pdf_files = list(Path(folder_path).glob("*.pdf"))
    
    if not pdf_files:
        print(f"‚ö†Ô∏è  No PDF files found in {folder_path}")
        continue
    
    print(f"\nüìÇ Found {len(pdf_files)} PDF(s) in folder:")
    print(f"   {folder_path}\n")
    
    # Display files with numbers
    for idx, pdf_file in enumerate(pdf_files, 1):
        print(f"   {idx}. {pdf_file.name}")
    
    # Ask user which files to import
    selection = input("\nEnter file numbers to import (e.g., '1,3,5' or 'all'): ").strip()
    
    if selection.lower() == 'all':
        selected_files = pdf_files
    else:
        try:
            indices = [int(x.strip()) - 1 for x in selection.split(',')]
            selected_files = [pdf_files[i] for i in indices if 0 <= i < len(pdf_files)]
        except (ValueError, IndexError):
            print("‚ùå Invalid selection. Skipping this folder.")
            continue
    
    # Copy selected files
    print(f"\nüì• Importing {len(selected_files)} file(s)...")
    for pdf_file in selected_files:
        dest = Path("pdfs") / pdf_file.name
        
        # Handle duplicate filenames
        if dest.exists():
            folder_name = Path(folder_path).name
            dest = Path("pdfs") / f"{folder_name}_{pdf_file.name}"
        
        shutil.copy2(pdf_file, dest)
        print(f"   ‚úì Copied: {pdf_file.name}")
        total_imported += 1

if total_imported > 0:
    print(f"\n\n‚úÖ Successfully imported {total_imported} PDF(s) from Google Drive")
    print(f"\nüìä Total PDFs in working directory: {len(list(Path('pdfs').glob('*.pdf')))}")
else:
    print("\n‚ö†Ô∏è  No files were imported")

## 4. Or Upload PDFs Directly

Upload PDFs from your local computer.

In [None]:
from google.colab import files
from pathlib import Path

# Create pdfs directory if it doesn't exist
Path("pdfs").mkdir(exist_ok=True)

# Upload files
print("üì§ Select PDF files to upload...")
uploaded = files.upload()

# Move uploaded files to pdfs directory
for filename in uploaded.keys():
    if filename.endswith('.pdf'):
        dest = Path("pdfs") / filename
        shutil.move(filename, dest)
        print(f"‚úì Uploaded: {filename}")
    else:
        print(f"‚ö†Ô∏è  Skipped non-PDF file: {filename}")

print(f"\n‚úÖ Upload complete!")

## 5. Process PDFs

Extract content from PDFs and index them.

In [None]:
from src.pdf_extractor.extractor import PDFExtractor
from src.rag_indexer.indexer import RAGIndexer
from pathlib import Path
import os

# Ensure directories exist
Path("pdfs").mkdir(exist_ok=True)
Path("markdown_outputs").mkdir(exist_ok=True)
Path("chroma_db").mkdir(exist_ok=True)

# Initialize components
print("Initializing PDF extractor and RAG indexer...")
try:
    pdf_extractor = PDFExtractor()
    rag_indexer = RAGIndexer()
    print("‚úì Components initialized successfully\n")
except Exception as e:
    print(f"‚ùå Error initializing components: {e}")
    print("\nTroubleshooting:")
    print("1. Make sure you're in the MFEGSN directory")
    print("2. Try running the installation cell again")
    raise

# Get all PDFs
pdf_files = list(Path("pdfs").glob("*.pdf"))

if not pdf_files:
    print("‚ö†Ô∏è  No PDF files found. Please upload or import PDFs first.")
else:
    print(f"\nProcessing {len(pdf_files)} PDF(s)...\n")
    
    for pdf_path in pdf_files:
        try:
            print(f"üìÑ Processing: {pdf_path.name}")
            
            # Extract PDF content
            result = pdf_extractor.extract_pdf(str(pdf_path))
            print(f"   ‚úì Extracted to: {result['markdown_path']}")
            
            # Index in RAG database
            doc_id = pdf_path.stem
            rag_indexer.index_document(
                doc_id=doc_id,
                content=result["markdown"],
                metadata={
                    "source": str(pdf_path),
                    "markdown_path": result["markdown_path"]
                }
            )
            print(f"   ‚úì Indexed as: {doc_id}\n")
            
        except Exception as e:
            print(f"   ‚ùå Error: {str(e)}\n")
    
    print("\n‚úÖ Processing complete!")
    
    # Show statistics
    try:
        stats = rag_indexer.get_collection_stats()
        print(f"\nüìä Statistics:")
        print(f"   Total documents: {stats['total_documents']}")
        print(f"   Total chunks: {stats['total_chunks']}")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Could not retrieve stats: {e}")

## 6. Query Your Documents

Perform semantic search across your indexed documents.

In [None]:
# Query the indexed documents
query = "What is the main topic?"  # Change this to your query
n_results = 3  # Number of results to return

print(f"üîç Query: {query}\n")

results = rag_indexer.query(query, n_results)

if results['results']:
    print(f"Found {len(results['results'])} result(s):\n")
    
    for i, (doc, metadata, distance) in enumerate(zip(
        results['results'],
        results['metadatas'],
        results['distances']
    )):
        relevance = 1 - distance
        print(f"Result {i+1} (Relevance: {relevance:.3f})")
        print(f"Document: {metadata.get('doc_id', 'unknown')}")
        print(f"Chunk: {metadata.get('chunk_id', 'unknown')}")
        print(f"Content: {doc[:300]}...")
        print("-" * 80 + "\n")
else:
    print("No results found.")

## 7. Save Outputs

Save your processed outputs to Google Drive or GitHub.

### Option 1: Save to Google Drive

In [None]:
import shutil
from pathlib import Path
from datetime import datetime

# Configure output path in Google Drive
GDRIVE_OUTPUT_FOLDER = "/content/drive/MyDrive/MFEGSN_Outputs"  # Change this path!

# Create output folder with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = Path(GDRIVE_OUTPUT_FOLDER) / f"output_{timestamp}"
output_path.mkdir(parents=True, exist_ok=True)

# Copy markdown outputs
if Path("markdown_outputs").exists():
    markdown_dest = output_path / "markdown_outputs"
    shutil.copytree("markdown_outputs", markdown_dest, dirs_exist_ok=True)
    print(f"‚úì Saved markdown files to: {markdown_dest}")

# Copy database
if Path("chroma_db").exists():
    db_dest = output_path / "chroma_db"
    shutil.copytree("chroma_db", db_dest, dirs_exist_ok=True)
    print(f"‚úì Saved database to: {db_dest}")

print(f"\n‚úÖ All outputs saved to: {output_path}")

### Option 2: Save to GitHub (New Branch)

Push your outputs to a new branch in your GitHub repository.

## 8. Launch Web Interface with Ngrok (One-Step Setup)

Start a web server with optional public URL access via Ngrok. This allows you to:
- Select and process specific PDFs
- View processing progress in real-time
- Export results to GitHub
- Query your documents

**You will be prompted for:**
- Ngrok authtoken (optional, for public URL access)
- GitHub PAT (optional, for GitHub export)
- Google Drive API key (optional, for Drive integration)

Press Enter to skip any optional configuration.

In [None]:
import os
import subprocess
import time
from threading import Thread
from getpass import getpass

# Configure API Keys (inline, one-step setup)
print("\n" + "="*60)
print("üöÄ MFEGSN WEB INTERFACE - ONE-STEP LAUNCH")
print("="*60)
print("\n‚öôÔ∏è  Quick Configuration (Press Enter to skip any option)\n")

# Check if already configured
ngrok_token = os.environ.get('NGROK_AUTHTOKEN', '')
github_pat = os.environ.get('GITHUB_PAT', '')
gdrive_key = os.environ.get('GOOGLE_DRIVE_API_KEY', '')

# Ngrok Authtoken (most important for public access)
if not ngrok_token:
    print("üåê Ngrok Authtoken (for public URL access)")
    print("   Get your free token from: https://dashboard.ngrok.com/get-started/your-authtoken")
    ngrok_token = getpass("   Enter token (or press Enter for local-only mode): ")
    if ngrok_token:
        os.environ['NGROK_AUTHTOKEN'] = ngrok_token
        print("   ‚úì Ngrok configured - Public URL will be enabled\n")
    else:
        print("   ‚äñ Skipped - Will run in local-only mode\n")
else:
    print("üåê Ngrok: ‚úì Already configured\n")

# GitHub Personal Access Token (for export feature)
if not github_pat:
    print("üîë GitHub Personal Access Token (for GitHub export)")
    print("   Get from: https://github.com/settings/tokens (scope: repo)")
    github_pat = getpass("   Enter token (or press Enter to skip): ")
    if github_pat:
        os.environ['GITHUB_PAT'] = github_pat
        print("   ‚úì GitHub PAT configured\n")
    else:
        print("   ‚äñ Skipped - Manual export will be required\n")
else:
    print("üîë GitHub PAT: ‚úì Already configured\n")

# Google Drive API Key (less commonly needed)
if not gdrive_key:
    print("üìÅ Google Drive API Key (for advanced Drive integration)")
    print("   Get from: https://developers.google.com/drive/api/v3/quickstart/python")
    gdrive_key = getpass("   Enter key (or press Enter to skip): ")
    if gdrive_key:
        os.environ['GOOGLE_DRIVE_API_KEY'] = gdrive_key
        print("   ‚úì Google Drive API key configured\n")
    else:
        print("   ‚äñ Skipped\n")
else:
    print("üìÅ Google Drive API: ‚úì Already configured\n")

print("\n" + "="*60)
print("üöÄ LAUNCHING WEB SERVER")
print("="*60 + "\n")

# Stop any existing server (get PIDs and kill them)
result = subprocess.run(['pgrep', '-f', 'uvicorn'], capture_output=True, text=True)
if result.stdout.strip():
    print("üõ†Ô∏è  Stopping existing server...")
    for pid in result.stdout.strip().split('\n'):
        try:
            os.kill(int(pid), 9)
        except (OSError, ProcessLookupError):
            pass
    time.sleep(2)

# Set environment variable for web interface
ngrok_enabled = bool(os.environ.get('NGROK_AUTHTOKEN', ''))
os.environ['USE_NGROK'] = 'true' if ngrok_enabled else 'false'

print("üíª Starting web server...")

# Start the web server in background
def run_server():
    os.system('python web_interface.py > /tmp/web_server.log 2>&1')

server_thread = Thread(target=run_server, daemon=True)
server_thread.start()

# Wait for server to start
print("‚è≥ Waiting for server to initialize...")
time.sleep(5)

# Check if server is running and get the public URL
if ngrok_enabled:
    print("üåê Setting up Ngrok tunnel...")
    time.sleep(3)  # Give ngrok more time to connect
    try:
        from pyngrok import ngrok
        tunnels = ngrok.get_tunnels()
        
        if tunnels:
            public_url = tunnels[0].public_url
            print("\n" + "="*60)
            print("‚úÖ WEB INTERFACE READY WITH PUBLIC ACCESS")
            print("="*60)
            print(f"üåê Public URL:  {public_url}")
            print(f"üè† Local URL:   http://localhost:8000")
            print("="*60)
            print("\nüìå NEXT STEPS:")
            print("   1. Copy the Public URL above")
            print("   2. Open it in your browser")
            print("   3. Start using the web interface!")
            print("\n‚ú® FEATURES AVAILABLE:")
            print("   ‚Ä¢ Select and process specific PDFs")
            print("   ‚Ä¢ View real-time processing progress")
            print("   ‚Ä¢ Export results to GitHub")
            print("   ‚Ä¢ Query your documents")
            print("\nüîí SECURITY NOTE:")
            print("   - Don't share your public URL with untrusted parties")
            print("   - The tunnel will close when you stop the server\n")
        else:
            print("\n‚ö†Ô∏è  Ngrok tunnel not established. Check logs below:")
            print("\n--- Server Logs ---")
            try:
                with open('/tmp/web_server.log', 'r') as f:
                    print(f.read()[-1000:])
            except (FileNotFoundError, IOError):
                print("No logs available")
    except Exception as e:
        print(f"\n‚ùå Error setting up Ngrok tunnel: {e}")
        print("\nFalling back to local mode...")
        print("\n" + "="*60)
        print("‚úÖ WEB INTERFACE READY (Local Mode)")
        print("="*60)
        print(f"üè† Local URL: http://localhost:8000")
        print("="*60)
        print("\nüíª Access from Colab environment only\n")
else:
    print("\n" + "="*60)
    print("‚úÖ WEB INTERFACE READY (Local Mode)")
    print("="*60)
    print(f"üè† Local URL: http://localhost:8000")
    print("="*60)
    print("\nüíª Running in local-only mode")
    print("\nüí° TIP: To enable public access next time:")
    print("   1. Get a free Ngrok token from: https://ngrok.com")
    print("   2. Re-run this cell and enter the token when prompted\n")

### View Server Logs (Troubleshooting)

If the server is not working as expected, check the logs below.

In [None]:
# View last 50 lines of server logs
try:
    with open('/tmp/web_server.log', 'r') as f:
        lines = f.readlines()
        print(''.join(lines[-50:]))
except FileNotFoundError:
    print("No log file found. The server may not have started yet.")

## 9. Stop Web Server

Run this cell when you're done to stop the web server and close the Ngrok tunnel.

In [None]:
import os
import subprocess

print("üõë Stopping web server...")

# Kill uvicorn processes
result = subprocess.run(['pgrep', '-f', 'uvicorn'], capture_output=True, text=True)
if result.stdout.strip():
    for pid in result.stdout.strip().split('\n'):
        try:
            os.kill(int(pid), 9)
            print(f"‚úì Stopped process {pid}")
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not stop process {pid}: {e}")
else:
    print("No running server processes found")

# Close ngrok tunnels
try:
    from pyngrok import ngrok
    ngrok.kill()
    print("‚úì Ngrok tunnels closed")
except Exception as e:
    print(f"‚ö†Ô∏è  Could not close Ngrok: {e}")

print("\n‚úÖ Server stopped")

In [None]:
# Configure Git
!git config --global user.email "your-email@example.com"  # Change this!
!git config --global user.name "Your Name"  # Change this!

print("‚úÖ Git configured")

In [None]:
from datetime import datetime
import os

# Create a new branch with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
branch_name = f"colab-outputs-{timestamp}"

print(f"Creating new branch: {branch_name}")
!git checkout -b {branch_name}

# Add outputs
print("\nAdding outputs to git...")
!git add markdown_outputs/ chroma_db/ pdfs/

# Commit changes
commit_message = f"Add Colab outputs from {timestamp}"
!git commit -m "{commit_message}"

print(f"\n‚úÖ Changes committed to branch: {branch_name}")

# Check if GITHUB_PAT is configured
github_pat = os.environ.get('GITHUB_PAT', '')

if github_pat:
    print("\nüì§ Pushing to GitHub using configured PAT...")
    # Use the PAT from environment
    push_cmd = f"git push https://{github_pat}@github.com/yaniber/MFEGSN.git {branch_name}"
    !{push_cmd}
    print("\n‚úÖ Successfully pushed to GitHub!")
    print(f"\nüìù Create a Pull Request at:")
    print(f"   https://github.com/yaniber/MFEGSN/compare/{branch_name}")
else:
    print("\n‚ö†Ô∏è  GitHub PAT not configured. To push, you have two options:")
    print("\nüìù Option 1: Configure PAT in the API keys cell above, then re-run this cell")
    print("\nüìù Option 2: Manual push with token:")
    print("\n1. Generate a token at: https://github.com/settings/tokens")
    print("2. Run: !git push https://YOUR_TOKEN@github.com/yaniber/MFEGSN.git", branch_name)
    print("3. Create a Pull Request on GitHub")


### Option 3: Download Outputs Locally

In [None]:
from google.colab import files
import shutil
from pathlib import Path

# Create a zip file with all outputs
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
archive_name = f"mfegsn_outputs_{timestamp}"

print("Creating archive...")
shutil.make_archive(archive_name, 'zip', '.', 'markdown_outputs')

print(f"Downloading {archive_name}.zip...")
files.download(f"{archive_name}.zip")

print("\n‚úÖ Download complete!")

## üìö Additional Resources

- [GitHub Repository](https://github.com/yaniber/MFEGSN)
- [Full Documentation](https://github.com/yaniber/MFEGSN/blob/main/README.md)
- [Docker Setup Guide](https://github.com/yaniber/MFEGSN/blob/main/DOCKER.md)

## üÜò Troubleshooting

### Common Issues

1. **Out of Memory**: Try processing fewer PDFs at once
2. **Google Drive Access**: Make sure to run the "Mount Google Drive" cell first
3. **PDF Processing Errors**: Some PDFs may have complex formatting that's difficult to extract

### Need Help?

Open an issue on GitHub: https://github.com/yaniber/MFEGSN/issues