# MFEGSN - PDF RAG System on Google Colab

This notebook allows you to run the MFEGSN PDF RAG System on Google Colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/yaniber/MFEGSN/blob/main/MFEGSN_Colab.ipynb)

## Features
- 📤 Upload PDFs or import from Google Drive
- 🔍 Extract content and convert to Markdown
- 🧠 Index documents with RAG (Retrieval-Augmented Generation)
- 🔎 Semantic search across your documents
- 💾 Save outputs to GitHub or Google Drive

## 1. Setup and Installation

Install dependencies and clone the repository.

In [None]:
# Clone the repository
!git clone https://github.com/yaniber/MFEGSN.git
%cd MFEGSN

# Install dependencies
!pip install -q -r requirements.txt

print("✅ Installation complete!")

## 2. Mount Google Drive (Optional)

Mount your Google Drive to import PDFs or save outputs.

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

print("\n✅ Google Drive mounted at /content/drive")
print("\nYou can now access files from:")
print("  - My Drive: /content/drive/MyDrive/")
print("  - Shared with me: /content/drive/Shareddrives/")

## 3. Import PDFs from Google Drive

Copy PDFs from your Google Drive to the working directory.

### Option A: Import from Multiple Folders

Import all PDFs from multiple Google Drive folders at once.

In [None]:
import shutil
from pathlib import Path
import os

# Option 1: Import from multiple folders
# Add all your Google Drive folders containing PDFs
GDRIVE_PDF_FOLDERS = [
    "/content/drive/MyDrive/PDFs",
    "/content/drive/MyDrive/Documents",
    "/content/drive/MyDrive/Research",
    # Add more folders as needed
]

# Create pdfs directory if it doesn't exist
Path("pdfs").mkdir(exist_ok=True)

total_imported = 0
print("📁 Searching for PDFs in Google Drive folders...\n")

for folder_path in GDRIVE_PDF_FOLDERS:
    if os.path.exists(folder_path):
        pdf_files = list(Path(folder_path).glob("*.pdf"))
        
        if pdf_files:
            print(f"📂 Folder: {folder_path}")
            print(f"   Found {len(pdf_files)} PDF(s)")
            
            for pdf_file in pdf_files:
                dest = Path("pdfs") / pdf_file.name
                
                # Handle duplicate filenames by adding folder name
                if dest.exists():
                    folder_name = Path(folder_path).name
                    dest = Path("pdfs") / f"{folder_name}_{pdf_file.name}"
                
                shutil.copy2(pdf_file, dest)
                print(f"   ✓ Copied: {pdf_file.name}")
                total_imported += 1
            print()
        else:
            print(f"⊘ Folder: {folder_path} (no PDFs found)\n")
    else:
        print(f"⚠️  Folder not found: {folder_path}\n")

if total_imported > 0:
    print(f"\n✅ Successfully imported {total_imported} PDF(s) from {len([f for f in GDRIVE_PDF_FOLDERS if os.path.exists(f)])} folder(s)")
else:
    print("\n⚠️  No PDF files found in any of the specified folders")
    print("\nTip: Update GDRIVE_PDF_FOLDERS list with the correct paths to your PDFs")
    print("Example paths:")
    print("  - /content/drive/MyDrive/FolderName")
    print("  - /content/drive/Shareddrives/SharedFolderName")

### Option B: Interactive File Selection

Browse and select specific PDF files from any Google Drive location.

In [None]:
import shutil
from pathlib import Path
from google.colab import files as colab_files
import ipywidgets as widgets
from IPython.display import display, clear_output
import os

# Alternative: Browse and select files interactively
print("📂 Interactive PDF Selection from Google Drive\n")
print("Instructions:")
print("1. Enter the full path to a Google Drive folder")
print("2. Select the PDFs you want to import")
print("3. Repeat for additional folders")
print("4. Type 'done' when finished\n")

# Create pdfs directory if it doesn't exist
Path("pdfs").mkdir(exist_ok=True)

total_imported = 0

while True:
    folder_path = input("\nEnter folder path (or 'done' to finish): ").strip()
    
    if folder_path.lower() == 'done':
        break
    
    if not os.path.exists(folder_path):
        print(f"❌ Folder not found: {folder_path}")
        continue
    
    # Find all PDFs in the folder
    pdf_files = list(Path(folder_path).glob("*.pdf"))
    
    if not pdf_files:
        print(f"⚠️  No PDF files found in {folder_path}")
        continue
    
    print(f"\n📂 Found {len(pdf_files)} PDF(s) in folder:")
    print(f"   {folder_path}\n")
    
    # Display files with numbers
    for idx, pdf_file in enumerate(pdf_files, 1):
        print(f"   {idx}. {pdf_file.name}")
    
    # Ask user which files to import
    selection = input("\nEnter file numbers to import (e.g., '1,3,5' or 'all'): ").strip()
    
    if selection.lower() == 'all':
        selected_files = pdf_files
    else:
        try:
            indices = [int(x.strip()) - 1 for x in selection.split(',')]
            selected_files = [pdf_files[i] for i in indices if 0 <= i < len(pdf_files)]
        except:
            print("❌ Invalid selection. Skipping this folder.")
            continue
    
    # Copy selected files
    print(f"\n📥 Importing {len(selected_files)} file(s)...")
    for pdf_file in selected_files:
        dest = Path("pdfs") / pdf_file.name
        
        # Handle duplicate filenames
        if dest.exists():
            folder_name = Path(folder_path).name
            dest = Path("pdfs") / f"{folder_name}_{pdf_file.name}"
        
        shutil.copy2(pdf_file, dest)
        print(f"   ✓ Copied: {pdf_file.name}")
        total_imported += 1

if total_imported > 0:
    print(f"\n\n✅ Successfully imported {total_imported} PDF(s) from Google Drive")
    print(f"\n📊 Total PDFs in working directory: {len(list(Path('pdfs').glob('*.pdf')))}")
else:
    print("\n⚠️  No files were imported")

## 4. Or Upload PDFs Directly

Upload PDFs from your local computer.

In [None]:
from google.colab import files
from pathlib import Path

# Create pdfs directory if it doesn't exist
Path("pdfs").mkdir(exist_ok=True)

# Upload files
print("📤 Select PDF files to upload...")
uploaded = files.upload()

# Move uploaded files to pdfs directory
for filename in uploaded.keys():
    if filename.endswith('.pdf'):
        dest = Path("pdfs") / filename
        shutil.move(filename, dest)
        print(f"✓ Uploaded: {filename}")
    else:
        print(f"⚠️  Skipped non-PDF file: {filename}")

print(f"\n✅ Upload complete!")

## 5. Process PDFs

Extract content from PDFs and index them.

In [None]:
from src.pdf_extractor.extractor import PDFExtractor
from src.rag_indexer.indexer import RAGIndexer
from pathlib import Path

# Initialize components
print("Initializing PDF extractor and RAG indexer...")
pdf_extractor = PDFExtractor()
rag_indexer = RAGIndexer()

# Get all PDFs
pdf_files = list(Path("pdfs").glob("*.pdf"))

if not pdf_files:
    print("⚠️  No PDF files found. Please upload or import PDFs first.")
else:
    print(f"\nProcessing {len(pdf_files)} PDF(s)...\n")
    
    for pdf_path in pdf_files:
        try:
            print(f"📄 Processing: {pdf_path.name}")
            
            # Extract PDF content
            result = pdf_extractor.extract_pdf(str(pdf_path))
            print(f"   ✓ Extracted to: {result['markdown_path']}")
            
            # Index in RAG database
            doc_id = pdf_path.stem
            rag_indexer.index_document(
                doc_id=doc_id,
                content=result["markdown"],
                metadata={
                    "source": str(pdf_path),
                    "markdown_path": result["markdown_path"]
                }
            )
            print(f"   ✓ Indexed as: {doc_id}\n")
            
        except Exception as e:
            print(f"   ❌ Error: {str(e)}\n")
    
    print("\n✅ Processing complete!")
    
    # Show statistics
    stats = rag_indexer.get_collection_stats()
    print(f"\n📊 Statistics:")
    print(f"   Total documents: {stats['total_documents']}")
    print(f"   Total chunks: {stats['total_chunks']}")

## 6. Query Your Documents

Perform semantic search across your indexed documents.

In [None]:
# Query the indexed documents
query = "What is the main topic?"  # Change this to your query
n_results = 3  # Number of results to return

print(f"🔍 Query: {query}\n")

results = rag_indexer.query(query, n_results)

if results['results']:
    print(f"Found {len(results['results'])} result(s):\n")
    
    for i, (doc, metadata, distance) in enumerate(zip(
        results['results'],
        results['metadatas'],
        results['distances']
    )):
        relevance = 1 - distance
        print(f"Result {i+1} (Relevance: {relevance:.3f})")
        print(f"Document: {metadata.get('doc_id', 'unknown')}")
        print(f"Chunk: {metadata.get('chunk_id', 'unknown')}")
        print(f"Content: {doc[:300]}...")
        print("-" * 80 + "\n")
else:
    print("No results found.")

## 7. Save Outputs

Save your processed outputs to Google Drive or GitHub.

### Option 1: Save to Google Drive

In [None]:
import shutil
from pathlib import Path
from datetime import datetime

# Configure output path in Google Drive
GDRIVE_OUTPUT_FOLDER = "/content/drive/MyDrive/MFEGSN_Outputs"  # Change this path!

# Create output folder with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = Path(GDRIVE_OUTPUT_FOLDER) / f"output_{timestamp}"
output_path.mkdir(parents=True, exist_ok=True)

# Copy markdown outputs
if Path("markdown_outputs").exists():
    markdown_dest = output_path / "markdown_outputs"
    shutil.copytree("markdown_outputs", markdown_dest, dirs_exist_ok=True)
    print(f"✓ Saved markdown files to: {markdown_dest}")

# Copy database
if Path("chroma_db").exists():
    db_dest = output_path / "chroma_db"
    shutil.copytree("chroma_db", db_dest, dirs_exist_ok=True)
    print(f"✓ Saved database to: {db_dest}")

print(f"\n✅ All outputs saved to: {output_path}")

### Option 2: Save to GitHub (New Branch)

Push your outputs to a new branch in your GitHub repository.

### Configure API Keys (Optional)

Set up your API keys for enhanced functionality:
- **Google Drive API Key**: For programmatic access to Drive
- **GitHub PAT**: For pushing to GitHub repositories
- **Ngrok Authtoken**: For public URL access (if running web interface)

In [None]:
import os
from getpass import getpass

print("⚙️  Configure API Keys (Optional - Press Enter to skip)\n")

# Google Drive API Key
print("📁 Google Drive API Key")
print("   Get from: https://developers.google.com/drive/api/v3/quickstart/python")
gdrive_key = getpass("   Enter key (or press Enter to skip): ")
if gdrive_key:
    os.environ['GOOGLE_DRIVE_API_KEY'] = gdrive_key
    print("   ✓ Google Drive API key configured\n")
else:
    print("   ⊘ Skipped\n")

# GitHub Personal Access Token
print("🔑 GitHub Personal Access Token (PAT)")
print("   Get from: https://github.com/settings/tokens")
print("   Required scopes: repo")
github_pat = getpass("   Enter token (or press Enter to skip): ")
if github_pat:
    os.environ['GITHUB_PAT'] = github_pat
    print("   ✓ GitHub PAT configured\n")
else:
    print("   ⊘ Skipped\n")

# Ngrok Authtoken
print("🌐 Ngrok Authtoken (for public URL)")
print("   Get from: https://dashboard.ngrok.com/get-started/your-authtoken")
ngrok_token = getpass("   Enter token (or press Enter to skip): ")
if ngrok_token:
    os.environ['NGROK_AUTHTOKEN'] = ngrok_token
    print("   ✓ Ngrok authtoken configured\n")
else:
    print("   ⊘ Skipped\n")

print("✅ API key configuration complete!")
print("\nNote: These are set as environment variables and will be available for this session.")

In [None]:
# Configure Git
!git config --global user.email "your-email@example.com"  # Change this!
!git config --global user.name "Your Name"  # Change this!

print("✅ Git configured")

In [None]:
from datetime import datetime
import os

# Create a new branch with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
branch_name = f"colab-outputs-{timestamp}"

print(f"Creating new branch: {branch_name}")
!git checkout -b {branch_name}

# Add outputs
print("\nAdding outputs to git...")
!git add markdown_outputs/ chroma_db/ pdfs/

# Commit changes
commit_message = f"Add Colab outputs from {timestamp}"
!git commit -m "{commit_message}"

print(f"\n✅ Changes committed to branch: {branch_name}")

# Check if GITHUB_PAT is configured
github_pat = os.environ.get('GITHUB_PAT', '')

if github_pat:
    print("\n📤 Pushing to GitHub using configured PAT...")
    # Use the PAT from environment
    push_cmd = f"git push https://{github_pat}@github.com/yaniber/MFEGSN.git {branch_name}"
    !{push_cmd}
    print("\n✅ Successfully pushed to GitHub!")
    print(f"\n📝 Create a Pull Request at:")
    print(f"   https://github.com/yaniber/MFEGSN/compare/{branch_name}")
else:
    print("\n⚠️  GitHub PAT not configured. To push, you have two options:")
    print("\n📝 Option 1: Configure PAT in the API keys cell above, then re-run this cell")
    print("\n📝 Option 2: Manual push with token:")
    print("\n1. Generate a token at: https://github.com/settings/tokens")
    print("2. Run: !git push https://YOUR_TOKEN@github.com/yaniber/MFEGSN.git", branch_name)
    print("3. Create a Pull Request on GitHub")


### Option 3: Download Outputs Locally

In [None]:
from google.colab import files
import shutil
from pathlib import Path

# Create a zip file with all outputs
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
archive_name = f"mfegsn_outputs_{timestamp}"

print("Creating archive...")
shutil.make_archive(archive_name, 'zip', '.', 'markdown_outputs')

print(f"Downloading {archive_name}.zip...")
files.download(f"{archive_name}.zip")

print("\n✅ Download complete!")

## 📚 Additional Resources

- [GitHub Repository](https://github.com/yaniber/MFEGSN)
- [Full Documentation](https://github.com/yaniber/MFEGSN/blob/main/README.md)
- [Docker Setup Guide](https://github.com/yaniber/MFEGSN/blob/main/DOCKER.md)

## 🆘 Troubleshooting

### Common Issues

1. **Out of Memory**: Try processing fewer PDFs at once
2. **Google Drive Access**: Make sure to run the "Mount Google Drive" cell first
3. **PDF Processing Errors**: Some PDFs may have complex formatting that's difficult to extract

### Need Help?

Open an issue on GitHub: https://github.com/yaniber/MFEGSN/issues