# üìö GraphRecall Book Ingestion Pipeline

**All-in-one notebook**: PDF ‚Üí Marker OCR ‚Üí Chunking ‚Üí API Ingestion

This notebook is a **backdoor** into GraphRecall. It:
1. üìÑ Converts PDF to Markdown using Marker (GPU-accelerated)
2. ‚úÇÔ∏è Chunks the markdown with image-aware, heading-preserving chunker
3. üß† Sends chunks to your GraphRecall backend API for full processing
4. üìä Extracts concepts, builds knowledge graph, generates flashcards & quizzes

**Requirements:**
- Colab GPU runtime (T4 or better)
- Your GraphRecall backend URL and auth token

## Step 1: Install Dependencies

In [None]:
!pip install marker-pdf requests -q
print("‚úÖ Dependencies installed!")

## Step 2: Configuration

Set your GraphRecall backend URL and authentication details.

In [None]:
# ============================================================
# ‚öôÔ∏è CONFIGURATION - Edit these values
# ============================================================

# Your GraphRecall backend URL (no trailing slash)
BACKEND_URL = "https://your-graphrecall-backend.com"  # ‚¨ÖÔ∏è EDIT THIS

# Auth: paste your Google OAuth access token here.
# Get it from: browser DevTools > Application > Cookies > access_token
AUTH_TOKEN = ""  # ‚¨ÖÔ∏è PASTE YOUR TOKEN

# Book metadata
BOOK_TITLE = ""  # ‚¨ÖÔ∏è e.g., "Introduction to Machine Learning"

# Chunking config
CHUNK_SIZE = 1400  # Max chars per chunk
OVERLAP_RATIO = 0.15  # 15% overlap between chunks

# Processing options
SKIP_REVIEW = True  # Auto-approve concepts (recommended for books)

# ============================================================

import requests

HEADERS = {
    "Authorization": f"Bearer {AUTH_TOKEN}",
    "Content-Type": "application/json",
}

# Verify connection
try:
    resp = requests.get(f"{BACKEND_URL}/api/v2/health", headers=HEADERS, timeout=10)
    if resp.status_code == 200:
        print(f"‚úÖ Connected to GraphRecall at {BACKEND_URL}")
    else:
        print(f"‚ö†Ô∏è Backend returned status {resp.status_code}")
        print(f"   Response: {resp.text[:200]}")
except Exception as e:
    print(f"‚ùå Cannot reach backend: {e}")
    print("   Make sure your backend is running and the URL is correct.")

## Step 3: Upload PDF & Run Marker OCR

In [None]:
import torch

if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è No GPU. Go to Runtime > Change runtime type > GPU")

In [None]:
# Upload PDF
UPLOAD_METHOD = "direct"  # Change to "drive" for Google Drive

if UPLOAD_METHOD == "direct":
    from google.colab import files
    print("üìÅ Select your PDF file...")
    uploaded = files.upload()
    PDF_PATH = list(uploaded.keys())[0]
    if not BOOK_TITLE:
        BOOK_TITLE = PDF_PATH.rsplit('.', 1)[0]
    print(f"‚úÖ Uploaded: {PDF_PATH}")
else:
    from google.colab import drive
    drive.mount('/content/drive')
    PDF_PATH = "/content/drive/MyDrive/Your_Book.pdf"  # ‚¨ÖÔ∏è EDIT THIS
    if not BOOK_TITLE:
        BOOK_TITLE = PDF_PATH.rsplit('/', 1)[-1].rsplit('.', 1)[0]

In [None]:
import time
from pathlib import Path

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered

pdf_path = Path(PDF_PATH)
print(f"üìñ Processing: {pdf_path.name} ({pdf_path.stat().st_size / 1024 / 1024:.1f} MB)")
print("-" * 50)

start_time = time.time()

print("üîß Loading OCR models...")
model_dict = create_model_dict()
converter = PdfConverter(artifact_dict=model_dict)

print("üìù Extracting text and images...")
rendered = converter(str(pdf_path))
markdown_text, _, images = text_from_rendered(rendered)

elapsed = time.time() - start_time
print(f"‚úÖ OCR complete in {elapsed:.1f}s")
print(f"   Text: {len(markdown_text):,} chars | Images: {len(images) if images else 0}")

# Save images locally
output_dir = Path(f"/content/{pdf_path.stem}_output")
output_dir.mkdir(exist_ok=True)
images_dir = output_dir / "images"
images_dir.mkdir(exist_ok=True)

if images:
    for img_name, img_data in images.items():
        img_path = images_dir / img_name
        if hasattr(img_data, 'save'):
            img_data.save(str(img_path))
        elif isinstance(img_data, bytes):
            img_path.write_bytes(img_data)
    print(f"   Saved {len(images)} images to {images_dir}")

# Save markdown
md_path = output_dir / f"{pdf_path.stem}.md"
md_path.write_text(markdown_text, encoding="utf-8")
print(f"   Saved markdown to {md_path}")

## Step 4: Chunk the Book

Uses GraphRecall's image-aware BookChunker: detects figures, preserves headings, smart overlap.

In [None]:
import re
from dataclasses import dataclass, field
from typing import List, Optional

# ============================================================
# BookChunker (copied from backend/services/book_chunker.py)
# ============================================================

IMAGE_PATTERN = re.compile(r"!\[[^\]]*\]\((?P<path>[^)]+)\)")
CAPTION_PATTERN = re.compile(r"^(Figure|Fig\.?|FIGURE)\s+[\w\.\-]+[:\.']?\s*(?P<caption>.+)$")

@dataclass
class ImageInfo:
    filename: str
    caption: Optional[str]
    page: Optional[int]
    url: Optional[str] = None

@dataclass
class Chunk:
    index: int
    text: str
    images: List[ImageInfo] = field(default_factory=list)
    headings: List[str] = field(default_factory=list)

class BookChunker:
    def __init__(self, max_chars=1400, overlap_ratio=0.15):
        self.max_chars = max_chars
        self.overlap_ratio = overlap_ratio

    def chunk_markdown(self, md_path, images_dir):
        lines = md_path.read_text(encoding="utf-8").splitlines()
        return self._chunk_lines(lines, images_dir)

    def _chunk_lines(self, lines, images_dir):
        units = []
        current_para = []
        heading_stack = []

        def flush_para():
            if current_para:
                text = "\n".join(current_para).strip()
                if text:
                    units.append({"type": "text", "text": text, "headings": heading_stack.copy()})
                current_para.clear()

        for idx, line in enumerate(lines):
            stripped = line.strip()
            if stripped.startswith("#"):
                flush_para()
                level = len(stripped) - len(stripped.lstrip("#"))
                heading_text = stripped.lstrip("#").strip()
                if level <= len(heading_stack):
                    heading_stack[:] = heading_stack[:level - 1]
                heading_stack.append(heading_text)
                units.append({"type": "text", "text": heading_text, "headings": heading_stack.copy()})
                continue

            image_match = IMAGE_PATTERN.search(stripped)
            if image_match:
                flush_para()
                raw_path = image_match.group("path").strip()
                normalized_name, page_num = self._normalize_filename(raw_path, images_dir)
                caption = self._find_caption(lines, idx)
                image_info = ImageInfo(filename=normalized_name, caption=caption, page=page_num)
                placeholder = caption or f"Image {normalized_name}"
                units.append({"type": "figure", "text": f"[Figure] {placeholder}", "images": [image_info], "headings": heading_stack.copy()})
                continue

            if stripped == "":
                flush_para()
            else:
                current_para.append(stripped)
        flush_para()

        chunks = []
        buf_units, buf_images, buf_headings = [], [], []
        current_len = 0

        def flush_chunk():
            nonlocal buf_units, buf_images, buf_headings, current_len
            if not buf_units: return
            text = "\n\n".join(u["text"] for u in buf_units).strip()
            chunks.append(Chunk(index=len(chunks), text=text, images=list(buf_images), headings=list(buf_headings)))
            overlap_chars = int(self.max_chars * self.overlap_ratio)
            carry_units, carry_len = [], 0
            for u in reversed(buf_units):
                u_len = len(u.get("text", ""))
                if carry_len + u_len > overlap_chars: break
                if u.get("type") == "figure": break
                carry_units.insert(0, u)
                carry_len += u_len + 2
            buf_units[:] = carry_units
            buf_images[:] = []
            buf_headings[:] = list(buf_headings)
            current_len = carry_len

        for unit in units:
            unit_text = unit.get("text", "").strip()
            if not unit_text: continue
            unit_len = len(unit_text)
            if current_len + unit_len > self.max_chars and buf_units:
                flush_chunk()
            buf_units.append(unit)
            buf_images.extend(unit.get("images", []))
            if unit.get("headings"): buf_headings[:] = unit["headings"]
            current_len += unit_len + 2
        flush_chunk()
        return chunks

    def _normalize_filename(self, raw_path, images_dir):
        raw_name = Path(raw_path).name
        stem = Path(raw_name).stem
        for ext in (".png", ".jpeg", ".jpg"):
            candidate = images_dir / f"{stem}{ext}"
            if candidate.exists():
                return candidate.name, self._extract_page(stem)
        return raw_name, self._extract_page(stem)

    def _extract_page(self, stem):
        match = re.search(r"_page_(\d+)_", stem)
        return int(match.group(1)) if match else None

    def _find_caption(self, lines, idx):
        for offset in [-2, -1, 1, 2]:
            pos = idx + offset
            if 0 <= pos < len(lines):
                match = CAPTION_PATTERN.match(lines[pos].strip())
                if match: return match.group("caption").strip()
        return None

# Run chunker
chunker = BookChunker(max_chars=CHUNK_SIZE, overlap_ratio=OVERLAP_RATIO)
chunks = chunker.chunk_markdown(md_path, images_dir)

total_images = sum(len(c.images) for c in chunks)
print(f"‚úÖ Chunked into {len(chunks)} chunks")
print(f"   Avg chunk size: {sum(len(c.text) for c in chunks) // max(len(chunks), 1)} chars")
print(f"   Chunks with images: {sum(1 for c in chunks if c.images)}")
print(f"   Total image references: {total_images}")

# Preview first chunk
if chunks:
    print(f"\n--- Chunk 0 Preview ---")
    print(chunks[0].text[:300])
    if chunks[0].images:
        print(f"  Images: {[i.filename for i in chunks[0].images]}")

## Step 5: Send to GraphRecall Backend

Sends the chunked book through the `/api/v2/ingest` endpoint.
The backend will: embed, extract concepts, build graph, generate flashcards & quizzes.

In [None]:
import json
import time

# Reassemble chunks into a single markdown with page markers for the backend
# The backend's BookChunker will re-chunk, but we include heading context
full_markdown = markdown_text

print(f"üöÄ Sending '{BOOK_TITLE}' to GraphRecall...")
print(f"   Content length: {len(full_markdown):,} chars")
print(f"   Backend: {BACKEND_URL}")
print("-" * 50)

# Split into manageable batches if very large (>50k chars)
MAX_BATCH_SIZE = 50000

if len(full_markdown) <= MAX_BATCH_SIZE:
    # Single ingestion call
    payload = {
        "content": full_markdown,
        "title": BOOK_TITLE,
        "skip_review": SKIP_REVIEW,
        "resource_type": "book",
    }
    resp = requests.post(
        f"{BACKEND_URL}/api/v2/ingest",
        headers=HEADERS,
        json=payload,
        timeout=300,
    )
    if resp.status_code == 200:
        result = resp.json()
        print(f"‚úÖ Ingestion complete!")
        print(f"   Note ID: {result.get('note_id')}")
        print(f"   Concepts: {len(result.get('concepts', []))}")
        print(f"   Flashcards: {len(result.get('flashcard_ids', []))}")
        print(f"   Status: {result.get('status')}")
    else:
        print(f"‚ùå Failed: {resp.status_code}")
        print(resp.text[:500])
else:
    # Batch ingestion for large books
    # Split by chunks and send in batches
    batch_size = 15  # chunks per batch
    total_concepts = 0
    total_flashcards = 0
    note_ids = []

    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        batch_text = "\n\n".join(c.text for c in batch)
        batch_num = i // batch_size + 1
        total_batches = (len(chunks) + batch_size - 1) // batch_size

        print(f"  Batch {batch_num}/{total_batches} ({len(batch)} chunks, {len(batch_text):,} chars)...")

        payload = {
            "content": batch_text,
            "title": f"{BOOK_TITLE} (Part {batch_num})",
            "skip_review": SKIP_REVIEW,
            "resource_type": "book",
        }
        try:
            resp = requests.post(
                f"{BACKEND_URL}/api/v2/ingest",
                headers=HEADERS,
                json=payload,
                timeout=300,
            )
            if resp.status_code == 200:
                result = resp.json()
                nc = len(result.get('concepts', []))
                nf = len(result.get('flashcard_ids', []))
                total_concepts += nc
                total_flashcards += nf
                note_ids.append(result.get('note_id'))
                print(f"    ‚úÖ +{nc} concepts, +{nf} flashcards")
            else:
                print(f"    ‚ùå Failed: {resp.status_code} - {resp.text[:200]}")
        except Exception as e:
            print(f"    ‚ùå Error: {e}")

        time.sleep(2)  # Rate limiting

    print(f"\n‚úÖ Book ingestion complete!")
    print(f"   Total concepts: {total_concepts}")
    print(f"   Total flashcards: {total_flashcards}")
    print(f"   Note parts: {len(note_ids)}")

## Step 6: Trigger Community Recomputation

After ingesting a book, recompute communities to include new concepts in the knowledge graph hierarchy.

In [None]:
print("üîÑ Recomputing communities...")
try:
    resp = requests.post(
        f"{BACKEND_URL}/api/graph3d/communities/recompute",
        headers=HEADERS,
        timeout=120,
    )
    if resp.status_code == 200:
        result = resp.json()
        print(f"‚úÖ Communities recomputed: {result.get('num_communities', '?')} communities")
    else:
        print(f"‚ö†Ô∏è Community recompute returned {resp.status_code}")
except Exception as e:
    print(f"‚ö†Ô∏è Failed: {e} (not critical, communities will be computed on next app load)")

## üéâ Done!

Your book is now in GraphRecall. Open the app to:
- üåê See new concepts in the Knowledge Graph
- üìö Browse the book in Library
- üß† Study with auto-generated flashcards & quizzes
- üí¨ Ask the AI assistant about the book's content

In [None]:
# Optional: Download the markdown + images locally too
import shutil
from google.colab import files

zip_path = f"/content/{pdf_path.stem}_extracted"
shutil.make_archive(zip_path, 'zip', output_dir)
print(f"üì¶ Download backup: {zip_path}.zip")
files.download(f"{zip_path}.zip")