# Getting Started with DocRag

Welcome to DocRag! This tutorial will guide you through the basics of using the DocRag package to extract and analyze content from PDF documents using AI-powered document layout analysis.

## What is DocRag?

DocRag is a powerful Python package that uses computer vision and large language models to intelligently parse PDF documents. It can:

- **Extract document structure**: Automatically identify figures, tables, formulas, text blocks, and titles
- **Generate markdown**: Convert PDF content to clean, structured markdown format
- **Parse content**: Use AI models to understand and describe visual elements like figures and tables
- **Maintain layout**: Preserve the logical reading order of document elements

## How it Works

The DocSearch workflow consists of three main steps:

1. **PDF to Images**: Convert PDF pages to high-resolution images
2. **Layout Analysis**: Use YOLO-based computer vision models to detect and classify document elements (figures, tables, text, etc.)
3. **Content Parsing**: Use large language models (like Google Gemini) to extract and describe the content of each element

## The Document Class Architecture

The core of DocSearch is built around two main classes:

- **`Document`**: Represents an entire PDF document containing multiple pages
- **`Page`**: Represents a single page with its detected elements

Each page contains multiple **`Element`** objects that represent different types of content:
- **Figures**: Images, charts, diagrams
- **Tables**: Structured data tables  
- **Formulas**: Mathematical equations
- **Text**: Regular paragraph text
- **Titles**: Section headers and titles

## Setting Up Google Gemini API Key

DocSearch uses Google's Gemini AI model for intelligent content parsing. To use this feature, you'll need to obtain a free API key:

### Step 1: Get Your API Key

1. Go to [Google AI Studio](https://aistudio.google.com/app/prompts/new_chat)
2. Click on **"Get API key"** button
3. Sign in with your Google account if prompted
4. Copy your API key

### Step 2: Set Environment Variable

Once you have your API key, you need to set it as an environment variable. You can do this in several ways:

**Option A: Set in Jupyter Notebook (temporary)**
```python
import os
os.environ['GEMINI_API_KEY'] = 'your-api-key-here'
```

**Option B: Set in your system permanently**
- **Windows**: Add `GEMINI_API_KEY=your-api-key-here` to your environment variables
- **Mac/Linux**: Add `export GEMINI_API_KEY=your-api-key-here` to your `.bashrc` or `.zshrc`

**Option C: Use a .env file**
Create a `.env` file in your project directory:
```
GEMINI_API_KEY=your-api-key-here
```

⚠️ **Important**: Never commit your API key to version control. Keep it secure and private!


## Installation and Setup

First, let's import the necessary libraries and set up our environment:


In [None]:
import os
from pathlib import Path
from IPython.display import display, Markdown
from rich.console import Console
from rich.markdown import Markdown as RichMarkdown

# Import the main DocSearch classes
from docrag import Document
from docrag.core.page import Page

# Set up your Google Gemini API key here (replace with your actual key)
# os.environ['GEMINI_API_KEY'] = 'your-api-key-here'

console = Console()
print("✅ DocSearch imported successfully!")

In [None]:
# Set up directories
ROOT_DIR = Path(os.path.abspath("."))
DATA_DIR = ROOT_DIR / "data"

# Model weights will be automatically downloaded if not present
MODEL_WEIGHTS = "doclayout_yolo_docstructbench_imgsz1024.pt"

# For this tutorial, we'll use a sample PDF path
# Replace this with the path to your PDF file
PDF_PATH = DATA_DIR / "sample_document.pdf"  # Update this to your PDF file

print(f"Root directory: {ROOT_DIR}")
print(f"Data directory: {DATA_DIR}")
print(f"PDF path: {PDF_PATH}")

# Check if data directory exists, create if not
DATA_DIR.mkdir(exist_ok=True)
print(f"✅ Setup complete!")

## Creating a Document from PDF

The main entry point for DocSearch is the `Document.from_pdf()` method. This method:

1. **Converts** the PDF to high-resolution images (one per page)
2. **Analyzes** each page using computer vision to detect elements 
3. **Parses** the content using AI models to extract text and descriptions
4. **Returns** a Document object containing all the processed information

### Key Parameters:

- **`pdf_path`**: Path to your PDF file
- **`dpi`**: Resolution for PDF conversion (default: 300, higher = better quality but slower)
- **`model_weights`**: YOLO model for layout detection (auto-downloaded if not present)
- **`verbose`**: Whether to show progress information

Let's create a Document from a PDF file:

In [None]:
# Create a Document from PDF
# Note: Make sure you have set GEMINI_API_KEY environment variable
# and replace PDF_PATH with the path to your actual PDF file
if PDF_PATH.exists():
    print(f"📄 PDF file exists: {PDF_PATH}")
else:
    print(f"❌ PDF file does not exist: {PDF_PATH}")

print(f"📄 Processing PDF: {PDF_PATH}")
print("This may take a few minutes depending on the document size...")

# Create the document
doc = Document.from_pdf(
    pdf_path=PDF_PATH,
    model_weights=MODEL_WEIGHTS,
    dpi=200,  # Lower DPI for faster processing in tutorial
    verbose=True
)

print(f"✅ Document created successfully!")
print(f"📊 Document contains {len(doc)} pages")


## Exploring Document Content

Once you have created a Document, you can access various types of information. The Document class provides several convenient properties and methods to explore the content:

### Document-Level Properties

- **`doc.pages`**: List of Page objects
- **`doc.figures`**: All figures from all pages
- **`doc.tables`**: All tables from all pages  
- **`doc.formulas`**: All mathematical formulas
- **`doc.text`**: All text blocks
- **`doc.titles`**: All title/header elements
- **`doc.elements`**: All detected elements
- **`doc.markdown`**: Complete document as markdown

Let's explore the content of our document:


In [None]:

print("📊 Document Overview:")
print(f"   Number of pages: {len(doc)}")
print(f"   Total figures: {len(doc.figures)}")
print(f"   Total tables: {len(doc.tables)}")
print(f"   Total formulas: {len(doc.formulas)}")
print(f"   Total text blocks: {len(doc.text)}")
print(f"   Total titles: {len(doc.titles)}")
print(f"   Total elements: {len(doc.elements)}")

print("\n" + "="*50)
print("📄 Page-by-Page Breakdown:")

for i, page in enumerate(doc.pages):
    print(f"\nPage {i+1}:")
    print(f"   Figures: {len(page.figures)}")
    print(f"   Tables: {len(page.tables)}")
    print(f"   Formulas: {len(page.formulas)}")
    print(f"   Text blocks: {len(page.text)}")
    print(f"   Titles: {len(page.titles)}")


## Accessing Specific Content Types

### Working with Figures

Figures include images, charts, diagrams, and other visual elements. Each figure has:
- **Bounding box coordinates**
- **Confidence score** from the detection model  
- **Extracted image** as a PIL Image object
- **AI-generated description** and markdown representation

In [None]:
# Examine figures in the document

print(f"🖼️ Found {len(doc.figures)} figures in the document:")

for i, figure in enumerate(doc.figures[:3]):  # Show first 3 figures
    print(f"\nFigure {i+1}:")
    print(f"   Element type: {figure.element_type}")
    print(f"   Confidence: {figure.confidence:.2f}")
    print(f"   Bounding box: {figure.bbox}")
    print(f"   Image size: {figure.image.size}")
    
    # Display AI-generated description
    if figure.markdown:
        print(f"   AI Description: {figure.markdown[:100]}...")
    
    # Display the figure image
    print(f"   Displaying figure {i+1}:")
    display(figure.image)
        

### Working with Tables

Tables are automatically detected and their structure is analyzed. DocSearch can extract both the visual table and convert it to markdown format:


In [None]:
# Examine tables in the document

print(f"📊 Found {len(doc.tables)} tables in the document:")

for i, table in enumerate(doc.tables[:2]):  # Show first 2 tables
    print(f"\nTable {i+1}:")
    print(f"   Element type: {table.element_type}")
    print(f"   Confidence: {table.confidence:.2f}")
    print(f"   Bounding box: {table.bbox}")
    
    # Display table image
    print(f"   Displaying table {i+1} image:")
    display(table.image)
    
    # Display AI-extracted table markdown
    if table.markdown:
        print(f"   AI-extracted table structure:")
        display(Markdown(table.markdown))
        


### Working with Text and Titles

Text blocks and titles are extracted and processed to provide clean, readable content:

In [None]:
# Examine text and titles

print(f"📝 Text Analysis:")
print(f"   Found {len(doc.titles)} titles")
print(f"   Found {len(doc.text)} text blocks")

# Show titles
if len(doc.titles) > 0:
    print(f"\n🏷️ Document Titles:")
    for i, title in enumerate(doc.titles[:5]):  # Show first 5 titles
        print(f"   Title {i+1}: {title.markdown[:100]}...")

# Show text blocks  
if len(doc.text) > 0:
    print(f"\n📄 Text Blocks (first 3):")
    for i, text in enumerate(doc.text[:3]):  # Show first 3 text blocks
        print(f"   Text {i+1}: {text.markdown[:150]}...")
        
# Show formulas if any
if len(doc.formulas) > 0:
    print(f"\n🧮 Mathematical Formulas:")
    for i, formula in enumerate(doc.formulas[:3]):
        print(f"   Formula {i+1}: {formula.markdown}")
else:
    print(f"\n🧮 No mathematical formulas found.")


## Converting to Markdown

One of the most powerful features of DocSearch is the ability to convert the entire document to clean, structured markdown format. This preserves the logical structure and content while making it easy to work with programmatically.


In [None]:
# Generate markdown for the entire document
print("🔄 Converting document to markdown...")

# Get the markdown representation
markdown_content = doc.to_markdown()

print(f"✅ Markdown generated! ({len(markdown_content)} characters)")

# Show first 1000 characters as preview
print("\n📄 Markdown Preview (first 1000 characters):")
print("="*60)
print(markdown_content[:1000])
print("="*60)

# Save to file
output_path = DATA_DIR / "document_output.md"
doc.to_markdown(filepath=output_path)
print(f"\n💾 Full markdown saved to: {output_path}")

# Display rendered markdown preview
print("\n🎨 Rendered Markdown Preview:")
display(Markdown(markdown_content[:2000]))  # Show first 2000 characters rendered
    

## Working with Individual Pages

You can also work with individual pages to get more granular control over the content:


In [None]:
# Get the first page
first_page = doc[0]  # or doc.get_page(1) for 1-indexed access

print(f"📄 Page 1 Analysis:")
print(f"   Elements on page: {len(first_page.elements)}")
print(f"   Figures: {len(first_page.figures)}")
print(f"   Tables: {len(first_page.tables)}")
print(f"   Text blocks: {len(first_page.text)}")
print(f"   Titles: {len(first_page.titles)}")

# Display the original page image
print(f"\n🖼️ Original Page Image:")
display(first_page.image)

# Display the annotated page (with bounding boxes)
if first_page.annotated_image:
    print(f"\n🎯 Annotated Page (with detected elements):")
    display(first_page.annotated_image)

# Get page-specific content
page_markdown = first_page.to_markdown()
print(f"\n📝 Page Markdown Length: {len(page_markdown)} characters")

# Show specific content types from this page
if len(first_page.figures) > 0:
    print(f"\n🖼️ First figure from page 1:")
    display(first_page.figures[0].image)
        

## Saving and Exporting

DocSearch provides multiple ways to save and export your processed document data:

In [None]:
# Save and export document data
output_dir = DATA_DIR / "document_export"
output_dir.mkdir(exist_ok=True)

print("💾 Saving document data in multiple formats...")

# 1. Save as JSON
json_path = output_dir / "document.json"
doc.to_json(json_path, include_images=False)
print(f"✅ JSON saved: {json_path}")

# 2. Save as Markdown
md_path = output_dir / "document.md"
doc.to_markdown(md_path)
print(f"✅ Markdown saved: {md_path}")

# 3. Save as PyArrow/Parquet (for data science workflows)
parquet_path = output_dir / "document.parquet"
doc.to_pyarrow(parquet_path)
print(f"✅ Parquet saved: {parquet_path}")

# 4. Save complete document data (includes images)
full_export_dir = output_dir / "full_export"
doc.save(full_export_dir, save_json=True, include_images=True)
print(f"✅ Full export saved: {full_export_dir}")

# 5. Get document as dictionary for programmatic use
doc_dict = doc.to_dict(include_images=False)
print(f"✅ Document dictionary created with {len(doc_dict['pages'])} pages")

print(f"\n📁 All exports saved to: {output_dir}")