# 🐣 Document Intelligence with Docling: Unlocking Complex Academic Content

This notebook demonstrates **Document Intelligence** - the advanced capability to understand and process complex documents like research papers, academic materials, and structured content that traditional RAG systems struggle with.

**The Challenge:**
Imagine trying to build an educational AI assistant using only basic text extraction from research papers. You'd lose:
- **📊 Table data** with crucial research findings
- **🧮 Mathematical formulas** and scientific notation  
- **📈 Charts and figures** that provide key insights
- **🏛️ Document structure** like sections, references, and metadata
- **📝 Multi-column layouts** common in academic papers

**The Solution: Docling**
Docling is an advanced document processing toolkit that acts like a brilliant research assistant, understanding the **meaning and structure** of complex academic documents.

**What You'll Build:**
- **🔬 Intelligent Document Processor**: Extract rich content from complex PDFs
- **📚 Enhanced RAG System**: Query tables, formulas, and structured content  
- **🎯 Academic AI Assistant**: Answer questions using complete document understanding
- **⚡ Production Pipeline**: Handle real-world educational materials at scale

**Why This Matters:**
Traditional RAG systems often fail with academic content, missing critical information trapped in tables or losing context from complex layouts. Docling transforms these challenging documents into fully searchable, queryable knowledge.

Let's build document intelligence that truly understands academic content! 🚀

## 📦 Install Required Packages

Install packages for LlamaStack RAG and advanced document processing.

**Note:** Docling processing can take 1-2 minutes for complex academic papers as it performs comprehensive analysis including layout detection, table extraction, and formula recognition.

In [2]:
# Core libraries for document intelligence and RAG
import uuid      # For generating unique vector database identifiers
import requests  # For HTTP communication with Docling service and document fetching
import base64    # For encoding binary data if needed (images, complex formats)
import json      # For handling Docling API responses and metadata
import os        # System utilities
import sys       # System path management
sys.path.append('..')  # Add parent directory for custom utilities

# LlamaStack client and RAG-specific classes
from llama_stack_client import LlamaStackClient  # Main interface for RAG operations
from llama_stack_client import RAGDocument  # Represents documents for RAG ingestion
from llama_stack_client.types.shared.content_delta import TextDelta, ToolCallDelta  # For streaming responses

# Display and utility imports
from src.utils import step_printer  # For progress tracking
from termcolor import cprint        # For colorized output

In [3]:
# === LlamaStack Connection Setup ===
# Connect to the LlamaStack server that coordinates document intelligence
base_url = "http://llama-stack-service:8321"

# Configure provider data (none needed for this demo)
provider_data = None

# Create the LlamaStack client for document intelligence and RAG
client = LlamaStackClient(
    base_url=base_url,
    provider_data=provider_data
)

print(f"Connected to LlamaStack server")

# === Model Configuration for Document Intelligence ===
# Configure the LLM that will reason about processed documents
model_id = "llama32"       # Llama 3.2 model for text generation
temperature = 0.0         # Deterministic responses for factual document queries  
max_tokens = 4096         # Larger context for complex document reasoning
stream = True             # Stream responses for better user experience

# Configure sampling strategy for consistent, factual responses
if temperature > 0.0:
    top_p = 0.95
    strategy = {"type": "top_p", "temperature": temperature, "top_p": top_p}
else:
    strategy = {"type": "greedy"}  # Deterministic for factual document analysis

# Package parameters for LlamaStack inference API
sampling_params = {
    "strategy": strategy,
    "max_tokens": max_tokens,
}

# Display configuration
print(f"Model Configuration:")
print(f"  • Model: {model_id}")
print(f"  • Strategy: {strategy['type']}")  
print(f"  • Max Tokens: {max_tokens} (enhanced for complex documents)")
print(f"  • Stream: {stream}")

Connected to LlamaStack server
Model Configuration:
  • Model: llama32
  • Strategy: greedy
  • Max Tokens: 4096 (enhanced for complex documents)
  • Stream: True


## 🏗️ The Docling-LlamaStack Architecture

Docling integrates with LlamaStack to create an intelligent document processing pipeline that transforms complex academic materials into searchable, queryable knowledge.

### 🔍 What is Docling?

**Docling** is an advanced document processing toolkit that simplifies the handling of diverse document formats with a focus on intelligent PDF understanding. Think of it as a universal translator for documents - it can read, understand, and convert complex academic materials into formats that AI systems can work with effectively.

**Key Features & Capabilities:**
- **📄 Multi-Format Support**: PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, PNG, TIFF, JPEG
- **🧠 Intelligent PDF Understanding**: Layout analysis, table structure, formula recognition, image classification
- **🔒 Enterprise-Ready**: Local execution, air-gapped support, extensive OCR, visual language models

### 🔧 The Three-Phase Docling Pipeline

Docling processes documents through three intelligent phases:

#### Phase 1: Intelligent Document Analysis
```
📄 PDF Input → 🔍 Layout Detection → 📋 Structure Analysis → 🧠 Content Extraction
```
- **Layout Detection**: Understands page structure, reading order, and multi-column layouts
- **Structure Analysis**: Identifies headers, paragraphs, lists, tables, and figures
- **Content Extraction**: Extracts text while preserving semantic meaning and relationships

#### Phase 2: Content Enhancement  
```
📝 Raw Text → 🏷️ Semantic Tagging → 📊 Table Extraction → 🖼️ Figure Processing
```
- **Semantic Tagging**: Identifies document sections, references, and metadata
- **Table Extraction**: Preserves complex table relationships and formatting
- **Figure Processing**: Handles mathematical equations, charts, and diagrams

#### Phase 3: RAG Integration
```
🔧 Intelligent Chunking → 🎯 Embedding Generation → 🗄️ Vector Storage → 🔍 LlamaStack RAG
```
- **Intelligent Chunking**: Splits documents into optimal pieces for retrieval
- **Embedding Generation**: Creates vector representations using sentence transformers
- **Vector Storage**: Stores embeddings in your Milvus database
- **LlamaStack RAG**: Enables semantic search and intelligent question answering

### 🧠 Why Docling Matters for Educational RAG

Traditional document processing often fails with academic content. Consider a typical computer science research paper:
- **Complex layouts** with multiple columns and sections
- **Mathematical equations** that need special handling
- **Figures and tables** that provide crucial context
- **Reference lists** that need to be preserved and linked
- **Metadata** like authors, institutions, and publication dates

Without intelligent processing, a RAG system might miss important information trapped in tables, lose context from figures, or struggle with multi-column layouts. Docling solves these problems by understanding document structure and extracting content intelligently.

In [4]:
def docling_processing(url):
    """
    Process a document URL using the Docling service for intelligent content extraction.
    
    This function performs advanced document analysis including:
    - Layout detection and structure analysis
    - Table extraction with preserved formatting  
    - Mathematical formula recognition
    - Figure and chart processing
    - Multi-column layout understanding
    - Semantic content structuring
    
    Args:
        url (str): URL of the document to process (PDF, DOCX, etc.)
        
    Returns:
        str: Structured Markdown content with preserved document intelligence
        
    Note: Processing can take 1-2 minutes for complex academic documents
    """
    # === Docling Service Configuration ===
    # Connect to the deployed Docling service in the cluster
    api_address = "http://docling-v0-7-0-predictor.ai501.svc.cluster.local:5001"
    
    # Configure headers (no authentication needed for cluster-internal service)
    headers = {"Content-Type": "application/json"}
    
    print(f"🔗 Docling Service: {api_address}/v1alpha/convert/source")
    print(f"📄 Processing document: {url}")
    print(f"⏰ This may take 1-2 minutes for complex documents...")
    
    # === Document Processing Request ===
    # Configure Docling to extract maximum intelligence from the document
    payload = {
        "http_sources": [{"url": url}],              # Document source
        "options": {
            "to_formats": ["md"],                    # Output as structured Markdown
            "image_export_mode": "placeholder"      # Handle images appropriately
        },
    }
    
    try:
        # === Submit Processing Request ===
        # Send document to Docling for intelligent analysis
        response = requests.post(
            f"{api_address}/v1alpha/convert/source",
            json=payload,
            headers=headers,
            timeout=180  # 3-minute timeout for complex documents
        )
        
        # Check for successful processing
        response.raise_for_status()
        
        # === Extract Processed Content ===
        # Docling returns structured Markdown with preserved document intelligence
        result_data = response.json()
        md_content = result_data["document"]["md_content"]
        
        print(f"✅ Document processing complete!")
        print(f"📊 Processed content length: {len(md_content)} characters")
        
        return md_content
        
    except requests.exceptions.Timeout:
        print(f"⏰ Processing timeout - complex documents may need more time")
        raise
    except requests.exceptions.RequestException as e:
        print(f"❌ Docling processing failed: {e}")
        raise
    except KeyError as e:
        print(f"❌ Unexpected response format: {e}")
        raise

In [5]:
# === Select Complex Academic Document ===
# Choose a research paper with tables, formulas, and complex structure
# This ArXiv paper contains the kind of complex content that showcases Docling's capabilities

# Option 1: Computer Vision research paper with tables and technical content
url = "https://arxiv.org/pdf/2404.14661"

# Alternative papers for testing (comment/uncomment as needed):
# url = "https://arxiv.org/pdf/2006.07156"  # Machine Learning paper with mathematical content
# url = "https://raw.githubusercontent.com/rhoai-genaiops/deploy-lab/main/university-data/canopy-in-botany.pdf"  # Simpler PDF for comparison

print(f"🎯 Selected document: {url}")
print(f"📋 This paper likely contains tables, formulas, figures, and structured sections")
print(f"⚡ Starting intelligent document processing...")

# === Process Document with Docling Intelligence ===
# This will take 1-2 minutes as Docling performs comprehensive analysis
md_content = docling_processing(url)

print(f"\n🎉 Document intelligence processing complete!")
print(f"📊 Content preview (first 500 characters):")
print(f"{'='*60}")
print(md_content[:500] + "..." if len(md_content) > 500 else md_content)
print(f"{'='*60}")
print(f"📈 Total processed content: {len(md_content)} characters")
print(f"📝 Docling has extracted and structured the complete document content!")

🎯 Selected document: https://arxiv.org/pdf/2404.14661
📋 This paper likely contains tables, formulas, figures, and structured sections
⚡ Starting intelligent document processing...
🔗 Docling Service: http://docling-v0-7-0-predictor.ai501.svc.cluster.local:5001/v1alpha/convert/source
📄 Processing document: https://arxiv.org/pdf/2404.14661
⏰ This may take 1-2 minutes for complex documents...
✅ Document processing complete!
📊 Processed content length: 159318 characters

🎉 Document intelligence processing complete!
📊 Content preview (first 500 characters):
## Highlights

## First Mapping the Canopy Height of Primeval Forests in the Tallest Tree Area of Asia

Guangpeng Fan,Fei Yan,Xiangquan Zeng,Qingtao Xu,Ruoyoulan Wang,Binghong Zhang,Jialing Zhou,Liangliang Nan,Jinhu Wang,Zhiwei Zhang,Jia Wang

- • First mapping the primeval forest canopy height of the tallest tree growing in Asia
- • Deep learning driven by multisource Earth observation to monitor the giant trees area
- • Customized pyram

## 📊 Document Processing Demonstration

Let's test Docling's document intelligence on a complex academic paper. We'll use a real research paper that contains:
- **📊 Tables** with numerical data and results
- **🧮 Mathematical formulas** and equations  
- **📈 Figures** and charts with captions
- **📝 Multi-column layout** typical of academic papers
- **🏛️ Structured sections** like Abstract, Methods, Results, References

**Example Document:** We'll process an ArXiv research paper that demonstrates the full complexity of academic content that traditional text extraction would struggle with.

### 🔬 Intelligent Processing in Action

In [6]:
# === STEP 1: Create Unique Vector Database ===
# Generate a unique identifier for this vector database instance
# Using UUID ensures no conflicts when multiple users run this notebook
vector_db_id = f"test_vector_db_{uuid.uuid4()}"
print(f"📊 Created vector database ID: {vector_db_id}")

# === STEP 2: Register Vector Database for Document Intelligence ===
# Configure the vector database to handle intelligently-processed documents
client.vector_dbs.register(
    vector_db_id=vector_db_id,                      # Unique identifier for this database
    embedding_model="all-MiniLM-L6-v2",            # Sentence transformer for embeddings
    embedding_dimension=384,                        # Vector dimensions (must match model)
    provider_id="milvus",                           # Use Milvus as the vector store backend
)

print(f"✅ Registered vector database for document intelligence:")
print(f"  • Database ID: {vector_db_id}")
print(f"  • Embedding Model: all-MiniLM-L6-v2 (384 dimensions)")
print(f"  • Provider: Milvus vector database")
print(f"  • Ready for Docling-processed content ingestion!")

INFO:httpx:HTTP Request: POST http://llama-stack-service:8321/v1/vector-dbs "HTTP/1.1 200 OK"


📊 Created vector database ID: test_vector_db_6e276e9b-94d1-42a0-9b7e-bf18273d2b21
✅ Registered vector database for document intelligence:
  • Database ID: test_vector_db_6e276e9b-94d1-42a0-9b7e-bf18273d2b21
  • Embedding Model: all-MiniLM-L6-v2 (384 dimensions)
  • Provider: Milvus vector database
  • Ready for Docling-processed content ingestion!


In [7]:
# === STEP 3: Ingest Docling-Processed Content into RAG System ===
# Create a RAGDocument object with the intelligently-processed content
documents = [
    RAGDocument(
        document_id=f"docling-processed-doc",        # Unique identifier for this document
        content=md_content,                          # The Docling-processed Markdown content
        metadata={                                   # Enhanced metadata for complex documents
            "source_url": url,                       # Original document URL
            "processing_method": "docling",          # Processing pipeline used
            "document_type": "academic_paper",       # Content classification
            "has_tables": True,                      # Contains structured tabular data
            "has_formulas": True,                    # Contains mathematical content
            "has_figures": True,                     # Contains visual elements
        },
    )
]

print(f"📚 Preparing to ingest intelligently-processed document:")
print(f"  • Document ID: docling-processed-doc")
print(f"  • Content length: {len(md_content)} characters")
print(f"  • Processing method: Docling document intelligence")
print(f"  • Content includes: tables, formulas, figures, and structured text")

# === STEP 4: Use LlamaStack RAG Tool for Intelligent Chunking ===
# The RAG tool will automatically chunk the content optimally for retrieval
try:
    client.tool_runtime.rag_tool.insert(
        documents=documents,                         # List of RAGDocument objects to process
        vector_db_id=vector_db_id,                  # Target vector database
        chunk_size_in_tokens=512,                   # Optimal chunk size for academic content
    )
    
    print(f"\n✅ Document ingestion complete!")
    print(f"🎯 Docling-processed content is now searchable via semantic similarity!")
    print(f"📊 Complex academic content (tables, formulas, figures) is now queryable!")
    
except Exception as e:
    print(f"\n❌ Document ingestion failed: {e}")
    print(f"💡 Check Docling processing and vector database configuration")

📚 Preparing to ingest intelligently-processed document:
  • Document ID: docling-processed-doc
  • Content length: 159318 characters
  • Processing method: Docling document intelligence
  • Content includes: tables, formulas, figures, and structured text


INFO:httpx:HTTP Request: POST http://llama-stack-service:8321/v1/tool-runtime/rag-tool/insert "HTTP/1.1 200 OK"



✅ Document ingestion complete!
🎯 Docling-processed content is now searchable via semantic similarity!
📊 Complex academic content (tables, formulas, figures) is now queryable!


## 🔍 Testing Document Intelligence RAG System

Now let's test our enhanced RAG system with queries that showcase Docling's document intelligence capabilities. We'll ask questions that would require understanding of:
- **📊 Tabular data** and structured information
- **🧮 Mathematical content** and technical details
- **📈 Research findings** and experimental results
- **🏛️ Document structure** and relationships

**The Power of Document Intelligence:**
Traditional text extraction would miss most of this information, but Docling's intelligent processing preserves the meaning and structure that enables accurate, comprehensive answers.

In [8]:
# Test queries for the processed document
queries = [
    "What is the PRFXception?",
    "The accuracy values of overall model prediction and residual cross-validation for five regions in southeast Tibet and four regions in northwest Yunnan"
]

for prompt in queries:
    cprint(f"\nUser> {prompt}", "blue")
    
    # RAG retrieval call - find relevant chunks from the vector database
    rag_response = client.tool_runtime.rag_tool.query(
        content=prompt, 
        vector_db_ids=[vector_db_id],
        query_config={
            "chunk_template": "Result {index}\nContent: {chunk.content}\nMetadata: {metadata}\n",
        },
        )

    cprint(rag_response)

    cprint(f"\n--- RAG Metadata ---", "yellow")
    cprint(rag_response.metadata, "cyan")

    # Create messages for the LLM with system prompt
    messages = [{"role": "system", "content": "You are a helpful assistant."}]

    # Combine the user query with retrieved context from RAG
    prompt_context = rag_response.content
    extended_prompt = f"Please answer the given query using the context below.\n\nCONTEXT:\n{prompt_context}\n\nQUERY:\n{prompt}"
    messages.append({"role": "user", "content": extended_prompt})

    # Get response from the LLM using the enhanced prompt
    response = client.inference.chat_completion(
        messages=messages,
        model_id=model_id,
        sampling_params=sampling_params,
        stream=stream,
    )
    
    # Print the streaming response
    cprint("inference> ", color="magenta", end='')
    if stream:
        for chunk in response:
            response_delta = chunk.event.delta
            if isinstance(response_delta, TextDelta):
                cprint(response_delta.text, color="magenta", end='')
            elif isinstance(response_delta, ToolCallDelta):
                cprint(response_delta.tool_call, color="magenta", end='')
    else:
        cprint(response.completion_message.content, color="magenta")

    cprint(f"\n--- End of RAG Answer ---", "blue")

[34m
User> What is the PRFXception?[0m


INFO:httpx:HTTP Request: POST http://llama-stack-service:8321/v1/tool-runtime/rag-tool/query "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://llama-stack-service:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"


QueryResult(metadata={'document_ids': ['docling-processed-doc', 'docling-processed-doc', 'docling-processed-doc', 'docling-processed-doc', 'docling-processed-doc'], 'chunks': [" 256, num\\_sepconv\\_filters] as the feature extractor. This part is mainly responsible for the initial feature extraction and dimension transformation of the input data. Second, by building a series of separated convolution blocks sepconv\\_blocks. These blocks further process and extract features to better capture complex patterns and relationships in the input data, and separating convolutional structures helps improve the model's perception of local features. Finally, output is generated from three 1×1 convolutional layers predictions, variances and second\\_moments. These convolution layers are used to generate the model's predictions, variances, and second moments, respectively. PRFXception preserves the residual connection of Xception to mitigate the vanishing gradient problem(Chollet, 2017). These conne

INFO:httpx:HTTP Request: POST http://llama-stack-service:8321/v1/tool-runtime/rag-tool/query "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST http://llama-stack-service:8321/v1/inference/chat-completion "HTTP/1.1 200 OK"


QueryResult(metadata={'document_ids': ['docling-processed-doc', 'docling-processed-doc', 'docling-processed-doc', 'docling-processed-doc', 'docling-processed-doc'], 'chunks': [' whether the model overfits the training location or whether it supports common features such as making predictions at unseen locations. We conducted geographical cross-validation for 5 regions in southeast Tibet and 4 regions in northwest Yunnan respectively (Fig. 5), which will produce 5 times and 4 times cross-validation. We conducted two types of geographic cross-validation. In the first, we trained two separate regions, southeast Tibet and northwest Yunnan, and then cross-predicted their respective regions. The second method is to cross-verify the five selected regions in southeast Tibet and four regions in northwest Yunnan. All but one area is trained, and the remaining test area is predicted so that the training data near the test area is not visible during the training period.\n\n## 3.6. Evaluation index

## 🎉 You've Built a Document Intelligence RAG System!

**What you accomplished:**
- **🔬 Document Intelligence**: Processed complex academic papers with Docling's advanced capabilities
- **📊 Structured Content Extraction**: Preserved tables, formulas, figures, and document hierarchy
- **🗄️ Enhanced Vector Storage**: Stored intelligently-processed content in Milvus for semantic search
- **🤖 Intelligent Querying**: Built a RAG system that understands complex academic content
- **⚡ Production Pipeline**: Created a scalable workflow for real-world educational materials

**Key Technical Insights:**
- **Document Intelligence vs Basic Extraction**: Docling preserves meaning and structure that simple text extraction would lose
- **Three-Phase Processing**: Analysis → Enhancement → RAG Integration creates comprehensive understanding
- **Semantic Understanding**: Complex documents become queryable by meaning, not just keywords
- **Metadata Enrichment**: Enhanced document metadata enables better retrieval and filtering

**Document Intelligence vs Traditional RAG:**
| Traditional RAG | Document Intelligence RAG |
|-----------------|---------------------------|
| ❌ Loses table structure | ✅ Preserves tabular relationships |
| ❌ Misses mathematical content | ✅ Handles formulas and equations |
| ❌ Ignores document layout | ✅ Understands multi-column layouts |
| ❌ Basic text chunks | ✅ Intelligent content structuring |
| ❌ Limited metadata | ✅ Rich semantic metadata |

**Real-World Applications:**
- **📚 Academic Research Assistants**: Query research papers for specific findings and data
- **🏫 Educational Content Search**: Find relevant course materials across complex documents  
- **📊 Data Extraction**: Automatically extract and query tabular information
- **🔬 Scientific Literature Review**: Analyze and compare findings across multiple papers
- **📖 Intelligent Document Libraries**: Build searchable repositories of complex materials

**Advanced Patterns to Explore:**
- **Multi-Document Intelligence**: Process and compare findings across multiple research papers
- **Domain-Specific Processing**: Optimize Docling for specific academic fields or document types
- **Visual Content Integration**: Enhance with image and chart understanding capabilities
- **Collaborative Intelligence**: Enable teams to build and share intelligent document repositories

Your document intelligence system can now understand and query the most complex academic content - transforming how educational institutions handle knowledge discovery and research! 🚀