# Document Ingestion and Chunking for RAG with Docling

This notebook demonstrates how to use **Docling** to ingest, parse, and chunk documents (PDF, DOCX, PPTX) for Retrieval Augmented Generation (RAG) systems.

## Overview

**Docling** is a powerful document processing library that:
- Extracts structured content from various document formats
- Preserves document layout and hierarchy (headings, sections, tables)
- Provides rich metadata (page numbers, document structure)
- Prepares content for embedding and vector database indexing

## Workflow Steps

1. **Install Dependencies** - Set up Docling and required libraries
2. **Document Ingestion** - Load and convert documents using DocumentConverter
3. **Structure Extraction** - Parse and access document elements with metadata
4. **Chunking Strategy** - Split content intelligently while preserving context
5. **Output Preparation** - Export chunks with metadata for embedding

## Prerequisites

- PDF, DOCX, or PPTX documents you want to process
- Python 3.8+
- Jupyter Notebook environment


## 1. Setup and Installation

First, install the required libraries. Docling provides comprehensive document processing capabilities.


In [1]:
# Import required libraries
import json
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any

# Docling imports
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode
from docling.chunking import HybridChunker

print("✓ All imports successful!")


  from .autonotebook import tqdm as notebook_tqdm


✓ All imports successful!


## 2. Document Ingestion

Now we'll set up the DocumentConverter and configure it to process various document types.

### Configuration Options

- **OCR**: Enable for scanned documents or images
- **Table Extraction**: Use TableFormer for accurate table parsing
- **Pipeline Options**: Configure PDF processing behavior


In [2]:
# Configure the document converter with pipeline options
# This setup optimizes for accurate extraction with table support

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False  # Set to True for scanned documents
pipeline_options.do_table_structure = True  # Enable table extraction
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE  # or FAST for speed

# Initialize the DocumentConverter
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

print("✓ DocumentConverter initialized successfully!")
print(f"  - OCR enabled: {pipeline_options.do_ocr}")
print(f"  - Table extraction: {pipeline_options.do_table_structure}")
print(f"  - Table mode: {pipeline_options.table_structure_options.mode}")


✓ DocumentConverter initialized successfully!
  - OCR enabled: False
  - Table extraction: True
  - Table mode: TableFormerMode.ACCURATE


### Processing a Single Document

Provide the path to your document (PDF, DOCX, or PPTX).


In [3]:
# Specify the path to your document
# Replace with your actual document path
document_path = "/Users/yashpatil/Developer/AI/SunnySavita/Persona.pptx"  # Can be .pdf, .docx, .pptx

# Convert the document
print(f"Processing document: {document_path}")
print("This may take a moment depending on document size...")

# Uncomment the lines below when you have a document ready
result = doc_converter.convert(document_path)
doc = result.document
print(doc)
# Print document content
print("\n" + "="*80)
print("DOCUMENT CONTENT")
print("="*80 + "\n")

# Export as markdown for readable output
markdown_content = doc.export_to_markdown()
print(markdown_content)

print("\n" + "="*80)
print(f"Document processed successfully!")
print(f"Total pages: {len(doc.pages) if hasattr(doc, 'pages') else 'N/A'}")
print("="*80)


Processing document: /Users/yashpatil/Developer/AI/SunnySavita/Persona.pptx
This may take a moment depending on document size...


FileNotFoundError: [Errno 2] No such file or directory: '/Users/yashpatil/Developer/AI/SunnySavita/Persona.pptx'

### Batch Processing Multiple Documents

Process multiple documents at once for efficient pipeline execution.


In [None]:
# Batch processing example
# Specify a directory containing multiple documents
documents_directory = "/Users/yashpatil/Developer/AI/SunnySavita/sample"  # Directory with PDFs, DOCX, PPTX files

# Uncomment to process all documents in a directory
document_paths = []
for ext in ['*.pdf', '*.docx', '*.pptx']:
    document_paths.extend(Path(documents_directory).glob(ext))

print(f"Found {len(document_paths)} documents to process")
#
# # Process all documents
results = []
for doc_path in document_paths:
    print(f"Processing: {doc_path.name}")
    result = doc_converter.convert(str(doc_path))
    results.append({
        'filename': doc_path.name,
        'document': result.document
    })
#
print(f"\n✓ Batch processing complete! Processed {len(results)} documents")


2025-11-10 01:03:33,124 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-10 01:03:33,153 - INFO - Going to convert document batch...
2025-11-10 01:03:33,154 - INFO - Initializing pipeline for StandardPdfPipeline with options hash 4a89bda5a26a77cb2b7befc50a30d131
2025-11-10 01:03:33,160 - INFO - Loading plugin 'docling_defaults'
2025-11-10 01:03:33,163 - INFO - Registered ocr engines: ['auto', 'easyocr', 'ocrmac', 'rapidocr', 'tesserocr', 'tesseract']


Found 3 documents to process
Processing: projectOverview.pdf


2025-11-10 01:03:33,890 - INFO - Accelerator device: 'mps'
2025-11-10 01:05:32,353 - INFO - Accelerator device: 'mps'
2025-11-10 01:05:32,784 - INFO - Processing document projectOverview.pdf
2025-11-10 01:05:44,769 - INFO - Finished converting document projectOverview.pdf in 131.65 sec.
2025-11-10 01:05:44,771 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-11-10 01:05:44,772 - INFO - Going to convert document batch...
2025-11-10 01:05:44,773 - INFO - Processing document Attention Is All You Need for KV Cache in Diffusion LLMs.pdf


Processing: Attention Is All You Need for KV Cache in Diffusion LLMs.pdf


2025-11-10 01:05:46,361 - INFO - Finished converting document Attention Is All You Need for KV Cache in Diffusion LLMs.pdf in 1.59 sec.
2025-11-10 01:05:46,364 - INFO - detected formats: [<InputFormat.PPTX: 'pptx'>]
2025-11-10 01:05:46,372 - INFO - Going to convert document batch...
2025-11-10 01:05:46,372 - INFO - Processing document Persona.pptx
2025-11-10 01:05:46,425 - INFO - Finished converting document Persona.pptx in 0.06 sec.


Processing: Persona.pptx

✓ Batch processing complete! Processed 3 documents


In [None]:
# Extract and display content from all processed documents
for result in results:
    print(f"\n{'='*80}")
    print(f"Document: {result['filename']}")
    print('='*80)
    
    doc = result['document']
    
    # Export as markdown for readable content
    markdown_content = doc.export_to_markdown()
    
    # Display first 500 characters of content
    print("\nContent Preview:")
    print(markdown_content[:3000])
    print("..." if len(markdown_content) > 500 else "")
    
    # Show document statistics
    print(f"\nDocument Stats:")
    print(f"  - Total pages: {len(doc.pages) if hasattr(doc, 'pages') else 'N/A'}")
    print(f"  - Content length: {len(markdown_content)} characters")
    print(f"  - Tables: {len(doc.tables) if hasattr(doc, 'tables') else 0}")
    print(f"  - Pictures: {len(doc.pictures) if hasattr(doc, 'pictures') else 0}")


Document: projectOverview.pdf

Content Preview:
DeepWiki sunnysavita10/document\_portal

## Menu

## Document Portal Overview

Relevant source files

## Purpose and Scope

This document provides a high-level introduction to the Document Portal system, a comprehensive document processing platform that combines AI-powered analysis, comparison, and conversational capabilities. The system enables users to upload documents and perform various operations including metadata extraction, document comparison, and interactive question-answering through retrieval-augmented generation (RAG).

This overview covers the system's core capabilities, architecture, and technology stack. For detailed setup instructions, see Getting Started . For in-depth architectural details, see System Architecture . For deployment information, see Deployment and Infrastructure .

## System Capabilities

The Document Portal provides four primary document processing capabilities:

| Capability           | Description    

## 3. Document Parsing and Structure Extraction

Once a document is converted, Docling provides structured access to all elements with rich metadata.


### Export as Markdown

Markdown format preserves structure while being human-readable.


In [None]:
# Export document to Markdown format
# Uncomment when you have a processed document

# markdown_content = doc.export_to_markdown()
# print("Markdown Export (first 1000 characters):")
# print("=" * 60)
# print(markdown_content[:1000])
# print("=" * 60)
# print(f"\nTotal length: {len(markdown_content)} characters")


### Export as JSON (Structured)

JSON export includes detailed metadata about each element.


In [4]:
# Export document to JSON format with structure
# Uncomment when you have a processed document

# json_content = doc.export_to_dict()
# 
# # Pretty print a sample of the JSON structure
# print("JSON Structure (sample):")
# print(json.dumps(json_content, indent=2, default=str)[:2000])
# 
# # Save to file if needed
# # with open('document_structure.json', 'w') as f:
# #     json.dump(json_content, f, indent=2, default=str)


### Accessing Document Elements

Iterate through document elements with metadata like page numbers, element types, and hierarchy.


In [5]:
# Access document elements with metadata
# Uncomment when you have a processed document

# print("Document Elements Analysis:")
# print("=" * 60)
# 
# # Iterate through document items
# for i, (element, level) in enumerate(doc.iterate_items()):
#     element_type = type(element).__name__
#     text_preview = str(element.text)[:100] if hasattr(element, 'text') else ""
#     
#     # Get page information if available
#     page_info = ""
#     if hasattr(element, 'prov') and element.prov:
#         for prov in element.prov:
#             if hasattr(prov, 'page_no'):
#                 page_info = f"Page {prov.page_no}"
#                 break
#     
#     print(f"[{i}] Type: {element_type:20} Level: {level} {page_info}")
#     if text_preview:
#         print(f"    Text: {text_preview}...")
#     
#     if i >= 10:  # Show first 10 elements
#         print(f"\n... and {len(list(doc.iterate_items())) - 10} more elements")
#         break


### Extracting Tables

Tables are extracted as structured data, preserving their content and layout.


In [6]:
# Extract and display tables
# Uncomment when you have a processed document

# from docling_core.types.doc import TableItem
# 
# tables = [item for item, _ in doc.iterate_items() if isinstance(item, TableItem)]
# 
# print(f"Found {len(tables)} table(s) in the document\n")
# 
# for i, table in enumerate(tables[:3]):  # Show first 3 tables
#     print(f"Table {i+1}:")
#     print("-" * 60)
#     
#     # Export table as markdown
#     if hasattr(table, 'export_to_markdown'):
#         print(table.export_to_markdown())
#     
#     # Get table data for pandas
#     if hasattr(table, 'data'):
#         df = pd.DataFrame(table.data)
#         print(df.head())
#     
#     print("\n")


### Using Docling's HybridChunker

The HybridChunker provides intelligent document-aware chunking.


In [9]:
# Initialize the HybridChunker
# Configure chunk size based on your needs
# HybridChunker uses a default tokenizer automatically

chunker = HybridChunker(
    max_tokens=512,  # Maximum tokens per chunk (adjust based on your embedding model)
)

print("✓ HybridChunker initialized")
print(f"  - Max tokens per chunk: {chunker.max_tokens}")


✓ HybridChunker initialized
  - Max tokens per chunk: 512


<!-- ### Common Issues and Solutions

**Issue: OCR is slow**
- Solution: Only enable OCR for scanned documents
- Use pipeline options: `pipeline_options.do_ocr = False`

**Issue: Tables not extracted properly**
- Solution: Use ACCURATE mode and ensure high-quality PDFs
- Check table detection: `pipeline_options.do_table_structure = True`

**Issue: Chunks too large for embedding model**
- Solution: Reduce `max_tokens` in HybridChunker
- Verify token count matches your embedding model's limit

**Issue: Memory errors with large documents**
- Solution: Process documents in smaller batches
- Consider processing one document at a time for very large files

**Issue: Poor retrieval quality**
- Solution: Adjust chunk size and overlap
- Ensure metadata is preserved and useful
- Consider semantic chunking instead of fixed-size chunks -->


## 8. Quick Reference

### Minimal Working Example

```python
# 1. Setup
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker

# 2. Convert
converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# 3. Chunk
chunker = HybridChunker(max_tokens=512)
chunks = list(chunker.chunk(doc))

# 4. Use chunks
for chunk in chunks:
    print(chunk.text)
```

### Key Parameters Reference

**DocumentConverter:**
- `do_ocr`: Enable/disable OCR (default: False)
- `do_table_structure`: Enable table extraction (default: True)
- `table_structure_options.mode`: FAST or ACCURATE

**HybridChunker:**
- `max_tokens`: Maximum tokens per chunk (default: 512)
- Uses a default tokenizer automatically (no need to specify)

### Supported Formats

- PDF (.pdf)
- Microsoft Word (.docx)
- PowerPoint (.pptx)
- HTML (.html)
- Markdown (.md)

Check Docling documentation for the complete list.


## Summary

This notebook demonstrated a complete workflow for document ingestion and chunking for RAG:

1. **Installation & Setup**: Installed Docling and configured the environment
2. **Document Ingestion**: Used DocumentConverter to process PDF/DOCX/PPTX files
3. **Structure Extraction**: Parsed documents with metadata (pages, sections, tables)
4. **Intelligent Chunking**: Split content using HybridChunker with metadata preservation
6. **Best Practices**: Covered optimization, troubleshooting, and integration


### Resources

- **Docling Documentation**: https://github.com/DS4SD/docling


In [None]:
# Install Docling and dependencies
# Uncomment and run this cell if you haven't installed these packages yet

# !pip install docling
# !pip install docling-core
# !pip install pandas
