<a href="https://colab.research.google.com/github/sunnygupta3535/Notebook/blob/main/pdf_processor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Processor (RAG Pipeline)

This notebook handles the PDF processing part of the RAG pipeline using Docling.
It scans a directory for PDF files, converts them to Markdown, and splits them by page.

# README: PDF Processor (RAG Pipeline)

This notebook is designed to process PDF documents using the Docling library, convert them into Markdown format, and then split the content by page. This process is a crucial first step in building a Retrieval Augmented Generation (RAG) pipeline, where text content needs to be extracted and chunked for efficient retrieval.

## Features

- Scans a specified directory for PDF files.
- Uses `docling` to convert PDFs to Markdown.
- Supports Optical Character Recognition (OCR) for scanned PDFs.
- Splits the generated Markdown content into individual pages.
- Saves both the full Markdown document and individual page Markdowns to an output directory.


## Setup

1.  **Install Dependencies**: The notebook begins by installing the `docling` library.
    ```bash
    !pip install docling
    ```

2.  **Prepare Data**: Create a `data` directory in the same location as your notebook and place all the PDF files you wish to process inside it.
    ```bash
    # Example: Create a 'data' directory and upload your PDFs
    !mkdir -p data
    # Upload your PDF files to the 'data' directory in Colab's file browser
    ```

3.  **Configure Output**: An `output` directory will be created automatically to store the processed Markdown files.



## Usage

Simply run all cells in the notebook. The script will:

1.  Initialize the `PDFProcessor` with OCR enabled.
2.  Scan the `data` directory for all `.pdf` files.
3.  Process each PDF file: convert it to Markdown, and then split it into page-wise Markdown files.
4.  Save the generated Markdown files (full document and individual pages) into subdirectories within the `output` folder, named after the original PDF file.



## Output

For each PDF file, the following output will be generated in a dedicated subdirectory within the `output` folder:

-   `<pdf_name>_full.md`: The complete Markdown content of the PDF.
-   `<pdf_name>_page_001.md`, `<pdf_name>_page_002.md`, etc.: Individual Markdown files for each page of the PDF.

Example output structure:

```
output/
├── CIGU-2025-0002/
│   ├── CIGU-2025-0002_full.md
│   ├── CIGU-2025-0002_page_001.md
│   ├── CIGU-2025-0002_page_002.md
│   └── ...
├── CIGU-2025-0003/
│   ├── CIGU-2025-0003_full.md
│   ├── CIGU-2025-0003_page_001.md
│   └── ...
└── ...
```

The `results` variable in the kernel contains a list of dictionaries, where each dictionary represents a processed page and includes its page number, file path, content, and source PDF.

## Important Notes

-   The `HF_TOKEN` from `google.colab.userdata` is accessed, suggesting potential integration with Hugging Face models, although its direct usage isn't explicitly shown in the provided code snippet.
-   Errors during PDF processing for individual files will be logged, and the processing will continue for other files.


In [1]:
!pip install docling

Collecting docling
  Downloading docling-2.64.1-py3-none-any.whl.metadata (11 kB)
Collecting docling-core<3.0.0,>=2.50.1 (from docling-core[chunking]<3.0.0,>=2.50.1->docling)
  Downloading docling_core-2.54.1-py3-none-any.whl.metadata (7.6 kB)
Collecting docling-parse<5.0.0,>=4.7.0 (from docling)
  Downloading docling_parse-4.7.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (10 kB)
Collecting docling-ibm-models<4,>=3.9.1 (from docling)
  Downloading docling_ibm_models-3.10.3-py3-none-any.whl.metadata (7.3 kB)
Collecting filetype<2.0.0,>=1.2.0 (from docling)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting pypdfium2!=4.30.1,<5.0.0,>=4.30.0 (from docling)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting rapidocr<4.0.0,>=3.3 (from docling)
  Downloading

In [2]:
import os
import logging
from pathlib import Path
from typing import Optional

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import PdfFormatOption

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

In [3]:
# Configuration
DATA_DIR = Path("data")
OUTPUT_DIR = Path("output")

print(f"Data Directory: {DATA_DIR.resolve()}")
print(f"Output Directory: {OUTPUT_DIR.resolve()}")

Data Directory: /content/data
Output Directory: /content/output


In [4]:
class PDFProcessor:
    """Process PDFs using Docling and export to markdown."""

    def __init__(self, output_dir: Optional[str] = None):
        self.output_dir = Path(output_dir or OUTPUT_DIR)
        self.output_dir.mkdir(parents=True, exist_ok=True)

        # Configure PDF pipeline options
        pipeline_options = PdfPipelineOptions()
        pipeline_options.do_ocr = True  # Enable OCR for scanned PDFs

        self.converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
            }
        )

    def process_pdf(self, pdf_path: str) -> list[dict]:
        """
        Process a single PDF and generate page-wise markdown files.

        Args:
            pdf_path: Path to the PDF file

        Returns:
            List of dicts with page info and markdown content
        """
        pdf_path = Path(pdf_path)
        if not pdf_path.exists():
            raise FileNotFoundError(f"PDF not found: {pdf_path}")

        # Convert PDF
        logger.info(f"Converting {pdf_path.name}...")
        result = self.converter.convert(pdf_path)

        # Get the document
        doc = result.document

        # Export full document to markdown
        full_md = doc.export_to_markdown()

        # Create output directory for this PDF
        pdf_output_dir = self.output_dir / pdf_path.stem
        pdf_output_dir.mkdir(parents=True, exist_ok=True)

        # Save full markdown
        full_md_path = pdf_output_dir / f"{pdf_path.stem}_full.md"
        with open(full_md_path, 'w', encoding='utf-8') as f:
            f.write(full_md)

        # Split by pages and save individual page files
        pages = self._split_by_pages(doc, full_md)

        page_files = []
        for i, page_content in enumerate(pages, 1):
            page_path = pdf_output_dir / f"{pdf_path.stem}_page_{i:03d}.md"
            with open(page_path, 'w', encoding='utf-8') as f:
                f.write(page_content)

            page_files.append({
                'page_num': i,
                'file_path': str(page_path),
                'content': page_content,
                'source_pdf': str(pdf_path),
            })

        return page_files

    def _split_by_pages(self, doc, full_md: str) -> list[str]:
        """
        Split document content by pages.
        """
        pages = []

        try:
            if hasattr(doc, 'pages') and doc.pages:
                for page in doc.pages:
                    if hasattr(page, 'export_to_markdown'):
                        pages.append(page.export_to_markdown())
                    else:
                        page_text = str(page) if page else ""
                        pages.append(page_text)
        except Exception:
            pass

        if not pages:
            if '\n---\n' in full_md:
                pages = full_md.split('\n---\n')
            else:
                pages = [full_md]

        return pages

    def process_all_pdfs(self, data_dir: Optional[str] = None) -> list[dict]:
        """
        Process all PDFs in the data directory.
        """
        data_dir = Path(data_dir or DATA_DIR)

        all_pages = []
        pdf_files = list(data_dir.glob("**/*.pdf"))

        print(f"Found {len(pdf_files)} PDF files to process in {data_dir}")

        for pdf_path in pdf_files:
            print(f"Processing: {pdf_path.name}")
            try:
                pages = self.process_pdf(pdf_path)
                all_pages.extend(pages)
                print(f"  → Generated {len(pages)} page(s)")
            except Exception as e:
                print(f"  ✗ Error processing {pdf_path.name}: {e}")

        return all_pages

In [8]:
from google.colab import userdata
userdata.get('HF_TOKEN')

# Run Processing
if __name__ == "__main__":
    processor = PDFProcessor(output_dir=OUTPUT_DIR)

    # Ensure data directory exists
    if not DATA_DIR.exists():
        print(f"Warning: Data directory {DATA_DIR} does not exist!")
        print("Please create it and place PDF files there.")
    else:
        results = processor.process_all_pdfs(data_dir=DATA_DIR)
        print(f"\nTotal pages generated: {len(results)}")

[32m[INFO] 2025-12-10 00:50:16,453 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-12-10 00:50:16,455 [RapidOCR] device_config.py:50: Using CPU device[0m
[32m[INFO] 2025-12-10 00:50:16,508 [RapidOCR] download_file.py:60: File exists and is valid: /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth[0m
[32m[INFO] 2025-12-10 00:50:16,510 [RapidOCR] main.py:50: Using /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_PP-OCRv4_det_infer.pth[0m


Found 4 PDF files to process in data
Processing: CIGU-2025-0002.pdf


[32m[INFO] 2025-12-10 00:50:16,791 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-12-10 00:50:16,792 [RapidOCR] device_config.py:50: Using CPU device[0m
[32m[INFO] 2025-12-10 00:50:16,797 [RapidOCR] download_file.py:60: File exists and is valid: /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_ptocr_mobile_v2.0_cls_infer.pth[0m
[32m[INFO] 2025-12-10 00:50:16,798 [RapidOCR] main.py:50: Using /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_ptocr_mobile_v2.0_cls_infer.pth[0m
[32m[INFO] 2025-12-10 00:50:16,912 [RapidOCR] base.py:22: Using engine_name: torch[0m
[32m[INFO] 2025-12-10 00:50:16,914 [RapidOCR] device_config.py:50: Using CPU device[0m
[32m[INFO] 2025-12-10 00:50:17,007 [RapidOCR] download_file.py:60: File exists and is valid: /usr/local/lib/python3.12/dist-packages/rapidocr/models/ch_PP-OCRv4_rec_infer.pth[0m
[32m[INFO] 2025-12-10 00:50:17,008 [RapidOCR] main.py:50: Using /usr/local/lib/python3.12/dist-packages/rapidocr/mod

  → Generated 69 page(s)
Processing: CIGU-2025-0003.pdf




  → Generated 11 page(s)
Processing: Cyber-security-tips-by-cyber-dost.pdf
  → Generated 45 page(s)
Processing: ANUAL-2025-0001.pdf
  → Generated 25 page(s)

Total pages generated: 150


In [9]:
# Verify Output
output_files = list(OUTPUT_DIR.glob("**/*.md"))
print(f"Found {len(output_files)} markdown files in output directory:")
for f in output_files[:5]:
    print(f" - {f.name}")
if len(output_files) > 5:
    print("...")

Found 154 markdown files in output directory:
 - CIGU-2025-0003_page_002.md
 - CIGU-2025-0003_page_011.md
 - CIGU-2025-0003_page_004.md
 - CIGU-2025-0003_page_008.md
 - CIGU-2025-0003_full.md
...
