# 📄 PDF to Markdown Converter with DocLing
This notebook demonstrates how to convert PDF documents into clean, structured Markdown files using the DocLing library.
It uses the DocumentConverter class with PdfPipelineOptions to process PDFs and extract both text and embedded images.

✅ Features:

Supports PDF input from local path or URL

Automatically generates page images

Outputs Markdown with embedded content

Stores results in a designated output directory

In [1]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import EasyOcrOptions, PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


## 🔧 Setting Up the PDF Conversion Pipeline
This block of code configures the PDF-to-Markdown conversion pipeline using DocLing. Here's a breakdown:

In [2]:
artifacts_path = "/home/db2inst1/.cache/docling/models"

pipeline_options = PdfPipelineOptions(artifacts_path=artifacts_path, generate_page_images=True, images_scale=2.0)
doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

## 📥 Loading and Converting the PDF
This section loads the input PDF and runs the conversion pipeline:

In [3]:
source = "sample-paper1.pdf"  # PDF path or URL
result = doc_converter.convert(source)

## 📤 Exporting the Results to Markdown
This section saves the converted document as a Markdown file with embedded images:

Creates an output directory if it doesn’t exist.

Extracts the filename (without extension) from the input PDF.

Writes the converted content to a .md file using export_to_markdown(image_mode="embedded"), which includes both text and inline images.

In [4]:
## Export results
output_dir = Path("output")
output_dir.mkdir(parents=True, exist_ok=True)
doc_filename = result.input.file.stem

# Export Markdown format:
with (output_dir / f"{doc_filename}.md").open("w", encoding="utf-8") as fp:
    fp.write(result.document.export_to_markdown(image_mode="embedded"))