# Docling

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/docling/docling_2_ocr.ipynb)

[Docling](https://github.com/DS4SD/docling) is an advanced document processor.  It can handle wide variety of formats like PDFs, DOCX, HTML, PPTX .etc.

In this notebook, we will parse 'scanned PDFs' and extract content using OCR


## Step-1: Figure out Runtime Environment

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [2]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 - Install dependencies if running on Google Colab

In [3]:
# %%capture

if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  \
        docling

## Step-2: Settings / Config

In [4]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [5]:
## Setup input / output dir
import shutil

shutil.os.makedirs('input', exist_ok=True)

shutil.rmtree('output', ignore_errors=True)
shutil.os.makedirs('output', exist_ok=True)

## Step-3: Data files

We will use scanned PDFs.  The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/scanned-pdfs).

These PDFs are scanned, so they don't have any 'digital text' data.

In [6]:
if RUNNING_IN_COLAB:
  !wget -O  'input/letter-1.pdf'    'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/scanned-pdfs/letter-1.pdf'
  !wget -O  'input/memo-1.pdf'    'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/scanned-pdfs/memo-1.pdf'
  !wget -O  'input/product-brochure.pdf'    'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/scanned-pdfs/product-brochure.pdf'
  !wget -O  'input/public-water-notice.pdf'    'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/scanned-pdfs/public-water-notice.pdf'
  !wget -O  'input/scanned-1.pdf'    'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/scanned-pdfs/scanned-1.pdf'
  !wget -O  'input/type-writter-scanned-1.pdf'    'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/scanned-pdfs/type-writter-scanned-1.pdf'

## Step-4: Extract Text from PDFs



In [7]:
if RUNNING_IN_COLAB:
  input_file = 'input/scanned-1.pdf'
else:
  input_file = '../data/scanned-pdfs/scanned-1.pdf'

### 4.1 - Command line

Usage

`docling   --output output --to md  input/file.pdf`


In [8]:
## PDF --> markdown
!docling   --output output --to md  {input_file}

## PDF --> json
!docling   --output output --to json  {input_file}

## PDF --> html
# !docling   --output output --to html  {input_dir}

INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 27216.10it/s]
INFO:docling.pipeline.base_pipeline:Processing document scanned-1.pdf
INFO:docling.document_converter:Finished converting document scanned-1.pdf in 7.15 sec.
INFO:docling.cli.main:writing Markdown output to output/scanned-1.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 7.15 seconds.
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 67529.04it/s]
INFO:docling.pipeline.base_pipeline:Processing document scanned-1.pdf
INFO:docling.document_converter:Finished converting document scanned-1.pdf in 6.20 sec.
INFO:docling.cli.main:writing JSON output to output/scanned-1.json
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 6.20 seconds.


### 4.2 - Using Python API

In [9]:
from docling.document_converter import DocumentConverter

print ("Processing:", input_file)

converter = DocumentConverter()
result = converter.convert(input_file)
md = result.document.export_to_markdown()
json = result.document.export_to_dict()
print (md)


Processing: ../data/scanned-pdfs/scanned-1.pdf


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

## IDRH

Non-text-searchable PDF

This is an example of a non-text-searchable PDF . Because it was created from an image rather than a text document, it cannot be rendered as plain text by the PDF reader. Thus, attempting to select the text on the page as though it were a text document or website will not work, regardless of how neatly it is organized.


In [10]:
import pprint
pprint.pprint (json, indent=4)

{   'body': {   'children': [   {'$ref': '#/texts/0'},
                                {'$ref': '#/texts/1'},
                                {'$ref': '#/texts/2'}],
                'label': 'unspecified',
                'name': '_root_',
                'self_ref': '#/body'},
    'furniture': {   'children': [],
                     'label': 'unspecified',
                     'name': '_root_',
                     'self_ref': '#/furniture'},
    'groups': [],
    'key_value_items': [],
    'name': 'scanned-1',
    'origin': {   'binary_hash': 5490598024236166341,
                  'filename': 'scanned-1.pdf',
                  'mimetype': 'application/pdf'},
    'pages': {'1': {'page_no': 1, 'size': {'height': 792.0, 'width': 612.0}}},
    'pictures': [],
    'schema_name': 'DoclingDocument',
    'tables': [],
    'texts': [   {   'children': [],
                     'label': 'section_header',
                     'level': 1,
                     'orig': 'IDRH',
                    

## Step-5: Batch Conversion

In [11]:
if RUNNING_IN_COLAB:
  input_dir = 'input'
else:
  input_dir = '../data/scanned-pdfs'

In [12]:
%%time

## Commnd line - uncomment this to execute

# !docling   --output output --to md  {input_dir}

CPU times: user 2 μs, sys: 1e+03 ns, total: 3 μs
Wall time: 5.72 μs


In [14]:
%%time

## python


import os
import sys
from pathlib import Path
from docling.document_converter import DocumentConverter

converter = DocumentConverter()

input_path = Path(input_dir)
pdf_files = list(input_path.glob('*.pdf'))
print(f"Found {len(pdf_files)} PDF files to convert")

for pdf_file in pdf_files:
    result = converter.convert(pdf_file)
    markdown_content = result.document.export_to_markdown()

    md_file_name = os.path.join('output', f"{pdf_file.stem}.md")
    with open(md_file_name, "w", encoding="utf-8") as md_file:
        md_file.write(markdown_content)

    print(f"Converted PDF '{pdf_file}' to markdown '{md_file_name}'")

Found 6 PDF files to convert


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

2025-03-12 22:41:38.159 (  92.914s) [        AD99A740]    doc_normalisation.h:448   WARN| found new `other` type: checkbox-unselected
2025-03-12 22:41:38.159 (  92.914s) [        AD99A740]    doc_normalisation.h:448   WARN| found new `other` type: checkbox-unselected
2025-03-12 22:41:38.159 (  92.914s) [        AD99A740]    doc_normalisation.h:448   WARN| found new `other` type: checkbox-unselected
2025-03-12 22:41:38.159 (  92.914s) [        AD99A740]    doc_normalisation.h:448   WARN| found new `other` type: checkbox-unselected


Converted PDF '../data/scanned-pdfs/public-water-notice.pdf' to markdown 'output/public-water-notice.md'
Converted PDF '../data/scanned-pdfs/scanned-1.pdf' to markdown 'output/scanned-1.md'
Converted PDF '../data/scanned-pdfs/product-brochure.pdf' to markdown 'output/product-brochure.md'
Converted PDF '../data/scanned-pdfs/type-writter-scanned-1.pdf' to markdown 'output/type-writter-scanned-1.md'
Converted PDF '../data/scanned-pdfs/letter-1.pdf' to markdown 'output/letter-1.md'




Converted PDF '../data/scanned-pdfs/memo-1.pdf' to markdown 'output/memo-1.md'
CPU times: user 2min 23s, sys: 5.84 s, total: 2min 29s
Wall time: 1min 29s
