# Docling

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/docling/docling_1_intro.ipynb)

[Docling](https://github.com/DS4SD/docling) is an advanced document processor.  It can handle wide variety of formats like PDFs, DOCX, HTML, PPTX .etc.


## Step-1: Figure out Runtime Environment

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [27]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 - Install dependencies if running on Google Colab

In [28]:
%%capture

if RUNNING_IN_COLAB:
    ! pip install  --default-timeout=100  \
        docling

## Step-2: Settings / Config

In [29]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [30]:
## Setup input / output dir
import shutil

shutil.os.makedirs('input', exist_ok=True)

shutil.rmtree('output', ignore_errors=True)
shutil.os.makedirs('output', exist_ok=True)

## Step-3: Checkout Data files

We will use simple PDFs.  The files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data)





## Step-4: Extract Text from PDF

input file: [earth.pdf](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf)

In [31]:
if RUNNING_IN_COLAB:
  input_file = 'input/earth.pdf'
  !wget -O  {input_file} 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/earth.pdf'
else:
  input_file = '../data/solar-system/earth.pdf'

### 4.1 - Command line

Usage

`docling   --output output --to md  input/earth.pdf`


In [32]:
## PDF --> markdown
!docling   --output output --to md  {input_file}

## PDF --> json
!docling   --output output --to json  {input_file}

## PDF --> html
# !docling   --output output --to html  {input_file}

INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|███████████████████████| 9/9 [00:00<00:00, 141381.03it/s]
INFO:docling.pipeline.base_pipeline:Processing document earth.pdf
INFO:docling.document_converter:Finished converting document earth.pdf in 5.46 sec.
INFO:docling.cli.main:writing Markdown output to output/earth.md
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 5.46 seconds.
INFO:docling.document_converter:Going to convert document batch...
Fetching 9 files: 100%|████████████████████████| 9/9 [00:00<00:00, 28815.83it/s]
INFO:docling.pipeline.base_pipeline:Processing document earth.pdf
INFO:docling.document_converter:Finished converting document earth.pdf in 4.90 sec.
INFO:docling.cli.main:writing JSON output to output/earth.json
INFO:docling.cli.main:Processed 1 docs, of which 0 failed
INFO:docling.cli.main:All documents were converted in 4.90 seconds.


### 4.2 - Using Python API

In [33]:
from docling.document_converter import DocumentConverter

print ("Processing:", input_file)

converter = DocumentConverter()
result = converter.convert(input_file)
md = result.document.export_to_markdown()
json = result.document.export_to_dict()
print (md)


Processing: ../data/solar-system/earth.pdf


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

## Earth

## Solar System

Our solar system is a vast and fascinating expanse, comprising eight planets, five dwarf planets, numerous moons, asteroids, comets, and other celestial bodies. At its center lies the star we call the Sun.

For more details about our Solar system see Chapter 1.

## Earth

Earth is the third planet from the Sun. It's our home planet. Earth is the only place we know of with life.

Basic facts about Earth:

- · Distance from the Sun: Average of 149.6 million kilometers (93 million miles)
- · Rotation Period: 24 hours (one day)
- · Moons: One moon, called Luna or simply "the Moon".


In [34]:
import pprint
pprint.pprint (json, indent=4)

{   'body': {   'children': [   {'$ref': '#/texts/0'},
                                {'$ref': '#/texts/1'},
                                {'$ref': '#/texts/2'},
                                {'$ref': '#/texts/3'},
                                {'$ref': '#/texts/4'},
                                {'$ref': '#/texts/5'},
                                {'$ref': '#/texts/6'},
                                {'$ref': '#/groups/0'},
                                {'$ref': '#/texts/10'}],
                'label': 'unspecified',
                'name': '_root_',
                'self_ref': '#/body'},
    'furniture': {   'children': [],
                     'label': 'unspecified',
                     'name': '_root_',
                     'self_ref': '#/furniture'},
    'groups': [   {   'children': [   {'$ref': '#/texts/7'},
                                      {'$ref': '#/texts/8'},
                                      {'$ref': '#/texts/9'}],
                      'label': 'lis

## Step-5: Handle Table Data

input file: [solar-system-overview.pdf](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/solar-system-overview.pdf)

In [35]:
if RUNNING_IN_COLAB:
  input_file = 'input/solar-system-overview.pdf'
  !wget -O  {input_file} 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/solar-system/solar-system-overview.pdf'
else:
  input_file = '../data/solar-system/solar-system-overview.pdf'

In [36]:
from docling.document_converter import DocumentConverter

print ("Processing:", input_file)

converter = DocumentConverter()
result = converter.convert(input_file)
md = result.document.export_to_markdown()
json = result.document.export_to_dict()
print (md)


Processing: ../data/solar-system/solar-system-overview.pdf


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

## Solar System

| Planet   | Distance from Sun (average)   | Number of Moons                                                        | Average Orbit Time   |
|----------|-------------------------------|------------------------------------------------------------------------|----------------------|
| Mercury  | 58 million km (0.39 AU)       | 0                                                                      | 87.97 days           |
| Venus    | 108 million km (0.72 AU)      | 0                                                                      | 224.7 days           |
| Earth    | 149.6 million km (1 AU)       | 1 (The Moon)                                                           | 365.25 days          |
| Mars     | 227.9 million km (1.38 AU)    | 2 (Phobos, Deimos)                                                     | 687.01 days          |
| Jupiter  | 778.3 million km (5.2 AU)     | 79 ( Io, Europa, Ganymede, Callisto, and 75 others)                    | 11.86 years        

## Step-6: Invoice

input file: [invoice-2.pdf](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/invoices/invoice-2.pdf)

In [37]:
if RUNNING_IN_COLAB:
  input_file = 'input/invoice-2.pdf'
  !wget -O  {input_file} 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/invoices/invoice-2.pdf'
else:
  input_file = '../data/invoices/invoice-2.pdf'

In [38]:
from docling.document_converter import DocumentConverter

print ("Processing:", input_file)

converter = DocumentConverter()
result = converter.convert(input_file)
md = result.document.export_to_markdown()
json = result.document.export_to_dict()
print (md)

Processing: ../data/invoices/invoice-2.pdf


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

<!-- image -->

BILL TO:

John Doe

415 Broadway

San Jose, CA

## Perfect Handyman

123 Main St

San Jose, CA

INVOICE #

00000001

DATE

12/31/2024

INVOICE DUE DATE

12/31/2024

| ITEMS   | DESCRIPTION   |   QUANTITY | PRICE   | AMOUNT   |
|---------|---------------|------------|---------|----------|
| ITEM 1  | Cleanup       |          1 | $100    | 100      |
| ITEM 2  | Installation  |          1 | $250    | $250     |

NOTES:

TOTAL

$350


## Step-7: Process HTML

Convert HTML to markdown.  And strip out all excess markup / trackers

Input file: [AI-Alliance-Home.html](https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/html/AI-Alliance-Home.html)

In [39]:
if RUNNING_IN_COLAB:
  input_file = 'input/AI-Alliance-Home.html'
  !wget -O  {input_file} 'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/html/AI-Alliance-Home.html'
else:
  input_file = '../data/html/AI-Alliance-Home.html'

In [40]:
from docling.document_converter import DocumentConverter

print ("Processing:", input_file)

converter = DocumentConverter()
result = converter.convert(input_file)
md = result.document.export_to_markdown()
json = result.document.export_to_dict()
print (md)

Processing: ../data/html/AI-Alliance-Home.html
Who We Are

Our Work

Our Members

Governance

Skills & Education

Trust & Safety

Applications & Tools

Hardware Enablement

Foundation Models

Advocacy

How to Join

Contribute Code

Become a Collaborator

Who We Are

Our Work

Our Members

Governance

Skills & Education

Trust & Safety

Applications & Tools

Hardware Enablement

Foundation Models

Advocacy

How to Join

Contribute Code

Become a Collaborator

# Building the open future of AI

We are technology developers, researchers, industry leaders and advocates who collaborate to advance safe, responsible AI rooted in open innovation.

<!-- image -->

### Skills & Education

<!-- image -->

Supporting global AI skill-building, education, and exploratory research.

### Trust & Safety

<!-- image -->

Creating benchmarks, tools, and methodologies to evaluate and ensure safe, trusted generative AI.

### Applications and Tools

<!-- image -->

Building the most capable tools for AI mode

In [42]:
## Save it to file

with open ('output/html.out.md', 'w') as f:
    f.write(md)