PDF doc structure as markdown #391

joe-pierce · 2026-02-16T10:35:18Z

joe-pierce
Feb 16, 2026

Hi, I've had a quick play around and like the performance I'm seeing for many file formats, however the PDF outputs I am getting are not great. Are there any plans to enable document and table structure in PDFs and output as markdown? E.g. with a word doc I get good structured markdown output, however with PDFs there is no identified structure, and tables are poorly output.

Goldziher · 2026-02-16T14:36:20Z

Goldziher
Feb 16, 2026
Maintainer

have you used markdown output for PDFs? can you give concrete examples?

0 replies

joe-pierce · 2026-02-16T14:46:12Z

joe-pierce
Feb 16, 2026
Author

With this PDF as an example:
Innovation-Magazine-_11summer_FC.pdf

The markdown output doesn't contain any headers, and there are line breaks at the same places as the text wraps in the pdf. I was hoping to get a clear structure of headers/subheaders. Plus proper paragraphs rather than lots of line breaks. And more closely following the implied structure. The docling library does a lot of this layout processing and converts to markdown fairly well, but the performance in terms of speed and memory is far behind kreuzberg.

I used this code:

async def extract_file_with_profiling(file_path: str) -> None:
    config = ExtractionConfig(
        use_cache=True,
        enable_quality_processing=True,
        output_format="markdown",
    )
    return await extract_file(file_path, config=config)


async def main() -> None:
    return await extract_file_with_profiling(
        r"Innovation-Magazine-_11summer_FC.pdf"
    )

to produce the following output:

innovation
Current Magazine Supplement Issue 10 June 2025
Featured on page 6: Using the innovative 
spiral flush piling technique in Kendal.

A Jackson Civil Engineering team, working 
on behalf of the Environment Agency, is 
using an innovative silt capture channel 
during the building of a major Flood Storage 
Reservoir (FSR). The £30 million FSR will 
protect 191 properties and businesses from 
flooding in a 1 in 100 year flood event in 
the village of Lowdham in Nottinghamshire. 
The construction of the reservoir will 
simultaneously be providing a biodiversity 
increase of 20% Biodiversity Net Gain (BNG). 
We are using the channel to keep as much
silt as possible out of a nearby watercourse.
Background
Our scheme involves a high level of earthworks
and much of the water running off the site
goes into a stream called the Cocker Beck. The
use of the silt capture channel as a natural
water treatment system, helps clean the water.
This reduces the risk of polluting nearby
watercourses and causing harm to aquatic life.
We installed the channel, working with Frog
Environmental, at a cost of £8,000.
What did you do differently?
The system over-pumps water into a holding
pond, which is then gravity fed into an
outlet pipe containing gel flocculant bricks.
This is an environmentally safe way of
introducing flocculant into ditches, drains,
pumped discharges or effluent streams to
cleanse muddy water, and trap silt and other
contaminants like nutrients, metals and
hydrocarbons.
A flocculant is an agent that helps
other agents, like fine suspended particles,
clump together into larger particles, which can
be captured more easily. Unmade ground on
construction sites is particularly vulnerable and,
without adequate protection, rain will lead to
fine particles of clayey soil and silt being carried
long distances off-site into streams and rivers.
The water from the pipe then runs
into the silt capture channel made up of a
series of mainly coir products. Formed on an
impermeable membrane and made up of a
combination of Floc Mat™, Silt Mats™ and Silt
Wattles, silt capture channels are used to slow
the flow of water, and to filter and trap silt and
fine sediment.
These mats and wattles hold up to 40
– 50kg of silt sediment and on Lowdham have
a lifespan of typically six months, depending on
the rain and the sediment that is washed down
as a result. Being 100% biodegradable, the
silt-filled mats can be utilised within an area of
landscape fill.
PREVENTING POLLUTION
Using innovative silt capture at Lowdham Flood Storage
Key stats: £30million flood
storage reservoir protecting
191 properties. Construction
will provide a BNG of 20%
increase.
Challenge: To prevent the
risk of polluting nearby
watercourses and causing
harm to aquatic life.
Solution: Our team installed
a simple but innovative silt
capture channel.
Team: Environment Agency,
Jackson Civil Engineering,
Frog Environmental
Contact:
Richard Barnes, Strategic
Client Director at Jackson Civil
Engineering.
rbarnes@jackson-civils.co.uk
Further information:
Frog Environmental silt
control
Benjamin Doughty, Flood
Risk Officer, Partnership &
Strategic Overview Team,
Environment Agency: "The
Environment Agency is
committed to ensuring
that the watercourse is
not polluted as a result of
the scheme earthworks –
and the silt trap channel
helps achieve just that.
The silt trap channel has
successfully reduced the
risk of siltation to the river
and pollutants to the Cocker
Beck which may have
otherwise affected fish and
other ecology within the
river habitat."
Benefits
• Captures and binds silt, cleaning muddy
construction site water.
• It can be made in any shape and size,
and a second channel can be built to
handle higher volumes of silty water.
• Easily monitored with consumable
products replaced quickly and simply.
• No ground disturbance or excavations
required.
• Quick and easy to install.
• Captures material and holds up to
40–50kg of silt sediment and part can
be replaced.
Lessons learnt and future uses
Depending on the available space,
topography and pumping rate, the silt
capture channel can be formed in different
configurations to suit the constraints of the
construction site.
Our site team and supply chain have
collaborated over this passive approach to
create this site-based solution, which could
be replicated by other projects and teams.
It will help reduce carbon emissions as the
project is carried out and ensure that silt
runoff from our project activities does not
affect existing watercourses.
Lowdham silt trap
18

The new issue of Current Magazine has been
published.
This new edition focuses on the application of
Artificial Intelligence into flood & coastal projects to
save time, cost and carbon. This edition also delves
into projects improving climate resilience and social
value. Click here to read.
Subscribe. Share. Feedback.
Email
sarah@sustain-communications.com
Join the LinkedIn Current &
Innovation Magazine group to
receive the magazine online.
Share
on Twitter, Facebook, Whatsapp or
LinkedIn.
Issue 1 Issue 2
Magazine library accessible to all
Don't forget to read our other editions of the Innovation Supplement and Current Magazine. You can
access the full library here: You can also click on any of the icons below to open previous issues of
Innovation.
Issue 4
Issue 3
Issue 5 Issue 6
Issue 7 Issue 8
Issue 9
20

2 replies

Goldziher Feb 16, 2026
Maintainer

ill check

Goldziher Feb 17, 2026
Maintainer

We indeed have a bug. Thanks for reporting. I am making this into an issue.

Goldziher · 2026-02-19T12:50:40Z

Goldziher
Feb 19, 2026
Maintainer

Please test v4.3.6.

2 replies

joe-pierce Feb 19, 2026
Author

Thanks - I just tested with 4.3.6 and still not seeing any markdown in the output, instead just the extracted text with double newlines between each line:

innovation

Issue 10 June 2025

Current Magazine Supplement

Featured on page 6: Using the innovative

spiral flush piling technique in Kendal.

PREVENTING POLLUTION

Using innovative silt capture at Lowdham Flood Storage

A Jackson Civil Engineering team, working

on behalf of the Environment Agency, is

using an innovative silt capture channel

during the building of a major Flood Storage

Reservoir (FSR). The £30 million FSR will

protect 191 properties and businesses from

flooding in a 1 in 100 year flood event in

the village of Lowdham in Nottinghamshire.

The construction of the reservoir will

simultaneously be providing a biodiversity

increase of 20% Biodiversity Net Gain (BNG).

We are using the channel to keep as much

silt as possible out of a nearby watercourse.

Background

Our scheme involves a high level of earthworks

and much of the water running off the site

goes into a stream called the Cocker Beck. The

use of the silt capture channel as a natural

water treatment system, helps clean the water.

This reduces the risk of polluting nearby

watercourses and causing harm to aquatic life.

We installed the channel, working with Frog

Environmental, at a cost of £8,000.

What did you do differently?

The system over-pumps water into a holding

pond, which is then gravity fed into an

outlet pipe containing gel flocculant bricks.

This is an environmentally safe way of

introducing flocculant into ditches, drains,

pumped discharges or effluent streams to

cleanse muddy water, and trap silt and other

contaminants like nutrients, metals and

hydrocarbons.

A flocculant is an agent that helps

other agents, like fine suspended particles,

clump together into larger particles, which can

be captured more easily. Unmade ground on

construction sites is particularly vulnerable and,

without adequate protection, rain will lead to

fine particles of clayey soil and silt being carried

long distances off-site into streams and rivers.

The water from the pipe then runs

into the silt capture channel made up of a

series of mainly coir products. Formed on an

impermeable membrane and made up of a

combination of Floc Mat™, Silt Mats™ and Silt

Wattles, silt capture channels are used to slow

the flow of water, and to filter and trap silt and

fine sediment.

These mats and wattles hold up to 40

– 50kg of silt sediment and on Lowdham have

a lifespan of typically six months, depending on

the rain and the sediment that is washed down

as a result. Being 100% biodegradable, the

silt-filled mats can be utilised within an area of

landscape fill.

Lessons learnt and future uses

Depending on the available space,

topography and pumping rate, the silt

capture channel can be formed in different

configurations to suit the constraints of the

construction site.

Our site team and supply chain have

collaborated over this passive approach to

create this site-based solution, which could

be replicated by other projects and teams.

It will help reduce carbon emissions as the

project is carried out and ensure that silt

runoff from our project activities does not

affect existing watercourses.

Key stats: £30million flood storage reservoir protecting 191 properties. Construction will provide a BNG of 20% increase.

Challenge: To prevent the risk of polluting nearby watercourses and causing harm to aquatic life.

Solution: Our team installed a simple but innovative silt capture channel.

Team: Environment Agency, Jackson Civil Engineering, Frog Environmental

Contact:

Richard Barnes, Strategic Client Director at Jackson Civil Engineering.

rbarnes@jackson-civils.co.uk

Further information:

Frog Environmental silt control

Benjamin Doughty, Flood Risk Officer, Partnership & Strategic Overview Team, Environment Agency: "The Environment Agency is committed to ensuring that the watercourse is not polluted as a result of the scheme earthworks – and the silt trap channel helps achieve just that. The silt trap channel has successfully reduced the risk of siltation to the river and pollutants to the Cocker Beck which may have otherwise affected fish and other ecology within the river habitat."

Benefits

•

Captures and binds silt, cleaning muddy construction site water.

•

It can be made in any shape and size, and a second channel can be built to handle higher volumes of silty water.

•

Easily monitored with consumable products replaced quickly and simply.

•

No ground disturbance or excavations required.

•

Quick and easy to install.

•

Captures material and holds up to 40–50kg of silt sediment and part can be replaced.

Lowdham silt trap

•

Scaling up decarbonisation

•

Integrating sustainability with BIM

•

Reducing carbon at Lydd

•

Low carbon concrete trial at Hexham FAS

•

Trialling cement binders at Lowdham reservoir

•

Using hydrogen fuel cells for ground engineering

•

Reducing carbon at Crossen's pumping station

•

First-time use of Geofoam for wetlands

•

Reusing asphalt planings at the Day's Lock

•

3D spatial modelling for short-term flood resilience

The new issue of Current Magazine has been

published.

This new edition focuses on the application of

Artificial Intelligence into flood & coastal projects to

save time, cost and carbon. This edition also delves

into projects improving climate resilience and social

value.

Click here to read.

Subscribe. Share. Feedback.

Email

sarah@sustain-communications.com

Join us

Join the

LinkedIn Current &

Innovation Magazine group

to

receive the magazine online.

Share

on

Twitter

,

Facebook

,

Whatsapp

or

LinkedIn

.

Magazine library accessible to all

Don't forget to read our other editions of the Innovation Supplement and Current Magazine. You can

access the full library

here

: You can also click on any of the icons below to open previous issues of

Innovation.

Issue 1

Issue 2

Issue 3

Issue 4

Issue 5

Issue 6

Issue 8

Issue 7

Issue 9

I have also found this new version is taking significantly longer to process than before.

I have also just tried it on a few larger PDFs and finding that a large portion of the content is missing from the output each time. For example, using this attached
Logic-Combi-ES-Manual.pdf when I extract the text (same code snippet as before) the returned content is missing the first 24 pages. It is also taking around 24 seconds to process, whereas before (v4.3.3) it was <1s.

joe-pierce Feb 19, 2026
Author

For reference, the output from pymupdfllm library looks like this, which is closer to what I would expect with the markdown output:

## PREVENTING POLLUTION
#### Using innovative silt capture at Lowdham Flood Storage

**Key stats:** £30million flood
storage reservoir protecting
191 properties. Construction
will provide a BNG of 20%
increase.

**Challenge** : To prevent the
risk of polluting nearby
watercourses and causing
harm to aquatic life.

**Solution** : Our team installed
a simple but innovative silt
capture channel.

**Team:** Environment Agency,
Jackson Civil Engineering,
Frog Environmental

**Contact:**
Richard Barnes, Strategic
Client Director at Jackson Civil
Engineering.
rbarnes@jackson-civils.co.uk

**Further information:**
Frog Environmental silt
control

_Benjamin Doughty, Flood_
_Risk Officer, Partnership &_
_Strategic Overview Team,_
_Environment Agency: "The_
_Environment Agency is_
_committed to ensuring_
_that the watercourse is_
_not polluted as a result of_
_the scheme earthworks –_
_and the silt trap channel_
_helps achieve just that._
_The silt trap channel has_
_successfully reduced the_
_risk of siltation to the river_
_and pollutants to the Cocker_
_Beck which may have_
_otherwise affected fish and_
_other ecology within the_
_river habitat."_

A Jackson Civil Engineering team, working
on behalf of the Environment Agency, is
using an innovative silt capture channel
during the building of a major Flood Storage
Reservoir (FSR). The £30 million FSR will
protect 191 properties and businesses from
flooding in a 1 in 100 year flood event in
the village of Lowdham in Nottinghamshire.
The construction of the reservoir will
simultaneously be providing a biodiversity
increase of 20% Biodiversity Net Gain (BNG).
We are using the channel to keep as much
silt as possible out of a nearby watercourse.

**Background**
Our scheme involves a high level of earthworks
and much of the water running off the site
goes into a stream called the Cocker Beck. The
use of the silt capture channel as a natural
water treatment system, helps clean the water.
This reduces the risk of polluting nearby
watercourses and causing harm to aquatic life.
We installed the channel, working with Frog
Environmental, at a cost of £8,000.

What did you do differently?
The system over-pumps water into a holding
pond, which is then gravity fed into an
outlet pipe containing gel flocculant bricks.
This is an environmentally safe way of
introducing flocculant into ditches, drains,
pumped discharges or effluent streams to
cleanse muddy water, and trap silt and other
contaminants like nutrients, metals and
hydrocarbons.
A flocculant is an agent that helps
other agents, like fine suspended particles,
clump together into larger particles, which can
be captured more easily. Unmade ground on
construction sites is particularly vulnerable and,
without adequate protection, rain will lead to
fine particles of clayey soil and silt being carried
long distances off-site into streams and rivers.
The water from the pipe then runs
into the silt capture channel made up of a
series of mainly coir products. Formed on an
impermeable membrane and made up of a
combination of Floc Mat™, Silt Mats™ and Silt
Wattles, silt capture channels are used to slow
the flow of water, and to filter and trap silt and
fine sediment.
These mats and wattles hold up to 40

- 50kg of silt sediment and on Lowdham have
a lifespan of typically six months, depending on
the rain and the sediment that is washed down
as a result. Being 100% biodegradable, the
silt-filled mats can be utilised within an area of
landscape fill.

_Benefits_

_• Captures and binds silt, cleaning muddy_
_construction site water._

_• It can be made in any shape and size,_
_and a second channel can be built to_
_handle higher volumes of silty water._

_• Easily monitored with consumable_
_products replaced quickly and simply._

_• No ground disturbance or excavations_
_required._

_• Quick and easy to install._

_• Captures material and holds up to_
_40–50kg of silt sediment and part can_
_be replaced._

Lessons learnt and future uses
Depending on the available space,
topography and pumping rate, the silt
capture channel can be formed in different
configurations to suit the constraints of the
construction site.
Our site team and supply chain have
collaborated over this passive approach to
create this site-based solution, which could
be replicated by other projects and teams.
It will help reduce carbon emissions as the
project is carried out and ensure that silt
runoff from our project activities does not
affect existing watercourses.

Lowdham silt trap

20

**The new issue of Current Magazine has been**
**published.**
This new edition focuses on the application of
Artificial Intelligence into flood & coastal projects to
save time, cost and carbon. This edition also delves
into projects improving climate resilience and social
value. Click here to read.

**Subscribe. Share. Feedback.**

Email
sarah@sustain-communications.com

**Join us**
Join the LinkedIn Current &
Innovation Magazine group to

receive the magazine online.

**Share**
on Twitter, Facebook, Whatsapp or

LinkedIn.

**Magazine library accessible to all**
Don't forget to read our other editions of the Innovation Supplement and Current Magazine. You can
access the full library here: You can also click on any of the icons below to open previous issues of
Innovation.

##### **Issue 2**

##### **Issue 3**

##### **Issue 4**

##### **Issue 7 Issue 8 Issue 9**

Goldziher · 2026-02-20T18:04:26Z

Goldziher
Feb 20, 2026
Maintainer

v4.3.7 is being released now with a fix. let me know if any issue persists once out.

1 reply

joe-pierce Feb 24, 2026
Author

Hi I just tested with v4.3.7 and v4.3.8 and no improvement in the output - same output as I was getting with v4.3.6 (posted above)

toresenneseth · 2026-04-05T15:10:03Z

toresenneseth
Apr 5, 2026

Hi,
I'm running v4.7.2 and seeing the same behavior. Basically, it looks like PDF (haven't tried anything else) only works for output_mode="plain".
If I set output_mode to markdown or html, I only get partial data back - and not markdown formatting.
My input is a multi-page PDF file.

7 replies

Goldziher Apr 5, 2026
Maintainer

cool. will report back. i see broken urls (i think its citations, checking).

toresenneseth Apr 5, 2026

Thanks!

Thinking about it - there probably something off with that file. I tried with docling and it returns the expected result - but is painfully slow.
MarkItDown returns almost exactly the same as Kreuzberg when output_format is set to "plain", so it's probably something wrong with the file. My fix for now has been to enable page extraction and loop over every page and stitch things together manually (no markdown formatting, but I get the text contents).

I tried another file (it's an invoice so I cannot post it here), and it extracts data with Markdown formatting as expected (with output_format="markdown").

Goldziher Apr 5, 2026
Maintainer

No, there is a bug. Fixing

toresenneseth Apr 5, 2026

Awesome!
Amazing work btw!

Goldziher Apr 6, 2026
Maintainer

release in progress. Im closing this disucssion. If you have further issues, open a GH issue pls.

Uh oh!

PDF doc structure as markdown #391

Uh oh!

joe-pierce Feb 16, 2026

Replies: 5 comments · 12 replies

Uh oh!

Goldziher Feb 16, 2026 Maintainer

Uh oh!

joe-pierce Feb 16, 2026 Author

Uh oh!

Goldziher Feb 16, 2026 Maintainer

Uh oh!

Goldziher Feb 17, 2026 Maintainer

Uh oh!

Goldziher Feb 19, 2026 Maintainer

Uh oh!

joe-pierce Feb 19, 2026 Author

Uh oh!

joe-pierce Feb 19, 2026 Author

Uh oh!

Goldziher Feb 20, 2026 Maintainer

Uh oh!

joe-pierce Feb 24, 2026 Author

Uh oh!

toresenneseth Apr 5, 2026

Uh oh!

Goldziher Apr 5, 2026 Maintainer

Uh oh!

Uh oh!

toresenneseth Apr 5, 2026

Uh oh!

Goldziher Apr 5, 2026 Maintainer

Uh oh!

toresenneseth Apr 5, 2026

Uh oh!

Goldziher Apr 6, 2026 Maintainer

joe-pierce
Feb 16, 2026

Replies: 5 comments 12 replies

Goldziher
Feb 16, 2026
Maintainer

joe-pierce
Feb 16, 2026
Author

Goldziher Feb 16, 2026
Maintainer

Goldziher Feb 17, 2026
Maintainer

Goldziher
Feb 19, 2026
Maintainer

joe-pierce Feb 19, 2026
Author

joe-pierce Feb 19, 2026
Author

Goldziher
Feb 20, 2026
Maintainer

joe-pierce Feb 24, 2026
Author

toresenneseth
Apr 5, 2026

Goldziher Apr 5, 2026
Maintainer

Goldziher Apr 5, 2026
Maintainer

Goldziher Apr 6, 2026
Maintainer