PDF doc structure as markdown #391
Replies: 5 comments 12 replies
-
|
have you used markdown output for PDFs? can you give concrete examples? |
Beta Was this translation helpful? Give feedback.
-
|
With this PDF as an example: The markdown output doesn't contain any headers, and there are line breaks at the same places as the text wraps in the pdf. I was hoping to get a clear structure of headers/subheaders. Plus proper paragraphs rather than lots of line breaks. And more closely following the implied structure. The docling library does a lot of this layout processing and converts to markdown fairly well, but the performance in terms of speed and memory is far behind kreuzberg. I used this code: async def extract_file_with_profiling(file_path: str) -> None:
config = ExtractionConfig(
use_cache=True,
enable_quality_processing=True,
output_format="markdown",
)
return await extract_file(file_path, config=config)
async def main() -> None:
return await extract_file_with_profiling(
r"Innovation-Magazine-_11summer_FC.pdf"
)to produce the following output: |
Beta Was this translation helpful? Give feedback.
-
|
Please test v4.3.6. |
Beta Was this translation helpful? Give feedback.
-
|
v4.3.7 is being released now with a fix. let me know if any issue persists once out. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I've had a quick play around and like the performance I'm seeing for many file formats, however the PDF outputs I am getting are not great. Are there any plans to enable document and table structure in PDFs and output as markdown? E.g. with a word doc I get good structured markdown output, however with PDFs there is no identified structure, and tables are poorly output.
Beta Was this translation helpful? Give feedback.
All reactions