Extending LLM usage for PDFs when the extracted text is empty after pdfminer #1285

gjmveloso · 2025-06-06T23:16:46Z

Initial work to attempt to use LLM to perform OCR operations within a PDF when pdfminer returns empty text

…pdfminer

gjmveloso · 2025-06-06T23:19:17Z

@microsoft-github-policy-service agree

- Proper handling of file_stream positioning after an empty result from pdfminer

dillonstreator · 2025-06-10T21:50:05Z

packages/markitdown/src/markitdown/converters/_pdf_converter.py

+                    prompt=llm_prompt,
+                )
+
+        return DocumentConverterResult(markdown=str(markdown))


There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.

Are you thinking on something like replacing the usage of extract_text with extract_pages and iterate over its non-text elements, like LTImage and LTFigure?

Layout system reference:
https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#topic-pdf-to-text-layout

Yes - that would allow a much more reliable, predictable, and comprehensive text extraction.

Extending LLM usage for PDFs where the extracted text was empty with …

117ffa2

…pdfminer

gjmveloso changed the title ~~Extending LLM usage for PDFs where the extracted text was empty with pdfminer~~ Extending LLM usage for PDFs when the extracted text is empty after pdfminer Jun 6, 2025

gjmveloso mentioned this pull request Jun 6, 2025

Add OCR fallback for scanned/non-searchable PDFs (#1156) #1268

Open

gjmveloso added 2 commits June 6, 2025 19:41

Improving prompt to avoid translation/generation of new content

f104813

Improving prompt to avoid translation/generation of new content

54ebd32

gjmveloso marked this pull request as draft June 9, 2025 19:38

- Prompt improvements for non-Gemini models

c83bacc

- Proper handling of file_stream positioning after an empty result from pdfminer

gjmveloso marked this pull request as ready for review June 9, 2025 22:08

dillonstreator reviewed Jun 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extending LLM usage for PDFs when the extracted text is empty after pdfminer #1285

Extending LLM usage for PDFs when the extracted text is empty after pdfminer #1285

Uh oh!

gjmveloso commented Jun 6, 2025

Uh oh!

gjmveloso commented Jun 6, 2025

Uh oh!

dillonstreator Jun 10, 2025

Uh oh!

gjmveloso Jun 11, 2025

Uh oh!

dillonstreator Jun 11, 2025

Uh oh!

Uh oh!

Extending LLM usage for PDFs when the extracted text is empty after pdfminer #1285

Are you sure you want to change the base?

Extending LLM usage for PDFs when the extracted text is empty after pdfminer #1285

Uh oh!

Conversation

gjmveloso commented Jun 6, 2025

Uh oh!

gjmveloso commented Jun 6, 2025

Uh oh!

dillonstreator Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

gjmveloso Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

dillonstreator Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!