Skip to content

Extending LLM usage for PDFs when the extracted text is empty after pdfminer #1285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

gjmveloso
Copy link

Initial work to attempt to use LLM to perform OCR operations within a PDF when pdfminer returns empty text

@gjmveloso gjmveloso changed the title Extending LLM usage for PDFs where the extracted text was empty with pdfminer Extending LLM usage for PDFs when the extracted text is empty after pdfminer Jun 6, 2025
@gjmveloso
Copy link
Author

@microsoft-github-policy-service agree

@gjmveloso gjmveloso marked this pull request as draft June 9, 2025 19:38
- Proper handling of file_stream positioning after an empty result from pdfminer
@gjmveloso gjmveloso marked this pull request as ready for review June 9, 2025 22:08
prompt=llm_prompt,
)

return DocumentConverterResult(markdown=str(markdown))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue of PDFs containing both mineable text and images that contain text. It would be nice to have a more sophisticated branching mechanism that accounts for this and/or allowing an API to override by the markitdown caller.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking on something like replacing the usage of extract_text with extract_pages and iterate over its non-text elements, like LTImage and LTFigure?

Layout system reference:
https://pdfminersix.readthedocs.io/en/latest/topic/converting_pdf_to_text.html#topic-pdf-to-text-layout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - that would allow a much more reliable, predictable, and comprehensive text extraction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants