Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pdfrenderer.cpp: Ignore non-text blocks #3959

Merged
merged 1 commit into from
Nov 10, 2022

Conversation

amitdo
Copy link
Collaborator

@amitdo amitdo commented Nov 7, 2022

Fix #3957.

@amitdo
Copy link
Collaborator Author

amitdo commented Nov 7, 2022

@egorpugin,

Can you fix sw?

src/api/pdfrenderer.cpp Outdated Show resolved Hide resolved
@amitdo amitdo closed this Nov 7, 2022
@egorpugin
Copy link
Contributor

I'll check sw issues.

@amitdo amitdo reopened this Nov 7, 2022
@amitdo amitdo force-pushed the amitdo-pdf-Ignore-non-text-blocks branch from fb90979 to f4c1946 Compare November 7, 2022 13:16
@amitdo amitdo force-pushed the amitdo-pdf-Ignore-non-text-blocks branch from f4c1946 to c196456 Compare November 8, 2022 06:05
@amitdo
Copy link
Collaborator Author

amitdo commented Nov 8, 2022

Here is the pdf file that Tesseract produces after applying the patch from this PR.

3957.pdf

@amitdo
Copy link
Collaborator Author

amitdo commented Nov 9, 2022

Here is the pdf file that Tesseract produces before applying the patch from this PR.

3957-0.pdf

@amitdo
Copy link
Collaborator Author

amitdo commented Nov 9, 2022

Doing 'select all', 'copy' and then 'paste' to a text file, and using pdftotxt as another method for testing. There is no spaces, for both 'before' and 'after' PDFs.

Still, there is a difference internally between the files. The 'after' pdf is slightly smaller (in bytes) than the 'before' pdf file.

@amitdo
Copy link
Collaborator Author

amitdo commented Nov 9, 2022

Waiting for a feedback from @bleze.

@amitdo
Copy link
Collaborator Author

amitdo commented Nov 10, 2022

https://www.pdf-online.com/osa/validate.aspx

File 3957-0.pdf
Compliance pdf1.5
Result Document validated successfully.
Details

Validating file "3957-0.pdf" for conformance level pdf1.5

The document does conform to the PDF 1.5 standard.

File 3957.pdf
Compliance pdf1.5
Result Document validated successfully.
Details

Validating file "3957.pdf" for conformance level pdf1.5

The document does conform to the PDF 1.5 standard.

@amitdo
Copy link
Collaborator Author

amitdo commented Nov 10, 2022

@stweil, can I merge this PR?

It removes unneeded stuff from the pdf output in documents with non-text blocks and make the document slightly smaller in bytes.

Copy link
Contributor

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thank you!

@stweil stweil merged commit fd83f3d into tesseract-ocr:main Nov 10, 2022
@amitdo amitdo deleted the amitdo-pdf-Ignore-non-text-blocks branch November 10, 2022 08:23
@bleze
Copy link

bleze commented Nov 10, 2022

Waiting for a feedback from @bleze.

I already took the code from the patch and applied to my copy. I can confirm that the spaces are no longer included in the output. Thank you for fixing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDF renderer: Tesseract inserts spaces for non-text blocks it finds
4 participants