Search strings are not always found correctly #184

sschuberth · 2015-05-03T19:34:08Z

I'm using Sumatra PDF 3.0 (32-bit) on Windows 7 (64-bit). For some PDFs, not all occurrences of a string that is obviously present are found. For example if you search for "Kund" in this PDF you'll find the occurrences in "Privatkundengeschäft" and "Geschäftskunden", but not the ones in "Kundin", "Kunde", "Kunden".

If I save the PDF as a text file from Sumatra it becomes more or less obvious why: For example the line that should say

Sehr geehrte Postbank Kundin, sehr geehrter Postbank Kunde,

instead says

peÜr geeÜrte mostÄank hundinI seÜr geeÜrter mostÄank hundeI

To me this looks like some OCR gone mad. To double check that the text is not stored as an image I've installed Abobe Reader 11.0.10 which is able to search the PDF just fine.

The text was updated successfully, but these errors were encountered:

sschuberth · 2015-05-03T20:23:34Z

PS: Xpdf's pdftotext seem to have the same issue, it generates (almost) the same extracted text.

sschuberth · 2015-05-04T19:53:14Z

Thanks for fixing this so quickly! For anyone interested, a snapshot build is available here.

zeniko added a commit that referenced this issue May 4, 2015

PDF: ignore /ToUnicode for non-embedded fonts (fixes issue #184)

eaf8243

kjk closed this as completed May 4, 2015

sschuberth mentioned this issue Nov 8, 2016

/ToUnicode mapping issue #471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search strings are not always found correctly #184

Search strings are not always found correctly #184

sschuberth commented May 3, 2015 •

edited

Loading

sschuberth commented May 3, 2015

sschuberth commented May 4, 2015

Search strings are not always found correctly #184

Search strings are not always found correctly #184

Comments

sschuberth commented May 3, 2015 • edited Loading

sschuberth commented May 3, 2015

sschuberth commented May 4, 2015

sschuberth commented May 3, 2015 •

edited

Loading