I used the sample text.rb program, fed it a small PDF (RubyMine_ReferenceCard.pdf) and got mostly good text output. However, there were many places in the output where an extract character (the number '2') was inserted. For example:
Alt + Shift + N Navigate to Rails 2model/view/controlle2r etc.Ctrl + F FindCtrl + Space Basic code completion2 (the name of any cl2ass, method
Alt + F2 Preview Rails View2 in browserF3 Find next or variable)
The only valid '2' is the one in "Alt + F2". The missing carriage returns ("etc.Ctrl", "FindCtrl", "browserF3") are not an issue; this is a three-column document.
Is this just the way of PDFs or is there some problem here?
Thanks for a great gem!
I would need to see the PDF to understand what's going on. Can you email it to me or provide a link?
I think github strips attachments. Can you email it to james - at - yob - dot - id- au ?
OK, I've identified an issue in the way I convert from glyph codes to Unicode code points. I'll try to add a fix to master this afternoon.
Bad news I'm afraid.
My research turned up a scenario where I wasn't correctly following the spec for extracting text to Unicode. Unfortunately, despite fixing the issue in 1f39fa8 your issue is still around.
The '2' characters are definitely in the content stream, but I've confirmed that adobe and poppler don't extract or render them. My best guess is that there is extra rendering detail somewhere (in the font?) that pdf-reader is ignoring, but I haven't been able to find it.
In the short term I'm out of time to keep researching this, but hopefully something will come to me.