Many extra '2' characters in text output #43

Open
JESii opened this Issue Jan 31, 2012 · 12 comments

2 participants

@JESii

I used the sample text.rb program, fed it a small PDF (RubyMine_ReferenceCard.pdf) and got mostly good text output. However, there were many places in the output where an extract character (the number '2') was inserted. For example:

Alt + Shift + N Navigate to Rails 2model/view/controlle2r etc.Ctrl + F FindCtrl + Space Basic code completion2 (the name of any cl2ass, method  
Alt + F2 Preview Rails View2 in browserF3 Find next or variable)

The only valid '2' is the one in "Alt + F2". The missing carriage returns ("etc.Ctrl", "FindCtrl", "browserF3") are not an issue; this is a three-column document.

Is this just the way of PDFs or is there some problem here?

Thanks for a great gem!

@yob
Owner

I would need to see the PDF to understand what's going on. Can you email it to me or provide a link?

@JESii
@yob
Owner

I think github strips attachments. Can you email it to james - at - yob - dot - id- au ?

@JESii
@JESii
@yob
Owner
@yob
Owner
yob commented Feb 4, 2012

OK, I've identified an issue in the way I convert from glyph codes to Unicode code points. I'll try to add a fix to master this afternoon.

@JESii
@yob
Owner
yob commented Feb 5, 2012

Bad news I'm afraid.

My research turned up a scenario where I wasn't correctly following the spec for extracting text to Unicode. Unfortunately, despite fixing the issue in 1f39fa8 your issue is still around.

The '2' characters are definitely in the content stream, but I've confirmed that adobe and poppler don't extract or render them. My best guess is that there is extra rendering detail somewhere (in the font?) that pdf-reader is ignoring, but I haven't been able to find it.

In the short term I'm out of time to keep researching this, but hopefully something will come to me.

@JESii
@JESii
@JESii
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment