page.text failing with: Undefined method `[]' #91

Closed
duncan-bayne opened this Issue Mar 14, 2013 · 5 comments

Projects

None yet

4 participants

@duncan-bayne

I have loaded a PDF (produced by our invoice system) into pdf-reader, but any attempt to extract page text is met with the following type of error:

1.9.3-p392 :006 > reader = PDF::Reader.new('test.pdf')
 => #<PDF::Reader:0x838ccac @cache=<PDF::Reader::ObjectCache size: 0>, @objects=<PDF::Reader::ObjectHash size: 67>>
1.9.3-p392 :007 > reader.pages.first.text
NoMethodError: undefined method `[]' for #<PDF::Reader::Reference:0x83cd194 @id=3, @gen=0>
from /home/duncan/.rvm/gems/ruby-1.9.3-p392/gems/pdf-reader-1.3.2/lib/pdf/reader/page_layout.rb:17:in `initialize'
from /home/duncan/.rvm/gems/ruby-1.9.3-p392/gems/pdf-reader-1.3.2/lib/pdf/reader/page_text_receiver.rb:49:in `new'
from /home/duncan/.rvm/gems/ruby-1.9.3-p392/gems/pdf-reader-1.3.2/lib/pdf/reader/page_text_receiver.rb:49:in `content'
from /home/duncan/.rvm/gems/ruby-1.9.3-p392/gems/pdf-reader-1.3.2/lib/pdf/reader/page.rb:76:in `text'
from (irb):7
from /home/duncan/.rvm/rubies/ruby-1.9.3-p392/bin/irb:16:in `<main>'

System details:

  • ruby-1.9.3-p392 [ i686 ] via RVM
  • pdf-reader 1.3.2

The problem doesn't seem to occur with other PDFs.

@yob
Owner
yob commented Mar 14, 2013

Thanks for the report. Are you able to email me a sample PDF that triggers this exception? My address is james@yob.id.au

@duncan-bayne

I'm already working on that ... catch is the PDF that's going bang is an actual customer invoice with commercially sensitive data on it.

I've asked folks for a de-identified PDF, and if that reproduces the error, I'll email it to you immediately.

@yob yob added a commit that closed this issue May 12, 2013
@yob the MediaBox might be an indirect object
* fixes #91
7871128
@yob yob closed this in 7871128 May 12, 2013
@duncan-bayne

I never managed to get my hands on a de-identified PDF that reproduced the issue. Sorry :-/

@endymion endymion added a commit to endymion/pdf-reader-issue-91-demonstration that referenced this issue Dec 6, 2013
@endymion endymion Demonstrates the problem. 0ee2f2f
@endymion
endymion commented Dec 6, 2013

Hi, I have a PDF that isn't too sensitive that demonstrates the problem. I was tinkering with the gem and I ran into the problem so I packed up what I was working on and pushed it to Github:

demo project: https://github.com/endymion/pdf-reader-issue-91-demonstration

output: https://gist.github.com/endymion/df6af4daa0abdc1c8c8a

I'm moving on to some other solution so this is not a problem for me. Just hoping to contribute...

@koenhandekyn

hey, i'm having the a similar issue with some PDF's. see below for the error. it's not that it can't read the PDF because i can walk the tree correctly and i'm getting the text correctly out if it with walking and a seemingly related error with other PDFs that also 'walk' fine but where also the build in text exctraction doesn't work.

error 1
/Users/upnxt/.rvm/gems/ruby-2.0.0-p247/gems/pdf-reader-1.3.3/lib/pdf/reader/page_layout.rb:17:in initialize': undefined method[]' for #<PDF::Reader::Reference:0x007faeba36a180 @id=75, @gen=0> (NoMethodError)

error 2
/Users/upnxt/.rvm/gems/ruby-2.0.0-p247/gems/pdf-reader-1.3.3/lib/pdf/reader/width_calculator/built_in.rb:93:in `glyph_width': Unknown glyph width for 160 Helvetica-Bold (ArgumentError)

i made a very simple walker for extracting the text (but it doesn't handle whitespace well in many cases)

    class TextWalker
      ...
      def mine(text)
        @gold << text
      end
      def show_text(text) 
        mine text
      end
      def show_text_with_positioning(text)
        extracted = text.select.with_index { |v, i| i.even? }.join()
        mine extracted
      end
    end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment