lines sometimes read out of order #50

Closed
strobejb opened this Issue Mar 30, 2012 · 2 comments

Projects

None yet

2 participants

@strobejb
Contributor

Hi,

I encountered an issue parsing PDF files, and found a work-around that others might find useful.

The PDF I am parsing is basically tabular data (converted from a spreadsheet). There are multiple cells and columns. The pdf-reader should be returning the text on a line-by-line basis. What I found was that on occasion, some cells in my 'table' were being read out of order. For example:

--------------------------------------------
 aaaaa   |  bbbbbb  |  ccccc |  dddddd
 eeeee   |  ffffff  |  ggggg |  hhhhhh
--------------------------------------------
 xxxxxx  |  yyyyyy  |  zzzzz |  zzzzzzz


Hopefully you get the idea. What should be returned is a text stream like:

aaaaa  bbbbb ccccc ddddd
eeeee  fffffffff  ggggg .... 

and so on

What I was getting (occasionally) was:

aaaaa  bbbbb
ddddd
ccccc
eeeee  fffffff ggggg

So you can see the data was returned out-of-order for part of the table. I tracked this down to the following location:

pdf/reader/page_text_receiver.rb (function show_text)

    def show_text(string) # Tj
        
        raise PDF::Reader::MalformedPDFError, "current font is invalid" if @state.current_font.nil?
        newx, newy = @state.trm_transform(0,0)
        @content[newy] ||= ""
        @content[newy]  << @state.current_font.to_utf8(string)
   end

The problem is the way the newy variable is calculated. What pdf-reader appears to do, is to store all text it parses into a Hash, keyed by the y-coordinate of where the text occurred. In my situation, the y-coordinate for each block of text appeared to have some amount of tiny variation - enough to result in

YCOORD  |     TEXT
---------------------------
303.91       aaaaaa
303.91       bbbbbb
303.92       cccccccc
303.91       ddddddd
350.001      eeeeee
350.001      fffffffff
350.001      ggggg

Look at the y-coord for 'ccccccc'. Even though the different is tiny (.92 vs .91), it is enough for the 'cccccc' text to be inserted into it's own 'row' in the @content Hash/array, and subsequently be returned out-of-order when we read the text with pdf.page(x).text. I don't know why this 'error' in the y-coord is there, but it occurs in the PDFs I am parsing.

The solution was round the y-coordinate to a whole number before inserting it into the @content hash. Just copy+paste the code below into your ruby program (no need to patch the original code)

module PDF
class Reader
class PageTextReceiver

   def show_text(string) # Tj
        
        raise PDF::Reader::MalformedPDFError, "current font is invalid" if @state.current_font.nil?
        newx, newy = @state.trm_transform(0,0)

        newy = newy.round(0)
        @content[newy] ||= ""
        @content[newy] << @state.current_font.to_utf8(string)
   end

end
end
end       

Hopefully this is useful for somebody else! Rounding to the nearest whole number worked for me. I don't know enough about the pdf coordinate system to know if that is a generic solution though...

@yob
Owner
yob commented Mar 30, 2012

Thanks for this feedback, I think you're solution is a reasonable one for now.

Ideally I'd like to change the show_text method to intelligently detect words and baselines instead of just naively aligning on the Y co-ordinate. Until then, rounding to an int is probably sensible.

Can you submit a pull request for me to merge?

@yob
Owner
yob commented Apr 7, 2013

pdf-reader 1.3.0+ has improved the way it calculates the Y position of text. I'll close this issue for now, but if you have any specific improvements for pdf-reader 1.3.0 or above please open a new ticket.

@yob yob closed this Apr 7, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment