Skip to content

Loading…

Page#text does not return extra whitespaces between words #72

Open
rubemz opened this Issue · 2 comments

3 participants

@rubemz

There is another change in 1.3.0 that affected our test suite.

It looks like that even the strings that were created intentionally with double(or more) whitespaces between a word, when calling Page#text it returns a single whitespace between the words.

For example, some date strings have double whitespaces due to the format mask (%l - Hour of the day, 12-hour clock, blank-padded ( 1..12)). But, since the Page#text does not return more than a single whitespace between words, the test is breaking.

Is it a desired behavior, limit the Page#text return to a single whitespace between words, even though the original string (and the rendered one) have more than a single whitespace between words?

@yob
Owner
yob commented

pdf-reader isn't intentionally limiting whitespace between words to a single space.

It's attempting to layout text of varying sizes and styles onto a canvas that only supports fixed-width text and it's likely to get things a bit wrong sometimes. There's almost certainly room for improvement. In this case, it thinks the space between your words is "about" equal to a single space in the current font, so it only leaves a single space.

There's two areas you could look into to see if it helps in your case:

1) remove the "unless utf8_chars == SPACE" check from line 111 of lib/pdf/reader/page_text_receiver.rb
2) See if you can improve the layout logic in PDF::Reader::PageLayout to help

If you find an change that improves the layout code for most people I'll be happy to merge it

@Msms-NJ

1) remove the "unless utf8_chars == SPACE" check from line 111 of lib/pdf/reader/page_text_receiver.rb

It works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.