Skip to content

Commit

Permalink
Parse *ALL* content instructions when they're split over multiple con…
Browse files Browse the repository at this point in the history
…tent streams

* thanks to Jack Rusher for reporting
  • Loading branch information
yob committed May 31, 2010
1 parent b5ff80c commit 471ad81
Show file tree
Hide file tree
Showing 5 changed files with 35 additions and 7 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG
@@ -1,6 +1,8 @@
v0.8.6 (XXX)
- initial support for pdf 1.5+ files that use object and xref streams
- support streams that use a flate filter with the predictor option
- ensure all content instructions are parsed when split over multiple stream
- thanks to Jack Rusher for reporting

v0.8.5 (11th April 2010)
- fix a regression introduced in 0.8.4.
Expand Down
11 changes: 6 additions & 5 deletions lib/pdf/reader/content.rb
Expand Up @@ -311,10 +311,8 @@ def walk_pages (page)
fonts = font_hash_from_resources(current_resources)

if page.has_key?(:Contents) and page[:Contents]
contents.each do |content|
obj = @xref.object(content)
content_stream(obj, fonts)
end
direct_contents = contents.map { |content| @xref.object(content) }
content_stream(direct_contents, fonts)
end

resources.pop if res
Expand Down Expand Up @@ -353,7 +351,10 @@ def current_resources
# Reads a PDF content stream and calls all the appropriate callback methods for the operators
# it contains
def content_stream (instructions, fonts = {})
instructions = instructions.unfiltered_data if instructions.kind_of?(PDF::Reader::Stream)
instructions = [instructions] unless instructions.kind_of?(Array)
instructions = instructions.map { |ins|
ins.is_a?(PDF::Reader::Stream) ? ins.unfiltered_data : ins.to_s
}.join
buffer = Buffer.new(StringIO.new(instructions))
parser = Parser.new(buffer, @xref)
current_font = nil
Expand Down
17 changes: 16 additions & 1 deletion specs/content_spec.rb
Expand Up @@ -8,7 +8,7 @@ class PDF::Reader::XRef
attr_accessor :xref
end

context "The PDF::Reader::Content class" do
context PDF::Reader::Content do

specify "should send the correct callbacks when processing instructions containing a single text block" do

Expand Down Expand Up @@ -71,6 +71,21 @@ class PDF::Reader::XRef
content.content_stream(obj)
end

# test for a bug reported by Jack Rusher where params at the end of a stream would be
# silently dropped if their matching operator was in the next contream stream in a series
specify "should send the correct callbacks when processing a PDF with content over multiple streams" do

receiver = PDF::Reader::RegisterReceiver.new

filename = File.dirname(__FILE__) + "/data/split_params_and_operator.pdf"
PDF::Reader.file(filename, receiver)

text_callbacks = receiver.all(:show_text_with_positioning)
text_callbacks.size.should eql(2)
text_callbacks[0][:args].should eql([["My name is"]])
text_callbacks[1][:args].should eql([["James Healy"]])
end

specify "should send the correct metadata callbacks when processing an PrinceXML PDF" do

receiver = PDF::Reader::RegisterReceiver.new
Expand Down
Binary file added specs/data/split_params_and_operator.pdf
Binary file not shown.
12 changes: 11 additions & 1 deletion specs/meta_spec.rb
Expand Up @@ -206,7 +206,17 @@ def show_text_with_positioning(*params)
receiver = PageTextReceiver.new
PDF::Reader.file(File.dirname(__FILE__) + "/data/indirect_xobject.pdf", receiver)

# confirm there was a single page of tet
# confirm there was a single page of text
receiver.content.size.should eql(1)
end

specify "should correctly process a PDF that uses multiple content streams for a single page" do
receiver = PageTextReceiver.new
PDF::Reader.file(File.dirname(__FILE__) + "/data/split_params_and_operator.pdf", receiver)

# confirm there was a single page of text
receiver.content.size.should eql(1)
receiver.content[0].include?("My name is").should be_true
receiver.content[0].include?("James Healy").should be_true
end
end

0 comments on commit 471ad81

Please sign in to comment.