Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
The PDF::Reader library implements a PDF parser conforming as much as possible to the PDF specification from Adobe.
Ruby

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
bin
lib/pdf
specs
tests
CHANGELOG
README
Rakefile
TODO

README

The PDF::Reader library implements a PDF parser conforming as much as possible
to the PDF specification from Adobe.

It provides programmatic access to the contents of a PDF file with a high
degree of flexibility.

The PDF 1.7 specification is a weighty document and not all aspects are
currently supported. We welcome submission of PDF files that exhibit
unsupported aspects of the spec to assist with improving out support.

= Installation

The recommended installation method is via Rubygems.

  gem install pdf-reader

= Usage

PDF::Reader is designed with a callback-style architecture. The basic concept
is to build a receiver class and pass that into PDF::Reader along with the PDF
to process. 

As PDF::Reader walks the file and encounters various objects (pages, text,
images, shapes, etc) it will call methods on the receiver class.  What those
methods do is entirely up to you - save the text, extract images, count pages,
read metadata, whatever.

For a full list of the supported callback methods and a description of when they
will be called, refer to PDF::Reader::Content. See the code examples below for a
way to print a list of all the callbacks generated by a file to STDOUT.

= Text Encoding

Internally, text can be stored inside a PDF in various encodings, including
zingbats, win-1252, mac roman and a form of Unicode. To avoid confusion, all
text will be converted to UTF-8 before it is passed back from PDF::Reader.

= Exceptions

There are two key exceptions that you will need to watch out for when processing a 
PDF file:

MalformedPDFError - The PDF appears to be corrupt in some way. If you believe the 
file should be valid, or that a corrupt file didn't raise an exception, please 
forward a copy of the file to the maintainers and we can attempt improve the code.

UnsupportedFeatureError - The PDF uses a feature that PDF::Reader doesn't currently 
support. Again, we welcome submissions of PDF files that exhibit these features to help 
us with future code improvements.

= Maintainers

- Peter Jones <mailto:pjones@pmade.com>
- James Healy <mailto:jimmy@deefa.com>

= Mailing List

Any questions or feedback should be sent to the PDF::Reader google group.

http://groups.google.com/group/pdf-reader

= Examples

The easiest way to explain how this works in practice is to show some examples.

== Page Counter

A simple app to count the number of pages in a PDF File.

  require 'rubygems'
  require 'pdf/reader'

  class PageReceiver
    attr_accessor :page_count

    def initialize
      @page_count = 0 
    end
    
    # Called when page parsing ends
    def end_page
      @page_count += 1
    end
  end

  receiver = PageReceiver.new
  pdf = PDF::Reader.file("somefile.pdf", receiver)
  puts "#{receiver.page_count} pages"

== List all callbacks generated by a single PDF

WARNING: this will generate a *lot* of output, so you probably want to pipe
it through less or to a text file.
  
  require 'rubygems'
  require 'pdf/reader'

  receiver = PDF::Reader::RegisterReceiver.new
  pdf = PDF::Reader.file("somefile.pdf", receiver)
  receiver.callbacks.each do |cb|
    puts cb
  end

== Extract metadata only 

  require 'rubygems'
  require 'pdf/reader'

  class MetaDataReceiver
    attr_accessor :regular
    attr_accessor :xml

    def metadata(data)
      @regular = data
    end

    def metadata_xml(data)
      @xml = data
    end
  end

  receiver = MetaDataReceiver.new
  pdf = PDF::Reader.file(ARGV.shift, receiver, :pages => false, :metadata => true)
  puts receiver.regular.inspect
  puts receiver.xml.inspect

== Basic RSpec of a generated PDF 

  require 'rubygems'
  require 'pdf/reader'
  require 'pdf/writer'
  require 'spec'

  class PageTextReceiver
    attr_accessor :content

    def initialize
      @content = []
    end

    # Called when page parsing starts
    def begin_page(arg = nil)
      @content << ""
    end

    def show_text(string, *params)
      @content.last << string.strip
    end

    # there's a few text callbacks, so make sure we process them all
    alias :super_show_text :show_text
    alias :move_to_next_line_and_show_text :show_text
    alias :set_spacing_next_line_show_text :show_text

    def show_text_with_positioning(*params)
      params = params.first
      params.each { |str| show_text(str) if str.kind_of?(String)}
    end
  end

  context "My generated PDF" do
    specify "should have the correct text on 2 pages" do

      # generate our PDF
      pdf = PDF::Writer.new
      pdf.text "Chunky", :font_size => 32, :justification => :center
      pdf.start_new_page
      pdf.text "Bacon", :font_size => 32, :justification => :center
      pdf.save_as("chunkybacon.pdf")

      # process the PDF
      receiver = PageTextReceiver.new
      PDF::Reader.file("chunkybacon.pdf", receiver)

      # confirm the text appears on the correct pages
      receiver.content.size.should eql(2)
      receiver.content[0].should eql("Chunky")
      receiver.content[1].should eql("Bacon")
    end
  end

== Extract ISBNs

Parse all text in the requested PDF file and print out any valid book ISBNs.
Requires the rbook-isbn gem.

  require 'rubygems'
  require 'pdf/reader'
  require 'rbook/isbn'

  class ISBNReceiver

    # there's a few text callbacks, so make sure we process them all
    def show_text(string, *params)
      process_words(string.split(/\W+/))
    end

    def super_show_text(string, *params)
      process_words(string.split(/\W+/))
    end

    def move_to_next_line_and_show_text (string)
      process_words(string.split(/\W+/))
    end

    def set_spacing_next_line_show_text (aw, ac, string)
      process_words(string.split(/\W+/))
    end

    private

    # check if any items in the supplied array are a valid ISBN, and print any 
    # that are to console
    def process_words(words)
      words.each do |word|
        word.strip!
        puts "#{RBook::ISBN.convert_to_isbn13(word)}" if RBook::ISBN.valid_isbn?(word)
      end
    end
  end

  receiver = ISBNReceiver.new
  PDF::Reader.file("somefile.pdf", receiver)

= Known Limitations

The order of the callbacks is unpredicable, and is dependent on the internal
layout of the file, not the order objects are displayed to the user. As a
consequence of this it is highly unlikely that text will be completely in
order.

Occasionally some text cannot be extracted properly due to the way it has been stored, or the use
of invalid bytes. In these cases PDF::Reader will output a little UTF-8 friendly box to indicate
an unrecognisable character.

= Resources

- PDF::Reader Homepage: http://software.pmade.com/pdfreader
- PDF::Reader Rubyforge Page: http://rubyforge.org/projects/pdf-reader/
- PDF Specification: http://www.adobe.com/devnet/pdf/pdf_reference.html
- PDF Tutorial Slide Presentations: http://home.comcast.net/~jk05/presentations/PDFTutorials.html
Something went wrong with that request. Please try again.