Skip to content
This repository
Fetching contributors…

Cannot retrieve contributors at this time

file 145 lines (125 sloc) 7.194 kb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145
v0.9.0 (19th November 2010)
- support for pdf 1.5+ files that use object and xref streams
- support streams that use a flate filter with the predictor option
- ensure all content instructions are parsed when split over multiple stream
  - thanks to Jack Rusher for reporting
- Various string parsing bug
  - some character conversions to utf-8 were failing (thanks Andrea Barisani)
  - hashes with nested hex strings were tokenising wronly (thanks Evan Arnold)
  - escaping bug in tokenising of literal strings (thanks David Westerink)
- Fix a bug that prevented PDFs with white space after the EOF marker from loading
  - thanks to Solomon White for reporting the issue
- Add support for de-filtering some LZW compressed streams
  - thanks to Jose Ignacio Rubio Iradi for the patch
- some small speed improvements
- API CHANGE: PDF::Hash renamed to PDF::Reader::ObjectHash
  - having a class named Hash was confusing for users

v0.8.6 (27th August 2010)
- new method: hash#page_references
  - returns references to all page objects, gives rapid access to objects
    for a given page

v0.8.5 (11th April 2010)
- fix a regression introduced in 0.8.4.
  - Parameters passed to resource_font callback were inadvertently changed

v0.8.4 (30th March 2010)
- fix parsing of files that use Form XObjects
  - thanks to Andrea Barisani for reporting the issue
- fix two issues that caused a small number of characters to convert to Unicode
  incorrectly
  - thanks to Andrea Barisani for reporting the issue
- require 'pdf-reader' now works a well as 'pdf/reader'
  - good practice to have the require file match the gem name
  - thanks to Chris O'Meara for highlighting this

v0.8.3 (14th February 2010)
- Fix a bug in tokenising of hex strings inside dictionaries
  - Thanks to Brad Ediger for detecting the issue and proposing a solution

v0.8.2 (1st January 2010)
- Fix parsing of files that use Form XObjects behind an indirect reference
  (thanks Cornelius Illi and Patrick Crosby)
- Rewrote Buffer class to fix various speed issues reported over the years
  - On my sample file extracting full text reduced from 220 seconds to 9 seconds.

v0.8.1 (27th November 2009)
- Added PDF::Hash#version. Provides access to the source file PDF version

v0.8.0 (20th November 2009)
- Added PDF::Hash. It provides direct access to objects from a PDF file
  with an API that emulates the standard Ruby hash

v0.7.7 (11th September 2009)
- Trigger callbacks contained in Form XObjects when we encounter them in a
  content stream
- Fix inheritance of page resources to comply with section 3.6.2

v0.7.6 (28th August 2009)
- Various bug fixes that increase the files we can successfully parse
  - Treat float and integer tokens differently (thanks Neil)
  - Correctly handle PDFs where the Kids element of a Pages dict is an indirect
    reference (thanks Rob Holland)
  - Fix conversion of PDF strings to Ruby strings on 1.8.6 (thanks Andrès Koetsier)
  - Fix decoding with ASCII85 and ASCIIHex filters (thanks Andrès Koetsier)
  - Fix extracting inline images from content streams (thanks Andrès Koetsier)
  - Fix extracting [ ] from content streams (thanks Christian Rishøj)
  - Fix conversion of text to UTF8 when the cmap uses bfrange (thanks Federico Gonzalez Lutteroth)

v0.7.5 (27th August 2008)
- Fix a 1.8.7ism

v0.7.4 (7th August 2008)
- Raise a MalformedPDFError if a content stream contains an unterminated string
- Fix an bug that was causing an endless loop on some OSX systems
  - valid strings were incorrectly thought to be unterminated
  - thanks to Jeff Webb for playing email ping pong with me as I tracked this
    issue down

v0.7.3 (11th June 2008)
- Add a high level way to get direct access to a PDF object, including a new executable: pdf_object
- Fix a hard loop bug caused by a content stream that is missing a final operator
- Significantly simplified the internal code for encoding conversions
  - Fixes YACC parsing bug that occurs on Fedora 8's ruby VM
- New callbacks
  - page_count
  - pdf_version
- Fix a bug that prevented a font's BaseFont from being recorded correctly

v0.7.2 (20th May 2008)
- Throw an UnsupportedFeatureError if we try to open an encrypted/secure PDF
- Correctly handle page content instruction sets with trailing whitespace
- Represent PDF Streams with a new object, PDF::Reader::Stream
  - their really wasn't any point in separating the stream content from it's associated dict. You need both
    parts to correctly interpret the content

v0.7.1 (6th May 2008)
- Non-page strings (ie. metadata, etc) are now converted to UTF-8 more accurately
- Fixed a regression between 0.6.2 and 0.7 that prevented difference tables from being applied
  correctly when translating text into UTF-8

v0.7 (6th May 2008)
- API INCOMPATIBLE CHANGE: any hashes that are passed to callbacks use symbols as keys instead of PDF::Reader::Name instances.
- Improved support for converting text in some PDF files to unicode
- Behave as expected if the Contents key in a Page Dict is a reference
- Include some basic metadata callbacks
- Don't interpret a comment token (%) inside a string as a comment
- Small fixes to improve 1.9 compatibility
- Improved our Zlib deflating to make it slightly more robust - still some more issues to work out though
- Throw an UnsupportedFeatureError if a pdf that uses XRef streams is opened
- Added an option to PDF::Reader#file and PDF::Reader#string to enable parsing of only parts of a PDF file(ie. only metadata, etc)

v0.6.2 (22nd March 2008)
- Catch low level errors when applying filters to a content stream and raise a MalformedPDFError instead.
- Added support for processing inline images
- Support for parsing XRef tables that have multiple subsections
- Added a few callbacks to improve the way we supply information on page resources
- Ignore whitespace in hex strings, as required by the spec (section 3.2.3)
- Use our "unknown character box" when a single character in an Identity-H string fails to decode
- Support ToUnicode CMaps that use the bfrange operator
- Tweaked tokenising code to ensure whitespace doesn't get in the way

v0.6.1 (12th March 2008)
- Tweaked behaviour when we encounter Identity-H encoded text that doesn't have a ToUnicode mapping. We
  just replace each character with a little box.
- Use the same little box when invalid characters are found in other encodings instead of throwing an ugly
  NoMethodError.
- Added a method to RegisterReceiver that returns all occurrences of a callback

v0.6.0 (27th February 2008)
- all text is now transparently converted to UTF-8 before being passed to the callbacks.
  before this version, text was just passed as a byte level copy of what was in the PDF file, which
  was mildly annoying with some encodings, and resulted in garbled text for Unicode encoded text.
- Fonts that use a difference table are now handled correctly
- fixed some 1.9 incompatible syntax
- expanded RegisterReceiver class to record extra info
- expanded rspec coverage
- tweaked a README example

v0.5.1 (1st January 2008)
- Several documentation tweaks
- Improve support for parsing PDFs under windows (thanks to Jari Williamsson)

v0.5 (14th December 2007)
- Initial Release
Something went wrong with that request. Please try again.