Quick decrypt lib for pdf-reader #18

3jb · 2011-08-13T21:01:05Z

I built up a quick decrypt lib for pdf-reader. I had this buried in my working tree, and when you indicated movement to a new API I panicked and dug it out to post it so it wouldn't just go the way of the dodo. Immediately it only works with PDFs encrypted with a blank user password, it doesn't throw any exceptions otherwise (so the user won't know it failed), and it has no implementation of document permissions.

I hadn't [ cleaned it up / really finished it ] because I wasn't entirely sure it was working the way it was supposed to. I seemed to be interpreting the PDF operators correctly, but there was no text. I spent some free time this week looking into it - and what I figured out is that I seem to be having trouble with string literals in text related instructions::

Hex Encoded Strings

(extracted from a doc encrypted with LibreOffice userPass::blank, ownerPass::)

[<19>1<0D0F19>-6<06>1<130F0A>2<0F1A0204>2<0906>-6<021806>1<05>1<04>2<09>-7<020D16020E06>1<0D0E0B>2<06>1<03>-2<1B>]TJ

When the text instructions are hex encoded - pdf-reader has no problem decrypt -> FlateDecode -> map int to character. This process is executed swimmingly and will work out just dandy. Document successfully rendered.

Literal Strings

(extracted from CA DMV doc )

[(^A^B^C^D)87.4(^E)7.7(^F)7.7(^G)27.4(^H)-26.1(^F^E^C )]TJ

when the string is encoded 'literally' I get instructions such as the above which are obviously nonsense, and ultimately result in strings of \342\226\257 -> wonderful little boxes. If you look at the example I included above, you'll notice it is suspiciously counting 1,2,3,4, ..etc through the instruction.

The literal strings are what I encountered first, and thus thought the decryption couldn't be working right. I've been implementing the feature based on a combination of the PDF spec and what poppler has done (thus all the shuffled auth_data.rb and other such immediately unused [ classes / methods ]). Poppler has been a great tool because I can root around in there and pull out the keys to individual objects for direct comparison to the keys I generate. Unsurprisingly they all match up.

I spent a lot of time trying to figure out what was going wrong. I ultimately resolved that I was improperly exerting my effort in that I could ultimately be using such time to implement proper error handling, password support, permission support, etc thereby rounding off a useful tool that I've [ made / understand comprehensively ].

I figure maybe you would be at least be able to finger the literal string issue more quickly, and maybe just fix it. Minimum you'd be able to give me something more specific to [ look at / think about ] so I wouldn't just fiddle away my time with figuring out how pdf-reader interprets and converts PDF files to the native UTF-8 through a tree of maps, glyphs, and fonts.

I set this up as a pull request because that seems to be the way all the kids are doing it these days. This is still all pre-alpha (practically useless), and not formally ready for release. If you want I can obviously also just forward you a patch to origin. I just wanted to get some attention to see if anyone else wanted to throw time at it.

yob · 2011-08-13T23:48:33Z

Wow, this is a great start thanks!

A quick review suggests that it mostly hooks in via the ObjectHash class. The API for that isn't changing, so there shouldn't be too many issues merging it in.

I'll make a few further comments on the commits

yob · 2011-08-14T13:08:49Z

I've looked into your literal string issue. Turns out the strings are being decrypted correctly but 2 other changes need to happen before they can be converted from their subset codes to UTF8:

we need to be decrypting metadata
the Font and Encoding classes need to recognise and extract data from the font CharSet entry. I'm still looking for a simple file that uses CharSet that I can add to the test suite.

I've added a decrypt branch to the main project on github. Feel free to cherry-pick commits from it if you want to move things forward.

3jb · 2011-08-14T14:33:04Z

It was my full intent to integrate the whole thing into objectHash so:: yes, I completely agree and I'll sort that out.
I couldn't remember if the rc4 and the md5 requires were gems, but now that you mention it I'm 100% sure the rc4 thing is a gem. Object keys are constructed from a set of items including passwords, a padding block, and various elements of the individual file and object ID all stuffed into an md5 digest. That key is then used in an rc4 algorithm (decrypt.rb::52) to decrypt the object streams. I can add the dependency to the gemspec if you haven't already; easy enough.
I know about the metadata thing. From what I remember (I haven't worked on this in a few months) I vaguely recall some sort of stumbling block with the [ trailer / file ID ] not being available before the metadata was accessed. Browsing over it again and considering slapping it into the constructor ( as you recommend ): I don't know why I didn't do that earlier. I think that was my problem. Kind of embarrassing. I'll fix that.
I can't immediately help you on the CharSet thing, I'll have to look that up.

I can get on this in the next few days, shouldn't take too much time.

yob · 2011-08-14T23:32:17Z

I've had a quick stab at metadata decryption in f34807 and 7dbfd2b. It's not fully tested yet, but it works in the basic case.

yob · 2011-08-15T03:13:18Z

I've also added some specs the show how I'd like users to provide the required user password for decryption (via options to PDF::Reader and PDF::Reader::ObjectHash).

I'm not up on how the decryption works though, so the specs are currently failing. Reckon you can fill in the blanks?

yob · 2011-08-15T13:22:22Z

I've fixed the text/charset issue in 32f6a4c an 6fef581 on master. If you merge them in with the latest decrypt work then the text in your DMV sample file is correctly extracted

* designed to replace pdf_list_callbacks

Add RunLengthDecode filter support

* There's a fairly limited set of filters defined in the PDF spec so there should be little need for new filters to be added by users. Given that's the case, I'd prefer to keep the Filter class as simple as possible

3jb · 2011-08-21T21:26:04Z

I pushed a new set of commits that:

allow the use of either owner/user passwords
allow password verification -> thereby exceptions on invalid passwords
updated meta_spec.rb to test owner passwords and corrected the user pass test text

As far as dropping the StandardSecurityHandler < SecurityHandler: that may be prudent at this point. The way it is right now really doesn't reflect appropriately the inherent structure in the spec - which was what was intended. StandardSecurityHandler actually implies that there is another set of values in the encryption dictionary ( :R, :O, :U, :P, :EncryptedMetadata pp 60::Table 21) which are all loaded in SecurityHandler right now anyway..

My next effort will be focused on tidying that up unless you get to it first.

yob · 2011-08-22T00:19:32Z

Cheers, I'll review this soon

3jb · 2011-08-23T00:58:54Z

I made standard_security_handler into SecurityHandler::Standard. I don't know if you're going to like it, but it seemed like something fun to do. If you want to drop it altogether - I'm fine with that. The spec doesn't elaborate on other possible :Filter values.

yob · 2011-08-23T14:10:06Z

brilliant, thanks for all your work.

I've pushed a handful of style changes to my decrypt branch and have another small change I plan to make tomorrow before merging it into master.

yob · 2011-08-23T14:14:56Z

If you're curious, the final change will be to merge the Decrypt class into StandardSecurityHandler.

Decrypt only has class methods so seems a little unnecessary. I suspect Most of the methods can probably be private methods on StandardSecurityHandler

yob · 2011-08-24T12:07:24Z

Thanks for your work on this, it's been merged into master with a few style changes.

I don't suppose you have access to a PDF with encrypted streams but plain text metadata? I suspect such files will be handled incorrectly and I'd like to add a spec

3jb · 2011-08-24T23:57:20Z

I tried sorting out encrypted/un-encrypted metadata thing today.. I couldn't find an example - I tried to make one by writing a file un-encrypted and then another copy encrypted and just moving the 'info' dictionary between the two and switching the 'EncryptMetadata' in the encryption dictionary.. poppler wasn't even happy with it. I only had 10 minutes to look at it today. I'll look more into it when I get another chance. Immediately though - no I can't come up with a good example. I actually run into that issue a lot - lately I've been trying to find pdfs with certain x_object structures..

yob · 2011-08-25T00:01:48Z

I'll open a new "wishlist" issue to track finding a sample PDF

bernerdschaefer and others added 8 commits August 19, 2011 16:00

Add RunLengthDecode filter support

eb06ba8

Merge remote-tracking branch 'upstream/master'

a246692

new executable: pdf_callbacks

e7ddfc1

* designed to replace pdf_list_callbacks

Merge pull request yob#19 from bernerdschaefer/runlengthdecode

aa23a42

Add RunLengthDecode filter support

back out changes that make adding filters easier for users

f877b24

* There's a fairly limited set of filters defined in the PDF spec so there should be little need for new filters to be added by users. Given that's the case, I'd prefer to keep the Filter class as simple as possible

update CHANGELOG

e2caec1

Merge remote-tracking branch 'upstream/master'

0b51124

Merge branch 'decrypt'

8d4c729

Merge branch 'decrypt'

7bede23

3jb closed this Aug 23, 2011

3jb reopened this Aug 23, 2011

yob closed this Aug 24, 2011

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick decrypt lib for pdf-reader #18

Quick decrypt lib for pdf-reader #18

3jb commented Aug 13, 2011

yob commented Aug 13, 2011

yob commented Aug 14, 2011

3jb commented Aug 14, 2011

yob commented Aug 14, 2011

yob commented Aug 15, 2011

yob commented Aug 15, 2011

3jb commented Aug 21, 2011

yob commented Aug 22, 2011

3jb commented Aug 23, 2011

yob commented Aug 23, 2011

yob commented Aug 23, 2011

yob commented Aug 24, 2011

3jb commented Aug 24, 2011

yob commented Aug 25, 2011

Quick decrypt lib for pdf-reader #18

Quick decrypt lib for pdf-reader #18

Conversation

3jb commented Aug 13, 2011

Hex Encoded Strings

(extracted from a doc encrypted with LibreOffice userPass::blank, ownerPass::)

Literal Strings

(extracted from CA DMV doc )

yob commented Aug 13, 2011

yob commented Aug 14, 2011

3jb commented Aug 14, 2011

yob commented Aug 14, 2011

yob commented Aug 15, 2011

yob commented Aug 15, 2011

3jb commented Aug 21, 2011

yob commented Aug 22, 2011

3jb commented Aug 23, 2011

yob commented Aug 23, 2011

yob commented Aug 23, 2011

yob commented Aug 24, 2011

3jb commented Aug 24, 2011

yob commented Aug 25, 2011