Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quick decrypt lib for pdf-reader #18

Closed
wants to merge 9 commits into from
Closed

Quick decrypt lib for pdf-reader #18

wants to merge 9 commits into from

Conversation

3jb
Copy link
Contributor

@3jb 3jb commented Aug 13, 2011

I built up a quick decrypt lib for pdf-reader. I had this buried in my working tree, and when you indicated movement to a new API I panicked and dug it out to post it so it wouldn't just go the way of the dodo. Immediately it only works with PDFs encrypted with a blank user password, it doesn't throw any exceptions otherwise (so the user won't know it failed), and it has no implementation of document permissions.

I hadn't [ cleaned it up / really finished it ] because I wasn't entirely sure it was working the way it was supposed to. I seemed to be interpreting the PDF operators correctly, but there was no text. I spent some free time this week looking into it - and what I figured out is that I seem to be having trouble with string literals in text related instructions::

Hex Encoded Strings

(extracted from a doc encrypted with LibreOffice userPass::blank, ownerPass::)

[<19>1<0D0F19>-6<06>1<130F0A>2<0F1A0204>2<0906>-6<021806>1<05>1<04>2<09>-7<020D16020E06>1<0D0E0B>2<06>1<03>-2<1B>]TJ


When the text instructions are hex encoded - pdf-reader has no problem decrypt -> FlateDecode -> map int to character. This process is executed swimmingly and will work out just dandy. Document successfully rendered.

Literal Strings

(extracted from CA DMV doc )

[(^A^B^C^D)87.4(^E)7.7(^F)7.7(^G)27.4(^H)-26.1(^F^E^C )]TJ


when the string is encoded 'literally' I get instructions such as the above which are obviously nonsense, and ultimately result in strings of \342\226\257 -> wonderful little boxes. If you look at the example I included above, you'll notice it is suspiciously counting 1,2,3,4, ..etc through the instruction.

The literal strings are what I encountered first, and thus thought the decryption couldn't be working right. I've been implementing the feature based on a combination of the PDF spec and what poppler has done (thus all the shuffled auth_data.rb and other such immediately unused [ classes / methods ]). Poppler has been a great tool because I can root around in there and pull out the keys to individual objects for direct comparison to the keys I generate. Unsurprisingly they all match up.

I spent a lot of time trying to figure out what was going wrong. I ultimately resolved that I was improperly exerting my effort in that I could ultimately be using such time to implement proper error handling, password support, permission support, etc thereby rounding off a useful tool that I've [ made / understand comprehensively ].

I figure maybe you would be at least be able to finger the literal string issue more quickly, and maybe just fix it. Minimum you'd be able to give me something more specific to [ look at / think about ] so I wouldn't just fiddle away my time with figuring out how pdf-reader interprets and converts PDF files to the native UTF-8 through a tree of maps, glyphs, and fonts.

I set this up as a pull request because that seems to be the way all the kids are doing it these days. This is still all pre-alpha (practically useless), and not formally ready for release. If you want I can obviously also just forward you a patch to origin. I just wanted to get some attention to see if anyone else wanted to throw time at it.

@yob
Copy link
Owner

yob commented Aug 13, 2011

Wow, this is a great start thanks!

A quick review suggests that it mostly hooks in via the ObjectHash class. The API for that isn't changing, so there shouldn't be too many issues merging it in.

I'll make a few further comments on the commits

@yob
Copy link
Owner

yob commented Aug 14, 2011

I've looked into your literal string issue. Turns out the strings are being decrypted correctly but 2 other changes need to happen before they can be converted from their subset codes to UTF8:

  • we need to be decrypting metadata
  • the Font and Encoding classes need to recognise and extract data from the font CharSet entry. I'm still looking for a simple file that uses CharSet that I can add to the test suite.

I've added a decrypt branch to the main project on github. Feel free to cherry-pick commits from it if you want to move things forward.

@3jb
Copy link
Contributor Author

3jb commented Aug 14, 2011

  1. It was my full intent to integrate the whole thing into objectHash so:: yes, I completely agree and I'll sort that out.
  2. I couldn't remember if the rc4 and the md5 requires were gems, but now that you mention it I'm 100% sure the rc4 thing is a gem. Object keys are constructed from a set of items including passwords, a padding block, and various elements of the individual file and object ID all stuffed into an md5 digest. That key is then used in an rc4 algorithm (decrypt.rb::52) to decrypt the object streams. I can add the dependency to the gemspec if you haven't already; easy enough.
  3. I know about the metadata thing. From what I remember (I haven't worked on this in a few months) I vaguely recall some sort of stumbling block with the [ trailer / file ID ] not being available before the metadata was accessed. Browsing over it again and considering slapping it into the constructor ( as you recommend ): I don't know why I didn't do that earlier. I think that was my problem. Kind of embarrassing. I'll fix that.
  4. I can't immediately help you on the CharSet thing, I'll have to look that up.

I can get on this in the next few days, shouldn't take too much time.

@yob
Copy link
Owner

yob commented Aug 14, 2011

I've had a quick stab at metadata decryption in f34807 and 7dbfd2b. It's not fully tested yet, but it works in the basic case.

@yob
Copy link
Owner

yob commented Aug 15, 2011

I've also added some specs the show how I'd like users to provide the required user password for decryption (via options to PDF::Reader and PDF::Reader::ObjectHash).

I'm not up on how the decryption works though, so the specs are currently failing. Reckon you can fill in the blanks?

@yob
Copy link
Owner

yob commented Aug 15, 2011

I've fixed the text/charset issue in 32f6a4c an 6fef581 on master. If you merge them in with the latest decrypt work then the text in your DMV sample file is correctly extracted

bernerdschaefer and others added 8 commits August 19, 2011 16:00
* designed to replace pdf_list_callbacks
* There's a fairly limited set of filters defined in the PDF spec so
  there should be little need for new filters to be added by users.
  Given that's the case, I'd prefer to keep the Filter class as simple
  as possible
@3jb
Copy link
Contributor Author

3jb commented Aug 21, 2011

I pushed a new set of commits that:

  • allow the use of either owner/user passwords
  • allow password verification -> thereby exceptions on invalid passwords
  • updated meta_spec.rb to test owner passwords and corrected the user pass test text

As far as dropping the StandardSecurityHandler < SecurityHandler: that may be prudent at this point. The way it is right now really doesn't reflect appropriately the inherent structure in the spec - which was what was intended. StandardSecurityHandler actually implies that there is another set of values in the encryption dictionary ( :R, :O, :U, :P, :EncryptedMetadata pp 60::Table 21) which are all loaded in SecurityHandler right now anyway..

My next effort will be focused on tidying that up unless you get to it first.

@yob
Copy link
Owner

yob commented Aug 22, 2011

Cheers, I'll review this soon

@3jb
Copy link
Contributor Author

3jb commented Aug 23, 2011

I made standard_security_handler into SecurityHandler::Standard. I don't know if you're going to like it, but it seemed like something fun to do. If you want to drop it altogether - I'm fine with that. The spec doesn't elaborate on other possible :Filter values.

@3jb 3jb closed this Aug 23, 2011
@3jb 3jb reopened this Aug 23, 2011
@yob
Copy link
Owner

yob commented Aug 23, 2011

brilliant, thanks for all your work.

I've pushed a handful of style changes to my decrypt branch and have another small change I plan to make tomorrow before merging it into master.

@yob
Copy link
Owner

yob commented Aug 23, 2011

If you're curious, the final change will be to merge the Decrypt class into StandardSecurityHandler.

Decrypt only has class methods so seems a little unnecessary. I suspect Most of the methods can probably be private methods on StandardSecurityHandler

@yob
Copy link
Owner

yob commented Aug 24, 2011

Thanks for your work on this, it's been merged into master with a few style changes.

I don't suppose you have access to a PDF with encrypted streams but plain text metadata? I suspect such files will be handled incorrectly and I'd like to add a spec

@yob yob closed this Aug 24, 2011
@3jb
Copy link
Contributor Author

3jb commented Aug 24, 2011

I tried sorting out encrypted/un-encrypted metadata thing today.. I couldn't find an example - I tried to make one by writing a file un-encrypted and then another copy encrypted and just moving the 'info' dictionary between the two and switching the 'EncryptMetadata' in the encryption dictionary.. poppler wasn't even happy with it. I only had 10 minutes to look at it today. I'll look more into it when I get another chance. Immediately though - no I can't come up with a good example. I actually run into that issue a lot - lately I've been trying to find pdfs with certain x_object structures..

@yob
Copy link
Owner

yob commented Aug 25, 2011

I'll open a new "wishlist" issue to track finding a sample PDF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants