-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick decrypt lib for pdf-reader #18
Conversation
Wow, this is a great start thanks! A quick review suggests that it mostly hooks in via the ObjectHash class. The API for that isn't changing, so there shouldn't be too many issues merging it in. I'll make a few further comments on the commits |
I've looked into your literal string issue. Turns out the strings are being decrypted correctly but 2 other changes need to happen before they can be converted from their subset codes to UTF8:
I've added a decrypt branch to the main project on github. Feel free to cherry-pick commits from it if you want to move things forward. |
I can get on this in the next few days, shouldn't take too much time. |
I've had a quick stab at metadata decryption in f34807 and 7dbfd2b. It's not fully tested yet, but it works in the basic case. |
I've also added some specs the show how I'd like users to provide the required user password for decryption (via options to PDF::Reader and PDF::Reader::ObjectHash). I'm not up on how the decryption works though, so the specs are currently failing. Reckon you can fill in the blanks? |
* designed to replace pdf_list_callbacks
Add RunLengthDecode filter support
* There's a fairly limited set of filters defined in the PDF spec so there should be little need for new filters to be added by users. Given that's the case, I'd prefer to keep the Filter class as simple as possible
I pushed a new set of commits that:
As far as dropping the StandardSecurityHandler < SecurityHandler: that may be prudent at this point. The way it is right now really doesn't reflect appropriately the inherent structure in the spec - which was what was intended. StandardSecurityHandler actually implies that there is another set of values in the encryption dictionary ( :R, :O, :U, :P, :EncryptedMetadata pp 60::Table 21) which are all loaded in SecurityHandler right now anyway.. My next effort will be focused on tidying that up unless you get to it first. |
Cheers, I'll review this soon |
I made standard_security_handler into SecurityHandler::Standard. I don't know if you're going to like it, but it seemed like something fun to do. If you want to drop it altogether - I'm fine with that. The spec doesn't elaborate on other possible :Filter values. |
brilliant, thanks for all your work. I've pushed a handful of style changes to my decrypt branch and have another small change I plan to make tomorrow before merging it into master. |
If you're curious, the final change will be to merge the Decrypt class into StandardSecurityHandler. Decrypt only has class methods so seems a little unnecessary. I suspect Most of the methods can probably be private methods on StandardSecurityHandler |
Thanks for your work on this, it's been merged into master with a few style changes. I don't suppose you have access to a PDF with encrypted streams but plain text metadata? I suspect such files will be handled incorrectly and I'd like to add a spec |
I tried sorting out encrypted/un-encrypted metadata thing today.. I couldn't find an example - I tried to make one by writing a file un-encrypted and then another copy encrypted and just moving the 'info' dictionary between the two and switching the 'EncryptMetadata' in the encryption dictionary.. poppler wasn't even happy with it. I only had 10 minutes to look at it today. I'll look more into it when I get another chance. Immediately though - no I can't come up with a good example. I actually run into that issue a lot - lately I've been trying to find pdfs with certain x_object structures.. |
I'll open a new "wishlist" issue to track finding a sample PDF |
I built up a quick decrypt lib for pdf-reader. I had this buried in my working tree, and when you indicated movement to a new API I panicked and dug it out to post it so it wouldn't just go the way of the dodo. Immediately it only works with PDFs encrypted with a blank user password, it doesn't throw any exceptions otherwise (so the user won't know it failed), and it has no implementation of document permissions.
I hadn't [ cleaned it up / really finished it ] because I wasn't entirely sure it was working the way it was supposed to. I seemed to be interpreting the PDF operators correctly, but there was no text. I spent some free time this week looking into it - and what I figured out is that I seem to be having trouble with string literals in text related instructions::
Hex Encoded Strings
(extracted from a doc encrypted with LibreOffice userPass::blank, ownerPass::)
[<19>1<0D0F19>-6<06>1<130F0A>2<0F1A0204>2<0906>-6<021806>1<05>1<04>2<09>-7<020D16020E06>1<0D0E0B>2<06>1<03>-2<1B>]TJ
When the text instructions are hex encoded - pdf-reader has no problem decrypt -> FlateDecode -> map int to character. This process is executed swimmingly and will work out just dandy. Document successfully rendered.
Literal Strings
(extracted from CA DMV doc )
[(^A^B^C^D)87.4(^E)7.7(^F)7.7(^G)27.4(^H)-26.1(^F^E^C )]TJ
when the string is encoded 'literally' I get instructions such as the above which are obviously nonsense, and ultimately result in strings of \342\226\257 -> wonderful little boxes. If you look at the example I included above, you'll notice it is suspiciously counting 1,2,3,4, ..etc through the instruction.
The literal strings are what I encountered first, and thus thought the decryption couldn't be working right. I've been implementing the feature based on a combination of the PDF spec and what poppler has done (thus all the shuffled auth_data.rb and other such immediately unused [ classes / methods ]). Poppler has been a great tool because I can root around in there and pull out the keys to individual objects for direct comparison to the keys I generate. Unsurprisingly they all match up.
I spent a lot of time trying to figure out what was going wrong. I ultimately resolved that I was improperly exerting my effort in that I could ultimately be using such time to implement proper error handling, password support, permission support, etc thereby rounding off a useful tool that I've [ made / understand comprehensively ].
I figure maybe you would be at least be able to finger the literal string issue more quickly, and maybe just fix it. Minimum you'd be able to give me something more specific to [ look at / think about ] so I wouldn't just fiddle away my time with figuring out how pdf-reader interprets and converts PDF files to the native UTF-8 through a tree of maps, glyphs, and fonts.
I set this up as a pull request because that seems to be the way all the kids are doing it these days. This is still all pre-alpha (practically useless), and not formally ready for release. If you want I can obviously also just forward you a patch to origin. I just wanted to get some attention to see if anyone else wanted to throw time at it.