Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for CRC23 checksum gzuncompress() #622

Merged
merged 3 commits into from
Aug 5, 2023
Merged

Conversation

GreyWyvern
Copy link
Contributor

A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042

Partially resolves #592 by correctly decoding the document's default font stream. Includes inv.pdf from that issue by permission from the issue author.

A zlib compressed stream may have a CRC32 checksum instead of Adler-32 which the PHP gzuncompress() function expects. Add a second zlib decompression attempt if the first one fails. See: https://www.php.net/manual/en/function.gzuncompress.php#79042
Partially resolves smalot#592.
Copy link
Collaborator

@k00ni k00ni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good extension!

Your code looks a bit clunky, because of a try-catch inside a catch. Also, an @ in front of a function may absorb errors which otherwise would have came to the developers attention. Some changes are just opinionated to improve readability. Here is my suggestion for the whole try-catch part (untested):

// initialize string to return
try {
    $decoded = gzuncompress($data, $decodeMemoryLimit);
    if (false === $decoded) {
        // If gzuncompress() failed try again allowing for a CRC32 
        // checksum instead of Adler-32.
        // See: https://www.php.net/manual/en/function.gzuncompress.php#79042
        // Issue: https://github.com/smalot/pdfparser/issues/592
        $crc32 = tempnam('/tmp', 'gz_fix');
        if (false != $crc32) {
            file_put_contents($crc32, "\x1f\x8b\x08\x00\x00\x00\x00\x00".$data);
            $decoded = file_get_contents('compress.zlib://'.$crc32);
            unlink($crc32);
        }

        if (is_string($decoded) && 0 < strlen($decoded)) {
            // all good
        } else {
           // If the decoded string is empty, that means decoding failed.
           throw new \Exception('TODO: meaningful error message');
        }
} catch (\Exception $e) {
    throw $e;
} finally {
    // Restore old handler just in case it was customized outside of PDFParser.
    restore_error_handler();
}

What do you think?

@GreyWyvern
Copy link
Contributor Author

GreyWyvern commented Aug 1, 2023

Good extension!

Your code looks a bit clunky, because of a try-catch inside a catch. Also, an @ in front of a function may absorb errors which otherwise would have came to the developers attention. Some changes are just opinionated to improve readability. Here is my suggestion for the whole try-catch part (untested):

[snip]

What do you think?

So I tried this and it turns out that the original gzuncompress() must fail and emit an E_WARNING in the case that the data is alternately coded. This means that because of the set_error_handler() above, that E_WARNING is upgraded to an E_ERROR and execution is knocked out of the single try { ... } catch () { ... } loop. The if (false === $decoded) { line is never even reached.

So either there must be a try/catch within a try/catch, or else the set_error_handler() should be removed, and the gzuncompress() should have it's E_WARNING suppressed with @. We can add comments in the code to explain why this is necessary. We don't even need the exterior try/catch if we're not making any special adjustments for gzuncompress().

    protected function decodeFilterFlateDecode(string $data, int $decodeMemoryLimit): ?string
    {
        /* REMOVE THIS
         * gzuncompress may throw a not catchable E_WARNING in case of an error (like $data is empty)
         * the following set_error_handler changes an E_WARNING to an E_ERROR, which is catchable.
         */
        /* REMOVE THIS
        set_error_handler(function ($errNo, $errStr) {
            if (\E_WARNING === $errNo) {
                throw new \Exception($errStr);
            } else {
                // fallback to default php error handler
                return false;
            }
        }); */

        $decoded = null;

        // initialize string to return
        /* REMOVE THIS try { */

            // Uncatchable E_WARNING for "data error" is @ suppressed
            // so execution may proceed with an alternate decompression
            // method.
            $decoded = @gzuncompress($data, $decodeMemoryLimit);
            if (false === $decoded) {
                // If gzuncompress() failed, try again using the compress.zlib://
                // wrapper to decode it in a file-based context.
                // See: https://www.php.net/manual/en/function.gzuncompress.php#79042
                // Issue: https://github.com/smalot/pdfparser/issues/592
                $ztmp = tempnam('/tmp', 'gz_fix');
                if (false != $ztmp) {
                    file_put_contents($ztmp, "\x1f\x8b\x08\x00\x00\x00\x00\x00".$data);
                    $decoded = file_get_contents('compress.zlib://'.$ztmp);
                    unlink($ztmp);
                }
            }

            if (!is_string($decoded) || 0 === strlen($decoded)) {
               // If the decoded string is empty, that means decoding failed.
               throw new \Exception('decodeFilterFlateDecode: invalid data');
            }

        /* REMOVE THIS
        } catch (\Exception $e) {
            throw $e;
        } finally {
            // Restore old handler just in case it was customized outside of PDFParser.
            restore_error_handler();
        } */

        return $decoded;
    }

This change will require changing some tests in FilterHelperTest.php as it will no longer be returning several different exceptions. Just the one 'invalid data' one.

@k00ni
Copy link
Collaborator

k00ni commented Aug 2, 2023

Good point. So the final code would look like the following (including some minor adaptions)?

protected function decodeFilterFlateDecode(string $data, int $decodeMemoryLimit): ?string
{
    // Uncatchable E_WARNING for "data error" is @ suppressed
    // so execution may proceed with an alternate decompression
    // method.
    $decoded = @gzuncompress($data, $decodeMemoryLimit);

    if (false === $decoded) {
        // If gzuncompress() failed, try again using the compress.zlib://
        // wrapper to decode it in a file-based context.
        // See: https://www.php.net/manual/en/function.gzuncompress.php#79042
        // Issue: https://github.com/smalot/pdfparser/issues/592
        $ztmp = tempnam('/tmp', 'gz_fix');
        if (false != $ztmp) {
            file_put_contents($ztmp, "\x1f\x8b\x08\x00\x00\x00\x00\x00".$data);
            $decoded = file_get_contents('compress.zlib://'.$ztmp);
            unlink($ztmp);
        }
    }

    if (false === is_string($decoded) || 0 === strlen($decoded)) {
       // If the decoded string is empty, that means decoding failed.
       throw new \Exception('decodeFilterFlateDecode: invalid data');
    }

    return $decoded;
}

Do we have to adapt $availableFilters property?

protected $availableFilters = ['ASCIIHexDecode', 'ASCII85Decode', 'LZWDecode', 'FlateDecode', 'RunLengthDecode'];

@GreyWyvern
Copy link
Contributor Author

Good point. So the final code would look like the following (including some minor adaptions)?

[snip]

Yes. We should even account for the $decodeMemoryLimit by limiting the size of the "file" file_get_contents() will load:

file_get_contents('compress.zlib://'.$ztmp, false, null, 0, $decodeMemoryLimit);

Do we have to adapt $availableFilters property?

No, the sections that are coded differently than gzuncompress() expects are still labeled as "FlateDecode". They are still zlib compressed, just with a different checksum.

@k00ni
Copy link
Collaborator

k00ni commented Aug 2, 2023

Yes. We should even account for the $decodeMemoryLimit by limiting the size of the "file" file_get_contents() will load:

Sounds reasonable, please do.

Instead of setting an error handler to catch the E_WARNING's that gzuncompress() emits, suppress it with an @ so we can do away with the try/catch. Make a note of this in the comments.
Switch from using tempnam() to tmpfile() because tempnam() can emit E_NOTICE's and would have to be suppressed as well. tmpfile() just returns a handle or false.
Limit file_get_contents() by the $decodeMemoryLimit. Unlike gzuncompress() for which a limit value of zero (0) means "no limit", file_get_contents() takes null to mean "no limit".
Fix for PHP < 8.0 that doesn't like a length limit of null for file_get_contents().
@k00ni
Copy link
Collaborator

k00ni commented Aug 3, 2023

Ready to merge?

@k00ni k00ni merged commit 2608ac3 into smalot:master Aug 5, 2023
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parsed incorectly, wrong symbols
2 participants