Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Font::uchr() method argument #1 expects int, but float given #621

Closed
jesse-greathouse opened this issue Jul 30, 2023 · 5 comments · Fixed by #623
Closed

Font::uchr() method argument #1 expects int, but float given #621

jesse-greathouse opened this issue Jul 30, 2023 · 5 comments · Fixed by #623
Labels

Comments

@jesse-greathouse
Copy link

  • PHP Version: 8.2.8
  • PDFParser Version: v2.5.0

Description:

 Smalot\PdfParser\Font::uchr(): Argument #1 ($code) must be of type int, float given, called in /home/jessegreathouse/dcol/src/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 223

  at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:138
    134▕
    135▕     /**
    136▕      * Convert unicode character code to "utf-8" encoded string.
    137▕      */
  ➜ 138▕     public static function uchr(int $code): string
    139▕     {
    140▕         if (!isset(self::$uchrCache[$code])) {
    141▕             // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
    142▕             // therefore, we use mb_convert_encoding() instead

I have also left a comment on #440 where this strong typing was added.

15-1039-1.pdf

<?php
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $this->parser->parseFile('15-1039-1.pdf');
?>

PDF input

15-1039-1.pdf

Expected output

There is no output, there is only a TypeError when the parse() method is called.

 Smalot\PdfParser\Font::uchr(): Argument #1 ($code) must be of type int, float given, called in /home/jessegreathouse/dcol/src/vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php on line 223

  at vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:138
    134▕
    135▕     /**
    136▕      * Convert unicode character code to "utf-8" encoded string.
    137▕      */
  ➜ 138▕     public static function uchr(int $code): string
    139▕     {
    140▕         if (!isset(self::$uchrCache[$code])) {
    141▕             // html_entity_decode() will not work with UTF-16 or UTF-32 char entities,
    142▕             // therefore, we use mb_convert_encoding() instead

Code

<?php
use Smalot\PdfParser\Parser;
$parser = new Parser();
$pdf = $this->parser->parseFile('15-1039-1.pdf');
?>
@k00ni k00ni added the bug label Jul 31, 2023
@k00ni
Copy link
Collaborator

k00ni commented Jul 31, 2023

Thank you for reporting. Can we use your PDF file for our test environment? I am wondering why $code can be a float sometimes. Also, it is casted to int anyway. At first glance it should be sufficient to change function parameter from int $code to int|float $code, shouldn't it?

@GreyWyvern
Copy link
Contributor

This is the result of a PHP integer overflow on an extremely large CIDMap offset value.

...
<2082> <2082> <8ca156e36cd54eba>
...

When converted to integers you get the following values:
8322 (int)
8322 (int)
1.0133476171344E+19 (float)

I think int|float $code is probably the easiest fix, but will the float value properly match the font array key in the uchr() function? We would have to see if the output is what @jesse-greathouse expects.

To me, with the int|float change, the output looks like the only characters missing are the bullet points, and they appear to be missing also in the actual document when opened in Adobe Acrobat.

@jesse-greathouse
Copy link
Author

I don't have any expectations on what this PDF output is. I am putting large amounts of publicly available PDF documents into text and the outcome is meaningless to me. I simply wanted to report the TypeError for the benefit of this project, because it seemed like it might be a bug.

@k00ni This is a publicly published PDF that was found on supremecourt.gov, so given the public nature of this document, I don't see any problem with using it in the test environment.

@k00ni
Copy link
Collaborator

k00ni commented Aug 1, 2023

Thank you for the feedback.

Some observations: uchr is used together with hexdec often, which may return a float in some cases (https://www.php.net/manual/en/function.hexdec.php). Also, it is used in combination with character arithmetic, like self::uchr($char - $char_from + $offset);, so its hard to "detect" when such an overflow happens.

BTW. I can't find the cast to int anymore.. must have been blind when mentioning that.

@k00ni k00ni closed this as completed in #623 Aug 3, 2023
k00ni added a commit that referenced this issue Aug 3, 2023
* Font::uchr: extends parameter $code to int|float

* Add files via upload

* fixed int|float so code runs on PHP 7.x

* add test case

* fixed coding style issue in Font.php

* added note

* Update Font.php
@k00ni
Copy link
Collaborator

k00ni commented Aug 3, 2023

@jesse-greathouse A release will follow in the next days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants