workaround for the Issue #450 #453

izabala · 2021-08-19T20:35:54Z

The file makes that 2 of the Page methods fails.

The Page->extractDecodedRawData was not returning the correct string. This was corrected.

The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work.

This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

The file makes that 2 of the Page methods fails. The Page->extractDecodedRawData was not returning the correct string. This was corrected. The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work. This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

izabala · 2021-08-19T22:10:09Z

Guys, I already run CS Fixer, but I still get the errors. Please some advice here.

k00ni · 2021-08-20T06:14:55Z

Thank you for your work @izabala!

Guys, I already run CS Fixer, but I still get the errors. Please some advice here.

I am busy this week, but I will try to assist you next week.

eddturtle · 2021-08-20T09:35:55Z

I grabbed a copy of your repo @izabala + phpcs was failing (like above). I opened it in my editor and saved and it auto changed quite a few things, now it passes php cs. Here's the updated code:

    public function testExtractDecodedRawDataIssue450()
    {
        $filename = $this->rootDir.'/samples/bugs/Issue450.pdf';
        $parser = $this->getParserInstance();
        $document = $parser->parseFile($filename);
        $pages = $document->getPages();
        $page = $pages[0];
        $extractedDecodedRawData = $page->extractDecodedRawData();
        $this->assertIsArray($extractedDecodedRawData);
        $this->assertGreaterThan(3, \count($extractedDecodedRawData));
        $this->assertIsArray($extractedDecodedRawData[3]);
        $this->assertEquals('TJ', $extractedDecodedRawData[3]['o']);
        $this->assertIsArray($extractedDecodedRawData[3]['c']);
        $this->assertIsArray($extractedDecodedRawData[3]['c'][0]);
        $this->assertEquals(3, \count($extractedDecodedRawData[3]['c'][0]));
        $this->assertEquals('{signature:signer505906:Please+Sign+Here}', $extractedDecodedRawData[3]['c'][0]['c']);
    }

    public function testGetDataTmIssue450()
    {
        $filename = $this->rootDir.'/samples/bugs/Issue450.pdf';
        $parser = $this->getParserInstance();
        $document = $parser->parseFile($filename);
        $pages = $document->getPages();
        $page = $pages[0];
        $dataTm = $page->getDataTm();
        $this->assertIsArray($dataTm);
        $this->assertEquals(1, \count($dataTm));
        $this->assertIsArray($dataTm[0]);
        $this->assertEquals(2, \count($dataTm[0]));
        $this->assertIsArray($dataTm[0][0]);
        $this->assertEquals(6, \count($dataTm[0][0]));
        $this->assertEquals(1, $dataTm[0][0][0]);
        $this->assertEquals(0, $dataTm[0][0][1]);
        $this->assertEquals(0, $dataTm[0][0][2]);
        $this->assertEquals(1, $dataTm[0][0][3]);
        $this->assertEquals(67.5, $dataTm[0][0][4]);
        $this->assertEquals(756.25, $dataTm[0][0][5]);
        $this->assertEquals('{signature:signer505906:Please+Sign+Here}', $dataTm[0][1]);
    }

Hope this helps!

As mentioned in the issue, I think the actual code change is looking good to me.

izabala · 2021-08-20T14:18:00Z

Thanks @k00ni. I wait, I just want to be sure what I am doing wrong!

This test is a bit wonky because it relies on memory values which may differ from system to system and run to run. Adjusted values to fix it. Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22

k00ni

@izabala thank you for the PR.

I took the liberty to directly adapt a few files in your repository (see latest commits). All tests are good now.

In ParserTest I had to refine a few lines to fix failing test. It was not related to your changes.

I have one question (see below). After we solved it we are good to go.

src/Smalot/PdfParser/Page.php

@j0k3r

Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r

To catching Throwable.

k00ni

👍🏿 Thanks.

As always, I will keep this open for a few days so others can comment.

k00ni added the fix label Aug 20, 2021

k00ni mentioned this pull request Aug 20, 2021

Error thrown on getDataTm() - Call to a member function decodeText() on null #450

Closed

k00ni self-assigned this Aug 22, 2021

k00ni added 4 commits August 23, 2021 09:12

PageTest: attempt to fix cs issues

4d1b126

Page.php: fixed cs issues

5857e85

refined memory threshold in ParserTest::testRetainImageContentImpact

248241d

k00ni requested changes Aug 23, 2021

View reviewed changes

src/Smalot/PdfParser/Page.php Outdated Show resolved Hide resolved

j0k3r reviewed Aug 23, 2021

View reviewed changes

src/Smalot/PdfParser/Page.php Outdated Show resolved Hide resolved

Update Page.php

4f6f5a1

j0k3r reviewed Aug 23, 2021

View reviewed changes

src/Smalot/PdfParser/Page.php Show resolved Hide resolved

izabala added 2 commits August 23, 2021 15:57

Taking out line

b044eea

Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r

Changing the catch of the Error

8799c3d

To catching Throwable.

k00ni approved these changes Aug 24, 2021

View reviewed changes

izabala mentioned this pull request Aug 24, 2021

extractRawData, extractDecodedRawData, getDataTm and getDataXY do not work with a Pdf file produced by FPDI/FPDF #454

Closed

k00ni merged commit 5dd2329 into smalot:master Aug 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workaround for the Issue #450 #453

workaround for the Issue #450 #453

izabala commented Aug 19, 2021

izabala commented Aug 19, 2021

k00ni commented Aug 20, 2021

eddturtle commented Aug 20, 2021

izabala commented Aug 20, 2021

k00ni left a comment

k00ni left a comment •

edited

Loading

workaround for the Issue #450 #453

workaround for the Issue #450 #453

Conversation

izabala commented Aug 19, 2021

izabala commented Aug 19, 2021

k00ni commented Aug 20, 2021

eddturtle commented Aug 20, 2021

izabala commented Aug 20, 2021

k00ni left a comment

Choose a reason for hiding this comment

k00ni left a comment • edited Loading

Choose a reason for hiding this comment

k00ni left a comment •

edited

Loading