Fix for Issue #454 #455

izabala · 2021-08-28T01:33:58Z

This is the fix/workaround I have been working this to fix that Issue.

The file makes that 2 of the Page methods fails. The Page->extractDecodedRawData was not returning the correct string. This was corrected. The Page->getTextArray breaks when the Page->get(´Contents´) returns a PDFObject, but this object makes that the PDFObject->getTextArray($this) throw an Error. But if you detected it and instead call PDFObject->getTextArray() , it returns the correct data. This is a workaround, because, what is exactly the difference in the format of this PDF and why it fails, needs to have a more deep investigation. I run all the PageTests and they work. This happends because the sample Pdf file is not format as we usually see in other files. Actually, I have a similar (not exactly the same) case for a file created with FPDI, that also broke the getTextArray and getDataTm methods, but I am doing a research to see what is actually happends before I open an Issue for that. As soon as I know what is happening in that case, I will opened the Issue, hopefully with the workaround or fix already done.

This test is a bit wonky because it relies on memory values which may differ from system to system and run to run. Adjusted values to fix it. Ref: https://github.com/smalot/pdfparser/pull/453/checks?check_run_id=3397695916#step:6:22

@j0k3r

Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r

To catching Throwable.

When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods. But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects.

izabala · 2021-08-28T01:38:01Z

Hi, @k00ni it seems that I still have the problem (when I create a new branch) with the CS Fixer. May you help me with that?

k00ni

I don't have much time these days, but I hope it helps you a bit for now.

Also, did you run php-cs-fixer in dev-tools folder? You can use the following shortcut, run it in root folder:

make run-php-cs-fixer

src/Smalot/PdfParser/Page.php

tests/Integration/PageTest.php

izabala · 2021-08-31T18:11:59Z

Ok... I am just travelling for 2 weeks. As soon as I get back, I will make all changes and tests.

izabala · 2021-09-22T18:02:38Z

I am getting back to this. I will start looking all the recommendations. Just one thing:

I have to look how to use the:
make run-php-cs-fixer

This is because I develop in a windows machine where I cann't run make, I am looking if I can install a windows tools that let me use make.

By the way... i run: dev-tools\vendor\bin\php-cs-fixer fix --verbose --dry-run
And even though it has some warnings like:

You are running PHP CS Fixer v2, which is not maintained anymore. Please update to v3. If you need help while solving warnings, ask at https://gitter.im/PHP-CS-Fixer, we will help you!

and:

- Configuration file `.php_cs` is deprecated, rename to `.php-cs-fixer.php`.
- PhpCsFixer\Config::create is deprecated since 2.17 and will be removed in 3.0, use the constructor instead.

It finished without errors.

Some of the changes asked in Github by kOOni

Other changes asked by k00ny

@k00ni

Some other @k00ni recommendations

izabala · 2021-09-24T16:45:30Z

I already installed a make tool for windows (the GNU one). But when you run the:
make run-php-cs-fixer
but it tries to run:
dev-tools/vendor/bin/php-cs-fixer fix
that in windows is not a command (the / instead of \).
So I have to run it manually ... it makes 64 changes in files that I havent touched. I will push that changes. Let me know if I have to re-do everything again.

I manually run dev-tools\vendor\bin\php-cs-fixer fix

just to make a code enhacement

izabala · 2021-09-24T17:50:39Z

Ok. Let me know if with the changes, this is Ok.

src/Smalot/PdfParser/Page.php

.vscode/launch.json

k00ni · 2021-09-27T06:49:51Z

I try to get back to you as soon as possible. Looks good so far.

@k00ni

Some corrections metions by @k00ni.

izabala

I took it out, that was a file for debugging that I forget to take it out.

k00ni

First of all, thank you for another important pull request.

Your implementation looks fine, but I would like to discuss placements of these new if-clauses which check for isFpdf. Using them this way bloats the code to an extent IMHO and makes code also a bit harder to read.

In your opinion, does it makes sense to use special functions, like getTextAreaForFPdf etc. to handle it? Or add further functions, because some code seems like (almost) the same multiple times. I would prefer the latter.

What do you think?

src/Smalot/PdfParser/Page.php

izabala · 2021-10-04T14:47:17Z

@k00ni About your comment:

In your opinion, does it makes sense to use special functions, like getTextAreaForFPdf etc. to handle it? Or add further functions, because some code seems like (almost) the same multiple times. I would prefer the latter.

What do you think?

Actually I just start doing a workaround/fix over the problem as soon as possible. Let me re-think the fix, and see if there is a better solution, now that we Know what the problem is, and how to solve it.

Give me a couple of days to work on this.

@k00ni

Follow the recomendation of @k00ni on using extra function to have the code clearer.

izabala · 2021-10-04T17:55:43Z

@k00ni , May you have a look on this approach? Let me know if this is ok or not.

Thanks for all your recommendations

src/Smalot/PdfParser/Page.php

tests/Integration/PageTest.php

k00ni · 2021-10-05T07:52:51Z

It looks better, thank you. I try to find the time to give you a more in depth feedback soon, hopefully next week, maybe soon.

@k00ni

Many changes following @k00ni recommendations.

Better explanation for the function

Changes in comments, functions names and variable names.

izabala · 2021-10-14T15:40:38Z

@k00ni Let me know if there is something more needed on this pull request. I believe I answered everything, and made all the changes that were requested.

k00ni · 2021-10-15T08:40:42Z

Let me know if there is something more needed on this pull request. I believe I answered everything, and made all the changes that were requested.

Sorry, I am currently super busy. Thank you for bearing with me. I made a few code nice ups and think that this PR is in a good state right now. If you have no objections I will merge it soon.

izabala · 2021-10-15T15:12:44Z

I am OK with this!

izabala and others added 10 commits August 19, 2021 16:34

PageTest: attempt to fix cs issues

4d1b126

Page.php: fixed cs issues

5857e85

refined memory threshold in ParserTest::testRetainImageContentImpact

248241d

Update Page.php

4f6f5a1

Taking out line

b044eea

Taking out the line: $decodedText = ''; This was not needed. Thanks @j0k3r

Changing the catch of the Error

8799c3d

To catching Throwable.

Fix/workaround for Issue #454

140b67a

When the pdf files is produced by setasign/fpdi/fpdi or FPDF, this correct that nothing is returning by the methods. But for doing that things like to know that the producer is FPDF and the page number are required and used in conjunction with getXObjects.

Merge branch 'master' of https://github.com/izabala/pdfparser

125e3bd

k00ni requested changes Aug 31, 2021

View reviewed changes

izabala added 3 commits September 22, 2021 18:26

Update Page.php

3b8bb4d

Some of the changes asked in Github by kOOni

Update Page.php

716a1f0

Other changes asked by k00ny

Some other recomendations

bb855f1

Some other @k00ni recommendations

izabala added 3 commits September 24, 2021 12:46

After manually doing php-cs-fixer

5998c82

I manually run dev-tools\vendor\bin\php-cs-fixer fix

Correcting the phpstan error

19cc481

Update Page.php

c75a1f5

just to make a code enhacement

k00ni requested changes Sep 27, 2021

View reviewed changes

src/Smalot/PdfParser/Page.php Outdated Show resolved Hide resolved

.vscode/launch.json Outdated Show resolved Hide resolved

izabala requested a review from k00ni September 27, 2021 14:48

Removing vscode\lauch.json and some corrections

c0d6235

Some corrections metions by @k00ni.

k00ni added the needs work label Sep 28, 2021

izabala commented Sep 29, 2021

View reviewed changes

k00ni reviewed Oct 1, 2021

View reviewed changes

k00ni added the needs more info label Oct 1, 2021

k00ni reviewed Oct 1, 2021

View reviewed changes

src/Smalot/PdfParser/Page.php Show resolved Hide resolved

creating some function to get this clearer

247e55c

Follow the recomendation of @k00ni on using extra function to have the code clearer.