Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance degradation #401

Closed
ccdric opened this issue Feb 19, 2020 · 9 comments
Closed

performance degradation #401

ccdric opened this issue Feb 19, 2020 · 9 comments

Comments

@ccdric
Copy link

ccdric commented Feb 19, 2020

hello,
I have some performances issues with qpdf. some process are very long several hours.
I use qpdf (in fact libqpdf directly) for spliting big bpf in unity doc, (some time adding also underlay)
I did a litle test with qpdf tool with different revision.
the choice for simplicity reason was to split in pages with qpdf --split-pages under linux.
for that, i used to compile different version and install shared lib, qpdf, and wrapper in separate files to use those different version at the same time.
here is the result

Version V9.1.1 V8.4.1 V8.3.0 V8.2.1 V8.1.0 V8.0.2 V7.1.1
Average : 42,66 38,83 38,75 33,48 33,45 25,23 13,92
% add time vs prev. : 10 0 16 0 33 81 0
% add time vs 7.1.1 : 207 179 178 141 140 81 0

the methodology was just run a script that does the measurement (linux time command, user time only). it perform 20 measurement for each revision alternately and push the result in a csv. then i calculated the average time for each version.
(let my computer alone and remove network for the test)

wee can see performance decrease wile increasing qpdf version.

is there a way to increase performance on next version ? or a way with actual version to code a more efficient split-pdf and under-overlay ?
(Somme of my jobs can run several hours it's a lot too much for us)

@jberkenbilt
Copy link
Contributor

Please see --preserve-unreferenced-resources in the documentation. If you add this when splitting pages, you should see something closer to the pre 8.1 performance. Take a look at the performance with this option. That said, I'm not sure why you are seeing performance degradation other than at the jump from 8.0.2 to 8.1.0.

I need to add performance testing to my qpdf release and test process. Would you be willing to share any of what you've built already? I would like to see if I can reproduce your results.

@ccdric
Copy link
Author

ccdric commented Feb 20, 2020

Hi,
thank-you for your answer.
here is my methode and scripts :
(this run on ubuntu 19.04)
firste build differents QPDF rev and install il in a local dir :
git source directory : ~/installManuel/qpdf
my local instal dir : ~/installManuel/bin
my build script : installBuildesVers
installBuildesVers.zip

use it for example like
installBuildesVers 8.4.0
to build and install release-qpdf-8.4.0 in ~/installManuel/bin.

Once differents version are available in ~/installManuel/bin directory (with there relative lib)

ls -rtl ~/installManuel/bin/
total 152
-rwxr-xr-x  1 cedric cedric 6333 janv. 30 15:20 qpdf_8.4.1*
-rw-r--r--  1 cedric cedric 1321 janv. 30 15:29 qpdf_todo.txt
-rwxr-xr-x  1 cedric cedric 6333 janv. 30 15:48 qpdf_7.1.1*
drwxr-xr-x  2 cedric cedric 4096 janv. 30 18:53 libqpdf_8.4.1/
drwxr-xr-x  2 cedric cedric 4096 janv. 30 18:53 libqpdf_7.1.1/
drwxr-xr-x  2 cedric cedric 4096 janv. 30 18:54 libqpdf_9.1.1/
drwxr-xr-x  2 cedric cedric 4096 janv. 30 20:44 libqpdf_8.3.0/
-rwxr-xr-x  1 cedric cedric 6333 janv. 30 20:44 qpdf_8.3.0*
drwxr-xr-x  2 cedric cedric 4096 janv. 30 21:20 libqpdf_8.0.2/
-rwxr-xr-x  1 cedric cedric 6333 janv. 30 21:20 qpdf_8.0.2*
drwxr-xr-x  2 cedric cedric 4096 janv. 30 23:33 libqpdf_8.2.1/
-rwxr-xr-x  1 cedric cedric 6333 janv. 30 23:33 qpdf_8.2.1*
drwxr-xr-x  2 cedric cedric 4096 janv. 30 23:38 libqpdf_8.1.0/
-rwxr-xr-x  1 cedric cedric 6333 janv. 30 23:38 qpdf_8.1.0*
drwxr-xr-x  2 cedric cedric 4096 janv. 31 07:35 libqpdf_5.2.0/
-rwxr-xr-x  1 cedric cedric 6342 janv. 31 07:35 qpdf_5.2.0*
drwxr-xr-x  2 cedric cedric 4096 janv. 31 07:45 libqpdf_6.0.0/
-rwxr-xr-x  1 cedric cedric 6342 janv. 31 07:45 qpdf_6.0.0*
drwxr-xr-x  2 cedric cedric 4096 janv. 31 08:00 libqpdf_7.0.0/
-rwxr-xr-x  1 cedric cedric 6333 janv. 31 08:00 qpdf_7.0.0*
drwxr-xr-x  2 cedric cedric 4096 janv. 31 09:00 bindir/
drwxr-xr-x  2 cedric cedric 4096 janv. 31 09:00 libqpdf_8.4.2/
-rwxr-xr-x  1 cedric cedric 6333 janv. 31 09:00 qpdf_8.4.2*
-rwxr-xr-x  1 cedric cedric 6335 janv. 31 17:19 qpdf_9.1.1*
drwxr-xr-x 20 cedric cedric 4096 févr. 15 12:17 ../
-rwxr-xr-x  1 cedric cedric 2406 févr. 20 16:02 installBuildesVers*
drwxr-xr-x 14 cedric cedric 4096 févr. 20 16:02 ./

I launch the test with the second script :
testPerfQpdf.zip

usit simply lique this :
testPerfQpdf AIBUS_564_lin.pdf
and an example of result file :
AIBUS_564_lin.benchResult.zip

cat AIBUS_564_lin.benchResult.csv 
	V9.1.1	V8.4.2	V8.2.1	V8.0.2	V7.1.1
	33,08	34,65	30,60	23,16	12,79
	46,23	39,16	33,48	28,05	15,97
	54,02	43,78	37,09	28,94	16,58
	47,88	43,84	39,75	30,53	16,85
	51,78	47,46	41,06	31,10	17,79
	54,50	56,18	53,19	35,64	20,55
	59,69	53,71	39,41	29,73	18,41
	50,11	45,48	39,27	30,81	17,98
	52,29	47,88	39,98	30,53	16,59
	48,22	49,00	54,52	37,28	20,60
	49,92	43,47	36,94	28,03	16,59
	47,49	46,49	40,59	28,78	16,09
	47,95	44,94	37,50	32,22	16,27
	48,02	45,58	39,02	29,54	15,39
	47,24	43,73	36,44	27,63	15,36
	46,70	44,02	37,42	30,25	16,48
	50,42	42,78	36,92	28,81	16,13
	46,86	43,37	37,24	27,62	15,29
	47,79	43,55	38,21	30,09	15,41
	47,59	45,68	36,85	28,19	16,54

(on my machine it write on ramdisk to avoid disk lag)

My Pdf is a customer privarte pdf I can't provide it but I performs different test with more or less big pdf, the result is always the same behaviour.

with some pdf, relative difference between qpdf revision are worst than above result. (about 350% longer for 9.1.1 vs 7.1.1)

for example : this pdf

hope this this is not to much a mess and can help you :-) .

I will check --preserve-unreferenced-resources and tel you after

@jberkenbilt
Copy link
Contributor

@ccdric Any news on whether --preserve-unreferenced-resources helped? I'm hoping to get a qpdf release out very soon, hopefully by the end of the day tomorrow, and I plan on doing a deeper investigation on this. I'm strongly considering adding some heuristics to allow qpdf to quickly determine whether unreferenced resources are likely to occur. I am also going to attempt bisection to see if I can find the places where performance degradation was introduced.

@jberkenbilt
Copy link
Contributor

I am making good progress on tracking down the root causes of the degradation.

@jberkenbilt
Copy link
Contributor

@ccdric My work branch contains fixes that have improved the performance of page splitting to a level that's about 30% worse than 7.1.1 but beats 8.2.1 by a significant margin and is about a 70% improvement over 9.1.1 (meaning 9.1.1 is 70% worse than my current code, which is 30% worse than 7.1.1).

I'm not sure I will be able to get it much below this. Two of the commits that slowed the performance relative to 7.1.1 are important bug fixes.

In one case, qpdf was generating invalid output in the case of a file that contained an indirect reference to non-existent object. The PDF spec explicitly allows this, but qpdf could in some cases overwrite such an object. These files are rare, but unfortunately I can't ignore this case, and detecting this case incurs about an 8% overhead.

In the second case, qpdf was allocating too much memory for arrays that are "sparse", which is a pattern that sometimes shows up. I made a new implementation for the qpdf array that handles spares arrays better, but unfortunately it's a little worse than using a vector. However, I tweaked it a little to get it down a bit from 9.1.1, so that's still only adding about a 5% overhead.

The other thing I haven't reverted is a change I made to significantly improve the diagnostics of invalid objects. This change adds about a 3% overhead. I tried adding some code that would allow users to turn this off, but once the code is refactored to make this selectable at runtime, turning it off only saves about 1.5%, and the value of the diagnostic messages is very great. My ability to help people with problems in their PDFs would be greatly reduced if I removed this.

So, when I release qpdf 10.0.0, you can count on its performing better than any release since 7.1.1. It will not quite reach the 7.1.1 level of performance, but the additional slowness comes along with more robustness and much better diagnostics. Hopefully it's a tolerable trade-off, especially since this is going to be vastly superior to 9.1.1.

I am also going to bake some performance benchmarks into my release process so that I will not accidentally break performance as I have done in the past few years. Hopefully, over time, there will be opportunities for further optimization.

@jberkenbilt
Copy link
Contributor

Note that this performance is obtained with --preserved-unreferenced-resources. I am also going to try to add some code to help qpdf do that more efficiently, perhaps having it do a quick assessment of the likelihood of duplicated objects. Right now, qpdf detects and removes unreferenced resources when splitting unless you tell it not to, and the process of searching for them is expensive but extremely valueable for certain types of files. However, files that have unreferenced resources almost always have shared resource dictionaries, and qpdf can detect that much more quickly than it can remove unreferenced resources. I will leave in the option to explicitly select the correct behavior, but I will also have the default be to NOT to the expensive process in the case in which no duplicated resources are found. This analysis will be a little slower than --preserve-unreferenced-resources but, for most files, the default performance will far exceed 9.1.1.

@ccdric
Copy link
Author

ccdric commented Apr 3, 2020

Thanks Jay for your work !
sorry not answering earlier.
I did the test you asked to me but never take time to put result in human readable report. Happy to test your changes in a few days.

Did you know roughly when you plane to release the 10.0.0 ?

@jberkenbilt
Copy link
Contributor

I am hoping to get 10.0.0 out today, but it should be no later than early next week. I have a handful of other issues to look at before I get it out the door.

I'm going to go ahead and close this since I have probably squeezed about as much out as I can without major, high-risk work. Thanks again for sharing your findings and technique and opening my eyes to the severity of this issue. Before I release 10, I will have a simple performance benchmarking procedure in my release process. I don't have a way to run it in CI at this time, but I will do it manually just like I do for binary compatibility testing. Also I will have a very easy way to test whether a specific commit had an impact on performance. So this should be the end of surprise performance degradation.

Feel free to comment and/or reopen if necessary.

@jberkenbilt
Copy link
Contributor

@ccdric In qpdf 10, qpdf will analyze files when run with --split-pages to determine whether there are any shared resources, and only do the expensive detection and removal of unreferenced resources if shared resources are found. This herustic should be right most of the time and will generally make it so people don't have to care about --preserve-unreferenced-resources. It is still possible to force explicit behavior. For details, see the release notes for qpdf 10 (to be released soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants