NewPdfReaderLazy supports reading PDF files in lazy-load mode #409

gunnsth · 2019-04-09T23:13:16Z

The NewPdfReader still implements the traditional approach of loading the entire structure at opening time. Whereas, NewPdfReaderLazy only accesses objects on an as-needed basis.

The non-lazy reader may be more efficient whereas the lazy reader should reduce memory usage when only using parts of the PDF.

Changed ColorSpace Resources to a core.PdfObject and added a getter which loads the PdfObject to a colorspace resource model (*PdfPageResourcesColorspaces). Reason is that colorspaces can be quite heavy and in large files this slows down loading significantly.

Similarly changed page annotations Annots to a core.PdfObject and a getter GetAnnotations which loads the PDFObject to a slice of PdfAnnotations.

Added a function to Logger interface to check loglevel, used to avoid calling resource intensive functions for outputing Trace logs.

Addresses #128 .

codecov · 2019-04-09T23:13:29Z

Codecov Report

Merging #409 into v3 will increase coverage by 4.19%.
The diff coverage is 77.66%.

@@            Coverage Diff             @@
##               v3     #409      +/-   ##
==========================================
+ Coverage   55.63%   59.83%   +4.19%     
==========================================
  Files         153      153              
  Lines       27531    27596      +65     
==========================================
+ Hits        15318    16513    +1195     
- Misses      10345    10697     +352     
+ Partials     1868      386    -1482

Impacted Files	Coverage Δ
pdf/core/symbols.go	`77.77% <ø> (-3.48%)`	⬇️
pdf/model/pattern.go	`0% <ø> (ø)`	⬆️
pdf/core/crypt.go	`75.51% <ø> (+20.56%)`	⬆️
pdf/extractor/image.go	`78.44% <0%> (+12.36%)`	⬆️
pdf/model/flatten.go	`0% <0%> (ø)`	⬆️
pdf/contentstream/inline-image.go	`43.29% <0%> (+9.19%)`	⬆️
pdf/model/utils.go	`38.88% <100%> (ø)`	⬆️
pdf/contentstream/processor.go	`59.75% <100%> (+9.45%)`	⬆️
pdf/creator/utils.go	`87.5% <100%> (+16.07%)`	⬆️
common/logging.go	`51.02% <100%> (+11.02%)`	⬆️
... and 119 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852b515...e8fe202. Read the comment docs.

peterwilliams97

The code looks really good.

Provided via new initialier NewPdfReaderLazy. NewPdfReader still loads entire PDF structure upon loading.

Reduced from 17s -> 10.6s

Colorspaces can be very complex so deal with them as PdfObjects unless need to work with the content.

…ion.

… reading

Avoid going through annotations on page loading, handle as generic PdfObject and load when GetAnnotations is called on the page.

Provided via new initialier NewPdfReaderLazy. NewPdfReader still loads entire PDF structure upon loading.

gunnsth · 2019-04-14T13:00:02Z

The e2e test TestSplitting is favorable for lazy-loading as it works with only a small part of a PDF (for multi page files). Thus it is expected to showcase significant improvements.

In v3 branch (not lazy):

--- PASS: TestSplitting (133.86s)
    split_test.go:70: 004feecd47e2da4f2ed5cdbbf4791a77dd59ce20.pdf
    split_test.go:147: Duration: 0.45 seconds. Alloc: 30.91 MB.
    split_test.go:70: 30c0a5cff80870cd58c2738d622f5d63e37dc90c.pdf
    split_test.go:147: Duration: 23.60 seconds. Alloc: 146.02 MB.
    split_test.go:70: 371dce2c2720581a3eef3f123e5741dd3566ef87.pdf
    split_test.go:147: Duration: 2.13 seconds. Alloc: 94.75 MB.
    split_test.go:70: 3bf64014e0c9e4a56f1a9363f1b34fd707bd9fa0.pdf
    split_test.go:147: Duration: 0.02 seconds. Alloc: 1.41 MB 
    split_test.go:70: 8f8ce400b9d66656cd09260035aa0cc3f7e46c82.pdf
    split_test.go:147: Duration: 0.00 seconds. Alloc: 0.86 MB 
    split_test.go:70: a35d386af4828b7221591343761191e8f9a28bc0.pdf
    split_test.go:147: Duration: 0.01 seconds.  Alloc: 0.87 MB
    split_test.go:70: bf7c9d5dabc7e7ec2fc0cf9db2d9c8e7aa456fca.pdf
    split_test.go:147: Duration: 4.12 seconds. Alloc: 204.16 MB
    split_test.go:70: e815311526b50036db6e89c54af2b9626edecf30.pdf
    split_test.go:147: Duration: 0.14 seconds.  Alloc: 10.02 MB
    split_test.go:70: e815699a5234540fda89ea3a2ece055349a0d535.pdf
    split_test.go:147: Duration: 0.01 seconds.   Alloc: 0.91 MB 
    split_test.go:94: Split benchmark complete for 9 files
PASS
ok  	github.com/unidoc/unidoc/pdf/internal/e2etest	143.063s

v3-lazyloading (with lazy reader):

--- PASS: TestSplitting (81.12s)
    split_test.go:70: 004feecd47e2da4f2ed5cdbbf4791a77dd59ce20.pdf
    split_test.go:147: Duration: 0.03 seconds (was 0.45 seconds). Alloc: 7.56 MB (was 30.91 MB)
    split_test.go:70: 30c0a5cff80870cd58c2738d622f5d63e37dc90c.pdf
    split_test.go:147: Duration: 1.20 seconds (was 23.60 seconds). Alloc: 75.80 MB (was 146.02 MB)
    split_test.go:70: 371dce2c2720581a3eef3f123e5741dd3566ef87.pdf
    split_test.go:147: Duration: 0.11 seconds (was 2.13 seconds). Alloc: 10.70 MB (was 94.75 MB)
    split_test.go:70: 3bf64014e0c9e4a56f1a9363f1b34fd707bd9fa0.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.02 seconds). Alloc: 1.36 MB (was 1.41MB)
    split_test.go:70: 8f8ce400b9d66656cd09260035aa0cc3f7e46c82.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.00 seconds). Alloc: 0.86 MB (was 0.86MB)
    split_test.go:70: a35d386af4828b7221591343761191e8f9a28bc0.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.01 seconds). Alloc: 0.87 MB (was 0.87MB)
    split_test.go:70: bf7c9d5dabc7e7ec2fc0cf9db2d9c8e7aa456fca.pdf
    split_test.go:147: Duration: 0.07 seconds (was 4.12 seconds). Alloc: 5.53 MB (was 204.16MB).
    split_test.go:70: e815311526b50036db6e89c54af2b9626edecf30.pdf
    split_test.go:147: Duration: 0.01 seconds (was 0.14 seconds). Alloc: 1.30 MB (was 10.02 MB)
    split_test.go:70: e815699a5234540fda89ea3a2ece055349a0d535.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.01 seconds). Alloc: 0.91 MB (was 0.91 MB)
    split_test.go:94: Split benchmark complete for 9 files
PASS
ok  	github.com/unidoc/unidoc/pdf/internal/e2etest	86.152s

Significant improvements in memory use and speed. Note that the reason for the long time the test is taking is that the input and output files are validated with ghostscript. In addition, debug.FreeOSMemory() is called prior to each test file to ensure consistent memory measurements.

… Trace mode - Add IsLogLevel function to logger. Can be used to avoid calling resource intensive functions except when running trace only. - Remove large testdata file and generate data for test case dynamically instead for TestBigDictParse - Small improvement in String() for dictionary

adrg

The code looks great. Seems like a lot of unnecessary code has been removed.

adrg · 2019-04-14T18:01:09Z

pdf/core/primitives.go

+		return MakeNull()
+	}
+	if obj == nil {
+		common.Log.Debug("ERROR resolving reference: nil object - returning a null object", err)


Extra err argument.

common.Log.Debug("ERROR resolving reference: nil object - returning a null object")

gunnsth requested a review from peterwilliams97 April 10, 2019 17:52

peterwilliams97 approved these changes Apr 12, 2019

View reviewed changes

gunnsth requested a review from adrg April 12, 2019 14:38

gunnsth added 12 commits April 14, 2019 12:13

Initial lazy loading support

54cc031

Provided via new initialier NewPdfReaderLazy. NewPdfReader still loads entire PDF structure upon loading.

Clean up

e3b7755

Clean up

c7bd4c9

Avoid traversing same trees multiple times.

838c2ad

Enhanced dictionary parsing and testcase for parsing big dict.

93fe4a0

Reduced from 17s -> 10.6s

Change colorspace loading to lazy. Clean up page object loading.

ef61f2f

Colorspaces can be very complex so deal with them as PdfObjects unless need to work with the content.

Clear up reference resolving and remove redundant traceToObject funct…

b7daa84

…ion.

Update hashes for split test due to changes in colorspace handling on…

94ac130

… reading

Update split test to use lazy reader

f9218ee

Make annotations lazy

b718293

Avoid going through annotations on page loading, handle as generic PdfObject and load when GetAnnotations is called on the page.

Initial lazy loading support

a936000

Provided via new initialier NewPdfReaderLazy. NewPdfReader still loads entire PDF structure upon loading.

Clean up

9fe33de

gunnsth force-pushed the v3-lazyloading branch from 30acc46 to 9fe33de Compare April 14, 2019 12:23

gunnsth added 2 commits April 14, 2019 14:29

Clean up

e8fe202

adrg approved these changes Apr 14, 2019

View reviewed changes

gunnsth added this to the v3.0.0-rc.1 milestone Apr 14, 2019

gunnsth added 3 commits April 14, 2019 21:12

Consistency in function naming

d45aaaf

Fix PR comment

b40c8c3

Merge branch 'v3' into v3-lazyloading

029245b

gunnsth merged commit f8ac397 into v3 Apr 14, 2019

gunnsth deleted the v3-lazyloading branch April 14, 2019 22:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NewPdfReaderLazy supports reading PDF files in lazy-load mode #409

NewPdfReaderLazy supports reading PDF files in lazy-load mode #409

gunnsth commented Apr 9, 2019 •

edited

codecov bot commented Apr 9, 2019 •

edited

peterwilliams97 left a comment

gunnsth commented Apr 14, 2019 •

edited

adrg left a comment

adrg Apr 14, 2019

NewPdfReaderLazy supports reading PDF files in lazy-load mode #409

NewPdfReaderLazy supports reading PDF files in lazy-load mode #409

Conversation

gunnsth commented Apr 9, 2019 • edited

codecov bot commented Apr 9, 2019 • edited

Codecov Report

peterwilliams97 left a comment

Choose a reason for hiding this comment

gunnsth commented Apr 14, 2019 • edited

adrg left a comment

Choose a reason for hiding this comment

adrg Apr 14, 2019

Choose a reason for hiding this comment

gunnsth commented Apr 9, 2019 •

edited

codecov bot commented Apr 9, 2019 •

edited

gunnsth commented Apr 14, 2019 •

edited