Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewPdfReaderLazy supports reading PDF files in lazy-load mode #409

Merged
merged 17 commits into from
Apr 14, 2019

Conversation

gunnsth
Copy link
Contributor

@gunnsth gunnsth commented Apr 9, 2019

The NewPdfReader still implements the traditional approach of loading the entire structure at opening time. Whereas, NewPdfReaderLazy only accesses objects on an as-needed basis.

The non-lazy reader may be more efficient whereas the lazy reader should reduce memory usage when only using parts of the PDF.

Changed ColorSpace Resources to a core.PdfObject and added a getter which loads the PdfObject to a colorspace resource model (*PdfPageResourcesColorspaces). Reason is that colorspaces can be quite heavy and in large files this slows down loading significantly.

Similarly changed page annotations Annots to a core.PdfObject and a getter GetAnnotations which loads the PDFObject to a slice of PdfAnnotations.

Added a function to Logger interface to check loglevel, used to avoid calling resource intensive functions for outputing Trace logs.

Addresses #128 .

@codecov
Copy link

codecov bot commented Apr 9, 2019

Codecov Report

Merging #409 into v3 will increase coverage by 4.19%.
The diff coverage is 77.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##               v3     #409      +/-   ##
==========================================
+ Coverage   55.63%   59.83%   +4.19%     
==========================================
  Files         153      153              
  Lines       27531    27596      +65     
==========================================
+ Hits        15318    16513    +1195     
- Misses      10345    10697     +352     
+ Partials     1868      386    -1482
Impacted Files Coverage Δ
pdf/core/symbols.go 77.77% <ø> (-3.48%) ⬇️
pdf/model/pattern.go 0% <ø> (ø) ⬆️
pdf/core/crypt.go 75.51% <ø> (+20.56%) ⬆️
pdf/extractor/image.go 78.44% <0%> (+12.36%) ⬆️
pdf/model/flatten.go 0% <0%> (ø) ⬆️
pdf/contentstream/inline-image.go 43.29% <0%> (+9.19%) ⬆️
pdf/model/utils.go 38.88% <100%> (ø) ⬆️
pdf/contentstream/processor.go 59.75% <100%> (+9.45%) ⬆️
pdf/creator/utils.go 87.5% <100%> (+16.07%) ⬆️
common/logging.go 51.02% <100%> (+11.02%) ⬆️
... and 119 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 852b515...e8fe202. Read the comment docs.

Copy link
Collaborator

@peterwilliams97 peterwilliams97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really good.

@gunnsth gunnsth requested a review from adrg April 12, 2019 14:38
Provided via new initialier NewPdfReaderLazy.  NewPdfReader still loads entire PDF structure upon  loading.
Colorspaces can be very complex so deal with them as PdfObjects unless need to work with the content.
Avoid going through annotations on page loading, handle as generic PdfObject and load when GetAnnotations is called on the page.
Provided via new initialier NewPdfReaderLazy.  NewPdfReader still loads entire PDF structure upon  loading.
@gunnsth
Copy link
Contributor Author

gunnsth commented Apr 14, 2019

The e2e test TestSplitting is favorable for lazy-loading as it works with only a small part of a PDF (for multi page files). Thus it is expected to showcase significant improvements.

In v3 branch (not lazy):

--- PASS: TestSplitting (133.86s)
    split_test.go:70: 004feecd47e2da4f2ed5cdbbf4791a77dd59ce20.pdf
    split_test.go:147: Duration: 0.45 seconds. Alloc: 30.91 MB.
    split_test.go:70: 30c0a5cff80870cd58c2738d622f5d63e37dc90c.pdf
    split_test.go:147: Duration: 23.60 seconds. Alloc: 146.02 MB.
    split_test.go:70: 371dce2c2720581a3eef3f123e5741dd3566ef87.pdf
    split_test.go:147: Duration: 2.13 seconds. Alloc: 94.75 MB.
    split_test.go:70: 3bf64014e0c9e4a56f1a9363f1b34fd707bd9fa0.pdf
    split_test.go:147: Duration: 0.02 seconds. Alloc: 1.41 MB 
    split_test.go:70: 8f8ce400b9d66656cd09260035aa0cc3f7e46c82.pdf
    split_test.go:147: Duration: 0.00 seconds. Alloc: 0.86 MB 
    split_test.go:70: a35d386af4828b7221591343761191e8f9a28bc0.pdf
    split_test.go:147: Duration: 0.01 seconds.  Alloc: 0.87 MB
    split_test.go:70: bf7c9d5dabc7e7ec2fc0cf9db2d9c8e7aa456fca.pdf
    split_test.go:147: Duration: 4.12 seconds. Alloc: 204.16 MB
    split_test.go:70: e815311526b50036db6e89c54af2b9626edecf30.pdf
    split_test.go:147: Duration: 0.14 seconds.  Alloc: 10.02 MB
    split_test.go:70: e815699a5234540fda89ea3a2ece055349a0d535.pdf
    split_test.go:147: Duration: 0.01 seconds.   Alloc: 0.91 MB 
    split_test.go:94: Split benchmark complete for 9 files
PASS
ok  	github.com/unidoc/unidoc/pdf/internal/e2etest	143.063s

v3-lazyloading (with lazy reader):

--- PASS: TestSplitting (81.12s)
    split_test.go:70: 004feecd47e2da4f2ed5cdbbf4791a77dd59ce20.pdf
    split_test.go:147: Duration: 0.03 seconds (was 0.45 seconds). Alloc: 7.56 MB (was 30.91 MB)
    split_test.go:70: 30c0a5cff80870cd58c2738d622f5d63e37dc90c.pdf
    split_test.go:147: Duration: 1.20 seconds (was 23.60 seconds). Alloc: 75.80 MB (was 146.02 MB)
    split_test.go:70: 371dce2c2720581a3eef3f123e5741dd3566ef87.pdf
    split_test.go:147: Duration: 0.11 seconds (was 2.13 seconds). Alloc: 10.70 MB (was 94.75 MB)
    split_test.go:70: 3bf64014e0c9e4a56f1a9363f1b34fd707bd9fa0.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.02 seconds). Alloc: 1.36 MB (was 1.41MB)
    split_test.go:70: 8f8ce400b9d66656cd09260035aa0cc3f7e46c82.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.00 seconds). Alloc: 0.86 MB (was 0.86MB)
    split_test.go:70: a35d386af4828b7221591343761191e8f9a28bc0.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.01 seconds). Alloc: 0.87 MB (was 0.87MB)
    split_test.go:70: bf7c9d5dabc7e7ec2fc0cf9db2d9c8e7aa456fca.pdf
    split_test.go:147: Duration: 0.07 seconds (was 4.12 seconds). Alloc: 5.53 MB (was 204.16MB).
    split_test.go:70: e815311526b50036db6e89c54af2b9626edecf30.pdf
    split_test.go:147: Duration: 0.01 seconds (was 0.14 seconds). Alloc: 1.30 MB (was 10.02 MB)
    split_test.go:70: e815699a5234540fda89ea3a2ece055349a0d535.pdf
    split_test.go:147: Duration: 0.00 seconds (was 0.01 seconds). Alloc: 0.91 MB (was 0.91 MB)
    split_test.go:94: Split benchmark complete for 9 files
PASS
ok  	github.com/unidoc/unidoc/pdf/internal/e2etest	86.152s

Significant improvements in memory use and speed. Note that the reason for the long time the test is taking is that the input and output files are validated with ghostscript. In addition, debug.FreeOSMemory() is called prior to each test file to ensure consistent memory measurements.

… Trace mode

- Add IsLogLevel function to logger.  Can be used to avoid calling resource intensive functions except when running trace only.
- Remove large testdata file and generate data for test case dynamically instead for TestBigDictParse
- Small improvement in String() for dictionary
Copy link
Collaborator

@adrg adrg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great. Seems like a lot of unnecessary code has been removed.

return MakeNull()
}
if obj == nil {
common.Log.Debug("ERROR resolving reference: nil object - returning a null object", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra err argument.

common.Log.Debug("ERROR resolving reference: nil object - returning a null object")

@gunnsth gunnsth added this to the v3.0.0-rc.1 milestone Apr 14, 2019
@gunnsth gunnsth merged commit f8ac397 into v3 Apr 14, 2019
@gunnsth gunnsth deleted the v3-lazyloading branch April 14, 2019 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants