Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely slow opening of CBZ with JP2 images #1922

Open
vrubleg opened this issue Apr 9, 2021 · 7 comments
Open

Extremely slow opening of CBZ with JP2 images #1922

vrubleg opened this issue Apr 9, 2021 · 7 comments

Comments

@vrubleg
Copy link

vrubleg commented Apr 9, 2021

There are a lot of scanned magazines on the archive.org which are available as a zip archive with JP2 (JPEG 2000) images. SumatraPDF is able to open them as CBZ files, but it takes a couple of minutes just to open a file, and switching between slides is also very slow. It seems like it tries to decode all images from the archive on open, and JP2 decoder is extremely slow.

How to reproduce:

  1. Download this archive: https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997_jp2.zip
  2. Change file extension to CBZ.
  3. Try to open it using SumatraPDF.
@vrubleg
Copy link
Author

vrubleg commented Apr 9, 2021

Another issue, probably related. There are _text.pdf files on the archive.org for magazines, and they are also rendered very slowly. Example: https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997_text.pdf

The same magazine in usual PDF and DJVU is rendered quickly. Example:
https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997.djvu
https://archive.org/download/russian-Katalog_Lego-1997/russian-Katalog_Lego-1997.pdf

Seems like the _text.pdf version uses JPEG 2000 as an image codec, and it is the reason why it is so slow. All these file types are standard for archive.org, there are hundreds of scanned magazines which use the JPEG 2000 format. It is worth to consider finding a faster decoder.

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Apr 9, 2021

I avoid any overly compressed files when possible, they were necessary 40 years ago when using 9600 baud modems to minimise transmission times, they are literally a waste of time in this day and age.
I understand Internet Archive would tend to use near lossless storage, but 1 hour compressing and millions of users hours decompressing makes no sense at all.

From Wikimill article on jp2 "Image compression is a type of data compression applied to digital images, to reduce their cost for storage or transmission." (Thus no consideration of end users needs, after all they have paid for the privilege to bog themselves down.)

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Apr 9, 2021

Interesting as to where those docs came from (clearly an amateur as the 56 page spread was hastily scanned as 58 images)
CbZip smallest Page.4 (5) is 5000 x 4600 pixels so translates into poster size page 1322.9 mm x 1217.1 mm, my dining table is roughly that size! so would need a 16K monitor to be of any value.

_text.pdf Cover page 1 and .4 report they are 423.3 mm x 389.5 so more like my coffee table and if saved as lossless png (even with all the jpeg garbage) at that giant size as 190MB.CBZ they display instantly same as the crappy jpeg version.

Jpeg is for best for photos NOT docpages. Png is best for most colour documents especially printed ones. LuraDoc Recoded into higher density compression which was primarily designed to handle tiled mapping / aero photos to be radio downlinked and viewed a few at a time, so my view is it should not be used except for single satellite images where you need to pick out the buildings in detail and can wait a while. I guess the Inter Galactic Archive will need to keep them in that format for interplanetary reading.

@GitHubRulesOK
Copy link
Collaborator

I ran the 56 pages through irfanview to convert that Last Century j2k wavelets into a modern webp so this cbz will work in SumatraPDF but will not work in older CB readers

SumatraPDF-56xWebp.zip

Also there are much better compressors now so try this format its much smaller but not extreme avoiding the decompression chamber delay.
56images_compressed.pdf

@vrubleg
Copy link
Author

vrubleg commented Apr 10, 2021

I don't create these files, so I can't choose some other format for images. I just need to view these magazines from the archive.org as is. In most cases, archive.org stores original magazine scans as JP2 files, and it would be really nice if SumatraPDF didn't slow down that much in this case.

@GitHubRulesOK
Copy link
Collaborator

GitHubRulesOK commented Apr 10, 2021

That Luradoc format is proprietary so if badly applied is a problem for many non-commercial libraries to find workarounds. MuPDF have withdrawn their notice of Luratech code inclusion so I suspect the situation may be a "wont fix" as the code is unlikely to become FOSS.
I noted that Internet Archive throttle their downloads so they are slow on a high speed downlink. Thus I found the quickest method was to send the url to a cloud de/compressor wait a few minutes for them to suffer the throttling and then at high speed quickly download their bigger / faster decompressed file.

@kjk
Copy link
Member

kjk commented May 6, 2021

It's true that the way we open archives with images is sub-optimal.

To do the layout, we need to know the size of all pages. Currently we need to decompress full jp2 image to get the size. It's probably possible by only reading the header of the image.

We should also remember mediaboxes for all pages in settings so that the second open doesn't even need to decompress.

Also, we should load / decompress on a background thread and decompress images to memory so that we don't need to keep the archive open / available anymore.

Not trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants