Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of files in zip are important for streaming use cases #1309

Open
codedread opened this issue Jan 10, 2020 · 13 comments
Open

Order of files in zip are important for streaming use cases #1309

codedread opened this issue Jan 10, 2020 · 13 comments
Labels

Comments

@codedread
Copy link

@codedread codedread commented Jan 10, 2020

It seems to me that the order of files within the EPUB Open Container Format [1] can be important in context of streaming the book to the reader.

The spec seems to only suggest that the "mimetype" file must be the first file [2]

However, imagine a 90MB zip file whose byte order is such that the files are in this order:

  • mimetype
  • ... all content files in a random order...
  • OEBPS/content.opf
  • META-INF/container.xml

This means that the reader app has to slurp in the entire zip file byte stream before it can even start rendering the very first page.

Instead, if the spec mandated that for version 3.X [EDIT 2020-01-10: To avoid worry about backward compatibility, I guess this suggestion should be only EPUB version 4+ files would mandate the file order within the zip?] that the files must be in the following order:

  1. mimetype
  2. META-INF/container.xml
  3. ... the opf file pointed to by container.xml...
    3b) encryption.xml, etc...

For example, if the spine's first reading idref is "section1.html" and the first file referenced inside "section1.html" is "001.css" and "cover.jpg" then the order of files you would expect for 4) should be:

  • section1.html
  • 001.css
  • cover.jpg
  • section2.html ...

This means that by the time the zip file has streamed over cover.jpg, then section1.html is fully ready to render.

The reason this should be mandated by the spec is that then book reader apps can trust, based on the version of the EPUB file, that files are in rendering order within the zip and need not wait to stream the entire file over to the client before rendering page 1.

[1] https://w3c.github.io/publ-epub-revision/epub32/spec/epub-ocf.html
[2] https://w3c.github.io/publ-epub-revision/epub32/spec/epub-ocf.html#sec-zip-container-mime

@dauwhe dauwhe added the Spec-OCF label Jan 10, 2020
@dauwhe

This comment has been minimized.

Copy link
Contributor

@dauwhe dauwhe commented Jan 10, 2020

I would be very concerned about making every existing EPUB invalid. At least for EPUB 3.X, we are committed to backward compatibility.

If the ebooks are available to you before the user reads them, you could repackage them to make streaming easier...

@codedread

This comment has been minimized.

Copy link
Author

@codedread codedread commented Jan 10, 2020

I was not implying every existing EPUB would be invalid - and backward compatibility is important. I would just like to see the spec (or a future version of the spec) address file order within the zip file, since it affects streaming performance.

Another idea is to mandate that the OPF file is in the zip file before any manifest files, and then you could add an attribute to the package or manifest element indicating whether manifest files in the archive are in referential order or not. This would help reader apps optimize the rendering.

@llemeurfr

This comment has been minimized.

Copy link

@llemeurfr llemeurfr commented Jan 10, 2020

@codedread

This comment has been minimized.

Copy link
Author

@codedread codedread commented Jan 10, 2020

I'm not sure I agree with "We must live with the fact that EPUB is not a streaming format", since what I'm now suggesting is providing what amounts to a "hint" to the reader apps about the file order. If that hint does not exist, then reader apps can still handle random-ordered archive files.

For now, I'm going to invent my own XML namespace and attribute for this and if my reader app encounters that attribute, it can start rendering asap, otherwise it has to stream the entire archive first.

@codedread

This comment has been minimized.

Copy link
Author

@codedread codedread commented Jan 10, 2020

NOTE: I updated my original request to make it clear I was not suggesting that earlier versions of EPUB files would become invalid if they did not mandate the file order.

@dauwhe

This comment has been minimized.

Copy link
Contributor

@dauwhe dauwhe commented Jan 10, 2020

NOTE: I updated my original request to make it clear I was not suggesting that earlier versions of EPUB files would become invalid if they did not mandate the file order.

This is tricky, because one of the consequences of our commitment to backward compatibility is that we will continue to use <package version="3.0"> for any EPUB 3.X. We tried using version="3.1" and it was perhaps the largest single factor in the failure of EPUB 3.1. So EPUB processors won't be able to distinguish between older EPUB 3s and newer EPUB 3s, by design.

@Doktorchen

This comment has been minimized.

Copy link

@Doktorchen Doktorchen commented Jan 11, 2020

With zipinfo one can determine, what is in the archive, with unzip and the option -p one can extract single files, therefore already ZIP allows to extract the content in favoured order, if required.

For something like streaming complex content or similar applications it might be interesting to look at the SMIL format, maybe to improve this for streaming purposes of any content.

@mattgarrish

This comment has been minimized.

Copy link
Member

@mattgarrish mattgarrish commented Jan 11, 2020

Is file ordering going to help all that much with streaming zips? Maybe for purely linear reading, or if the only goal is to allow the first document to load as quickly as possible, but if the goal is to only grab the content needed as needed it's not going to be terribly useful. As soon as the user jumps around inside the publication how do you know where any files might be (content or supporting)?

If we wanted to make epub 3s easier to stream, a possible solution might be to look at serializing the necessary information from the spine/central directory record in an external document so that that document would be the starting reference for accessing the epub. (An epub feed?)

@codedread

This comment has been minimized.

Copy link
Author

@codedread codedread commented Jan 11, 2020

Yes, to allow the first page to render as quickly as possible.

@codedread

This comment has been minimized.

Copy link
Author

@codedread codedread commented Jan 11, 2020

"If we wanted to make epub 3s easier to stream, a possible solution might be to look at serializing the necessary information from the spine/central directory record in an external document so that that document would be the starting reference for accessing the epub. (An epub feed?)"

Making a separate file that has to be kept "together" with the EPUB file seems not great. I thought the goal of an EPUB is to have one file with everything needed, which is why I thought this information should be contained as metadata within the zip file and then mandated to appear first.

For now, I'm going to experiment with my own XML namespaced attribute.

@mattgarrish

This comment has been minimized.

Copy link
Member

@mattgarrish mattgarrish commented Jan 11, 2020

Making a separate file that has to be kept "together" with the EPUB file seems not great.

That's the potential limitation of retrofitting any optimization for streaming into epub 3. The ship has already sailed, so unless you're providing your own content to your own reading system -- in which case the whole issue is moot -- the reality is more ordering within the zip file is not likely to gain traction. It'd prove a rather dramatic change to the existing ecosystem.

Having to insert the mimetype first is one of epub's more unloved features, after all. It was unceremoniously dumped from the packaging for audioboooks done in W3C.

I don't suspect a feed-like approach to finding and accessing zipped epubs on the web is any more realistic to gain traction, to be completely honest, but it's a possibility that would at least benefit from not having to repackage existing content.

As @llemeurfr has already mentioned, a future epub 4 probably would take a different approach from 3 altogether. The idea on the table was to be able to unpackage a single-file epub for deploying from the web as a "web publication" without modification, and with a web-friendly representation of the package document, but that work has been put on hiatus for the time being.

@danielweck

This comment has been minimized.

Copy link
Contributor

@danielweck danielweck commented Jan 13, 2020

Here are some technical notes (you can skip this if you are interested primarily in specification-related discussions, as this is primarily aimed at reading system developers):

EPUB publications - or any other zip-based format, for that matter - hosted remotely as "whole files" (i.e. not expanded / exploded / unzipped), can be accessed via HTTP 1.1+ partial requests from remote clients, even from Javascript code running in vanilla web browsers (assuming adequate permissive CORS headers, due to the web security model).

Such clients can request specific byte ranges from a remote ZIP archive, much like playing audio/video over a network connection (thus the "streaming" analogy). Note that the ZIP directory is stored at the end of EPUB / ZIP archives, so an initial HTTP partial / byte range request must typically target this data suffix, first and foremost (which is why this model is only a "streaming" analogy, not a true realization).

There are web-browser-based EPUB reading systems that use this mechanism to access remote publications without having to download the entire EPUB file as an in-memory binary "blob". The Readium cloud/web reader has supported this pseudo-streaming of EPUB files pretty much since day one. There are notable caveats due to web platform limitations at the time of implementation: requested ZIP entries are loaded and stored internally as memory blobs (e.g. entire HTML. CSS, image files), so an obvious drawback is that audio/video resources cannot be fetched gradually / with random access, and furthermore complex pre-processing is necessary in order to replace content URLs with their Blob counterparts, everywhere.

However, nowadays Service Workers can alleviate some of the problems by acting as URL request proxy between a remote zip-packaged EPUB file, and a "naive" webview client - for example, URL https://domain.com/content-server/book.epub/chapters/01.html#id can be transparently resolved to a single 01.html resource (as far as the webview is concerned) even though the remote host really only serves the https://domain.com/content-server/book.epub file (in other words, a Service Worker can eliminate the need to populate BlobURIs everywhere in EPUB HTML / CSS resources).

To conclude this rather long blurb: I think that ordering resources / ZIP entries inside EPUB archives provides limited benefits and yields debatable performance improvements, only in specific cases (e.g. to consistently initially fetch the cover image, immediately followed by the first HTML chapter and its associated CSS, etc.). I can see this being useful in a particular proprietary closed-loop client/server implementation, but I do not see great benefits in the general case.

@codedread

This comment has been minimized.

Copy link
Author

@codedread codedread commented Jan 13, 2020

Hi Daniel - thanks for the suggestion on partial HTTP requests / byte ranges! I agree, ServiceWorkers can also push the processing out of the "naive" client too. So yes, sniffing the last ~65k of the zip file for the central directory header and then making N random-access byte range requests can get around this issue, assuming partial requests are supported on all types of servers these days (I don't know, myself), even though we are now making N requests instead of 1 longer one.

On the other hand, pushing the impact on epub archive creators to just put the files in the right order in the zip and then dropping a hint to reader apps via a XML attribute seems to be an actually pretty low burden.

Note that there is zero effect on reader apps that don't want to use this optimization.

Also note that EPUB files that do not have this XML attribute also stay valid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.