Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider exposing size for cache entries #587

Open
wanderview opened this issue Dec 10, 2014 · 15 comments
Open

consider exposing size for cache entries #587

wanderview opened this issue Dec 10, 2014 · 15 comments

Comments

@wanderview
Copy link
Member

(I think this is a known feature request, but didn't see an issue for it.)

Consider how you would build a media application like a music player or photo gallery. While you probably couldn't store all media offline, it would be nice to save some amount of the most frequently used files.

Currently the Cache API lets us build an LRU cache based on count:

var cache;
cache.open('foo').then(function(foo) {
  cache = foo;
  return cache.match(request);
}).then(function(response) {
  if (response) {
    // update order of entries in keys()
    cache.put(request, response.clone());
    return response;
  }

  var maxItems = 100;
  return addToLRU(cache, maxItems, request);
});

function addToLRU(cache, maxItems, request) {
  return cache.keys().then(function(keys) {
    if (keys.length < maxItems) {
      return cache.add(request);
    }

    return cache.delete(keys[0]).then(function() {
      return cache.add(request);
    });
  });
}

It would be nice, however, to be able to store items up until a certain size limit is reached. I was thinking an API like:

  Promise<unsigned long> sizeOf(Request or USVString request, optional QueryParams params);
  Promise<unsigned long> sizeOfAll(Request or USVString request, optional QueryParams params);

These would function like match() and matchAll(), but return a size value instead of the responses.

It would then be necessary to define what "size" means. This I am less sure of:

  • Size of on-disk or in-memory? I assume on-disk would be preferable.
  • Size of just the body or including all fields? The body will typically dominate for most Responses.
  • Require approximate or exact measurements from the browser? Given compression, databases, and de-duplication, I think approximate would be better.

Thoughts?

@gauntface
Copy link

The only criticism I have is that this needs to go through the cache to work, my instinct is that I'd get a response, figure out it's size and then check what the current fill of the cache is in terms of file size and make a decision to include or not.

  • This allows decisions per file basis (i.e. this file is huge, don't ever cache it)
  • This file is good to cache, do I have enough space?
  • This file is fresh, is there a big file I should drop from the cache for this response and possibly others?

But this all sounds like it would live on the request rather than the cache API and would shift the onus onto the developer to track memory usage (which is good and bad).

@wanderview
Copy link
Member Author

I think the main problem with getting the size on the Request or Response is that you don't really know until the body is drained. Fetch resolves when just the headers are available and the body may still be coming in off the network. Unfortunately content-length is often a lie. Either you need to read the whole stream into memory (bad for mobile) or write it to disk and check.

I think you would need to add to the cache, then delete any excess over the threshold:

fetch(request.clone()).then(function(response) {
  cache.put(request, response.clone()).then(function() {
    deleteExcess(cache, 1024*1024);
  });
  return response;
});

function deleteExcess(cache, maxSize) {
  cache.sizeOfAll('./avatars/', { prefixMatch: true }).then(function(size) {
    if (size < maxSize) {
      return;
    }
    cache.keys().then(function(keys) {
      cache.delete(keys[0]);
      deleteExcess(cache, maxSize);
    });
  });
}

@wanderview
Copy link
Member Author

Alternatively, if the origin can ensure its server provides accurate content-length headers, then it can simply use those.

var length = response.headers.get('content-length');

You could then store the total length in IDB.

This would not be exactly the size stored on disk in the cache, though. The actual size would depend on how the Cache implementation dealt with content encoding, additional compression, de-duplication, etc. For example, the gecko cache will (unfortunately) remove the content-encoding and then recompress with snappy.

Anyway, maybe the content-length headers are enough and we don't need a new API here. What do people think?

@jakearchibald
Copy link
Contributor

Twitter asked for this too. We should have some kind of method on the cache object that provides an object of meta information, which includes size on disk.

@kinu
Copy link
Contributor

kinu commented Jan 21, 2015

(Since we've heard similar requests in Quota API in the past I filed a related issue, we probably have to agree on where these APIs should live on: kinu/quota-api#10)

@KenjiBaheux
Copy link
Collaborator

One more datapoint: this came up as a question on StackOverflow.

@robrbecker
Copy link

tl;dr I'd like to see the ability to query not only the size of individual items in the cache, but also the total size of the cache itself. Hopefully this can be done in a performant way, without having to spin through all the items adding up the size. Each cache could keep a tally on the total size of the items it contains and update it as items are added and removed.

Imagine a use case where you might want to take an app offline that deals with displaying large documents. You also want to enable the user to take an entire document offline. The document is comprised of multiple requests, say 1 upfront and 1 per page. When the user clicks to take a document offline, let's say we create a service worker cache for that document, request all the assets and add them to that cache.

Now the user has taken a few documents offline. There is a cache per doc, and one for the app assets. The user may want to see what documents are available offline, how much space each uses on their device, and remove a document from the offline cache. If each cache knew the size of the items it contained, then segmenting data into caches in this fashion makes it easy to query the size of a single document.

In a use case like this, the user would want to see the space actually consumed on disk, including request overhead and taking into account compression, etc.

@jakearchibald
Copy link
Contributor

@robrbecker your tl;dr is longer than the OP! But the extra detail is really valuable, thanks for the use-case and clarification.

the user would want to see the space actually consumed on disk, including request overhead and taking into account compression, etc

The UA may dedupe across caches to save disk space. I don't think this is a huge issue, but we should nod to it in the spec.

not only the size of individual items in the cache

In fact, the cache size is going to be more accurate than the content-size header which may be absent or completely false. An accurate way would be response.arrayBuffer().then(a => a.length), but that would be pretty horrific for performance.

@robrbecker what's more important, the size of individual items or the size of the whole cache?

@wanderview would it be simpler to have a single cache.size() promise-returning method? If we need a way to query the size of multiple cache entries maybe we should change the resolve-type of .matchAll to be something that extends Array, so it would be cache.matchAll(request).then(rs => rs.totalSize()) or something.

@matt-cook
Copy link

While you probably couldn't store all media offline, it would be nice to save some amount of the most frequently used files.

Why not store all media offline?

Similar to @robrbecker, specifics on amount of space used.. and more importantly: amount available is very important to our use case: caching large binary assets (video, images, audio) for physically installed media (digital signage,kiosks, interactive video walls, etc.) that load via web.. but may have intermittent connection.

Specs. regarding clearing of data from cache also critical. Need to understand exactly how much, and how long data will be cached before we can fully rely on it (100% offline-enabled app).

@wanderview
Copy link
Member Author

Why not store all media offline?

My 2 TB dropbox account will probably not fit on my mobile device for some years... That was the case I was referring to.

Similar to @robrbecker, specifics on amount of space used.. and more importantly: amount available is very important to our use case: caching large binary assets (video, images, audio) for physically installed media (digital signage,kiosks, interactive video walls, etc.) that load via web.. but may have intermittent connection.

Specs. regarding clearing of data from cache also critical. Need to understand exactly how much, and how long data will be cached before we can fully rely on it (100% offline-enabled app).

Yes, these cases are important too. I believe the current leading proposal for handling guaranteed persistent storage is in here:

https://wiki.whatwg.org/wiki/Storage

It sounds like the v1 bits there would help with your use case.

@jakearchibald jakearchibald added this to the Version 2 milestone Oct 28, 2015
@jakearchibald
Copy link
Contributor

I'm keen on looking at this stuff for v2. We need to make sure that by exposing size we don't hint at content of opaque responses.

@robrbecker
Copy link

@jakearchibald Finally rounding back on this issue... I think total size of a cache is more important than individual size. That may be a way to get around security concerns of giving out the exact size of each response. (Except in the degenerate case of 1 to 1 cache <-> response)

@jakearchibald
Copy link
Contributor

That's what I'm worried about. Actively thinking about this.

@wanderview
Copy link
Member Author

This same problem exists for exposing size estimates even on the origin with storage spec. I guess we could just exclude opaque bodies in the script exposed size estimates, but that is kind of annoying to implement and reduces the utility here.

@petkaantonov
Copy link

The total cache size of opaque items could be ceiled up to closest e.g. 100k bytes so it would be useful for total size use case but not hinting too much of the actual size of an opaque response in the 1 to 1 cache <-> response case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants