Can caches store and retrieve results that are still being written to the cache? #573

jakearchibald · 2014-12-04T12:13:59Z

At the moment cache-adding methods resolve when the full body of the response if written to the cache. That means we can reject if the body stream looks like a failure, which can be suggested by content-length #362 (comment). This is a must due to installEvent.waitUntil(addStuffToCaches) - we can't have that resolve but ultimately fail.

However, if you have code that tries the cache, falls back to the network and adds to the cache, you can end up with multiple pages doing this with the same resource, as cache.match won't return items until it's fully cached.

Shall we change this so items are added to the underlying cache as soon as we have headers? So for cache.add(request):

Fetch request
If fetch fails, reject & abort
Add (request, response) to RequestToResponseMap
Wait for body to stream into cache
If body writing fails, remove (request, response) from RequestToResponseMap, reject the promise & abort these steps
Remove other entries in RequestToResponseMap that match request (.match can no longer reach them)
Resolve promise

If you call .match(request) and there's already a match fully-written in the cache, you'll get that rather than the in-progress one. This seems sensible to me, but we could favour the in-progress one.

If you get an in-progress response back from .match(request), that goes back to the page, it may arrive slowly, it may ultimately fail. But hey, that's the same as the network, which it ultimately is.

Questions:

Does this sound like a good idea?
It is universally a good idea, or should it be an option to .match(). I don't think it needs to be optional.

Related: #361 #362

/cc Cache-warriors @wanderview @gavinp @sicking

The text was updated successfully, but these errors were encountered:

wanderview · 2014-12-04T13:22:50Z

To clarify, with this approach all MatchAll() would retrieve both the previous existing entries and the new in progress entry? And Match() gets the previous entry due to ordering constraints?

One thing I'm trying to understand, why can ".match() can no longer reach them" in step 6, but they can be reached after step 3? What happens in between those two steps to make them unreachable?

jakearchibald · 2014-12-04T13:30:21Z

To clarify, with this approach all MatchAll() would retrieve both the previous existing entries and the new in progress entry? And Match() gets the previous entry due to ordering constraints?

Yep!

One thing I'm trying to understand, why can ".match() can no longer reach them" in step 6, but they can be reached after step 3? What happens in between those two steps to make them unreachable?

The in-progress request can still fail in step 5, removing it from RequestToResponseMap, which re-exposes the other matches.

wanderview · 2014-12-04T13:39:46Z

And if you have two cache.add() calls with overlapping requests at the same time, then its a bit non-deterministic which ends up in the cache, right? Whichever one finishes first will delete the other.

Not sure that's a problem, just trying to think through the races.

wanderview · 2014-12-04T13:46:15Z

Also, will the spec say anything about how the match()'d response should treat partially written entries? For example, I could see the easy implementation waiting to provide the first byte of the body until the body is fully written. The trickier, better, solution of course would be to stream the data out as it slowly fills in the cache file.

jakearchibald · 2014-12-04T13:46:31Z

Yeah, not sure if it's a problem either. Maybe in-progress matches should be removed between steps 3 & 4.

jakearchibald · 2014-12-04T13:48:54Z

Streaming is obviously way better, 'SHOULD' is probably good enough.

wanderview · 2014-12-04T13:53:01Z

I agree, streaming is way better. It will just be tricky to implement in gecko, though, due to our stream infrastructure. I'll have to build an async wrapper around our file streams that lets me kick the reader across the process boundary whenever new data is written to the stream. Not a spec issue, of course.

jakearchibald · 2014-12-04T13:59:32Z

There are parts of Chrome's fetch & cache impl that isn't as streamy as it should be. SHOULD is a pretty good get-out, allowing an initial implementation that works but isn't quite as optimal.

jungkees · 2014-12-09T13:18:29Z

Actually some clarifications are needed between the current algorithms as-is and the expected behaviors being discussed here.

At the moment cache-adding methods resolve when the full body of the response if written to the cache.

This was expected but the current addAll() is working in a way that the result promise resolves without waiting for the full body written to the cache.

Shall we change this so items are added to the underlying cache as soon as we have headers?

.match also was not waiting for the full body written to cache to resolve with the response.

Accordingly, in order to satisfy the suggested behavior, the steps 4 to 6 should be added to the current addAll(). Plus, some further review and modification would be required in the Batch Cache Operations algorithm where the existing matched entries are being removed before new ones are being written. Well.. I need to examine further to sort them out, indeed.

slightlyoff · 2014-12-10T02:34:37Z

Sorry for the late comments.

I agree that the streaming behavior is better and that there's essential non-determinism in the Cache API today (it's known, and we didn't add transactions because of interactions with IDB transactions that we didn't want to think through in an ad-hoc way).

I think it's the right thing to "commit" to the Cache when you have headers (not the full body). This is compatible with body streaming.

Junkee: I agree with those changes. As long as the individual promises resolve when headers are available, I'm not sure that "Batch Cache Operations" needs to change.

jakearchibald · 2014-12-10T17:13:08Z

I think the changes are right, but it doesn't solve the problems I thought it might:

self.addEventListener('fetch', function(event) {
  var requestURL = new URL(event.request.url);

  if (/\.jpg$/.test(requestURL.pathname)) {
    event.respondWith(
      caches.match('/img.jpg').then(function(response) {
        return response || fetch('/img.jpg').then(function(response) {
          caches.open('img-cache').then(function(cache) {
            cache.put('/img.jpg', response.clone());
            return response;
          });
        });
      })
    );
  }
});

Imagine a page with a.jpg and b.jpg:

a.jpg: look for /img.jpg in caches
b.jpg: look for /img.jpg in caches
a.jpg: no match found
a.jpg: fetch /img.jpg
b.jpg: fetch /img.jpg
etc etc

Should we try to solve the above or leave it to devs? Devs could do it with something like:

var imgRequest;

self.addEventListener('fetch', function(event) {
  var requestURL = new URL(event.request.url);

  if (/\.jpg$/.test(requestURL.pathname)) {
    event.respondWith(
      caches.match('/img.jpg').then(function(response) {
        if (response) return response;
        if (!imgRequest) {
          imgRequest = fetch('/img.jpg').then(function(response) {
            caches.open('img-cache').then(function(cache) {
              cache.put('/img.jpg', response.clone());
              return response;
            });
          });
        }

        return imgRequest.then(function(response) {
          return response.clone();
        });
      })
    );
  }
});

wanderview · 2014-12-10T17:18:53Z

var imgRequest;

If we're advising people not to use global script vars in SW normally, it seems like this might be confusing for devs.

Is this an argument to add a cache method to "match() and add() only if not present"?

jakearchibald · 2014-12-10T18:13:31Z

…yeah, and if we ever need to speed up ServiceWorkers by running multiple instances at once, we break this.

The low level pattern here is:

Is it in the cache? If so, serve it & abort
Is it currently being requested? If so:
1. Wait for response (headers only)
2. If response it's not a network failure, serve it & abort
Fetch it via the network
Wait for response (headers only)
Put response clone in the cache
Serve response

The bit we're missing is 2. It's not totally simple, as we don't know if the in-flight request is really a match for our request until we have headers, as matching can be affected by vary headers on the response.

@slightlyoff suggested that the networking layer of the browser should just take care of 2. I'm not so sure it's safe or desirable to do that. Any idea what Gecko does here?

wanderview · 2014-12-10T18:26:41Z

I don't think the network layer can do this. It doesn't know which network requests will have side effects; e.g. hitting a REST API endpoint. Content has to indicate this through cache control headers. So the second fetch() would go to the http cache if configured to do so. I think we have to live with the race where a second fetch() occurs before the headers are known for the first fetch().

jakearchibald · 2014-12-10T19:00:15Z

Agree with @wanderview that we can't rely on magic here.

@annevk do you think this is a sensible thing to live in Fetch? An option to only get the response if it's currently in-flight. So:

fetch('/').then(function(response) {
   // ...
});

fetch('/', {
  onlyInFlightStuffPlease: true
}).then(function(response) {
   // ...
});

…would only result in one network request, and the 2nd response is a clone of the first? There are lots of details of course, just trying to work out if it's something we should enable. The "in-flight" request list would be origin specific of course.

jungkees · 2014-12-11T07:34:54Z

Clarified responses can commit to cache as soon as their headers are available. And made add/addAll/put resolve when the full body of the responses are written to the cache: 9eb76f4.

If you call .match(request) and there's already a match fully-written in the cache, you'll get that rather than the in-progress one. This seems sensible to me, but we could favour the in-progress one.

This requirement has not been addressed yet. I'm thinking of defining an associated fetching record (in-progress one) for each incumbent record (a match fully-written) in request to response map.

wanderview · 2014-12-11T14:19:18Z

Note: The cache commit is allowed as long as the response's headers are available.

This suggests to me that committing at this point is optional for the implementation. Is this correct?

If its required, I suggest stronger language like "The cache must commit the response once the headers are available."

wanderview · 2014-12-11T14:28:39Z

I don't anticipate supporting this any time soon in the gecko Cache. Here is a bug to track the issue:

https://bugzilla.mozilla.org/show_bug.cgi?id=1110137

jakearchibald · 2014-12-11T14:44:41Z

"should" feels like the right wording, but I don't want it to block any of us shipping commit-on-full-write implementations.

wanderview · 2014-12-11T19:19:28Z

@jakearchibald, I was thinking more about this:

do you think this is a sensible thing to live in Fetch? An option to only get the response if it's
currently in-flight.
fetch('/').then(function(response) {
   // ...
});
fetch('/', {
  onlyInFlightStuffPlease: true
}).then(function(response) {
   // ...
});

It seems the browser would have to cache the in-flight responses somewhere. With multi-process architectures, this may not be super straightforward.

I still think this would be better to add to Cache. Instead of implicitly creating a cache we would explicitly expose the capability on the API available to content. Then libraries could implement a "in-flight only fetch" feature using this primitive.

For example, lets say we added { fetchIfMissing: true } and { bodyComplete: true } to QueryParams, then a library could do something like this:

function fetchInFlightOnly(request) {
  var cache;
  return caches.open('FetchInFlightOnly').then(function(c) {
    cache = c;
    return cache.match(request, { fetchIfMissing: true });
  }).then(function(response) {
    cache.delete(request, { bodyComplete: true });
    return response;
  });
}

wanderview · 2014-12-11T19:57:24Z

It seems the browser would have to cache the in-flight responses somewhere. With multi-process architectures, this may not be super straightforward.

Well, this probably isn't as hard as I make it out to be. But I still find the flag on fetch() harder to reason about and I don't know that devs will understand when to use or not use it.

jungkees · 2014-12-12T09:42:58Z

Besides the redundant-fetch-for-the-same-resource issue, here are some points about the cache behavior under the committing-on-header decision:

However, if you have code that tries the cache, falls back to the network and adds to the cache, you can end up with multiple pages doing this with the same resource, as cache.match won't return items until it's fully cached.

It seems not necessarily true now as .match should return in-progress items.

If you call .match(request) and there's already a match fully-written in the cache, you'll get that rather than the in-progress one. This seems sensible to me, but we could favour the in-progress one.

Frequent matched in-progress responses can affect some negative offline user experience I guess. So serving the fully-written response with the higher priority seem a must. I'd like to know whether we're on the same page.

jungkees · 2014-12-19T11:58:56Z

Defined fetching record (an in-progress match) and incumbent record (a fully-written match) for cache entry: ba025c9.

The behaviors can be summarized as:

We don't wait for the full body stream. Committing fetching record to cache happens when response's headers are available.
Incumbent records have higher priority always.
- Fetching records basically serve the same thing as the network request, so they're served as in-progress fetches and they're want-to-be-incumbent-records for the next fetch events.
- When no incumbent record exists, we do serve the fetching record
Fulfillment of the cache-adding methods (add/addAll/put) guarantees we have incumbent records for those requests.

jakearchibald · 2015-01-15T14:47:08Z

@jungkees

I might be reading wrong, but I think there's been a couple of regressions in the spec. Imagine a cache with 2 entries that look like this header-wise (assume each has the same URL):

Request: { "Accept": "text/html" } Response: {"Vary": "Accept"}
Request: { "Accept": "foo/bar" } Response: {}

Then if we add a new entry to the cache, Request: { "Accept": "text/html" } Response: {} (no Vary), the cache should now look like this:

Request: { "Accept": "text/html" } Response: {}

The original algorithm would delete entries from the cache that match the new entry's request. As far as I can tell, the current spec would modify the first entry in the cache, leaving the second. That means the second entry cannot be matched, since the first will always match first. Am I reading it correctly?

In a system where we allow in-flight requests into the cache, it could work like this:

Request: { "Accept": "text/html" } Response: {"Vary": "Accept"}
Request: { "Accept": "foo/bar" } Response: {}

Then if we add a new entry to the cache, Request: { "Accept": "text/html" } Response: {}:

Request: { "Accept": "text/html" } Response: {"Vary": "Accept"}
Request: { "Accept": "foo/bar" } Response: {}
Request: { "Accept": "text/html" } Response: {} (in-flight)

So nothing's deleted from the cache at this stage. .match will favour entry 1 before 3, because it goes with first match. But once the request from entry 3 completes, it removes entries from the cache that match its request, leaving:

Request: { "Accept": "text/html" } Response: {}

This mean there aren't entries in the cache that cannot be matched.

I also think it's important to keep the cache in an order where the most recently added entries are at the end. This is useful when pruning caches. Talking to Twitter last week, they were interested in adding cached items back into the cache once they'd sent them to the browser. If added entries appear at the end of the list, this helps them track which items were "last accessed", therefore which entries they should keep.

jakearchibald · 2015-01-15T14:49:31Z

Also, sorry I was so late in getting to this :(

jakearchibald · 2015-01-15T14:52:43Z

@wanderview

But I still find the flag on fetch() harder to reason about

Yeah, my suggestion was silly. I think you're right, this needs to go on the cache. Will open a new ticket for that once we have the other in-flight stuff defined.

jungkees · 2015-01-16T13:21:00Z

So nothing's deleted from the cache at this stage. .match will favour entry 1 before 3, because it goes with first match. But once the request from entry 3 completes, it removes entries from the cache that match its request

Addressed the point: 8e76892. Thanks for spotting it and explaining the details @jakearchibald.

jungkees · 2015-01-26T01:38:12Z

I believe we've done with defining the behavior for the fetching / incumbent records of caches. Closing.

jakearchibald added cache needs input question labels Dec 4, 2014

jungkees closed this as completed Jan 26, 2015

jungkees mentioned this issue Apr 6, 2015

what should Cache put operations do if the body errors out after resolving promise? #665

Closed

jungkees mentioned this issue Jul 7, 2015

When should implementations serialize caches? #362

Closed

mfalken mentioned this issue Jul 7, 2015

Prefetching resources using SW #570

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can caches store and retrieve results that are still being written to the cache? #573

Can caches store and retrieve results that are still being written to the cache? #573

jakearchibald commented Dec 4, 2014

wanderview commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

wanderview commented Dec 4, 2014

wanderview commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

wanderview commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

jungkees commented Dec 9, 2014

slightlyoff commented Dec 10, 2014

jakearchibald commented Dec 10, 2014

wanderview commented Dec 10, 2014

jakearchibald commented Dec 10, 2014

wanderview commented Dec 10, 2014

jakearchibald commented Dec 10, 2014

jungkees commented Dec 11, 2014

wanderview commented Dec 11, 2014

wanderview commented Dec 11, 2014

jakearchibald commented Dec 11, 2014

wanderview commented Dec 11, 2014

wanderview commented Dec 11, 2014

jungkees commented Dec 12, 2014

jungkees commented Dec 19, 2014

jakearchibald commented Jan 15, 2015

jakearchibald commented Jan 15, 2015

jakearchibald commented Jan 15, 2015

jungkees commented Jan 16, 2015

jungkees commented Jan 26, 2015

Can caches store and retrieve results that are still being written to the cache? #573

Can caches store and retrieve results that are still being written to the cache? #573

Comments

jakearchibald commented Dec 4, 2014

wanderview commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

wanderview commented Dec 4, 2014

wanderview commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

wanderview commented Dec 4, 2014

jakearchibald commented Dec 4, 2014

jungkees commented Dec 9, 2014

slightlyoff commented Dec 10, 2014

jakearchibald commented Dec 10, 2014

wanderview commented Dec 10, 2014

jakearchibald commented Dec 10, 2014

wanderview commented Dec 10, 2014

jakearchibald commented Dec 10, 2014

jungkees commented Dec 11, 2014

wanderview commented Dec 11, 2014

wanderview commented Dec 11, 2014

jakearchibald commented Dec 11, 2014

wanderview commented Dec 11, 2014

wanderview commented Dec 11, 2014

jungkees commented Dec 12, 2014

jungkees commented Dec 19, 2014

jakearchibald commented Jan 15, 2015

jakearchibald commented Jan 15, 2015

jakearchibald commented Jan 15, 2015

jungkees commented Jan 16, 2015

jungkees commented Jan 26, 2015