Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can caches store and retrieve results that are still being written to the cache? #573

Closed
jakearchibald opened this issue Dec 4, 2014 · 28 comments

Comments

@jakearchibald
Copy link
Contributor

At the moment cache-adding methods resolve when the full body of the response if written to the cache. That means we can reject if the body stream looks like a failure, which can be suggested by content-length #362 (comment). This is a must due to installEvent.waitUntil(addStuffToCaches) - we can't have that resolve but ultimately fail.

However, if you have code that tries the cache, falls back to the network and adds to the cache, you can end up with multiple pages doing this with the same resource, as cache.match won't return items until it's fully cached.

Shall we change this so items are added to the underlying cache as soon as we have headers? So for cache.add(request):

  1. Fetch request
  2. If fetch fails, reject & abort
  3. Add (request, response) to RequestToResponseMap
  4. Wait for body to stream into cache
  5. If body writing fails, remove (request, response) from RequestToResponseMap, reject the promise & abort these steps
  6. Remove other entries in RequestToResponseMap that match request (.match can no longer reach them)
  7. Resolve promise

If you call .match(request) and there's already a match fully-written in the cache, you'll get that rather than the in-progress one. This seems sensible to me, but we could favour the in-progress one.

If you get an in-progress response back from .match(request), that goes back to the page, it may arrive slowly, it may ultimately fail. But hey, that's the same as the network, which it ultimately is.

Questions:

  • Does this sound like a good idea?
  • It is universally a good idea, or should it be an option to .match(). I don't think it needs to be optional.

Related: #361 #362

/cc Cache-warriors @wanderview @gavinp @sicking

@wanderview
Copy link
Member

To clarify, with this approach all MatchAll() would retrieve both the previous existing entries and the new in progress entry? And Match() gets the previous entry due to ordering constraints?

One thing I'm trying to understand, why can ".match() can no longer reach them" in step 6, but they can be reached after step 3? What happens in between those two steps to make them unreachable?

@jakearchibald
Copy link
Contributor Author

To clarify, with this approach all MatchAll() would retrieve both the previous existing entries and the new in progress entry? And Match() gets the previous entry due to ordering constraints?

Yep!

One thing I'm trying to understand, why can ".match() can no longer reach them" in step 6, but they can be reached after step 3? What happens in between those two steps to make them unreachable?

The in-progress request can still fail in step 5, removing it from RequestToResponseMap, which re-exposes the other matches.

@wanderview
Copy link
Member

And if you have two cache.add() calls with overlapping requests at the same time, then its a bit non-deterministic which ends up in the cache, right? Whichever one finishes first will delete the other.

Not sure that's a problem, just trying to think through the races.

@wanderview
Copy link
Member

Also, will the spec say anything about how the match()'d response should treat partially written entries? For example, I could see the easy implementation waiting to provide the first byte of the body until the body is fully written. The trickier, better, solution of course would be to stream the data out as it slowly fills in the cache file.

@jakearchibald
Copy link
Contributor Author

Yeah, not sure if it's a problem either. Maybe in-progress matches should be removed between steps 3 & 4.

@jakearchibald
Copy link
Contributor Author

Streaming is obviously way better, 'SHOULD' is probably good enough.

@wanderview
Copy link
Member

I agree, streaming is way better. It will just be tricky to implement in gecko, though, due to our stream infrastructure. I'll have to build an async wrapper around our file streams that lets me kick the reader across the process boundary whenever new data is written to the stream. Not a spec issue, of course.

@jakearchibald
Copy link
Contributor Author

There are parts of Chrome's fetch & cache impl that isn't as streamy as it should be. SHOULD is a pretty good get-out, allowing an initial implementation that works but isn't quite as optimal.

@jungkees
Copy link
Collaborator

jungkees commented Dec 9, 2014

Actually some clarifications are needed between the current algorithms as-is and the expected behaviors being discussed here.

At the moment cache-adding methods resolve when the full body of the response if written to the cache.

This was expected but the current addAll() is working in a way that the result promise resolves without waiting for the full body written to the cache.

Shall we change this so items are added to the underlying cache as soon as we have headers?

.match also was not waiting for the full body written to cache to resolve with the response.

Accordingly, in order to satisfy the suggested behavior, the steps 4 to 6 should be added to the current addAll(). Plus, some further review and modification would be required in the Batch Cache Operations algorithm where the existing matched entries are being removed before new ones are being written. Well.. I need to examine further to sort them out, indeed.

@slightlyoff
Copy link
Contributor

Sorry for the late comments.

I agree that the streaming behavior is better and that there's essential non-determinism in the Cache API today (it's known, and we didn't add transactions because of interactions with IDB transactions that we didn't want to think through in an ad-hoc way).

I think it's the right thing to "commit" to the Cache when you have headers (not the full body). This is compatible with body streaming.

Junkee: I agree with those changes. As long as the individual promises resolve when headers are available, I'm not sure that "Batch Cache Operations" needs to change.

@jakearchibald
Copy link
Contributor Author

I think the changes are right, but it doesn't solve the problems I thought it might:

self.addEventListener('fetch', function(event) {
  var requestURL = new URL(event.request.url);

  if (/\.jpg$/.test(requestURL.pathname)) {
    event.respondWith(
      caches.match('/img.jpg').then(function(response) {
        return response || fetch('/img.jpg').then(function(response) {
          caches.open('img-cache').then(function(cache) {
            cache.put('/img.jpg', response.clone());
            return response;
          });
        });
      })
    );
  }
});

Imagine a page with a.jpg and b.jpg:

  1. a.jpg: look for /img.jpg in caches
  2. b.jpg: look for /img.jpg in caches
  3. a.jpg: no match found
  4. a.jpg: fetch /img.jpg
  5. b.jpg: fetch /img.jpg
  6. etc etc

Should we try to solve the above or leave it to devs? Devs could do it with something like:

var imgRequest;

self.addEventListener('fetch', function(event) {
  var requestURL = new URL(event.request.url);

  if (/\.jpg$/.test(requestURL.pathname)) {
    event.respondWith(
      caches.match('/img.jpg').then(function(response) {
        if (response) return response;
        if (!imgRequest) {
          imgRequest = fetch('/img.jpg').then(function(response) {
            caches.open('img-cache').then(function(cache) {
              cache.put('/img.jpg', response.clone());
              return response;
            });
          });
        }

        return imgRequest.then(function(response) {
          return response.clone();
        });
      })
    );
  }
});

@wanderview
Copy link
Member

var imgRequest;

If we're advising people not to use global script vars in SW normally, it seems like this might be confusing for devs.

Is this an argument to add a cache method to "match() and add() only if not present"?

@jakearchibald
Copy link
Contributor Author

…yeah, and if we ever need to speed up ServiceWorkers by running multiple instances at once, we break this.

The low level pattern here is:

  1. Is it in the cache? If so, serve it & abort
  2. Is it currently being requested? If so:
    1. Wait for response (headers only)
    2. If response it's not a network failure, serve it & abort
  3. Fetch it via the network
  4. Wait for response (headers only)
  5. Put response clone in the cache
  6. Serve response

The bit we're missing is 2. It's not totally simple, as we don't know if the in-flight request is really a match for our request until we have headers, as matching can be affected by vary headers on the response.

@slightlyoff suggested that the networking layer of the browser should just take care of 2. I'm not so sure it's safe or desirable to do that. Any idea what Gecko does here?

@wanderview
Copy link
Member

I don't think the network layer can do this. It doesn't know which network requests will have side effects; e.g. hitting a REST API endpoint. Content has to indicate this through cache control headers. So the second fetch() would go to the http cache if configured to do so. I think we have to live with the race where a second fetch() occurs before the headers are known for the first fetch().

@jakearchibald
Copy link
Contributor Author

Agree with @wanderview that we can't rely on magic here.

@annevk do you think this is a sensible thing to live in Fetch? An option to only get the response if it's currently in-flight. So:

fetch('/').then(function(response) {
   // ...
});

fetch('/', {
  onlyInFlightStuffPlease: true
}).then(function(response) {
   // ...
});

…would only result in one network request, and the 2nd response is a clone of the first? There are lots of details of course, just trying to work out if it's something we should enable. The "in-flight" request list would be origin specific of course.

@jungkees
Copy link
Collaborator

Clarified responses can commit to cache as soon as their headers are available. And made add/addAll/put resolve when the full body of the responses are written to the cache: 9eb76f4.

If you call .match(request) and there's already a match fully-written in the cache, you'll get that rather than the in-progress one. This seems sensible to me, but we could favour the in-progress one.

This requirement has not been addressed yet. I'm thinking of defining an associated fetching record (in-progress one) for each incumbent record (a match fully-written) in request to response map.

@wanderview
Copy link
Member

Note: The cache commit is allowed as long as the response's headers are available.

This suggests to me that committing at this point is optional for the implementation. Is this correct?

If its required, I suggest stronger language like "The cache must commit the response once the headers are available."

@wanderview
Copy link
Member

I don't anticipate supporting this any time soon in the gecko Cache. Here is a bug to track the issue:

https://bugzilla.mozilla.org/show_bug.cgi?id=1110137

@jakearchibald
Copy link
Contributor Author

"should" feels like the right wording, but I don't want it to block any of us shipping commit-on-full-write implementations.

@wanderview
Copy link
Member

@jakearchibald, I was thinking more about this:

do you think this is a sensible thing to live in Fetch? An option to only get the response if it's
currently in-flight.

fetch('/').then(function(response) {
   // ...
});
fetch('/', {
  onlyInFlightStuffPlease: true
}).then(function(response) {
   // ...
});

It seems the browser would have to cache the in-flight responses somewhere. With multi-process architectures, this may not be super straightforward.

I still think this would be better to add to Cache. Instead of implicitly creating a cache we would explicitly expose the capability on the API available to content. Then libraries could implement a "in-flight only fetch" feature using this primitive.

For example, lets say we added { fetchIfMissing: true } and { bodyComplete: true } to QueryParams, then a library could do something like this:

function fetchInFlightOnly(request) {
  var cache;
  return caches.open('FetchInFlightOnly').then(function(c) {
    cache = c;
    return cache.match(request, { fetchIfMissing: true });
  }).then(function(response) {
    cache.delete(request, { bodyComplete: true });
    return response;
  });
}

@wanderview
Copy link
Member

It seems the browser would have to cache the in-flight responses somewhere. With multi-process architectures, this may not be super straightforward.

Well, this probably isn't as hard as I make it out to be. But I still find the flag on fetch() harder to reason about and I don't know that devs will understand when to use or not use it.

@jungkees
Copy link
Collaborator

Besides the redundant-fetch-for-the-same-resource issue, here are some points about the cache behavior under the committing-on-header decision:

However, if you have code that tries the cache, falls back to the network and adds to the cache, you can end up with multiple pages doing this with the same resource, as cache.match won't return items until it's fully cached.

It seems not necessarily true now as .match should return in-progress items.

If you call .match(request) and there's already a match fully-written in the cache, you'll get that rather than the in-progress one. This seems sensible to me, but we could favour the in-progress one.

Frequent matched in-progress responses can affect some negative offline user experience I guess. So serving the fully-written response with the higher priority seem a must. I'd like to know whether we're on the same page.

@jungkees
Copy link
Collaborator

Defined fetching record (an in-progress match) and incumbent record (a fully-written match) for cache entry: ba025c9.

The behaviors can be summarized as:

  1. We don't wait for the full body stream. Committing fetching record to cache happens when response's headers are available.
  2. Incumbent records have higher priority always.
    • Fetching records basically serve the same thing as the network request, so they're served as in-progress fetches and they're want-to-be-incumbent-records for the next fetch events.
    • When no incumbent record exists, we do serve the fetching record
  3. Fulfillment of the cache-adding methods (add/addAll/put) guarantees we have incumbent records for those requests.

@jakearchibald
Copy link
Contributor Author

@jungkees

I might be reading wrong, but I think there's been a couple of regressions in the spec. Imagine a cache with 2 entries that look like this header-wise (assume each has the same URL):

  1. Request: { "Accept": "text/html" } Response: {"Vary": "Accept"}
  2. Request: { "Accept": "foo/bar" } Response: {}

Then if we add a new entry to the cache, Request: { "Accept": "text/html" } Response: {} (no Vary), the cache should now look like this:

  1. Request: { "Accept": "text/html" } Response: {}

The original algorithm would delete entries from the cache that match the new entry's request. As far as I can tell, the current spec would modify the first entry in the cache, leaving the second. That means the second entry cannot be matched, since the first will always match first. Am I reading it correctly?

In a system where we allow in-flight requests into the cache, it could work like this:

  1. Request: { "Accept": "text/html" } Response: {"Vary": "Accept"}
  2. Request: { "Accept": "foo/bar" } Response: {}

Then if we add a new entry to the cache, Request: { "Accept": "text/html" } Response: {}:

  1. Request: { "Accept": "text/html" } Response: {"Vary": "Accept"}
  2. Request: { "Accept": "foo/bar" } Response: {}
  3. Request: { "Accept": "text/html" } Response: {} (in-flight)

So nothing's deleted from the cache at this stage. .match will favour entry 1 before 3, because it goes with first match. But once the request from entry 3 completes, it removes entries from the cache that match its request, leaving:

  1. Request: { "Accept": "text/html" } Response: {}

This mean there aren't entries in the cache that cannot be matched.

I also think it's important to keep the cache in an order where the most recently added entries are at the end. This is useful when pruning caches. Talking to Twitter last week, they were interested in adding cached items back into the cache once they'd sent them to the browser. If added entries appear at the end of the list, this helps them track which items were "last accessed", therefore which entries they should keep.

@jakearchibald
Copy link
Contributor Author

Also, sorry I was so late in getting to this :(

@jakearchibald
Copy link
Contributor Author

@wanderview

But I still find the flag on fetch() harder to reason about

Yeah, my suggestion was silly. I think you're right, this needs to go on the cache. Will open a new ticket for that once we have the other in-flight stuff defined.

@jungkees
Copy link
Collaborator

So nothing's deleted from the cache at this stage. .match will favour entry 1 before 3, because it goes with first match. But once the request from entry 3 completes, it removes entries from the cache that match its request

Addressed the point: 8e76892. Thanks for spotting it and explaining the details @jakearchibald.

@jungkees
Copy link
Collaborator

I believe we've done with defining the behavior for the fetching / incumbent records of caches. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants