New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can caches store and retrieve results that are still being written to the cache? #573
Comments
To clarify, with this approach all MatchAll() would retrieve both the previous existing entries and the new in progress entry? And Match() gets the previous entry due to ordering constraints? One thing I'm trying to understand, why can ".match() can no longer reach them" in step 6, but they can be reached after step 3? What happens in between those two steps to make them unreachable? |
Yep!
The in-progress request can still fail in step 5, removing it from RequestToResponseMap, which re-exposes the other matches. |
And if you have two cache.add() calls with overlapping requests at the same time, then its a bit non-deterministic which ends up in the cache, right? Whichever one finishes first will delete the other. Not sure that's a problem, just trying to think through the races. |
Also, will the spec say anything about how the match()'d response should treat partially written entries? For example, I could see the easy implementation waiting to provide the first byte of the body until the body is fully written. The trickier, better, solution of course would be to stream the data out as it slowly fills in the cache file. |
Yeah, not sure if it's a problem either. Maybe in-progress matches should be removed between steps 3 & 4. |
Streaming is obviously way better, 'SHOULD' is probably good enough. |
I agree, streaming is way better. It will just be tricky to implement in gecko, though, due to our stream infrastructure. I'll have to build an async wrapper around our file streams that lets me kick the reader across the process boundary whenever new data is written to the stream. Not a spec issue, of course. |
There are parts of Chrome's fetch & cache impl that isn't as streamy as it should be. SHOULD is a pretty good get-out, allowing an initial implementation that works but isn't quite as optimal. |
Actually some clarifications are needed between the current algorithms as-is and the expected behaviors being discussed here.
This was expected but the current addAll() is working in a way that the result promise resolves without waiting for the full body written to the cache.
Accordingly, in order to satisfy the suggested behavior, the steps 4 to 6 should be added to the current addAll(). Plus, some further review and modification would be required in the Batch Cache Operations algorithm where the existing matched entries are being removed before new ones are being written. Well.. I need to examine further to sort them out, indeed. |
Sorry for the late comments. I agree that the streaming behavior is better and that there's essential non-determinism in the Cache API today (it's known, and we didn't add transactions because of interactions with IDB transactions that we didn't want to think through in an ad-hoc way). I think it's the right thing to "commit" to the Cache when you have headers (not the full body). This is compatible with body streaming. Junkee: I agree with those changes. As long as the individual promises resolve when headers are available, I'm not sure that "Batch Cache Operations" needs to change. |
I think the changes are right, but it doesn't solve the problems I thought it might: self.addEventListener('fetch', function(event) {
var requestURL = new URL(event.request.url);
if (/\.jpg$/.test(requestURL.pathname)) {
event.respondWith(
caches.match('/img.jpg').then(function(response) {
return response || fetch('/img.jpg').then(function(response) {
caches.open('img-cache').then(function(cache) {
cache.put('/img.jpg', response.clone());
return response;
});
});
})
);
}
}); Imagine a page with
Should we try to solve the above or leave it to devs? Devs could do it with something like: var imgRequest;
self.addEventListener('fetch', function(event) {
var requestURL = new URL(event.request.url);
if (/\.jpg$/.test(requestURL.pathname)) {
event.respondWith(
caches.match('/img.jpg').then(function(response) {
if (response) return response;
if (!imgRequest) {
imgRequest = fetch('/img.jpg').then(function(response) {
caches.open('img-cache').then(function(cache) {
cache.put('/img.jpg', response.clone());
return response;
});
});
}
return imgRequest.then(function(response) {
return response.clone();
});
})
);
}
}); |
If we're advising people not to use global script vars in SW normally, it seems like this might be confusing for devs. Is this an argument to add a cache method to "match() and add() only if not present"? |
…yeah, and if we ever need to speed up ServiceWorkers by running multiple instances at once, we break this. The low level pattern here is:
The bit we're missing is 2. It's not totally simple, as we don't know if the in-flight request is really a match for our request until we have headers, as matching can be affected by vary headers on the response. @slightlyoff suggested that the networking layer of the browser should just take care of 2. I'm not so sure it's safe or desirable to do that. Any idea what Gecko does here? |
I don't think the network layer can do this. It doesn't know which network requests will have side effects; e.g. hitting a REST API endpoint. Content has to indicate this through cache control headers. So the second fetch() would go to the http cache if configured to do so. I think we have to live with the race where a second fetch() occurs before the headers are known for the first fetch(). |
Agree with @wanderview that we can't rely on magic here. @annevk do you think this is a sensible thing to live in Fetch? An option to only get the response if it's currently in-flight. So: fetch('/').then(function(response) {
// ...
});
fetch('/', {
onlyInFlightStuffPlease: true
}).then(function(response) {
// ...
}); …would only result in one network request, and the 2nd response is a clone of the first? There are lots of details of course, just trying to work out if it's something we should enable. The "in-flight" request list would be origin specific of course. |
Clarified responses can commit to cache as soon as their headers are available. And made add/addAll/put resolve when the full body of the responses are written to the cache: 9eb76f4.
This requirement has not been addressed yet. I'm thinking of defining an associated fetching record (in-progress one) for each incumbent record (a match fully-written) in request to response map. |
This suggests to me that committing at this point is optional for the implementation. Is this correct? If its required, I suggest stronger language like "The cache must commit the response once the headers are available." |
I don't anticipate supporting this any time soon in the gecko Cache. Here is a bug to track the issue: |
"should" feels like the right wording, but I don't want it to block any of us shipping commit-on-full-write implementations. |
@jakearchibald, I was thinking more about this:
It seems the browser would have to cache the in-flight responses somewhere. With multi-process architectures, this may not be super straightforward. I still think this would be better to add to Cache. Instead of implicitly creating a cache we would explicitly expose the capability on the API available to content. Then libraries could implement a "in-flight only fetch" feature using this primitive. For example, lets say we added
|
Well, this probably isn't as hard as I make it out to be. But I still find the flag on fetch() harder to reason about and I don't know that devs will understand when to use or not use it. |
Besides the redundant-fetch-for-the-same-resource issue, here are some points about the cache behavior under the committing-on-header decision:
It seems not necessarily true now as
Frequent matched in-progress responses can affect some negative offline user experience I guess. So serving the fully-written response with the higher priority seem a must. I'd like to know whether we're on the same page. |
Defined fetching record (an in-progress match) and incumbent record (a fully-written match) for cache entry: ba025c9. The behaviors can be summarized as:
|
I might be reading wrong, but I think there's been a couple of regressions in the spec. Imagine a cache with 2 entries that look like this header-wise (assume each has the same URL):
Then if we add a new entry to the cache, Request:
The original algorithm would delete entries from the cache that match the new entry's request. As far as I can tell, the current spec would modify the first entry in the cache, leaving the second. That means the second entry cannot be matched, since the first will always match first. Am I reading it correctly? In a system where we allow in-flight requests into the cache, it could work like this:
Then if we add a new entry to the cache, Request:
So nothing's deleted from the cache at this stage.
This mean there aren't entries in the cache that cannot be matched. I also think it's important to keep the cache in an order where the most recently added entries are at the end. This is useful when pruning caches. Talking to Twitter last week, they were interested in adding cached items back into the cache once they'd sent them to the browser. If added entries appear at the end of the list, this helps them track which items were "last accessed", therefore which entries they should keep. |
Also, sorry I was so late in getting to this :( |
Yeah, my suggestion was silly. I think you're right, this needs to go on the cache. Will open a new ticket for that once we have the other in-flight stuff defined. |
Addressed the point: 8e76892. Thanks for spotting it and explaining the details @jakearchibald. |
I believe we've done with defining the behavior for the fetching / incumbent records of caches. Closing. |
At the moment cache-adding methods resolve when the full body of the response if written to the cache. That means we can reject if the body stream looks like a failure, which can be suggested by content-length #362 (comment). This is a must due to
installEvent.waitUntil(addStuffToCaches)
- we can't have that resolve but ultimately fail.However, if you have code that tries the cache, falls back to the network and adds to the cache, you can end up with multiple pages doing this with the same resource, as
cache.match
won't return items until it's fully cached.Shall we change this so items are added to the underlying cache as soon as we have headers? So for
cache.add(request)
:.match
can no longer reach them)If you call
.match(request)
and there's already a match fully-written in the cache, you'll get that rather than the in-progress one. This seems sensible to me, but we could favour the in-progress one.If you get an in-progress response back from
.match(request)
, that goes back to the page, it may arrive slowly, it may ultimately fail. But hey, that's the same as the network, which it ultimately is.Questions:
.match()
. I don't think it needs to be optional.Related: #361 #362
/cc Cache-warriors @wanderview @gavinp @sicking
The text was updated successfully, but these errors were encountered: