Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve HTTP Cache #23495

Open
gterzian opened this issue Jun 1, 2019 · 3 comments
Open

Improve HTTP Cache #23495

gterzian opened this issue Jun 1, 2019 · 3 comments

Comments

@gterzian
Copy link
Member

@gterzian gterzian commented Jun 1, 2019

I think there is a bunch of stuff that can be improved on in the HTTP cache, although it requires some looking into.

A good template to see what we can improve can be found at https://developer.mozilla.org/en-US/docs/Mozilla/HTTP_cache

Our current implementation is actually pretty decent, since we already have a form of concurrent read/write(unlike what is referred to as "version 1" of the cache in Gecko).

However, having looked more at the code, I'm actually surprised it works. While it can pass quite a lot of the tests in the suite corresponding to the protocol specified at http://tools.ietf.org/html/rfc7234, I'm not sure it could survive a more realistic usage from actual users due to some issues around it's use across parallel instances of the fetch algorithm.

Couple of issues I noted(not exhaustive):

  1. In update_awaiting_consumers we update any concurrent "readers" that are awaiting a network response. However we don't seem to check whether the response waking up all those readers actually is what those readers are looking for, in other words we don't do anything to prevent spurious wake-ups. Main problem would be 200 vs 206 responses, since those would still have the same cache key. I think there is actually a bug there in that for example a 206 response could wake-up readers of a 200 cached response that hasn't finished yet or vice versa.
  2. In the same place, we also "wake-up" all readers at once, which is a good idea in the case of a finished response, but seems like a bad idea in the case of the response having been aborted. It seems like a bad idea because all those readers will then all perform a new network request. It would seem better to just have one of them do so and have the others continue waiting.
    That is somewhat hard to do because we current tie a "cached resource" to a given response, which might or might not have finished, which could result in a network error, or be aborted by script. So to have to readers "wait" on a new response, would require "transplanting" the cached resource they are reading/waiting on to use that new response.
  3. In the case of a network error, it seems that we don't wake-up any "waiting" concurrent readers. That seems like a pretty bad idea since they'll be waiting for essentially another fetch to start for that same resources, which might never happen.
  4. When a fetch is aborted in script, we abort the related network request, which means that any concurrent readers of the "potentially already cached but now aborted" entry, will then all wake-up and basically do the network request again(one request per reader, unless their own fetches have been aborted as well). It might be a better idea to not tie "script aborting a fetch" to "cancelling the network request", especially if other fetches, not necessarily aborted themselves, might still want the network response. Perhaps we should count the readers, and only abort the network when all the associated fetches have been aborted? This would require taking one reader, and promoting it to writer, while keeping the other readers waiting.
  5. In the cache handling of range requests, we only construct cached responses in the case of being able to construct a response from available Done cached resources. We should instead also allow for satisfying range request with data of cached resources still in Receiving mode. The reason we don't is, I think, mainly because of what is described at 1. So this would I think simply require proper handling of 200 versus 206 responses at 1, and then instead of just "waking up" everyone, inspecting the range present in the response that is the basis for the "wake-up", and potentially also using it's body to "fill-up" some outstanding range requests(although if you were to use a 200 to complete a bunch of 206, you'd have to cancel their network requests, but not their fetches, so some coordination would be necessary).
  6. We're not doing anything with the partial data received on a aborted entry. Currently we just always filter out aborted entries and don't use them at all. It would seem that we could instead re-use the data where possible, which would require having the fetch initiate a network request to complete the strored response.

As a quick intro to the cache: the "data" of the cache is currently a HashMap<CacheKey, Vec<CachedResource>>.

So it looks like a HashMap of entries with a lock around it, but it's actually more complicated because the contents of the map are Arc's that are written/read to/from potentially from parallel fetch threads, and there is a DoneChannel used for "waking-up" readers when data is written to the contents of the Arc by the single reader.

Where a CachedResource is a

struct CachedResource {
    request_headers: Arc<Mutex<HeaderMap>>,
    body: Arc<Mutex<ResponseBody>>,
    aborted: Arc<AtomicBool>,
    awaiting_body: Arc<Mutex<Vec<Sender<Data>>>>,
    data: Measurable<MeasurableCachedResource>,
}

All those Arc are actually cloned from a Response that is either finished or still ongoing somewhere in a fetch thread. The awaiting_body is a list of Sender half of DoneChannel, corresponding to fetch workers that are "readers" of the current resource and waiting for the body to complete.

Besides the problems I noted above, I think the high-level design problems is:

We tie a CachedResource too closely with a given network response. This makes it hard to do anything that would involve doing additional networking to affect an existing cached resource.

Also, it seems we tie a "network request/response" closely to a fetch, whereas in reality other fetches might also depend on that network request/response via the cache.

So while we do support multiple concurrent readers awaiting a given cached resource to be finished by a single writer, and if the writers somehow aborts, we just throw the entire workflow away and pretend any concurrent reader is just another fetch thread that missed the cache.

I've made a start at #23494

@gterzian
Copy link
Member Author

@gterzian gterzian commented Jun 3, 2019

Ok so I made a start at #23494 to fix the easier and most obvious problems, and also added some comments in the code re what is left TODO:

@gterzian
Copy link
Member Author

@gterzian gterzian commented Jun 3, 2019

Finally, on a broader note, the current Fetch implementation, or at least large parts of it found in http_loader.rs, basically read like massive imperative algorithms. That reads nicely along with the spec, but in order to support things like the cache doing networking outside of the spec to complete partial responses, or to support more complicate "concurrent read/write" workflows, we might have to restructure parts of the current implementation of fetch.

For example the actual networking in Fetch is done inside http-network-fetch, implemented here, with the actual networking happening inside of it at

res.into_body()

So obviously if the cache is doing additional networking, it won't be able to go through the entire http-network-fetch, so we should look at extracting the networking into something re-usable inside the cache, in some way.

Same thing with the "readers awaiting a cached response" at

fn wait_for_cached_response(done_chan: &mut DoneChannel, response: &mut Option<Response>) {

this is now done right inside the implementation of http-network-or-cache-fetch, we might want to somehow extract it from this so that it becomes less tied into the Fetch flow itself, yet can be used inside of it but also by the cache separately to coordinate various concurrent reader/writers.

So in other words, the current implementation of fetch, and the use of the cache inside of it, follows the spec very closely, perhaps too closely for its own good. In order to support for complicated workflows that aren't necessarily part of the Fetch spec, we will probably have to re-structure fetch itself so it doesn't just read like a long string of imperative statements.

Finally, someone looking to do additional work could even try to see if those cache operations, like "completing a 206 response", couldn't be integrated in Fetch itself, as the fetch spec seems in imply a very primitive HTTP cache and only defines the most basic "store" and "retrieve" operations.

Although the spec includes a blanket "Follow the relevant requirements from HTTP. [HTTP] [HTTP-SEMANTICS] [HTTP-COND] [HTTP-CACHING] [HTTP-AUTH]", the various Fetch algorithm don't seem to take into account for example having the cache complete responses in parallel of fetch, or having the read coordinate various reader/writers.

An example is how fetch specifies how to abort-fetch or terminate fetch. It seems to me that those concepts, involving things such as "cancelling response body", are mostly about how aborting a fetch should influence the "request/response" objects found in script. So while those indeed need to be fully aborted, it would seem that the actual underlying networking could still be re-used by for example other concurrent readers of a cached entry who don't belong to the same web-page and hence are unlikely to be "fetch aborted" themselves.

Our current implementation again follows the spec very closely, for example when a fetch is aborted, the response body is "emptied" at

*res_body.lock().unwrap() = ResponseBody::Done(vec![]);

If we want to do more complicate things with concurrent read/write of various cache clients, we'd have to separate the "fetch response" from the underlying "networking response", and perhaps allow the networking to continue despite the fetch having been aborted(without allowing that do influence the "response" in script).

@gterzian
Copy link
Member Author

@gterzian gterzian commented Jun 3, 2019

One more thing, just stumbled upon the discussion at whatwg/fetch#904 about keying the cache entries not just by url but also origin of the request.

In some ways this could be also an easy way to get a kind of better concurrent read/write flow in case of fetch being aborted. The reasoning would be: if cached resources a kept separate per page origin, you're unlikely to end-up with lots of concurrent readers for a given resources, unless somehow a single page makes lots of request for the same resources, which seems something for the author of the page to optimize. And within a single origin, if one fetch related to a given resource is aborted, it's likely all of them are.

So it could be interesting to prioritize this concept before doing anything more complicated with concurrent reader/writers orchestration...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.