Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add refresh_pattern option 'ignore-must-revalidate' back in #1196

Closed
wants to merge 1 commit into from

Conversation

compholio
Copy link

This is a partial revert of commit 064679e, which removed the "ignore-must-revalidate" option. The option was not actually broken, but it is not about the storage of objects - it is about whether they need to be retrieved again for new requests.

For example, I encountered a university download service that has set both "Expires: 0" and "Cache-Control: must-revalidate" for unchanging content. The refresh_pattern "override-expires" option will allow you to store a cache of this content, but due to the "Cache-Control: must-revalidate" header the proxy server will always re-fetch the content because the server has effectively indicated (incorrectly) that it is a live data feed. By re-enabling the "ignore-must-revalidate" refresh_pattern option it is possible to override the behavior for the download service's specific URL, such that the proxy can properly cache the content and not have to constantly re-fetch a bunch of multi-gigabyte files.

@squid-prbot

This comment was marked as resolved.

@squid-anubis squid-anubis added the M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels label Dec 2, 2022
@kinkie

This comment was marked as resolved.

this feature was not broken, it is helpful when a website incorrectly
sets Expires to zero and 'Control-Cache: must-revalidate' for
resources that do not change
@compholio

This comment was marked as resolved.

Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for offering this enhancement!

FYI: PR description (i.e. the future official commit message) will need to be updated, but I think we should come back to that after code changes are settled. I interpret the overall scope of this PR as "adding ignore-must-revalidate or a similar option" and ignoring secondary PR description problems (for now).

headers received from a server. Doing this VIOLATES
the HTTP standard. Enabling this feature could make you
liable for problems which it causes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add text to warn the admin that enabling this feature prevents Squid from caching authenticated responses containing must-revalidate and no-cache directives.

Roughly speaking, folks enable ignore-must-revalidate to "cache more" rather than to "cache less". The opposite side effect (for authenticated responses) is likely to be unexpected by most admins. We should document that side effect.

@@ -336,6 +336,9 @@ refreshCheck(const StoreEntry * entry, HttpRequest * request, time_t delta)
*/
const bool revalidateAlways = EBIT_TEST(entry->flags, ENTRY_REVALIDATE_ALWAYS);
if (revalidateAlways || (staleness > -1 &&
#if USE_HTTP_VIOLATIONS
!R->flags.ignore_must_revalidate &&
#endif
EBIT_TEST(entry->flags, ENTRY_REVALIDATE_STALE))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change has two problems AFAICT:

  • It applies ignore-must-revalidate configuration option to responses with Cache-Control: max-age. That is wrong.
  • It applies ignore-must-revalidate configuration option to responses cached before the option was enabled. The proposed documentation does not make it clear that the option affects already cached responses. We should be explicit about this unexpected (for some) "backdating" effect or unexpected (for some) lack of thereof.

The fix(es) would depend on whether we want ignore-must-revalidate to apply to previously cached responses. AFAICT, if we want the simplest solution, then we should answer "no" and

  • "move" this check, in its REFRESH_OVERRIDE() form, to where hasMustRevalidate() is called by HttpStateData::haveParsedReplyHeaders(), eventually resulting in setting ENTRY_REVALIDATE_STALE;
  • adjust proposed documentation to explicitly exclude already cached responses from the configuration option scope/control.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Maybe I'm missing something, but doesn't Cache-Control: max-age get processed after this (checked independently)?

It seems to me like the easiest thing would be to:

  1. Document that this option does not affect Cache-Control: proxy-revalidate, presumably someone who has turned on that option knows what they are doing (and if they don't, a separate option probably makes sense)
  2. Allow the option to apply to cached information from before the option was enabled, to match the original behavior (to minimize unexpected changes, as mentioned in your other comment)
  3. Document this behavior so that it is clear that the option is about revalidation of cached objects, not really the original caching

Proposed documentation:

ignore-must-revalidate ignores any Cache-Control: must-revalidate headers received from a server. This means that previously cached content (potentially cached before the option was enabled) will be reused instead of checking whether the server has updated content. Doing this VIOLATES the HTTP standard. Enabling this feature could make you liable for problems which it causes. Please note this option does not change the behavior of Cache-Control: proxy-revalidate, but does take affect for authenticated transactions where Cache-Control: must-revalidate or Cache-Control: no-cache is set (without parameters) for backwards-compatibility with the pre-v4 behavior.

Note: The authenticated transaction part is not relevant in my use case, but I figured in re-introducing this option it would be best to have it behave the same way it did before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't Cache-Control: max-age get processed after this (checked independently)?

Sorry, I should have said s-maxage not max-age.

const bool ccMustRevalidate = (rep->cache_control->hasProxyRevalidate() || rep->cache_control->hasMustRevalidate());
...
const bool ccSMaxAge = rep->cache_control->hasSMaxAge();
...
else if (ccMustRevalidate || ccSMaxAge)
    EBIT_SET(entry->flags, ENTRY_REVALIDATE_STALE);

From http.cc code quoted above, we know that at least three directives set ENTRY_REVALIDATE_STALE: proxy-revalidate, must-revalidate, and s-maxage. We know that this code change ties ENTRY_REVALIDATE_STALE to ignore-must-revalidate which is specific to only one or two (out of those three) directives. We should not do that, of course.

  1. Document that this option does not affect Cache-Control: proxy-revalidate, presumably someone who has turned on that option knows what they are doing (and if they don't, a separate option probably makes sense)

Most likely, code changes are also needed to make that proposed documentation accurate.

  1. Allow the option to apply to cached information from before the option was enabled, to match the original behavior (to minimize unexpected changes, as mentioned in your other comment)
  2. Document this behavior

OK, assuming you can actually make that work correctly despite the complications identified in this change request.
Needless to say, applying the old option logic to the current PR is allowed only when there are no known bugs in that logic and in its proposed implementation. One cannot justify adding bugs because the old implementation was similarly buggy...

so that it is clear that the option is about revalidation of cached objects, not really the original caching

I do not recommend thinking about this option this way when the code changes affect "original caching" as well (due to authenticated responses changes).

Overall, please be very careful when you add a specific claim to the documentation -- folks will apply that claim to their use case, not yours. Focus on documenting the exact effects (rather than the general intent).

Proposed documentation: This means that previously cached content will be reused instead of checking whether the server has updated content

For example, the above claim is overreaching: Cached content may still not be reused (for many reasons), and Squid may still check with the server (for many reasons), even when an ignore-must-revalidate rule matches the response.

Note: The authenticated transaction part is not relevant in my use case, but I figured in re-introducing this option it would be best to have it behave the same way it did before.

I think you did the right thing there (in principle): Your specific use case is important, but the code/documentation changes should be correct in all use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this is a generic check that the object is stale and therefor needs revalidating. If it gets removed the objects will likely go through a full re-fetch which is counter to intentions.
Its relationship to must-revalidate is through how the revalidate requirement implies immediate staleness. If you can identify the code converting response Cache-Control:must-revalidate into the staleness=0 value that would likely be a better place to put the override check.

I would prefer splitting this complicated if-condition up so the effects of the relevant cache-control values can be clearly seen and this ignore/override can be done to them individually. Though that may require expanding out the ENTRY_REVALIDATE_STALE flag meanings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex: Sorry, I should have said s-maxage not max-age...

Hm... I wrote that comment a few days ago, but only received an email notification about it now, after adding a couple of fresh comments to other change requests. It looks like my comment was stuck for all those days (in a pending state?). GitHub still shows the correct (from my point of view) comment order in this change request, but my comment has a much later timestamp than the subsequent Amos' comment. Anyway, I think that stuck/resurfaced comment is still valid. Sorry about the mess!

headers received from a server. Doing this VIOLATES
the HTTP standard. Enabling this feature could make you
liable for problems which it causes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After addressing other change requests, please explicitly document whether the option applies to Cache-Control: proxy-revalidate responses.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, please restrict the options behaviour to only ignoring must-revalidate. Leaving other controls being obeyed correctly until we have an explicit case appear needing them to be violated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support Amos suggestion about restricting this option to must-revalidate (if that can be correctly implemented without heroic efforts!), but I think this particular change request does not depend on that decision -- we should explicitly document whether this option applies to proxy-revalidate regardless of whether it applies, especially because it did apply before it was removed in 064679e.

@@ -6445,6 +6445,7 @@ DOC_START
reload-into-ims
ignore-reload
ignore-no-store
ignore-must-revalidate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to folks that had ignore-must-revalidate in their old Squid configurations when they update to this new Squid code without going through any Squid version that does not have ignore-must-revalidate? Will there be any changes in relevant Squid behavior (while their configuration remains unchanged)? If yes, we should disclose those changes, either in the PR description or Squid release notes. Personally, I prefer the latter.

@rousskov
Copy link
Contributor

rousskov commented Dec 2, 2022

I'm not sure what the M-failed-description tag means, presumably some sort of formatting issue?

Yes, probably. Please see Anubis docs about pull request labels (linked from each Anubis label but GitHub makes it difficult to see/use those links). I would not worry about that formatting for now, but that is your call. FWIW, the PR title/description will become the official commit message when your PR merges; any decent editor will let you quickly reformat PR description text to follow typical Git commit message guidelines (documented and enforced by Anubis).

Copy link
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alex has pointed out most of the major issues already. I've just added more to the pile.

The "ignore-must-revalidate" option. The option was not actually broken, but it is not about
the storage of objects - it is about whether they need to be retrieved again for new requests.

I think you have misunderstood the original removal. Unless the Cache-Control:no-store is present an object can always be stored. The option was removed because the useful part of its previous implementation are now the default behaviour of HTTP/1.1 Squid and all that remained was the security bugs and broken servers.

For example, I encountered a university download service that has set both "Expires: 0" and "Cache-Control: must-revalidate" for unchanging content.

[FWIW; have you attempted to contact that one broken service and inform them of the problem? it is often more useful to get the few broken servers fixed than to open the whole Internet up to security vulnerabilities just to let them stay broken.]

The refresh_pattern "override-expires" option will allow you to store a cache of this content,

No. The mere existence of HTTP/1.1 Cache-Control header makes the HTTP/1.0 Expires header be ignored by default.

So the Cache-Control:must-revalidate alone is determining what happens. Not your refresh_pattern.

but due to the "Cache-Control: must-revalidate" header the proxy server will always re-fetch

What must-revalidate does is tell Squid the object should be considered stale when it arrived and that a short/fast revalidate is all that is needed to re-use it. Proxy will always fetch something for stale content, period.

The real problem is that example server responding to revalidate requests by delivering the whole object in full even when it has not changed.

Comment on lines +407 to 408
} else if (rep->cache_control->hasNoCacheWithoutParameters() && !REFRESH_OVERRIDE(ignore_must_revalidate)) {
debugs(22, 3, "Authenticated but server reply Cache-Control:no-cache (equivalent to must-revalidate)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is exclusively about the behaviour of no-cache response control. Your setting should not have any impact on this no-cache side of the behaviour.

Suggested change
} else if (rep->cache_control->hasNoCacheWithoutParameters() && !REFRESH_OVERRIDE(ignore_must_revalidate)) {
debugs(22, 3, "Authenticated but server reply Cache-Control:no-cache (equivalent to must-revalidate)");
} else if (rep->cache_control->hasNoCacheWithoutParameters())) {
debugs(22, 3, "Authenticated but server reply Cache-Control:no-cache");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Amos. Please note that, unlike old 064679e code, in modern code, Cache-Control:no-cache sets ENTRY_REVALIDATE_ALWAYS (rather than ENTRY_REVALIDATE_STALE that this PR is working on).

Before ignore-must-revalidate was removed in 064679e, the two flags were the same -- ENTRY_REVALIDATE_ALWAYS did not exist back then -- so this if statement had no choice but treat ignore-must-revalidate specially. We split ENTRY_REVALIDATE into two flags in fa83b76. Now, this if statement does not need to worry about ignore-must-revalidate.

if (ccNoCacheNoParams || ccPrivate)
    EBIT_SET(entry->flags, ENTRY_REVALIDATE_ALWAYS);
else if (ccMustRevalidate || ccSMaxAge)
    EBIT_SET(entry->flags, ENTRY_REVALIDATE_STALE);

headers received from a server. Doing this VIOLATES
the HTTP standard. Enabling this feature could make you
liable for problems which it causes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better yet, please restrict the options behaviour to only ignoring must-revalidate. Leaving other controls being obeyed correctly until we have an explicit case appear needing them to be violated.

@@ -336,6 +336,9 @@ refreshCheck(const StoreEntry * entry, HttpRequest * request, time_t delta)
*/
const bool revalidateAlways = EBIT_TEST(entry->flags, ENTRY_REVALIDATE_ALWAYS);
if (revalidateAlways || (staleness > -1 &&
#if USE_HTTP_VIOLATIONS
!R->flags.ignore_must_revalidate &&
#endif
EBIT_TEST(entry->flags, ENTRY_REVALIDATE_STALE))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: this is a generic check that the object is stale and therefor needs revalidating. If it gets removed the objects will likely go through a full re-fetch which is counter to intentions.
Its relationship to must-revalidate is through how the revalidate requirement implies immediate staleness. If you can identify the code converting response Cache-Control:must-revalidate into the staleness=0 value that would likely be a better place to put the override check.

I would prefer splitting this complicated if-condition up so the effects of the relevant cache-control values can be clearly seen and this ignore/override can be done to them individually. Though that may require expanding out the ENTRY_REVALIDATE_STALE flag meanings.

@rousskov rousskov added the S-waiting-for-author author action is expected (and usually required) label Dec 4, 2022
@compholio
Copy link
Author

Alex has pointed out most of the major issues already. I've just added more to the pile.

The "ignore-must-revalidate" option. The option was not actually broken, but it is not about
the storage of objects - it is about whether they need to be retrieved again for new requests.

I think you have misunderstood the original removal. Unless the Cache-Control:no-store is present an object can always be stored. The option was removed because the useful part of its previous implementation are now the default behaviour of HTTP/1.1 Squid and all that remained was the security bugs and broken servers.

I was attempting to have the feature behave the same way it did before, but it sounds like it would do so even without the !REFRESH_OVERRIDE(ignore_must_revalidate) bit.

For example, I encountered a university download service that has set both "Expires: 0" and "Cache-Control: must-revalidate" for unchanging content.

[FWIW; have you attempted to contact that one broken service and inform them of the problem? it is often more useful to get the few broken servers fixed than to open the whole Internet up to security vulnerabilities just to let them stay broken.]

The "contact webmaster" link is broken, so we'll have to do a little digging to figure out who to talk to. We meet with the science group associated with the site on occasion, so they may be able to point us in the right direction. Is there a way that broken sites can be addressed that doesn't cause you concern about security vulnerabilities? Would moving the location of the fix to wherever Cache-Control:must-revalidate sets staleness=0 address your concern (your other comment)?

The refresh_pattern "override-expires" option will allow you to store a cache of this content,

No. The mere existence of HTTP/1.1 Cache-Control header makes the HTTP/1.0 Expires header be ignored by default.

So the Cache-Control:must-revalidate alone is determining what happens. Not your refresh_pattern.

I apologize, the store-stale option was also set and I misunderstood which option was responsible for caching of the file. With just the store-stale option set the server caches the file, but the proxy server always returns TCP_REFRESH_MODIFIED for additional requests without changing the behavior of the revalidation.

but due to the "Cache-Control: must-revalidate" header the proxy server will always re-fetch

What must-revalidate does is tell Squid the object should be considered stale when it arrived and that a short/fast revalidate is all that is needed to re-use it. Proxy will always fetch something for stale content, period.

The real problem is that example server responding to revalidate requests by delivering the whole object in full even when it has not changed.

Yes, it would be great if the server did "the right thing" such that this was not necessary.

@squid-anubis squid-anubis added M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels labels Jan 9, 2024
@kinkie kinkie added the M-ignored-by-merge-bots https://github.com/measurement-factory/anubis/blob/master/README.md#pull-request-labels label Feb 14, 2024
Copy link
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is predicated on a misunderstanding about how Expires operates.

The HTTP Specification requires that when a Cache-Control header is provide the Expires header is to be ignored. It does not matter whether your configuration says that value X is to be used instead of Expires: Y when that header is being ignored.

Please be aware that what refresh_pattern ... override-expires ignore-must-revaliate means when applied to the semantics of

 Expires: 0
 Cache-Control: must-revalidate

is that Squid should operate as if it had received an empty Cache-Control header. It does not mean use the Expires value.

@yadij
Copy link
Contributor

yadij commented May 25, 2024

@compholio are you intending to continue with this PR? if not I think we should close the PR to help clear up the review queue.

@compholio
Copy link
Author

It sounds like there's not a way to make folks happy, so I'll say it's probably best to drop the request.

@compholio compholio closed this May 28, 2024
@rousskov rousskov removed S-waiting-for-author author action is expected (and usually required) M-ignored-by-merge-bots https://github.com/measurement-factory/anubis/blob/master/README.md#pull-request-labels M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels labels May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants