Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to cache GET (Query String) requests, but ignore the parameter value possible? #639

Closed
lkraav opened this issue Dec 23, 2015 · 31 comments

Comments

@lkraav
Copy link

lkraav commented Dec 23, 2015

I would like to cache AdWords campaign responses which will receive tons of people with ?gclid=<id> requests where <id> is an endless amount of random hashes. It makes no sense to cache these, because essentially they're one time use. So I would need to return a response that's general for just gclid= parameter, regardless of the exact value.

http://zencache.com/kb/ doesn't seem to indicate it's possible at the moment. Am I missing anything?

@raamdev
Copy link
Contributor

raamdev commented Dec 23, 2015

@lkraav Requests that include a query variable (e.g., ?gclid=) are almost always meant to be dynamic in some way, which is why ZenCache does not cache GET requests by default.

You could enable GET Request caching (ZenCache → Plugin Options → GET Requests) and tell ZenCache to cache those pages anyway, however if you have a lot of variations of the same URL (e.g., ?gclid=, ?gclid=2, ?gclid=3, etc.), that could result in many cache files, and if each request is only meant for one person it's probably not necessary to cache each page.

ZenCache has no way of knowing what a "default value" is for a page with a query string... i.e., there's no way to tell ZenCache "if the URL contains ?gclid=XXX, just cache the page as if it didn't contain the query string at all". Doing so would require that ZenCache make a second request to the page (a request without ?gclid=) and then cache that result, which is just outside the scope of what ZenCache is designed to do.

@lkraav
Copy link
Author

lkraav commented Dec 23, 2015

..., however if you have a lot of variations of the same URL (e.g., ?gclid=, ?gclid=2, ?gclid=3, etc.), that could result in many cache files, and if each request is only meant for one person it's probably not necessary to cache each page.

Correct, this is exactly the problem. Caching even hurts performance here (probably near-negligible but still), because the page will always be built from scratch, and additionally the useless cache output will be built.

ZenCache has no way of knowing what a "default value" is for a page with a query string... i.e., there's no way to tell ZenCache "if the URL contains ?gclid=XXX, just cache the page as if it didn't contain the query string at all". Doing so would require that ZenCache make a second request to the page (a request without ?gclid=) and then cache that result, which is just outside the scope of what ZenCache is designed to do.

I'm thinking you're overly pessimistic here. Certainly a configurable query parameter parsing algorithm can be devised that will arrive at the correct conclusion about which cache file to build or serve. CloudFlare for example has the "ignore query string" caching feature built in https://support.cloudflare.com/hc/en-us/articles/200168256 even though, yes, by their own omission they get their cache result from origin by stripping the query parameter. But there's no reason why this can't be replicated locally.

If the decision machine would have a configurable list of query parameters to ignore, it could actually strip those from the incoming request when making the final routing decision about the cache file. This actually enables other query parameters to still stay around and cause a non-cached reply, if needed. But if let's say gclid is the only one, which it often is, then query would become parameter free and normal page cache file would be served.

AdWords and other similar systems are in massive use. You can surely imagine the wasted global resource here. Any arguments against the algorithm proposed above?

EDIT I have one argument: another textarea in the seemingly endless river of Configuration textareas :) But perhaps this is a small price to pay for gained global savings?

@raamdev
Copy link
Contributor

raamdev commented Dec 23, 2015

But if let's say gclid is the only one, which it often is, then query would become parameter free and normal page cache file would be served.

That sounds doable, yes. A new textarea inside the GET Requests panel that allows you to list query parameters that ZenCache should consider "safe to cache", and if ZenCache receives a request from a URL that contains only that query parameter, it could silently drop it and cache the page as if the parameter wasn't present.

It does seem like a rather odd use-case, and prone to accidentally creating a static cache of a page that should have been dynamic (if the site owner does not properly configure the exclusion), but from a ZenCache perspective it doesn't seem that difficult to do. Again, my feeling is if you have a query string present, you probably want to use it. If you don't want to use it, then why not simply exclude the query string when linking to the page?

@jaswsinc any feedback here?

@lkraav
Copy link
Author

lkraav commented Dec 23, 2015

...if ZenCache receives a request from a URL that contains only that query parameter, it could silently drop it and cache the page as if the parameter wasn't present.

I would propose that the algorithm should not consider the subject caching strategy when it's the only query parameter. There seems to be no reason why there couldn't be an arbitrary number of query parameters in an incoming request working out with this enhanced decision strategy. ZenCache would

  1. examine the full URI
  2. remove all matching parameter names from plugin configuration textarea (one per line)
  3. what is left over (including a parameter-free scenario) is sent to caching engine to work exactly as it is working now

If you don't want to use it, then why not simply exclude the query string when linking to the page?

For this example, gclid is used by Google Analytics implementations (usually JS-based) to link paid traffic to general analytics hits. So this value needs to be visible for the tools running on the page, but page contents actually doesn't need to change at all. It's just critical that this data piece gets correctly forwarded to external tools.

@jaswrks
Copy link

jaswrks commented Dec 29, 2015

@lkraav

gclid is used by Google Analytics implementations (usually JS-based)

I agree that it would be nice if ZC could be configured to strip specific query string variables from it's caching algorithm. Your example, Google Analytics, is a use case where gclid and these utm_ variables are probably safe to ignore in almost every case. These are used to assist in tracking campaign performance and referrals, so they need to be there, but should not (in most cases) have an impact on the cache engine (ideally).

@raamdev writes...

any feedback here?

I give this a 👍. I think it's a good idea to allow for a configuration where you could define query string variables to exclude. I think that even warrants a new UI config panel in ZenCache myself.


Aside:

In addition to this, I recently observed that CloudFlare has an algorithm that can sort the query string variables by key, ordering them in the same way each time, before they calculate the cache location for that page. This is something we could consider doing also.

Example query string:

?a=123&c=456&b=993&_=hello+world

... is always sorted, and then represented internally as:

?_=hello+world&a=123&b=993&c=456

So that if you happen to run ZenCache w/ GET requests enabled (not recommended), but if you do, ZC could be smarter about how it determines the cache location. Sorting the variables removes most of the permutations that would need to be considered otherwise.

@jaswrks
Copy link

jaswrks commented Dec 29, 2015

Short-Term Solution

A decent short-term solution is to add the &zcAC=1 variable to the URL.

So if you have a URL which contains query string variables, and you still want it cached, add that onto the end of the URL and it tells ZenCache that it's OK to do so.

http://example.com/?gclid=xxxxx&zcAC=1

@bridgeport
Copy link

I was thinking over this suggestion, and believe there's a flaw to the approach of designating certain query parameters to be cached as well as @jaswsinc zcAC=1 suggestion. And overcomplicates it.

I don't think the optimum solution is allowing individual query string url variations to be cached as separate files. Instead, I think you should serve the non-GET/non-query string version of the cached file that's on disk.

As it stands with the current set of suggestions, if a person visited:

http://example.com

ZenCache will create static cache file.

/wp-content/cache/zencache/cache/http/exampe-com/index.html

But if a person visited the following:

http://example.com/?zcAC=1
http://example.com/?zcAC=2
http://example.com/?zcAC=3

ZenCache would create three more cache files here:

/wp-content/cache/zencache/cache/http/exampe-com/index.q/

But that's only workable if those query string URL's are common to many distinct visitors. If many people are hitting those same URLs. For instance, it would be fine for something like:

http://example.com/?product_id=100
http://example.com/?product_id=200
http://example.com/?product_id=300

You'd want those to be cached, as many people will be hitting those exact same URL's.

But with gclid and the other utm_* parameters, virtually every URL will be unique. Every visitor would get their own cache page, which will have to be generated from scratch. That cache file will never be revisited by anyone else, as it's unique to that visitor. So, no one actually experiences the benefit of the cache delivery. It's may even be suboptimal.

The difference between the two scenarios is the "product_id" approach is controlled by the website owner/admin, and is predictable. The gclid and utm_* values are generated by the ad serving networks, and therefore non-predictable and outside the control of the site administrator.

For instance, here's how google adwords describes "utm_term":

Used for paid search. Use utm_term to note the keywords for this ad.
Example: utm_term=running+shoes

Allowing such pages to be cached could result in an explosion of one-off cache files all containing the exact same HTML content.

So, I propose pushing it further. For certain designated query string parameters, do not create a index-q version of the file, but just simply serve the main static file version.

So, a visit to any of these pages:

http://example.com
http://example.com/?zcAC=1
http://example.com/?zcAC=2
http://example.com/?zcAC=3
http://example.com/?gclid=1abc
http://example.com/?gclid=2def
http://example.com/?utm_term=red+shoes
http://example.com/?utm_term=red+shoes+cheap
http://example.com/?utm_term=red+shoes+cheap&abc=123&random=123&nonsense=q#ooo

Would only serve this file:

/wp-content/cache/zencache/cache/http/exampe-com/index.html

On top of that, it would be great if the "filter" could be an "any" filter. So, if I specify to serve the main cache file version if a query string contains "any" of the following:

utm_term, gclid, utm_*

The single, primary cache file will be served, even if the url contains a bunch of other nonsense parameters. Because, who knows what extra stuff can end being tacked on.

Or think of it this way: If "any" of the specified query strings exist in the URL, strip the entire query string (including hashes/anchors) and just serve the non-query string version.

Because, if I'm running an ad campaign, the tracking of those query strings are being done asynchronously, via JavaScript. And the single, main cached file version would suffice for any and all URLs, query string, or no query string.

I think this would be the majority use case, but if there are those that actually need separate index.q versions of the cache files, then the zcAC=1 would work, or a separate setting within the admin area to essentially accompany zcAC=1 as part of a list of parameters to separately cache for.

Thanks.

@lkraav
Copy link
Author

lkraav commented Dec 29, 2015

You're correct @jaswsinc, utm-parameters are the perfect second example which I forgot. I knew there was something I left out :)

Either way, everything written above (although some concepts are repeated by now) is correct. I like the alphabetic ordering idea - that's indeed the optimal way to go about it.

@raamdev
Copy link
Contributor

raamdev commented Dec 29, 2015

@bridgeport writes...

I propose pushing it further. For certain designated query string parameters, do not create a index-q version of the file, but just simply serve the main static file version.

Right. That was my understanding about how this would work: You would have "Query Vars To Ignore" section inside the GET Requests panel that would allow you to define a series of query variables that should be ignored (stripped) from the request prior to caching or loading an existing cache file.

@bridgeport writes...

On top of that, it would be great if the "filter" could be an "any" filter. So, if I specify to serve the main cache file version if a query string contains "any" of the following:

utm_term, gclid, utm_*

The single, primary cache file will be served, even if the url contains a bunch of other nonsense parameters. Because, who knows what extra stuff can end being tacked on.

Or think of it this way: If "any" of the specified query strings exist in the URL, strip the entire query string (including hashes/anchors) and just serve the non-query string version.

That's an interesting idea and I agree that should be possible. We could probably achieve this by using the Watered-Down Regex Syntax (recently implemented in other areas of the plugin) inside the new "Query Vars To Ignore" panel:

For example, to remove only the gclid query var, you might enter this:

gclid*

Whereas if you want ZenCache to strip all query vars from the request whenever it finds utm_, you could add this (notice the double *):

utm_**

Putting those two together, the "Query Vars To Ignore" box might contain this:

gclid*
utm_**

@jaswsinc writes...

In addition to this, I recently observed that CloudFlare has an algorithm that can sort the query string variables by key, ordering them in the same way each time, before they calculate the cache location for that page. This is something we could consider doing also.

That sounds like a great idea!

@lkraav
Copy link
Author

lkraav commented Jun 24, 2016

How's this doing? My clients are still struggling with caching campaign pages properly.

It would be a useful incremental win if simply a query-string free cached version could be opted into serving, until the more complex dynamics described in above messages can be fully implemented.

@raamdev
Copy link
Contributor

raamdev commented Jun 24, 2016

@lkraav Thank you for the ping here. I'll see if we can work this into the next development cycle (the current cycle ends today and enters a 1 week testing phase).


@jaswsinc writes...

I give this a 👍. I think it's a good idea to allow for a configuration where you could define query string variables to exclude. I think that even warrants a new UI config panel in ZenCache myself.

I'd rather put this inside the existing GET Requests panel for now. If you could outline this one when you get a chance, that would be great! 😁

@lkraav
Copy link
Author

lkraav commented Aug 17, 2016

How are we doing on this for fall release? The number of sites where I could use this keeps increasing.

@jaswrks
Copy link

jaswrks commented Aug 19, 2016

Outline

New Option

After this line add:

'exclude_get_vars' => '', // Comma-delimited list of query string variables.

New Constant

After this line add:

if (!defined('COMET_CACHE_EXCLUDE_GET_VARS')) {
    /**
     * HTTP GET variable exclusions.
     *
     * @since 16xxxx GET variable exclusions.
     *
     * @var string A comma-delimited list of query string variable names.
     */
    define('COMET_CACHE_EXCLUDE_GET_VARS', '%%COMET_CACHE_EXCLUDE_GET_VARS%%');
}

Refactor Conditional Utility

Refactor this method so that it considers COMET_CACHE_EXCLUDE_GET_VARS

Refactor Cache Path Generation

Replace this line with code that can parse the query string variables (see: parse_str()), then sort the array, and implode before generating the MD5 hash. Or, instead of using implode(), see: http://php.net/manual/en/function.http-build-query.php

New UI Option

A new <input type="text" field should be made available in the UI so the option exclude_get_vars can be customized. Note that this option will only apply when the get_requests option is set to yes.

@raamdev
Copy link
Contributor

raamdev commented Aug 20, 2016

@jaswsinc What do you think about supporting WREGX in the query var exclusions?

@raamdev
Copy link
Contributor

raamdev commented Aug 20, 2016

@lkraav writes...

How are we doing on this for fall release? The number of sites where I could use this keeps increasing.

We'll work on getting this into the next release. :-)

@jaswrks
Copy link

jaswrks commented Aug 22, 2016

@raamdev writes...

@jaswsinc What do you think about supporting WREGX in the query var exclusions?

Sounds like a good idea to me also.

@lkraav
Copy link
Author

lkraav commented Aug 23, 2016

Thanks a ton guys. Maybe to ease the workload for you, this could be a developer-only feature to start, without a UI. This allows some testing, calibrating etc. before spending time on bolting on a full user-friendly interface.

@lkraav
Copy link
Author

lkraav commented Aug 23, 2016

PS if there would be a shared branch patch queue to work with and test, that wouldn't have any other modifications (avoid breakage), I could test this feature in a few real world environments very quickly, and possibly even provide patches based on what I see.

Not sure if the risk is appropriate piling too many "in-development" changes on at one time in live environments, hence would be more comfortable with a well contained feature branch.

@raamdev
Copy link
Contributor

raamdev commented Aug 23, 2016

@lkraav Thanks for your willingness to contribute! Once work on this issue is underway, you'll get a notification here about the new feature branch and you can follow along with the Pull Request.

@raamdev raamdev self-assigned this Aug 24, 2016
@raamdev raamdev modified the milestones: Next Release, Future Release Sep 6, 2016
@raamdev raamdev modified the milestones: Future Release, Next Release Sep 6, 2016
@lkraav
Copy link
Author

lkraav commented Sep 12, 2016

PS looks like there's more people thinking about sorting GET parameters for caching https://github.com/wandenberg/nginx-sorted-querystring-module

@raamdev
Copy link
Contributor

raamdev commented Sep 12, 2016

@lkraav Thanks for the heads up! This issue didn't quite make it into the current development cycle (which closed last week in preparation for an official release this week), but this issue will be on docket for the next release. :-)

@lkraav
Copy link
Author

lkraav commented Sep 12, 2016

This issue didn't quite make it into the current development cycle (which closed last week in preparation for an official release this week), but this issue will be on docket for the next release. :-)

Oh shoot. I just launched a site where this is really needed.

I think I'm going to poke around in the codebase and see if I can come up with a hold me over patch.

@raamdev
Copy link
Contributor

raamdev commented Sep 12, 2016

@lkraav Sorry we couldn't work this in sooner. Jason's outline above is a great place to start if you want to give this a stab on your own. If you'd like to submit a working PR to the Comet Cache Pro repo, that would help us get this worked into the next version more quickly. :-)

@lkraav
Copy link
Author

lkraav commented Sep 12, 2016

If you'd like to submit a working PR to the Comet Cache Pro repo, that would help us get this worked into the next version more quickly. :-)

Perfect

  1. can I run the development version of the plugin out of plugins/comet-cache-pro.git? Some big plugins have hardcoded directory names they're looking for assets in (bad bad bad).
  2. any specific build process to follow after clone?

@raamdev
Copy link
Contributor

raamdev commented Sep 12, 2016

@lkraav The build environment takes some time setting up. It would be easier to just work from the latest public release of Comet Cache Pro (or the latest RC release), modifying files and testing as you go.

@raamdev
Copy link
Contributor

raamdev commented Oct 6, 2016

@raamdev raamdev removed their assignment Oct 11, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 8, 2016
@jaswrks
Copy link

jaswrks commented Nov 8, 2016

Work on this issue has begun.

jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 10, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 10, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 10, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 10, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 10, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 10, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Nov 10, 2016
@jaswrks
Copy link

jaswrks commented Nov 10, 2016

Next Release Changelog:

  • New Pro Feature: In the pro version it is now possible to define a list of GET request variable names that should be ignored entirely by Comet Cache. See: Dashboard → Comet Cache → Plugin Options → GET Requests → List of GET Variable Names to Ignore. As an example, this new feature makes it possible for site owners to pass query string variables associated with Google Analytics (i.e., utm_* variable names) without incurring a cache performance hit. See also: issue #639 if you'd like additional details.
  • Performance Enhancement: For sites configured to allow query string variables into the cache, those variables are now sorted by key name internally to avoid duplicate cache files; i.e., whenever the order of query string variables changes from one request to another, but with the same exact values. In short, Comet Cache now knows how to serve the same underlying cache file; i.e., from a previous request that may have had the same query string, just in a slightly different order. See also: issue #639 if you'd like additional details.

@lkraav
Copy link
Author

lkraav commented Nov 16, 2016

Fantastic, guys 👍 Going to put this through some good testing cycles shortly.

@renzms
Copy link

renzms commented Nov 17, 2016

@raamdev

Confirmed Working

  • added a query string variable to URL, ?utm_source=test-*
  • Comet Cache Notes cache path location does not include a hash of query string.
  • Comet Cache Notes show that the cache file was built and served without query string variable -- - Visiting the same page with a different value also serves the same cache file

Set GET Variable to ignore

screen shot 2016-11-17 at 11 32 34 pm

Base Page :https://domain.net/test/

Visiting https://domain.net/test/?utm_source=test-a

screen shot 2016-11-17 at 11 51 56 pm

Visiting https://domain.net/test/?utm_source=test-b

screen shot 2016-11-17 at 11 53 13 pm

@raamdev
Copy link
Contributor

raamdev commented Nov 19, 2016

Comet Cache v161119 has been released and includes changes from this GitHub Issue. See the v161119 announcement for further details.


This issue will now be locked to further updates. If you have something to add related to this GitHub Issue, please open a new GitHub Issue and reference this one (#639).

@raamdev raamdev closed this as completed Nov 19, 2016
@wpsharks wpsharks locked and limited conversation to collaborators Nov 19, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants