Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACE URL List Processing Issues #480

Closed
ronnieg303 opened this issue May 11, 2015 · 6 comments
Closed

ACE URL List Processing Issues #480

ronnieg303 opened this issue May 11, 2015 · 6 comments

Comments

@ronnieg303
Copy link

Using ZC Pro: Version 150409.

On my very large site, I have cleared/blanked the ZenCache Pro ACE parameter for a XML Sitemap URL because I don't want ACE to automatically process all of the urls in the sitemap. I only want it to auto-cache specific files in the urls list, which contains about 1500 urls, formatted like:
/
/user/
/user/copyright-notice/
/user/privacy-policy/
/user/sitemap/
... etc.
Please confirm that this is the proper format, and that http://www.domainname.com is not also required for each url. If it is required, then that should be in the parameter instructions.

My assumption with this approach is that ACE should auto-cache only the files I specified, and in the order I specified them, which implements a priority scheme by which urls I want auto-cached right away are at the top of the list, then other lower priority urls following. And if a user accesses a url not on the ACE urls list, it will be cached as usual, but not by ACE.

I also have a 5000ms http request delay configured, which I have always assumed is one request per page url, and not a delay between requests for each resource (js, css, etc) that might be used by a page url. Some of the pages in my urls list can take 3-4 seconds for initial render and cache. This delay should equate to about 20 urls per minute, and 300 urls in a 15 minute ACE cycle.

However, what I am seeing is this:

  • ACE is not processing the urls list in the order I specified. The order appears to be random. This may be a result of user activity and/or SE crawlers, but impossible to tell.
  • Pages not on the urls list appear to be auto-cached. Not many, but a few. These few could be a result of user access or SE crawler access.
  • In the cache file html notes generated by ZC, I do not see any note that indicates caching was done as a result of ACE initiated processing. That would be nice to have.
  • In 40 minutes, ACE has only cached about 100 urls vs the 600+ it should have been able to process with the delay I specified.
  • For some urls, the plugin php that generates the page has: define('DONOTCACHEPAGE', TRUE). However, some of these urls are still being cached when they are in the ACE urls list. If the ACE urls list overrides this parameter, that's not a bad thing, and is probably what I really want anyway. But still curious as to why.
  • One concern is that since I blanked out the XML url parameter, is that causing ACE to ignore the URLs list? If so, then that would be a bug. Do I need to put a dummy url in the XML sitemaps parm to get around that? I created an empty nofile.xml file and put its url in the XML sitemap file parm, and that didn't seem to change anything.
@raamdev
Copy link
Contributor

raamdev commented May 12, 2015

I also have a 5000ms http request delay configured, which I have always assumed is one request per page url, and not a delay between requests for each resource (js, css, etc) that might be used by a page url. Some of the pages in my urls list can take 3-4 seconds for initial render and cache. This delay should equate to about 20 urls per minute, and 300 urls in a 15 minute ACE cycle.

The delay is between requests to cache a URL. A 5 second (5000ms) delay means the Auto-Cache Engine will cache a URL, then wait 5 seconds, then cache another URL, etc. That means you could cache (at most) 12 URLs per minute with a 5000ms delay, likely far less considering the added time it takes to cache each request (if some pages are taking 3-4 seconds to cache, you might be getting 4 or 5 URLs cached each minute with a 5000ms delay).

ACE is not processing the urls list in the order I specified. The order appears to be random. This may be a result of user activity and/or SE crawlers, but impossible to tell.

Looking at the code, I see that there's a line that does shuffle($_blog_urls); // Randomize the order.; so it would appear that the order listed is not taken into account but is randomized. I'm not seeing any good reason for this. It makes sense to me that the order of the URLs listed in the Other URLs box should be taken into account. @jaswsinc can you shed any light here?

Pages not on the urls list appear to be auto-cached. Not many, but a few. These few could be a result of user access or SE crawler access.

Those are likely a result of other users accessing the page or SE crawlers.

In the cache file html notes generated by ZC, I do not see any note that indicates caching was done as a result of ACE initiated processing. That would be nice to have.

Yes, we have a GitHub Issue open for this here: #292

In 40 minutes, ACE has only cached about 100 urls vs the 600+ it should have been able to process with the delay I specified.

No, at most the Auto-Cache Engine would be able to cache 480 URLs in 40 minutes (40*60/5), but that's not even taking into account the amount of time it takes to load each URL. Around 100 URLs in 40 minutes sounds right to me if you have a 5000ms delay. If you want to cache URLs faster, you need to lower the delay, which will require more system resources.

For some urls, the plugin php that generates the page has: define('DONOTCACHEPAGE', TRUE). However, some of these urls are still being cached when they are in the ACE urls list. If the ACE urls list overrides this parameter, that's not a bad thing, and is probably what I really want anyway. But still curious as to why.

Hmm, that shouldn't be happening. The Auto-Cache Engine essentially just "visits" each URL, as if it were a normal visitor visiting the URL, which results in ZenCache caching the page (if necessary) as it would if a normal visitor visited an uncached page. So if DONOTCACHEPAGE is working when a regular user visits the page, there's no reason it shouldn't work when the Auto-Cache Engine visits the page.

How are you implementing DONOTCACHEPAGE?

One concern is that since I blanked out the XML url parameter, is that causing ACE to ignore the URLs list? If so, then that would be a bug. Do I need to put a dummy url in the XML sitemaps parm to get around that? I created an empty nofile.xml file and put its url in the XML sitemap file parm, and that didn't seem to change anything.

The Auto-Cache Engine should run and cache the URLs listed in the Other URLs box, regardless of whether or not there's a Sitemap URL.

@jaswrks
Copy link

jaswrks commented May 12, 2015

so it would appear that the order listed is not taken into account but is randomized. I'm not seeing any good reason for this. It makes sense to me that the order of the URLs listed in the Other URLs box should be taken into account. @jaswsinc can you shed any light here?

Correct, it is randomized. I believe there is an issue open about using Sitemap priority (#443), but for now, URLs are randomized in order to prevent a top-to-bottom approach in the stateless crawl process.

Consider a site with 5K, 10K, or 100K pages. The crawler runs for only a few minutes at a time (based on configuration). Since there is no state-tracking in the current release, going from top-to-bottom (or in any specific order) could result in some pages never being cached, as the crawler would always start from the top. Even if those pages at the top have already been cached (which we detect), it still takes time to check each of them in a specific order. Thus, we randomize the order to avoid this.

A solution (it could be a part of the work in #443) would be to add state-tracking; i.e., for the ACE to record where it left off. Ideally, this could be coupled with priorities being read from the Sitemap also.

@ronnieg303
Copy link
Author

I understand randomizing for the xml sitemap urls. That could make sense.

However, in the case of the other urls list, I think most site owners would have some method and reason for creating that list in the first place, and would, like I did, assume that list would a) take priority over xml sitemap files, and b) be processed in the order listed. I can see where some urls might be in both lists, in which case the site owner wants those urls indexed first, possibly due to very high access rates for those pages, In my case, my site is currently over 7,000 pages, so for several reasons, including: 1) very low access rates and already acceptable/quick load times for some pages, so they don't need to be pre-cached, and 2) Very high access rates (home page and some others), and/or get SE crawled daily so I need to get them indexed quickly after a purge, so Google and other SE crawlers always see fast page load times, which is good for SEO.

My initial thought on sequential processing of the other urls list would be to load them to a unique dedicated MySQL table that is dropped and re-created whenever ZC options are saved. That table would then be accessed in index (row creation) order by ACE as it starts each pre-cache cycle, picking up wherever it left off last time. Unlike a XML sitemap, which is or could be very dynamic as posts are created, the other urls list should be relatively stable, and only updated manually by the site owner when critical pages are created that need to be added to that list.

@ronnieg303
Copy link
Author

Re: How are you implementing DONOTCACHEPAGE?

Is that a question for me or ZC developer? I just added a line:
define('DONOTCACHEPAGE', TRUE);
to the one php script that creates those particular pages. I assumed that was a more or less global process, not specific to ZC, already in place, and/or that ZC would recognize it and not cache the page. That seems to be what the instructions in the ZC options say would happen.

After 24 hours, the 1500 urls in my other urls list were still not all pre-cached. I just reduced the http delay to 2500ms, which I hope should get all of them indexed within 24 hrs of a full cache purge. I only do that when I make significant plugin programming changes and/or theme changes, which I have been doing on quite a few sites lately to get them to be fully Mobile friendly.

@raamdev
Copy link
Contributor

raamdev commented May 12, 2015

However, in the case of the other urls list, I think most site owners would have some method and reason for creating that list in the first place, and would, like I did, assume that list would a) take priority over xml sitemap files, and b) be processed in the order listed

I agree. You already opened a feature request for this (#481) so we can move further discussion on that topic to that GitHub Issue.

Re: How are you implementing DONOTCACHEPAGE?

Is that a question for me or ZC developer? I just added a line:
define('DONOTCACHEPAGE', TRUE);
to the one php script that creates those particular pages. I assumed that was a more or less global process, not specific to ZC, already in place, and/or that ZC would recognize it and not cache the page. That seems to be what the instructions in the ZC options say would happen.

If you are adding that line inside a theme file that generates the pages, you should be fine. Some non-developers can use DONOTCACHEPAGE incorrectly, hence my asking how you were using it. It sounds like you're using it correctly.

I'm not aware of any issue related to the Auto-Cache Engine and DONOTCACHEPAGE.

@raamdev
Copy link
Contributor

raamdev commented Jul 27, 2015

I'm closing this issue, as the outstanding issues discussed here have been filed in #443 and #481.

@raamdev raamdev closed this as completed Jul 27, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants