Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggest .htaccess rules to prevent some erroneous cache directories #101

Closed
raamdev opened this issue Apr 18, 2014 · 18 comments
Closed

Suggest .htaccess rules to prevent some erroneous cache directories #101

raamdev opened this issue Apr 18, 2014 · 18 comments

Comments

@raamdev
Copy link
Contributor

raamdev commented Apr 18, 2014

During my testing of the new branched cache structure on a live site, I found that after several days my cache had the following directories:

69-16-219-214/
ftp-raamdev-com/
raamdev-com/
RAAMDEV-COM/
raamdev-comhttp/

Some of these should not exist, notably the uppercase RAAMDEV-COM and raamdev-comhttp.

@raamdev raamdev added this to the Next Release milestone Apr 18, 2014
@jaswrks
Copy link

jaswrks commented Apr 21, 2014

@raamdev I think this behavior is correct on the part of QC. If the host changes, a separate cache should be kept for it since it's always possible that the host name would impact the content generated server-side. Even though a default WP install might do fine against its configured host name, you never know what else might be running on that server and/or via custom themes/plugins; which might alter the final output based on the host name in the request.

You mentioned to me before that you thought a possible solution might be to offer site owners an .htaccess snippet that would help them avoid some common issues associated with different host names. The code snippet below could be generated dynamically via PHP and presented to a site owner in the Dashboard with some recommendations. I think it should be optional.

This should resolve...

  • 69-16-219-241 where the site was accessed over the IP and this was allowed. The code snippet below will enforce a proper host name matching the one you intend to use.
  • ftp-raamdev-com and any other wildcarded sub-domains would be avoided too.
  • RAAMDEV-COM should be fixed with this too; because I removed the [NC] portion of this line RewriteCond %{HTTP_HOST} !^example\.com$.
  • raamdev-comhttp This is IMO the most troublesome issue. This folder was nested inside of your raamdev-com directory right? I've seen this occur whenever crawlers/bots get the relative paths wrong; i.e. they parse the content of a post that contains relative links, and instead of getting it right (e.g. raamdev.com/another-post) they incorrectly try raamdev.com/original-post/raamdev.com/another-post. The code below will not resolve this. However, since the latest dev branch of QC deals with 404 errors now, enabling 404 caching would work to prevent this from eating up disk space when/if it occurs. If 404 caching is not enabled, this would prevent the file from ever being created to begin with. In short, I think we've already resolved this problem, even though the .htaccess snippet below doesn't deal with it at all.
# BEGIN Host Enforcer
<IfModule rewrite_module>
    RewriteEngine on
    RewriteBase /

    RewriteCond %{HTTP_HOST} !^example\.com$
    RewriteCond %{HTTPS} !^on$ [NC]
    RewriteCond %{HTTP:X-Forwarded-Proto} !^https$ [NC]
    RewriteRule .* http://example.com%{REQUEST_URI} [R=301,L]

    RewriteCond %{HTTP_HOST} !^example\.com$
    RewriteCond %{HTTPS} ^on$ [NC,OR]
    RewriteCond %{HTTP:X-Forwarded-Proto} ^https$ [NC]
    RewriteRule .* https://example.com%{REQUEST_URI} [R=301,L]
</IfModule>
# END Host Enforcer

@raamdev
Copy link
Contributor Author

raamdev commented Apr 21, 2014

You mentioned to me before that you thought a possible solution might be to offer site owners an .htaccess snippet

Thanks! I had actually forgotten that we discussed that. Yes, I agree that's the best approach.

raamdev-comhttp This is IMO the most troublesome issue. This folder was nested inside of your raamdev-com directory right?

No, it wasn't nested. It was at the same level as raamdev-com. Here's what that directory structure looks like right now:

wp-content/cache/http/raamdev-comhttp/raamdev-com/

Inside which I have:

2006/
2007/
2010/
2011/
raamdev-com/

And every cache file in those sub-directories is, as would be expected, a 404 (symlinked back to the default 404 file).

I tried searching my apache logs for any requests matching some of the 404s, but came up empty. I'll leave this open for now and do some more testing on my live site. I'll also defer this issue for a future release, as I don't feel it's important enough to get out right away.

@raamdev raamdev modified the milestones: Future Release, Next Release Apr 21, 2014
@raamdev raamdev added enhancement and removed bug labels Apr 21, 2014
@raamdev raamdev changed the title Branched Cache Structure and Erroneous Directories Offer Site-Owners an .htaccess to Prevent some Erroneous Directories Apr 21, 2014
@raamdev raamdev changed the title Offer Site-Owners an .htaccess to Prevent some Erroneous Directories Suggest .htaccess rules to prevent some erroneous cache directories Apr 21, 2014
@raamdev raamdev added todo and removed enhancement labels Apr 30, 2014
@raamdev
Copy link
Contributor Author

raamdev commented May 2, 2014

Just a quick update on this: I've been running Quick Cache Pro (from April 16th) for the past two weeks, along with the .htaccess code you recommended above for the Host Enforcer, on my raamdev.com site and my wp-content/cache/http/ directory as of today as these subdirectories:

raamdev-com/
RAAMDEV-COM/
raamdev-comhttp/
raamdev-comHTTP/
raamdev-comhttps/

I'm installing the latest Quick Cache Pro as of today and will continue testing.

I realized that this issue with erroneous directories might also have something to do with 404 Caching, as if an invalid URL is requested, Quick Cache will create the necessary subdirectories in to make the symlink to the 404 cache file. With 404 Caching disabled (the default), I bet these erroneous directories would go away.

I'll let it run for a few days and then test again with 404 Caching disabled.

@raamdev
Copy link
Contributor Author

raamdev commented May 6, 2014

I've had the latest dev version running for the past few days (404 Caching enabled). I have the following subdirectories in wp-content/cache/quick-cache/cache/http/

raamdev-com/
raamdev-comhttp/raamdev-com/

Inside raamdev-comhttp/raamdev-com/ I have lots of subdirectories and cache files, all of which are 404 Cache files that point back to the default 404 cache file. For example, I have the following:

raamdev-comhttp/raamdev-com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app.html (symlink)
raamdev-comhttp/raamdev-com/2014/try-it-a-different-way.html (symlink)

The actual working URLs are here:
http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/
http://raamdev.com/2013/try-it-a-different-way/

I dug through my Apache access logs looking for the 404 errors to see if there was something funky about the GET request, but here they both are and the GET requests look correct (notice both of these are from the same IP address).

122.96.59.106 - - [04/May/2014:16:19:21 -0400] "GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ HTTP/1.0" 404 35383
122.96.59.106 - - [04/May/2014:16:21:35 -0400] "GET http://raamdev.com/2013/try-it-a-different-way/ HTTP/1.0" 404 35383

(I'm assuming these are the corresponding requests based on the fact the date and timestamps match up to when the 404 symlinks were created.)

What's odd to me is that Apache returned a 404 when the request looks like it should go through. I mean, if you copy and paste those two URLs into your browser, they won't return a 404 but rather the post they're supposed to return.

@JasWSInc Any idea what might be going on here? Or any thoughts about how else I can attempt to figure out what's going on here?


I'm going to disable 404 Caching now and let it run for a few more days just to verify that this issue goes away with 404 Caching disabled.

@jaswrks
Copy link

jaswrks commented May 7, 2014

Regarding these two log entries in your Apache log...

122.96.59.106 - - [04/May/2014:16:19:21 -0400] "GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ HTTP/1.0" 404 35383
122.96.59.106 - - [04/May/2014:16:21:35 -0400] "GET http://raamdev.com/2013/try-it-a-different-way/ HTTP/1.0" 404 35383

These actually look wrong to me, but it might just be the Apache log format you're using. Could you check on this? Ordinarily, an HTTP request includes a Host: header and of course the request itself is aimed at a particular IP address that is resolved during the request.

The GET request itself should not include a host name, only the path to a file that is expected to live on that host. So what I would expect to see in the log file is....

GET /2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/
GET /2013/try-it-a-different-way/

@jaswrks
Copy link

jaswrks commented May 7, 2014

In short, when I see GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ in the log file, it looks to me like the URL that was requested was actually...

http://raamdev.comhttp://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/

@raamdev
Copy link
Contributor Author

raamdev commented May 7, 2014

The GET request itself should not include a host name,

Ah, yes, you're absolutely right. I was looking at way too many log entries and didn't catch that. The GET request should not contain the hostname.

So, this looks like it's just an invalid request and there's not a whole lot that we can do about that, correct?

I just tried reproducing this, both in a browser and via the command line using curl, but in both cases visiting http://raamdev.comhttp://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ doesn't work... because http://raamdev.comhttp is an invalid domain.

I'm curious how such a request ever made it through to WordPress where Quick Cache picked it up.

@jaswrks
Copy link

jaswrks commented May 8, 2014

So, this looks like it's just an invalid request and there's not a whole lot that we can do about that, correct?

Right. I'm not aware of a way to stop this. It's just a 404 error really.

I'm curious how such a request ever made it through to WordPress where Quick Cache picked it up.

Here's how you can reproduce it. These requests are most likely coming from a bot, it would be very difficult to reproduce this in a browser. Instead of building a URL, think about the underlying HTTP communication that would occur if you made this request without using a URL; and instead you simply opened a socket that sends an invalid GET request with the correct Host: header. That's really what a browser does anyway, but it parses the URL that you give it, and in that case the Host: would be wrong. However, if you remove the URL from the equation, and instead connect to the host IP and issue a GET request through a script, you can reproduce this.

<?php
error_reporting(-1);
ini_set('display_errors', TRUE);

$raamdev_ip = gethostbyname('raamdev.com'); // Resolve to an IP address.
$connection = fsockopen($raamdev_ip, 80, $errno, $errstr, 30); // Open connection.

if(!$connection) echo $errstr.' ('.$errno.')<br />'."\n";

else // We have a connection to `$raamdev_ip:80`. We're good so far.
    {
        /*
         * BuildS a GET request that is intentionally invalid in this case.
         */
        $request = 'GET http://raamdev.com/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/ HTTP/1.1'."\r\n";
        // ↑ this is intentionally invalid; it should be `/2014/linkedin-spam-black-girl-birthday-os-x-contacts-app/`.
        $request .= 'Host: raamdev.com'."\r\n"; // Apache virtual host @ `$raamdev_ip:80`.
        $request .= 'Connection: Close'."\r\n\r\n";

        /*
         * Talk to the IP handling `raamdev.com`.
         */
        fwrite($connection, $request);

        /*
         * Get the response; a 404 in this case.
         */
        while(!feof($connection))
            echo fgets($connection, 128);

        /*
         * Close the connection.
         */
        fclose($connection);
    }

@jaswrks
Copy link

jaswrks commented May 8, 2014

So, this looks like it's just an invalid request and there's not a whole lot that we can do about that, correct?

One thing you could do is investigate any reports from Google Webmaster Tools for raamdev.com that may indicate you have some invalid links on your site; i.e. invalid relative locations within a document that a spider might pick up by mistake.

For example, if you have an <a href=""> tag that might confuse a spider, you could see more than your fair share of invalid requests like this. The bot is simply following what you give it, and if that's wrong you get hit with lots of 404 errors when it attempts to spider your site.

That said, this can happen even if you don't have any invalid links on the site. Some spiders just don't function properly. They get things wrong when they spider your site. You could scan your log files and try to find a bot that is consistently doing this to you; then ban it using a robots.txt file or other means.

@raamdev
Copy link
Contributor Author

raamdev commented May 9, 2014

Here's how you can reproduce it.

Thanks for explaining that and for the sample code. That really helped clarify a few things for me. :) I tested that script and it does exactly as you said; it recreates the raamdev-comhttp/ directory that I was seeing along with the 404 symlink.

I'm not aware of a way to stop this. It's just a 404 error really.

Got it. We'll just offer an .htaccess file that helps clean things up a bit, if the site owner wants to implement something like that.

I think it will also be a good idea to explain that with 404 caching enabled, any invalid request will result in the cache file symlink being created, just so that there's no confusion about why there are cache directories for seemingly invalid hosts. In fact, I can probably turn a lot of what we've talked about here into a wiki article and reference that right form the inline docs.

@raamdev
Copy link
Contributor Author

raamdev commented Jun 18, 2014

Punting this to the Future Release milestone.

@raamdev raamdev modified the milestones: Future Release, Next Release Jun 18, 2014
@raamdev
Copy link
Contributor Author

raamdev commented Dec 16, 2014

@mchlbrry writes in #288:

As a side note for some reason im getting multiple domain cache directories being generated, potentially related? eg:

www-domain-com
www-domain-com-
www-domain-comhttp
These directories hang around too after using the 'clear cache' option in WP

Those are the result of a slightly misconfigured web server. Quick Cache uses the PHP $_SERVER['REQUEST_URI'] variable to determine the cache directory path it should build when generating and saving the cache file. If a request has a malformed Request URI, then Quick Cache will end up creating odd directories like that.

Where do these requests come from? Well, a search engine bot that scans large amounts of sites could itself be misconfigured and make bad requests, which Quick Cache picks up and attempts to cache.

The best way around this issue is to create an .htaccess rule that tells the web server to always redirect any bad requests to the proper Request URI. Please see Jason's first reply at the top of this issue for an example .htaccess rule.

@sallyfarmer
Copy link

Ever since I installed zen cache at http://alcohol-abuse-and-addictions-agency.co.uk unless an .htaccess is not present at file manager the whole site is down. Even with the .htaccess deleted and the site is up none of the page or posts links work. permalinks is set to %postnames% at the bottom of the list. The .htaccess regularly re-appears and when it does the site comes down. I cannot just delete zencache as now i don't want to be further messed up. I have aws account with a created distribution correctly setup as per your excellent video with a cname at cpanel cdn. etc. i USED THE

BEGIN Host Enforcer

RewriteEngine on RewriteBase /
RewriteCond %{HTTP_HOST} !^example\.com$
RewriteCond %{HTTPS} !^on$ [NC]
RewriteCond %{HTTP:X-Forwarded-Proto} !^https$ [NC]
RewriteRule .* http://example.com%{REQUEST_URI} [R=301,L]

RewriteCond %{HTTP_HOST} !^example\.com$
RewriteCond %{HTTPS} ^on$ [NC,OR]
RewriteCond %{HTTP:X-Forwarded-Proto} ^https$ [NC]
RewriteRule .* https://example.com%{REQUEST_URI} [R=301,L]

END Host Enforcer

by replacing example with alcohol-abuse-and-addictions-agency.co.uk within it being careful that it was exactly the same bar the href but it didn't work and the only way i could get the site to show again was by deleting the .htaccess file agin completely. Still no links work but the cloudfront aws is very fast in rendering links that don't work

@raamdev
Copy link
Contributor Author

raamdev commented Mar 28, 2015

@sallyfarmer It sounds like you may have an error in your .htaccess file, or a misconfiguration on your web server. I recommend contacting your web hosting company and asking them why the .htaccess file isn't working, as they have access to the server logs and they can diagnose this issue.

@jaswrks
Copy link

jaswrks commented Nov 12, 2015

@raamdev I'm just noting that this is another candidate for our new .htaccess tweaks system.

@raamdev raamdev added this to the Future Release milestone Nov 15, 2016
@raamdev raamdev modified the milestones: Next Release, Future Release Nov 22, 2016
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Jan 27, 2017
jaswrks pushed a commit to wpsharks/comet-cache-pro that referenced this issue Jan 27, 2017
…at allows site owners to enforce an exact host name for all requests. See: **Dashboard → Comet Cache Pro → Plugin Options → Apache Optimizations → Enforce an Exact Host Name?**. See also: [Issue #101](wpsharks/comet-cache#101).
raamdev added a commit that referenced this issue Feb 1, 2017
- **New Feature:** Comet Cache can now be configured to automatically clear the cache for date-based archive views whenever any single post is cleared due to changes in content, title, etc. See: **Dashboard → Comet Cache → Plugin Options → Automatic Cache Clearing → Auto-Clear "Date-Based Archives" Too?**. See also: [Issue #724](#724).
- **New Pro Feature:** Apache Optimizations now include a new option that allows site owners to enforce an exact host name for all requests. See: **Dashboard → Comet Cache Pro → Plugin Options → Apache Optimizations → Enforce an Exact Host Name?**. See also: [Issue #101](#101).
- **Bug Fix:** Apache detection sometimes inaccurate. So instead of using default WP core globals for server detection, Comet Cache now uses it's own set of Apache/Nginx/IIS detection functions. And, this release enhances our Apache and Nginx detection routines; making them smart enough to catch additional edge cases; i.e., to further reduce the likelihood of there being a false-positive. See [Issue #748](#748).
- **Bug Fix:** Some XML-RPC and REST API requests were being cached inadvertently. See [Issue #855](#855).
- **Bug Fix:** Broken textarea field due to `white-space:nowrap` in Firefox. See [Issue #866](#866).
- **Bug Fix:** This release resolves empty directories being left in the cache folder, in some scenarios. See [Thread #866](https://forums.wpsharks.com/t/cache-folders-not-removed-during-clean-up-process/866).
- **Bug Fix** (Pro): Some REST requests were being redirected incorrectly whenever Apache Optimizations were enabled. See [Issue #855](#855).
- **Compatibility Bug Fix:** Some Jetpack API calls were being cached inadvertently. See [Issue #855](#855).
- **Enhancement:** Notes in HTML source now indicate fully functional on first load for improved clarity. See [Issue #860](#860).
- **Code Cleanup:** Enhancing security by removing `basename(__FILE__)` from direct access notices.
@renzms
Copy link

renzms commented Feb 3, 2017

Tested for Site using NGINX

screen shot 2017-02-03 at 8 05 56 pm

Also tried adding the following manually for sites that use NGINX:

# Enforce exact host name.
<IfModule rewrite_module>
    RewriteEngine on
    RewriteBase /

    RewriteCond %{HTTP_HOST} !^domain.net$
    RewriteCond %{HTTPS} !^on$ [NC]
    RewriteCond %{HTTP:X-Forwarded-Proto} !^https$ [NC]
    RewriteRule .* http://domain.net%{REQUEST_URI} [R=301,L]

    RewriteCond %{HTTP_HOST} !^domain.net$
    RewriteCond %{HTTPS} ^on$ [NC,OR]
    RewriteCond %{HTTP:X-Forwarded-Proto} ^https$ [NC]
    RewriteRule .* https://domain.net%{REQUEST_URI} [R=301,L]
</IfModule>

was unable to continue testing as there were problems with server detection please see comment here

@renzms
Copy link

renzms commented Feb 13, 2017

@raamdev

Confirmed Working

Tested Using:

WordPress Version: 4.7.2
Current WordPress Theme: Twenty Seventeen version 1.1
Theme Author: the WordPress team - https://wordpress.org/
Theme URI: https://wordpress.org/themes/twentyseventeen/
Active Plugins: Comet Cache Pro Version 170209-RC
PHP Version: 7.0.10
MySQL Version: 10.0.29-MariaDB-0ubuntu0.16.04.1
Apache Version: Apache/2.4.10 (Debian)

Tested using different incorrect web addresses/ made up subdomains; such as http://foo.bar.php70-renz.wpsharks.net/path, the IP itself, and ftp.php70-renz.wpsharks.net.

screen shot 2017-02-13 at 5 25 41 pm

No erroneous cache directories:

screen shot 2017-02-13 at 5 27 44 pm

raamdev added a commit that referenced this issue Feb 20, 2017
- **New Feature:** Comet Cache can now be configured to automatically clear the cache for date-based archive views whenever any single post is cleared due to changes in content, title, etc. See: **Dashboard → Comet Cache → Plugin Options → Automatic Cache Clearing → Auto-Clear "Date-Based Archives" Too?**. See also: [Issue #724](#724).
- **New Pro Feature:** Apache Optimizations now include a new option that allows site owners to enforce an exact host name for all requests. See: **Dashboard → Comet Cache Pro → Plugin Options → Apache Optimizations → Enforce an Exact Host Name?**. See also: [Issue #101](#101).
- **Bug Fix:** Apache detection sometimes inaccurate. So instead of using default WP core globals for server detection, Comet Cache now uses it's own set of Apache/Nginx/IIS detection functions. And, this release enhances our Apache and Nginx detection routines; making them smart enough to catch additional edge cases; i.e., to further reduce the likelihood of there being a false-positive. See [Issue #748](#748).
- **Bug Fix:** Some XML-RPC and REST API requests were being cached inadvertently. See [Issue #855](#855).
- **Bug Fix:** Broken textarea field due to `white-space:nowrap` in Firefox. See [Issue #866](#866).
- **Bug Fix:** This release resolves empty directories being left in the cache folder, in some scenarios. See [Thread #866](https://forums.wpsharks.com/t/cache-folders-not-removed-during-clean-up-process/866).
- **Bug Fix** (Pro): Some REST requests were being redirected incorrectly whenever Apache Optimizations were enabled. See [Issue #855](#855).
- **Compatibility Bug Fix:** Some Jetpack API calls were being cached inadvertently. See [Issue #855](#855).
- **Enhancement:** Notes in HTML source now indicate fully functional on first load for improved clarity. See [Issue #860](#860).
- **Enhancement:** Enhancing security by removing `basename(__FILE__)` from direct access notices.
@raamdev
Copy link
Contributor Author

raamdev commented Feb 20, 2017

Comet Cache v170220 has been released and includes changes from this GitHub Issue. See the v170220 announcement for further details.


This issue will now be locked to further updates. If you have something to add related to this GitHub Issue, please open a new GitHub Issue and reference this one (#101).

@raamdev raamdev closed this as completed Feb 20, 2017
@wpsharks wpsharks locked and limited conversation to collaborators Feb 20, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants