-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web: Almost all searches are automated SEO searches, impacting running costs, so trying to block these #55
Comments
…igin, impacting running costs, so try to block these
This has now been implemented. I'm now seeing almost all traffic (99.999%?) to searchmysite.net respond with "Please visit https://searchmysite.net/ to search searchmysite.net". The way it is implemented the search isn't being performed when that response is returned so it is taking load off the server, but the extremely high level of queries will still be putting load on the server. So now need to see if whatever is doing this either (i) stops, (ii) contacts me so it can be handled properly (assuming it is legitimate), or (iii) implements a workaround and continues (in which case I'll reopen and implement a CSRF solution). |
More than 24 hours after this change and the requests are still coming in thick and fast. Unfortunately after moving to the cheaper hosting provider I don't get the same sort of monitoring, and the analytics solution (understandably) only reports on real users, so it isn't easy to give nice illustrations of what is going on. But as a rough idea, there were 5 real searches yesterday, and 77,167 of the problem searches, so over 99.99% of the search traffic is problematic. Looking at the search terms, most/all look like random snippets of text scraped off web pages, largely adult and gambling sites, e.g. "Powered By Tube Ace Tube Script" and "powered by pMachine bk8 Notify me when someone replies to this post", rather than actual queries a real person would have entered into a search box, so I'm starting to wonder if this is some kind of DDoS attack. Anyway, now blocking at the nginx reverse proxy level to take some more of the pressure off Flask and docker:
|
To me those searches look like spammers, probably trying to influence the autocompletion I don't believe you have. I occasionally get spammers on my own personal sites submitting links through the tastiest form they can find. Doesn't take much to scare mine away, but seems you're having a harder time at it. |
Thanks for your info. Why would a spammer try to influence autocompletion? Looking at the logs it seems to me to be searching for strings that have been scraped off other websites, e.g.: Some random IP lookups suggest very geographically diverse sources, e.g. Russia, India, US, UK, The Netherlands, and slightly diverse browsers (mostly Safari, Opera, QQBrowser and Vivaldi). So I'm wondering if it is some kind of bot farm which is farming original content to place on SEO link farms, or something like that. Anyway, the nginx config seems to be stopping them from getting anything useful for the time being, and keeping the traffic away from docker + Flask so the CPU utilisation is a little smoother now. But I am still very curious as to what is actually going on. |
Okay, I think I have found out what is going on. I got the first clue after searching the internet for one of the search terms I'm seeing: "Designed by Mitre Design and SWOOP". So, there are a number of paid-for black hat SEO tools like ScrapeBox, GSA SEO and SEnuke. SEO spammers enter "scraping footprints" on these, combined with their search terms, to search for URLs to target. A simple scraping footprint example is "Powered by Wordpress", to search for pages that are probably blog pages generated by Wordpress, and example search term would be "best turntables", so the tool fires off search requests like ""Powered by Wordpress" best turntables" to the search engines. The tool can then use the list of results to do all sorts of further activities, e.g. to target with automated backlink generators, copy content to link farms, scrape for email addresses to spam, etc. Presumably searchmysite.net has been added as a search engine to one or more of these tools (although I don't know which at the moment), which is why I'm seeing vast numbers of these searches. Now I'm not sure how many, or even if any, of the results on searchmysite.net will be vulnerable to things like automated backlink generators. But with spammers it is a numbers game - if they feed in millions of links, they only need 0.01% to work and they've still got themselves 100s of URLs that will serve their needs. Given how lucrative SEO spam appears to be and how well funded these spam enabling operations are, I think it is only a matter of time before they break through the simple defences I've put up, so I'm probably not going to be able to win this one long term on my own in my spare time with no funding, hence my reopening this issue. One possible solution is to use Cloudflare Scrape Shield to protect searchmysite.net, which is kind-of ironic since Cloudflare is currently blocking the searchmysite.net spider as per #46. Another solution is simply to switch off the public search and focus on the search as a service as per #57. |
I'm now on over 160,000 spam bot searches per day. The nginx config is still holding up reasonably well, but I don't know for how long. I've asked for ideas on https://www.indiehackers.com/post/how-do-you-block-the-seo-spam-bots-208f6eb503 . So far, if the nginx config fails, alternatives include:
|
An idea that just occurred to me for you: switch from free-form text entry to a tagcloud. That could reduce the usefulness of your service for SEO spammers whilst preserving most of it for (the few) actual users. |
@alcinnz Thanks for your suggestion, and I like the idea. The challenge is that it would be difficult to get a tag cloud sufficiently comprehensive to replace the free text box. You can see from the Browse page that not many sites have tags, and the Filter shows that the tags that there are aren't that useful (top tags are blog, developer, software, programming). I know some blog search engines have manually tagged blog home pages, but (i) that could be a lot of work, and (ii) many of the interesting blogs don't just cover one narrow set of topics but have posts about all sorts of things. I also know there are auto-tagging algorithms, but the results can be a bit hit and miss. It might be something worth exploring as a better interface to the Browse section though. P.S. There's a bit of a discussion on Hacker News about potential solutions at https://news.ycombinator.com/item?id=31395231 . So far it seems that trying to block them as early as possible via Cloudflare is the best option. |
I've set up on CloudFlare, and tried a few settings like a firewall rule to block known bots and enabling Bot Fight Mode, but none have had the desired affect. I've asked for suggestions on the specific config required (assuming there is some) on the CloudFlare community at https://community.cloudflare.com/t/how-do-i-block-automated-seo-searches-for-scraping-footprints/384414 . See also Firefox search integration and direct link to search results issue that the reverse proxy config and code change have introduced: #59 . |
Well this explains why Search My Site searches made via my personal Searx instance started getting blocked earlier this month :) I'm wondering if there's a solution that will block the automated SEO searches without blocking small-time users using e.g. Searx? |
Here's the latest stats:
Its a bit more tricky to work out what are the automated SEO search requests now there are more real searches mixed in with them (a good problem to have:-), and the stats from 18 May are missing around 10 hours because the log files filled up all the disk space (fortunately the only service that was impacted was the analytics). But looking at the logs, I reckon quite a few real users with real searches have been blocked by the "block requests with no referrer" rule put in to try to block the automated SEO search, which is really unfortunate. I also really don't like that direct links to search results don't work as a result of this rule, plus there is the Firefox search bar issue too, so I've had a look at alternative solutions. The best I've come up with for now is to "block requests with no referrer where the query string is longer than X characters". The idea is that this can block the longer ""Powered by <system>" <search term>" type of requests while still allowing shorter direct links without referrers, e.g. /search/?q=domain:michael-lewis.com. The nginx config for this is (noting that if statements can't have an and so there's a very strange looking workaround):
The number 36 is a bit arbitrary. Unfortunately this is now letting through some shorter automated queries like 'Powered by Qhub.com ', but does look like it is letting some real searches through that would otherwise have been blocked, so with a bit of tweaking of the number it might be possible to get a reasonable balance. It is still a long way off what I'd call a good solution though, so I'd still think of it as temporary. I've set up Cloudflare so the site is now going through that as well. I've spent a bit of time trying out various options, including 2 which I'd have thought would have worked, but didn't. I still want to try a few more things out with Cloudflare before coming to any conclusions. |
@eagle-dogtooth - I think the searx searches come without a referrer, so yes they'd have been blocked from 9 May. The change I've made made today should have unblocked them, but I don't know if adding Cloudflare will have caused an issue, nor if any future changes will affect it. At some point I'd like to take a deeper look at searx though, because it would be nice to make sure it works. |
This fixes both Firefox integration and searx, at least for search terms within the length limit. Thanks! |
@eagle-dogtooth Yes, best start a new Discussion. |
Latest stats are:
The configuration on Cloudflare which has been running (unchanged) since Wed 18 May is:
FWIW, previous configurations I tried were to:
As an aside, there have been a couple of issues with Cloudflare:
So I think I'll leave Cloudflare on for now, but leave this issue open for the time being. |
Latest stats:
So the number of automated SEO searches has been a manageable number for the past couple of weeks. This doesn't appear to have happened as a result of any specific actions I've taken, unless it was the reverse proxy configuration which has returned a 403 for most of these for the past month. For the record, current changes are:
|
Another thought - although I am blocking most of these "automated SEO searches" at the reverse proxy, even if they were to get through, I don't know how useful the results would be to SEO practitioners given the highly specific terms they are trying to "search engine optimise", e.g. to take 3 random ones I saw in the logs "liquid silicone molding", "White Cherry Gelato" and "pest control sachse" (apparently sachse is an area in Dallas) they don't return especially useful results from searchmysite.net (especially without double-quotes to make phrase searches). Anyway, I've published a blog post at https://blog.searchmysite.net/posts/an-update-on-the-automated-seo-searches-issue/ with a full update on this issue, and am going to close it for now. |
The analytics solution shows there are currently around 10 visitors a day performing around 4 searches a day.
However, the logs show the server is getting several search requests every second, totalling 100s of 1000s of searches every day. Almost all of these have no referrer set, so it isn't clear where they're coming from. They're generally from different IP addresses with different user agents, suggesting that they may be real users rather than bots.
My guess is that one or more other web sites are presenting a search box, getting the results from searchmysite.net, and displaying the results on their own site. This isn't necessarily a problem if the site(s) are legitimate and make appropriate credits and the users are actually making use of the results, but right now I don't know if that is the case, so I'm going to try to block this to try to find out more information.
Simple solution is to have the search page check for a referrer and display a message if not present. If that is circumvented, e.g. via a dummy referrer, I'll need to investigate some Cross Site Request Forgery style protection.
Suggested message is "Please visit https://searchmysite.net/ to search searchmysite.net".
The text was updated successfully, but these errors were encountered: