New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
False negatives in robots.txt processing? #3637
Comments
The behavior you see is correct and not a bug in Scrapy. In that robots.txt file you have rules about
The (usual) standard is to take the first matching directive: https://en.wikipedia.org/wiki/Robots_exclusion_standard#Allow_directive Internally Scrapy uses Additionally, if you reorder it like this:
it would still fail, because the For it to actually allow you access it would have to be:
The middle rule won't ever match anyway, but the first one will. So to answer (4): unfortunately, those are your only options because this website doesn't follow the (usual) standard. |
Ok, so you don't have to go full-blown custom robots.txt parser. Would a minimal subclass removing just the offending rule be fine? For example: ROBOTSTXT_OBEY = True
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.NewRobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': None
}
from urllib import robotparser
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.utils.python import to_native_str
class NewRobotsTxtMiddleware(RobotsTxtMiddleware):
def _parse_robots(self, response, netloc):
self.crawler.stats.inc_value('robotstxt/response_count')
self.crawler.stats.inc_value(
'robotstxt/response_status_count/{}'.format(response.status))
rp = robotparser.RobotFileParser(response.url)
body = ''
if hasattr(response, 'text'):
body = response.text
else: # last effort try
try:
body = response.body.decode('utf-8')
except UnicodeDecodeError:
# If we found garbage, disregard it:,
# but keep the lookup cached (in self._parsers)
# Running rp.parse() will set rp state from
# 'disallow all' to 'allow any'.
self.crawler.stats.inc_value('robotstxt/unicode_error_count')
# stdlib's robotparser expects native 'str' ;
# with unicode input, non-ASCII encoded bytes decoding fails in Python2
# Start change: Remove the offending items here.
lines = to_native_str(body).splitlines()
lines.remove('Disallow: /en/')
rp.parse(lines)
# End of change.
rp_dfd = self._parsers[netloc]
self._parsers[netloc] = rp
rp_dfd.callback(rp) |
Thank you for the rapid and comprehensive explanation. I've adopted your adjustment suggestion in my
So I'm still uncomfortable re-ordering the file but I am comforted by the wikipedia article, which clearly indicates there's some "wiggle-room" in the standard (Google's implementation differs in that Allow patterns with equal or more characters in the directive path win over a matching Disallow pattern). The file just contains so much that is just silly (according to the usual) standard. Does this, according to the standard, make any sense at all...?
And this...
I suppose, to be really "safe", I should contact the website author for clarification on the intension of such a rule. Anyway, thanks for your help. It has thrown a light on this dark subject. |
I forgot to mention that partial wildcards (
By itself, no. The latter would include the former. For all we know there might've been some historical reasons and they never bothered to remove unnecessary entries since the big search engines don't have an issue.
This is fine with a smarter parser. So by default disallow all English pages, except for those on the
That's a good idea - sometimes a robots.txt might technically allow you, but their T&C's don't. |
Hey @malberts , in your first comment for url
Does the following can also allow access ?
|
@maramsumanth Yes, they are practically the same. In Scrapy it will never match the rule with |
https://www.idealista.it has a
robots.txt
which appears complex but essentially has the following: -User-agent: *
Allow: /en/geo/
Scrapy (1.6.0) keeps telling me that where-ever I go on this site that I'm Forbidden by robots.tx: -
2019-02-23T11:06:44.226Z scrapy.downloadermiddlewares.robotstxt DEBUG # Forbidden by robots.txt: <GET https://www.idealista.it/en/geo/vendita-case/molise/>
I'm confused. I don't think I should blocked and I suspect that Scrapy may be thrown by other instructions in the
robots.txt
file.I'm no expert by any means but when I validate an apparently legitimate URL (
https://www.idealista.it/en/geo/vendita-case/molise/
) using an independent tool like http://tools.seobook.com/robots-txt/analyzer/ (and I've tried more than one to gain confidence) I'm told...So, is the robot.txt analysis in scrapy broken?
Scrapy tells me that everywhere on this site is blocked by the robots.txt. Just looking at the file myself, and not fully understanding the order of precedence, that just doesn't seem right.
circumvent robots with set ROBOTSTXT_OBEY = False
orwrite your own robots.txt analyser
.The text was updated successfully, but these errors were encountered: