False negatives in robots.txt processing? #3637

alanbchristie · 2019-02-23T11:18:13Z

https://www.idealista.it has a robots.txt which appears complex but essentially has the following: -

User-agent: *
Allow: /en/geo/

Scrapy (1.6.0) keeps telling me that where-ever I go on this site that I'm Forbidden by robots.tx: -

2019-02-23T11:06:44.226Z scrapy.downloadermiddlewares.robotstxt DEBUG # Forbidden by robots.txt: <GET https://www.idealista.it/en/geo/vendita-case/molise/>

I'm confused. I don't think I should blocked and I suspect that Scrapy may be thrown by other instructions in the robots.txt file.

I'm no expert by any means but when I validate an apparently legitimate URL (https://www.idealista.it/en/geo/vendita-case/molise/) using an independent tool like http://tools.seobook.com/robots-txt/analyzer/ (and I've tried more than one to gain confidence) I'm told...

Url: https://www.idealista.it/en/geo/vendita-case/molise/
Multiple robot rules found 
Robots allowed: All robots

So, is the robot.txt analysis in scrapy broken?

Scrapy tells me that everywhere on this site is blocked by the robots.txt. Just looking at the file myself, and not fully understanding the order of precedence, that just doesn't seem right.

If the answer is "Scrapy is correct" then why does it conflict with other analysers?
Is there more I need to configure in Scrapy?
Is there some middlewhere I'm missing?
And, most importantly, how do I continue to use Scrapy now and analyse sites like this? Suggestions I don't want are: circumvent robots with set ROBOTSTXT_OBEY = False or write your own robots.txt analyser.

The text was updated successfully, but these errors were encountered:

malberts · 2019-02-25T12:12:38Z

The behavior you see is correct and not a bug in Scrapy. In that robots.txt file you have rules about /en/ in the following order:

Disallow: /en/
Allow: /en/$
Allow: /en/geo/

The (usual) standard is to take the first matching directive: https://en.wikipedia.org/wiki/Robots_exclusion_standard#Allow_directive

Internally Scrapy uses urllib.robotparser which follows that ordering standard. It adds the rules top to bottom, and therefore that is also the order when matching rules. Not all robots.txt checkers follow the same rules, so that might be why some of the third party checkers did not complain.

Additionally, if you reorder it like this:

Allow: /en/$
Disallow: /en/
Allow: /en/geo/

it would still fail, because the $ is not part of the standard. Google uses it to to check "end of line" (like regex), but robotparser does a simple startswith which will fail unless you have a literal $ in the URL.

For it to actually allow you access it would have to be:

Allow: /en/geo/
Allow: /en/$
Disallow: /en/

The middle rule won't ever match anyway, but the first one will.

So to answer (4): unfortunately, those are your only options because this website doesn't follow the (usual) standard.

malberts · 2019-02-25T13:51:58Z

Ok, so you don't have to go full-blown custom robots.txt parser. Would a minimal subclass removing just the offending rule be fine?

For example:
settings.py

ROBOTSTXT_OBEY = True

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.NewRobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': None
}

middlewares.py

from urllib import robotparser
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.utils.python import to_native_str

class NewRobotsTxtMiddleware(RobotsTxtMiddleware):
    def _parse_robots(self, response, netloc):
        self.crawler.stats.inc_value('robotstxt/response_count')
        self.crawler.stats.inc_value(
            'robotstxt/response_status_count/{}'.format(response.status))
        rp = robotparser.RobotFileParser(response.url)
        body = ''
        if hasattr(response, 'text'):
            body = response.text
        else:  # last effort try
            try:
                body = response.body.decode('utf-8')
            except UnicodeDecodeError:
                # If we found garbage, disregard it:,
                # but keep the lookup cached (in self._parsers)
                # Running rp.parse() will set rp state from
                # 'disallow all' to 'allow any'.
                self.crawler.stats.inc_value('robotstxt/unicode_error_count')
        # stdlib's robotparser expects native 'str' ;
        # with unicode input, non-ASCII encoded bytes decoding fails in Python2

        # Start change: Remove the offending items here.
        lines = to_native_str(body).splitlines()
        lines.remove('Disallow: /en/')
        rp.parse(lines)
        # End of change.

        rp_dfd = self._parsers[netloc]
        self._parsers[netloc] = rp
        rp_dfd.callback(rp)

alanbchristie · 2019-02-25T23:25:45Z

Thank you for the rapid and comprehensive explanation.

I've adopted your adjustment suggestion in my NewRobotsTxtMiddleware but rather than removing lines I've decided to re-order them (so as not to break too many rules). So, I place...

All the Disallow lines that name specific files or wildcards first
Followed by all Allow lines
Followed by the remaining Disallow lines

So Disallow:/*?ordine=stato-asc comes before Allow: /en/geo/, which then comes before Disallow: /en/.

I'm still uncomfortable re-ordering the file but I am comforted by the wikipedia article, which clearly indicates there's some "wiggle-room" in the standard (Google's implementation differs in that Allow patterns with equal or more characters in the directive path win over a matching Disallow pattern).

The file just contains so much that is just silly (according to the usual) standard. Does this, according to the standard, make any sense at all...?

Disallow: /en/node/
Disallow: /en/

And this...

Disallow: /en/
Allow: /en/geo/

I suppose, to be really "safe", I should contact the website author for clarification on the intension of such a rule.

Anyway, thanks for your help. It has thrown a light on this dark subject.

malberts · 2019-02-26T06:27:58Z

I forgot to mention that partial wildcards (Disallow:/*?ordine=stato-asc) will also not trigger for the same reason as $ (because of startswith()). Complete wildcards (Allow: *) will be fine because the parser checks that explicitly, but that's usually only to override the default rules for a specific bot.

Does this, according to the standard, make any sense at all...?
Disallow: /en/node/
Disallow: /en/

By itself, no. The latter would include the former. For all we know there might've been some historical reasons and they never bothered to remove unnecessary entries since the big search engines don't have an issue.

And this...

Disallow: /en/
Allow: /en/geo/

This is fine with a smarter parser. So by default disallow all English pages, except for those on the geo sub-path. That would be shorthand for explicitly Disallowing every non-/en/geo path.

I suppose, to be really "safe", I should contact the website author for clarification on the intension of such a rule.

That's a good idea - sometimes a robots.txt might technically allow you, but their T&C's don't.
However. if you can get them to officially support your bot with an explicit User-Agent: ABCspider and simpler rules, then that'll get you around the messy rules that apply to everyone else. Or they can just rearrange it so that a stricter parser understands it correctly.

maramsumanth · 2019-03-21T19:03:33Z

Hey @malberts , in your first comment for url https://www.idealista.it/en/geo/vendita-case/molise/,
apart from this:-

Allow: /en/geo/
Allow: /en/$
Disallow: /en/

Does the following can also allow access ?

Allow: /en/$
Allow: /en/geo/
Disallow: /en/

malberts · 2019-03-22T05:47:55Z

@maramsumanth Yes, they are practically the same. In Scrapy it will never match the rule with $ anyway. In a smarter parser it also doesn't make a difference, because both rules are there.

Gallaecio · 2019-08-12T09:03:01Z

#3796 makes it possible to select a different robots.txt parser. Please, check if any of the parsers with built-in support (Reppy, Robotexclusionrulesparser) work for this use case. Otherwise, #3935 may do the trick.

whalebot-helmsman mentioned this issue Mar 5, 2019

[GSoC 2019] Support for Different robots.txt Parsers #3656

Closed

Gallaecio closed this as completed Sep 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

False negatives in robots.txt processing? #3637

False negatives in robots.txt processing? #3637

alanbchristie commented Feb 23, 2019

malberts commented Feb 25, 2019 •

edited

malberts commented Feb 25, 2019 •

edited

alanbchristie commented Feb 25, 2019

malberts commented Feb 26, 2019

maramsumanth commented Mar 21, 2019 •

edited

malberts commented Mar 22, 2019

Gallaecio commented Aug 12, 2019

False negatives in robots.txt processing? #3637

False negatives in robots.txt processing? #3637

Comments

alanbchristie commented Feb 23, 2019

malberts commented Feb 25, 2019 • edited

malberts commented Feb 25, 2019 • edited

alanbchristie commented Feb 25, 2019

malberts commented Feb 26, 2019

maramsumanth commented Mar 21, 2019 • edited

malberts commented Mar 22, 2019

Gallaecio commented Aug 12, 2019

malberts commented Feb 25, 2019 •

edited

malberts commented Feb 25, 2019 •

edited

maramsumanth commented Mar 21, 2019 •

edited