Develop a "how far should we obey robots.txt?" heuristic function to complement this #11

ssokolow · 2016-08-31T01:51:08Z

Some sites, like Fanfiction.net, have evolved on their former "ban all bots" approach to robots.txt and now offer rules that could actually be useful as a second layer of protection against making a mess.

(In Fanfiction.net's case, they ban the Internet Archive crawler entirely, but then only ban URLs like the mobile page variants, chapter comments, and "must be logged in" pages for everything else.)

I can see it being helpful to have a simplified robots.txtwrapper where you initialize it with a user-agent string and a starting URL and then it will make pass/fail judgements based on a heuristic like this:

If the starting URL is allowed by robots.txt, simply act as a robots.txt evaluator.
If the starting URL is outside the range defined by all Allow rules, apply step 1 as if an Allow rule existed for the starting URL's parent folder.
If the starting rule is within a Disallow rule, ignore that rule (and, depending on how the parser was initialized, possibly all other Disallow rules too).

I think that should provide a reasonably intuitive balance between obeying orders and respecting robots.txt rules. (I know it would cover every case I can remember where I instructed HTTrack to ignore robots.txt in favour of my hand-crafted include/exclude rules.)

(And I'll want to make it into its own module to avoid feature creep here, but I'll track it on this issue tracker until the project gets started.)

The text was updated successfully, but these errors were encountered:

ssokolow added the enhancement label Aug 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop a "how far should we obey robots.txt?" heuristic function to complement this #11

Develop a "how far should we obey robots.txt?" heuristic function to complement this #11

ssokolow commented Aug 31, 2016 •

edited

Loading

Develop a "how far should we obey robots.txt?" heuristic function to complement this #11

Develop a "how far should we obey robots.txt?" heuristic function to complement this #11

Comments

ssokolow commented Aug 31, 2016 • edited Loading

ssokolow commented Aug 31, 2016 •

edited

Loading