Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a "how far should we obey robots.txt?" heuristic function to complement this #11

Open
ssokolow opened this issue Aug 31, 2016 · 0 comments

Comments

@ssokolow
Copy link
Owner

ssokolow commented Aug 31, 2016

Some sites, like Fanfiction.net, have evolved on their former "ban all bots" approach to robots.txt and now offer rules that could actually be useful as a second layer of protection against making a mess.

(In Fanfiction.net's case, they ban the Internet Archive crawler entirely, but then only ban URLs like the mobile page variants, chapter comments, and "must be logged in" pages for everything else.)

I can see it being helpful to have a simplified robots.txtwrapper where you initialize it with a user-agent string and a starting URL and then it will make pass/fail judgements based on a heuristic like this:

  1. If the starting URL is allowed by robots.txt, simply act as a robots.txt evaluator.
  2. If the starting URL is outside the range defined by all Allow rules, apply step 1 as if an Allow rule existed for the starting URL's parent folder.
  3. If the starting rule is within a Disallow rule, ignore that rule (and, depending on how the parser was initialized, possibly all other Disallow rules too).

I think that should provide a reasonably intuitive balance between obeying orders and respecting robots.txt rules. (I know it would cover every case I can remember where I instructed HTTrack to ignore robots.txt in favour of my hand-crafted include/exclude rules.)

(And I'll want to make it into its own module to avoid feature creep here, but I'll track it on this issue tracker until the project gets started.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant