You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some sites, like Fanfiction.net, have evolved on their former "ban all bots" approach to robots.txt and now offer rules that could actually be useful as a second layer of protection against making a mess.
(In Fanfiction.net's case, they ban the Internet Archive crawler entirely, but then only ban URLs like the mobile page variants, chapter comments, and "must be logged in" pages for everything else.)
I can see it being helpful to have a simplified robots.txtwrapper where you initialize it with a user-agent string and a starting URL and then it will make pass/fail judgements based on a heuristic like this:
If the starting URL is allowed by robots.txt, simply act as a robots.txt evaluator.
If the starting URL is outside the range defined by all Allow rules, apply step 1 as if an Allow rule existed for the starting URL's parent folder.
If the starting rule is within a Disallow rule, ignore that rule (and, depending on how the parser was initialized, possibly all other Disallow rules too).
I think that should provide a reasonably intuitive balance between obeying orders and respecting robots.txt rules. (I know it would cover every case I can remember where I instructed HTTrack to ignore robots.txt in favour of my hand-crafted include/exclude rules.)
(And I'll want to make it into its own module to avoid feature creep here, but I'll track it on this issue tracker until the project gets started.)
The text was updated successfully, but these errors were encountered:
Some sites, like Fanfiction.net, have evolved on their former "ban all bots" approach to
robots.txt
and now offer rules that could actually be useful as a second layer of protection against making a mess.(In Fanfiction.net's case, they ban the Internet Archive crawler entirely, but then only ban URLs like the mobile page variants, chapter comments, and "must be logged in" pages for everything else.)
I can see it being helpful to have a simplified
robots.txt
wrapper where you initialize it with a user-agent string and a starting URL and then it will make pass/fail judgements based on a heuristic like this:robots.txt
, simply act as arobots.txt
evaluator.Allow
rules, apply step 1 as if anAllow
rule existed for the starting URL's parent folder.Disallow
rule, ignore that rule (and, depending on how the parser was initialized, possibly all otherDisallow
rules too).I think that should provide a reasonably intuitive balance between obeying orders and respecting
robots.txt
rules. (I know it would cover every case I can remember where I instructed HTTrack to ignorerobots.txt
in favour of my hand-crafted include/exclude rules.)(And I'll want to make it into its own module to avoid feature creep here, but I'll track it on this issue tracker until the project gets started.)
The text was updated successfully, but these errors were encountered: