-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Path allowed despite Disallow for * #33
Comments
Thanks for writing in! The section about
That's far from decisive, but my interpretation of that is that the current behavior is correct - it follows the most specific stanza for the particular bot. It's pretty common for a While Google is not exactly a standard, it has this to say on the matter:
That same document talks about the precedence order that they use, but it seems that they would have the same interpretation as the current implementation. All that said, one of the weaknesses of REP is that there isn't one clear answer, and it is mostly set forth by the original RFC and then by mass adoption / convention (mostly driven by Google). It's also entirely possible that I've made a mistake :-) |
That's not how Specifically, look at the example from the RFC under Section 4:
Specifically notice that webcrawler & excite are allowed to crawl |
Hey, thank you all for taking the time to correct, explain and direct to relevant sources - this is very much appreciated and helps me a lot. 👍 |
Hey,
I have written an R based robots.txt parser (https://github.com/ropenscilabs/robotstxt). @hrbrmstr wrapped this library and suggested using it for big-x speedup (https://github.com/hrbrmstr/spiderbar).
Related issue: hrbrmstr/spiderbar#2
Now I have run my test cases against my implementation and against those wrapping rep-cpp and found a divergence which I think is a bug on your side. Consider the following example:
Consider the following robots.txt file:
User-agent: UniversalRobot/1.0 User-agent: mein-Robot Disallow: /quellen/dtd/ User-agent: * Disallow: /unsinn/ Disallow: /temp/ Disallow: /newsticker.shtml
In the example there are some directories forbidden for all robots e.g.
/temp/
but when using rep-cpp for permission checking the path is indicated as ok when checking for bot mein-Robot which I am quite sure should not be the case. (rep-cpp is used for those function calls wherecheck_method="spiderbar"
)The text was updated successfully, but these errors were encountered: