Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path allowed despite Disallow for * #33

Closed
petermeissner opened this issue Jan 25, 2018 · 3 comments
Closed

Path allowed despite Disallow for * #33

petermeissner opened this issue Jan 25, 2018 · 3 comments

Comments

@petermeissner
Copy link

petermeissner commented Jan 25, 2018

Hey,

I have written an R based robots.txt parser (https://github.com/ropenscilabs/robotstxt). @hrbrmstr wrapped this library and suggested using it for big-x speedup (https://github.com/hrbrmstr/spiderbar).

Related issue: hrbrmstr/spiderbar#2

Now I have run my test cases against my implementation and against those wrapping rep-cpp and found a divergence which I think is a bug on your side. Consider the following example:

Consider the following robots.txt file:

User-agent: UniversalRobot/1.0
User-agent: mein-Robot
Disallow: /quellen/dtd/

User-agent: *
Disallow: /unsinn/
Disallow: /temp/
Disallow: /newsticker.shtml

In the example there are some directories forbidden for all robots e.g. /temp/ but when using rep-cpp for permission checking the path is indicated as ok when checking for bot mein-Robot which I am quite sure should not be the case. (rep-cpp is used for those function calls where check_method="spiderbar")

library(robotstxt)

rtxt <- "# robots.txt zu http://www.example.org/\n\nUser-agent: UniversalRobot/1.0\nUser-agent: mein-Robot\nDisallow: /quellen/dtd/\n\nUser-agent: *\nDisallow: /unsinn/\nDisallow: /temp/\nDisallow: /newsticker.shtml"

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "*"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "robotstxt",
  bot            = "mein-Robot"
)
#> [1] FALSE

paths_allowed(
  paths          = "/temp/some_file.txt", 
  robotstxt_list = list(rtxt), 
  check_method   = "spiderbar",
  bot            = "mein-Robot"
)
#> [1] TRUE
@dlecocq
Copy link
Contributor

dlecocq commented Jan 25, 2018

Thanks for writing in!

The section about User-Agent in the original RFC says:

These name tokens are used in User-agent lines in /robots.txt to identify to which specific robots the record applies. The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a "*" value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited.

That's far from decisive, but my interpretation of that is that the current behavior is correct - it follows the most specific stanza for the particular bot. It's pretty common for a robots.txt to group bots together with a set or rules or repeat that set of rules explicitly for each of a number of agents.

While Google is not exactly a standard, it has this to say on the matter:

The start-of-group element user-agent is used to specify for which crawler the group is valid. Only one group of records is valid for a particular crawler. We will cover order of precedence later in this document.

That same document talks about the precedence order that they use, but it seems that they would have the same interpretation as the current implementation.

All that said, one of the weaknesses of REP is that there isn't one clear answer, and it is mostly set forth by the original RFC and then by mass adoption / convention (mostly driven by Google).

It's also entirely possible that I've made a mistake :-)

@b4hand
Copy link
Contributor

b4hand commented Jan 25, 2018

That's not how robots.txt files work. It's a common misunderstanding that * applies to all bots. It does not. It only applies to bots that are not matched by other sections. Yes, this means you must repeat rules if you declare specific sections for different bots. I didn't write the robots.txt specification, but this is the letter of the specification. The specification for robots.txt makes it clear that robots only have to look at one section of rules: specifically the first section that they match.

Specifically, look at the example from the RFC under Section 4:

      # /robots.txt for http://www.fict.org/
      # comments to webmaster@fict.org

      User-agent: unhipbot
      Disallow: /

      User-agent: webcrawler
      User-agent: excite
      Disallow: 

      User-agent: *
      Disallow: /org/plans.html
      Allow: /org/
      Allow: /serv
      Allow: /~mak
      Disallow: /

The following matrix shows which robots are allowed to access URLs:

                                               unhipbot webcrawler other
                                                        & excite
     http://www.fict.org/                         No       Yes       No
     http://www.fict.org/index.html               No       Yes       No
     http://www.fict.org/robots.txt               Yes      Yes       Yes
     http://www.fict.org/server.html              No       Yes       Yes
     http://www.fict.org/services/fast.html       No       Yes       Yes
     http://www.fict.org/services/slow.html       No       Yes       Yes
     http://www.fict.org/orgo.gif                 No       Yes       No
     http://www.fict.org/org/about.html           No       Yes       Yes
     http://www.fict.org/org/plans.html           No       Yes       No
     http://www.fict.org/%7Ejim/jim.html          No       Yes       No
     http://www.fict.org/%7Emak/mak.html          No       Yes       Yes

Specifically notice that webcrawler & excite are allowed to crawl http://www.fict.org/org/about.html.

@b4hand b4hand closed this as completed Jan 25, 2018
@petermeissner
Copy link
Author

Hey,

thank you all for taking the time to correct, explain and direct to relevant sources - this is very much appreciated and helps me a lot.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants