You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
assignee='https://github.com/smontanaro'closed_at=<Date2003-03-06.08:27:12.000>created_at=<Date2003-02-20.18:55:13.000>labels= ['invalid', 'library']
title='robotparser only applies first applicable rule'updated_at=<Date2003-03-06.08:27:12.000>user='https://bugs.python.org/f8dy'
robotparser robotparser.py::RobotFileParser::can_fetch
currently returns the result of the first applicable rule. It
should loop through all rules looking for anything that
disallows access. For example, if your first rule applies
to 'wget' and 'python' and disallows access to /dir1/, and
your second rule is a 'python' rule that disallows access
to /dir2/, robotparser will falsely claim that python is
allowed to access /dir2/.
Mark, if you dive into http://www.robotstxt.org/wc/norobots-rfc.txt you'll note
that the first matching user-agent line as well as the first
matching allow or disallow line must be obeyed by the robot
(see 3.2.1 and 3.2.2).
Now, I am not opposed to disobey the above rfc, but there
are other arguments against your patch:
a) it breaks current implementations of robots.txt
(potentially disallowing access to sites)
b) your problem is easily solved by moving Disallow and/or
User-Agent lines to the top
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: