Make robotparser.RobotFileParser ignore blank lines #57490

bernie9998 · 2011-10-27T20:30:44Z

BPO	13281
Nosy	@terryjreedy, @orsenthil, @ezio-melotti, @merwok, @akheron
Files	robotparser.py.patch: patch for RobotFileParser which ignores all blank lines

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2011-10-27.20:30:43.604>
labels = ['type-feature', 'library', '3.9', '3.10']
title = 'Make robotparser.RobotFileParser ignore blank lines'
updated_at = <Date 2020-11-17.20:21:04.497>
user = 'https://bugs.python.org/bernie9998'

bugs.python.org fields:

activity = <Date 2020-11-17.20:21:04.497>
actor = 'iritkatriel'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2011-10-27.20:30:43.604>
creator = 'bernie9998'
dependencies = []
files = ['23538']
hgrepos = []
issue_num = 13281
keywords = ['patch', 'needs review']
message_count = 12.0
messages = ['146518', '146536', '146586', '146601', '146619', '146668', '146675', '146679', '146731', '146871', '147478', '147541']
nosy_count = 6.0
nosy_names = ['terry.reedy', 'orsenthil', 'ezio.melotti', 'eric.araujo', 'bernie9998', 'petri.lehtinen']
pr_nums = []
priority = 'normal'
resolution = None
stage = 'test needed'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue13281'
versions = ['Python 3.9', 'Python 3.10']

bernie9998 · 2011-10-27T20:30:43Z

When attempting to parse a robots.txt file which has a blank line between allow/disallow rules, all rules after the blank line are ignored.

If a blank line occurs between the user-agent and its rules, all of the rules for that user-agent are ignored.

I am not sure if having a blank line between rules is allowed in the spec, but I am seeing this behavior in a number of sites, for instance:

http://www.whitehouse.gov/robots.txt has a blank line between the disallow rules all other lines, including the associated user-agent line, resulting in the python RobotFileParser to ignore all rules.

http://www.last.fm/robots.txt appears to separate their rules with arbitrary blank lines between them. The python RobotFileParser only sees the first two rule between the user-agent and the next newline.

If the parser is changed to simply ignore all blank lines, would it have any adverse affect on parsing robots.txt files?

I am including a simple patch which ignores all blank lines and appears to find all rules from these robots.txt files.

akheron · 2011-10-28T05:43:23Z

Blank lines are allowed according to the specification at http://www.robotstxt.org/norobots-rfc.txt, section 3.3 Formal Syntax.

The issue also seems to exist on 3.2 and 3.3.

terryjreedy · 2011-10-29T01:06:24Z

Because of the line break, clicking that link gives "Server error 404".
http://www.robotstxt.org/norobots-rfc.txt
works (so please pay attention to formatting). The main page is
http://www.robotstxt.org/robotstxt.html

The way I read the grammar, 'records' (which start with an agent line) cannot have blank lines and must be separated by blank lines. Other than than, the suggestion seems reasonable, but it also seems like a feature request. Does test/test_robotparser pass with the patch?

I also do not see "Crawl-delay" and "Sitemap" (from whitehouse.gov) in the grammar referenced above. So I wonder if de facto practice has evolved.

Philip S.: do you have any opinions?
(I am asking you because of your comments on bpo-1437699.)

akheron · 2011-10-29T10:11:21Z

Because of the line break, clicking that link gives "Server error 404".

I don't see a line break, but the comma after the link seems to breaks it. Sorry.

The way I read the grammar, 'records' (which start with an agent
line) cannot have blank lines and must be separated by blank lines.

Ah, true. But it seems to me that having blank lines elsewhere doesn't break the parsing. If other robots.txt parser implementations allow arbitrary blank lines, we could add a strict=False parameter to make the parser non-strict. This would be a new feature of course.

Does the parser currently handle blank lines between full records (agentline(s) + ruleline(s)) correctly?

I also do not see "Crawl-delay" and "Sitemap" (from whitehouse.gov) in the grammar referenced above. So I wonder if de facto practice has evolved.

The spec says:

Lines with Fields not explicitly specified by this specification
may occur in the /robots.txt, allowing for future extension of the
format.

So these seem to be nonstandard extensions.

terryjreedy · 2011-10-29T20:02:24Z

Sorry, the visual linebreak depends on font size. It *is* the comma that caused the problem.

You missed my question about the current test suite.

Senthil, you are the listed expert for urllib, which includes robotparser. Any opinions on what to do?

orsenthil · 2011-10-31T00:01:49Z

I agree with your interpretation of the RFC. The parsing rules do not specify any provision for inclusion of blank lines "within" the records.

However, I find that inclusion is no harm either. I checked that with a robots.txt parser (Google webmaster tools) and presented the last.fm's robots.txt file which had blank line within records. As expected, it did not crib.

I would say that we can be lenient on this front and the question would if we allow, would it break any parsing rules? I think, no.

The patch does not break any tests, but a new test should be added to reflect this situation.

I don't have a strong opinion on having a strict=(True|False) for the blank line accommodation within records(only). I think, it is better we don't add a new parameter and just be lenient.

terryjreedy · 2011-10-31T03:22:28Z

Since following the spec is not a bug, this issue is a feature request for 3.3. I agree with just being lenient with no added parameter. Perhaps that should be mentioned in the doc with (or in) a version-added note. Senthil: does doc standard allow something like

Version added 3.3: Ignore blanks lines within record groups.
?

ezio-melotti · 2011-10-31T09:26:02Z

If it's added it should be a versionchanged, not a versionadded.
I'm also not entirely sure this should be considered a new feature and don't see the point of having a strict mode. IMHO robotparser should honor what the robots.txt files say, and not doing so because there's an extra blank line doesn't strike me as a useful behavior. Of course it shouldn't parse every kind of broken syntax, but the OP pointed to two fairly popular websites that use blank lines and that seems to indicate that blank lines are generally allowed.

terryjreedy · 2011-10-31T19:00:39Z

The robotparser is currently doing exactly what it is documented as doing. 20.9. urllib.robotparser — Parser for robots.txt
says "For more details on the structure of robots.txt files, see http://www.robotstxt.org/orig.html." (Since there are no previous details, 'more' should be deleted.) That page, in turn, says

'''The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>".'''

The formal grammar says the same thing. The page goes on with

'''Comments ... are discarded completely, and therefore do not indicate a record boundary.'''

followed by

'''The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.'''

Not allowing blank lines within records is obviously, to me, intentional and not an accidental oversight. It aids error detection. Consider:

User-agent: A ...
Disallow: ...

User-aget: B ...
Disallow: ...

Currently, the blank line signals a new record, the misspelled 'User-aget' line is ignored, and the new record, starting with 'Disallow' instead of 'User-agent' is correctly seen as an error and ignored. The same would be true if the User-agent line were accidentally omitted. When humans edit files, perhaps from someone else's notes, such things happen.

With this change, the second disallow line will be incorrectly attributed to A. We can justify that on the hypothesis that intentional blank lines within record, in violation of the standard, are now more common than missing or misspelled User-Agent lines. Or we can decide that mis-attributing Disallow lines is a lesser sin than ignoring them. But the change is pretty plainly a feature change and not a bug fix.

My current suggested doc change is to replace the sentence quoted at the top with
"Such files are parsed according to the rules given at http://www.robotstxt.org/orig.html , with the exception that blank lines are allowed within records.
Versionchanged 3.3: allow blank lines within records"

Side note: The example in the doc uses musi-cal.com. We need a replacement as it was closed last June, as noted in
http://www.wolfgangsvault.com/blog/index.php/2011/06/closing-mojam-com-and-musi-cal-com/

akheron · 2011-11-02T19:20:11Z

My current suggested doc change is to replace the sentence quoted at the top with

Sounds good to me.

merwok · 2011-11-12T11:51:37Z

First, I’d like to remind that the robots spec is not an official Internet spec backed up by an official body. It’s also not as important as (say) HTTP parsing.

For this bug, IMO the guiding principle should be Postel’s Law. What harm is there in being more lenient than the spec? People apparently want to parse the robots.txt with blank lines from last.fm and whitehouse.gov, and I don’t think there are people that depend on the fact that blank lines cause the rest of the file to be ignored. Hence, I think too that we should be pragmatic and allow blank lines, to follow the precedent established by other tools and be pragmatic.

If you feel strongly about this, I can contact the robotstxt.org people.

terryjreedy · 2011-11-13T02:41:13Z

My suggested doc change is how to change the doc along with the patch.

rtb-zla-karma · 2024-01-26T15:00:41Z

Does the whole discussion died down? Was there any followup which isn't linked?

I've just experienced substantial surprised Pikachu face that our project scraped web pages disallowed in robots.txt just because someone added blank lines inside a record (which is very common).

Yes, spec (which we didn't read) is linked in docs but I seriously don't think that most people working in web-related stuff know HOW SUBSTANTIAL single blank line is inside robots.txt record. Simple visual aid makes the rule not do its job. To prove that people have no idea is a fact that https://www.last.fm/robots.txt is still going strong with these visual aids AFTER 12 YEARS.

The case with invalid record

User-agent: A ...
Disallow: ...

User-aget: B ...
Disallow: ...

doesn't justify how blank lines are treated in spec. Most people will consider the above broken anyway because of the typo and will consider example below as not broken but hard to read:

User-agent: A ...
Disallow: ...

Disallow: ...
User-agent: B ...
Disallow: ...

If you know it you know, but if not, in worst case website owner will start legal action against someones crawler which they can defend with "your file doesn't follow the spec that's why we crawled these pages" hoping that owner will get it the first time and won't have bunch of bored lawyers who wait to get their hands dirty.
I know it can happen for both of examples above but "User-aget" is easier to reason and defend than a blank line.

Moreover I think test cases in test_robotparser.py are not exhaustive enough given that these edge case scenarios aren't covered.

EDIT: What is even more stupid (from human perspective not programmers) is that if I put e.g. one space/tab in every blank line then file works as expected. COME ON!

bernie9998 mannequin added the type-bug An unexpected behavior, bug, or error label Oct 27, 2011

akheron added the stdlib Python modules in the Lib dir label Oct 28, 2011

terryjreedy changed the title ~~robotparser.RobotFileParser ignores rules preceeded by a blank line~~ Make robotparser.RobotFileParser ignore blank lines Oct 31, 2011

terryjreedy added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Oct 31, 2011

iritkatriel added 3.9 only security fixes 3.10 only security fixes labels Nov 17, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make robotparser.RobotFileParser ignore blank lines #57490

Make robotparser.RobotFileParser ignore blank lines #57490

bernie9998 mannequin commented Oct 27, 2011

bernie9998 mannequin commented Oct 27, 2011

akheron commented Oct 28, 2011

terryjreedy commented Oct 29, 2011

akheron commented Oct 29, 2011

terryjreedy commented Oct 29, 2011

orsenthil commented Oct 31, 2011

terryjreedy commented Oct 31, 2011

ezio-melotti commented Oct 31, 2011

terryjreedy commented Oct 31, 2011

akheron commented Nov 2, 2011

merwok commented Nov 12, 2011

terryjreedy commented Nov 13, 2011

rtb-zla-karma commented Jan 26, 2024 •

edited

Make robotparser.RobotFileParser ignore blank lines #57490

Make robotparser.RobotFileParser ignore blank lines #57490

Comments

bernie9998 mannequin commented Oct 27, 2011

bernie9998 mannequin commented Oct 27, 2011

akheron commented Oct 28, 2011

terryjreedy commented Oct 29, 2011

akheron commented Oct 29, 2011

terryjreedy commented Oct 29, 2011

orsenthil commented Oct 31, 2011

terryjreedy commented Oct 31, 2011

ezio-melotti commented Oct 31, 2011

terryjreedy commented Oct 31, 2011

akheron commented Nov 2, 2011

merwok commented Nov 12, 2011

terryjreedy commented Nov 13, 2011

rtb-zla-karma commented Jan 26, 2024 • edited

rtb-zla-karma commented Jan 26, 2024 •

edited