-
-
Notifications
You must be signed in to change notification settings - Fork 31.7k
urllib.robotparser: incomplete __str__ methods #77042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello, I have stumbled upon a couple of inconsistencies in urllib.robotparser's __str__ methods. These appear to be unintentional omissions; basically the code was modified but the string methods were never updated.
>>> from urllib.robotparser import RobotFileParser
>>> parser = RobotFileParser()
>>> text = """
... User-agent: *
... Allow: /some/path
... Disallow: /another/path
...
... User-agent: Googlebot
... Allow: /folder1/myfile.html
... """
>>> parser.parse(text.splitlines())
>>> print(parser)
User-agent: Googlebot
Allow: /folder1/myfile.html
>>> This is *especially* awkward when parsing a valid robots.txt that only contains a wildcard User-agent. >>> from urllib.robotparser import RobotFileParser
>>> parser = RobotFileParser()
>>> text = """
... User-agent: *
... Allow: /some/path
... Disallow: /another/path
... """
>>> parser.parse(text.splitlines())
>>> print(parser)
>>>
>>> from urllib.robotparser import RobotFileParser
>>> parser = RobotFileParser()
>>> text = """
... User-agent: figtree
... Crawl-delay: 3
... Request-rate: 9/30
... Disallow: /tmp
... """
>>> parser.parse(text.splitlines())
>>> print(parser)
User-agent: figtree
Disallow: /tmp
Taken on their own these are all minor issues, but they do make things quite confusing when using robotparser from the REPL! |
The default entry was moved out of entries added in bpo-523041, but RobotFileParser.__str__ was not updated. Support for "Crawl-delay" and "Request-Rate" was added in bpo-16099, but Entry.__str__ was not updated. This looks like bugs to me, and I think the fix should be backported. But two unnecessary trailing newlines should be kept for compatibility in maintained versions. I think we can get rid of them in 3.8 (unless Senthil has other opinion). |
Yup, that sounds good to me. It doesn't seem like any RFC requirements. It's just kept for the compatibility and we can do away with it in 3.8 |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: