robotparser doesn't support request rate and crawl delay parameters #60303
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
assignee = 'https://github.com/berkerpeksag' closed_at = <Date 2015-10-08.09:34:21.553> created_at = <Date 2012-10-01.12:58:25.141> labels = ['easy', 'type-feature', 'library'] title = "robotparser doesn't support request rate and crawl delay parameters" updated_at = <Date 2015-10-08.09:34:21.551> user = 'https://bugs.python.org/XapaJIaMnu'
activity = <Date 2015-10-08.09:34:21.551> actor = 'berker.peksag' assignee = 'berker.peksag' closed = True closed_date = <Date 2015-10-08.09:34:21.553> closer = 'berker.peksag' components = ['Library (Lib)'] creation = <Date 2012-10-01.12:58:25.141> creator = 'XapaJIaMnu' dependencies =  files = ['27373', '27374', '27476', '27477', '33071', '35377'] hgrepos =  issue_num = 16099 keywords = ['patch', 'easy', 'needs review'] message_count = 17.0 messages = ['171711', '171712', '171715', '171719', '172327', '172338', '205567', '205641', '205755', '205761', '208721', '219212', '223099', '225916', '252483', '252521', '252525'] nosy_count = 7.0 nosy_names = ['rhettinger', 'orsenthil', 'christian.heimes', 'python-dev', 'berker.peksag', 'hynek', 'XapaJIaMnu'] pr_nums =  priority = 'normal' resolution = 'fixed' stage = 'resolved' status = 'closed' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue16099' versions = ['Python 3.6']
The text was updated successfully, but these errors were encountered:
Robotparser doesn't support two quite important optional parameters from the robots.txt file. I have implemented those in the following way:
crawl_delay(useragent) - Returns time in seconds that you need to wait for crawling
Thank you for the review!
Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents.
I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the:
logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer.