-
-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
robotparser doesn't support request rate and crawl delay parameters #60303
Comments
Robotparser doesn't support two quite important optional parameters from the robots.txt file. I have implemented those in the following way: crawl_delay(useragent) - Returns time in seconds that you need to wait for crawling |
Thanks for the patch. New features must be implemented in Python 3.4. Python 2.7 is in feature freeze mode and therefore doesn't get new features. |
Okay, sorry didn't know that (: Feedback is welcome, as always (: |
We have a team that mentors new contributors. If you are interested to get your patch into Python 3.4, please read http://pythonmentors.com/ . The people are really friendly and will help you with every step of the process. |
Okay, here's a proper patch with documentation entry and test cases. |
Reformatted patch |
Hey, Nick |
I left a few comments on Rietveld. |
Thank you for the review! http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser.rst http://bugs.python.org/review/16099/diff/6206/Doc/library/urllib.robotparser....
Crawl delay and request rate parameters are targeted to custom crawlers that many people/companies write for specific tasks. The Google webmaster tools is targeted only to google's crawler and typically web admins have different rates for google/yahoo/bing and all other user agents. http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py http://bugs.python.org/review/16099/diff/6206/Lib/urllib/robotparser.py#newco...
I have followed the model of what was written beforehand. A 0(1) implementation (probably based on dictionaries) would require a complete rewrite of this library, as all previously implemented functions employ the: logic. I don't think this matters particularly here, as those two functions in particular need only be called once per domain and robots.txt seldom contains more than 3 entries. This is why I have just followed the design laid out by the original developer. Thanks Nick |
Oh... Sorry for the spam, could you please verify my documentation link syntax. I'm not entirely sure I got it right. |
Hey, Just a reminder friendly reminder that there hasn't been any activity for a month and I have released a v2, pending for review (: |
Updated patch, all comments addressed, sorry for the 6 months delay. Please review |
Hey, Just a friendly reminder that there has been no activity for a month and a half and v3 is pending for review (: |
Hey, Just a friendly reminder that the patch is pending for review and there has been no activity for 3 months (: |
Hey, Friendly reminder that there has been no activity on this issue for more than an year. Cheers, Nick |
New changeset dbed7cacfb7e by Berker Peksag in branch 'default': |
I've finally committed your patch to default. Thank you for not giving up, Nikolay :) Note that currently the link in the example section doesn't work. I will open a new issue for that. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: