Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Protego the default robots.txt parser #3969

Closed
anubhavp28 opened this issue Aug 19, 2019 · 4 comments
Closed

Make Protego the default robots.txt parser #3969

anubhavp28 opened this issue Aug 19, 2019 · 4 comments
Labels

Comments

@anubhavp28
Copy link
Contributor

@anubhavp28 anubhavp28 commented Aug 19, 2019

Protego is more compliant robots.txt parser than the current default RobotFileParser. It supports wildcard matching, length based directive ordering, sitemaps, crawl-delay and request-rate directive. It is also compatible with Google's robots.txt parser. AFAIK, It has bigger test suite than any other open source robots.txt parser. The latest version of Protego is on PyPI and can be installed with pip install protego.

It also faster than other Python based candidates. I performed a benchmark using popular robots.txt parsers -

I crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. I downloaded 111, 824 links in total. Next I made each robots.txt parser - parse and answer query (I made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :

  • Protego (Python based) :
    25th percentile : 0.002419 seconds
    50th percentile : 0.006798 seconds
    75th percentile : 0.014307 seconds
    100th percentile : 2.546978 seconds
    Total Time : 19.545984 seconds

  • RobotFileParser (default in Scrapy) :
    25th percentile : 0.002188 seconds
    50th percentile : 0.005350 secondsstyle
    75th percentile : 0.010492 seconds
    100th percentile : 1.805923 seconds
    Total Time : 13.799954 seconds

  • Robotexclusionrulesparser (Python based) :
    25th percentile : 0.001288 seconds
    50th percentile : 0.005222 seconds
    75th percentile : 0.014640 seconds
    100th percentile : 52.706880 seconds
    Total Time : 76.460496 seconds

  • Reppy Parser (C++ based with Python interface) :
    25th percentile : 0.000210 seconds
    50th percentile : 0.000492 seconds
    75th percentile : 0.000997 seconds
    100th percentile : 0.129440 seconds
    Total Time: 1.405558 seconds

The code and data used for benchmarking is here.

@anubhavp28 anubhavp28 changed the title [Discussion] Making Protego as the default robots.txt parser. [Discussion] Make Protego as the default robots.txt parser. Aug 19, 2019
@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Aug 19, 2019

cc @kmike @dangra @whalebot-helmsman

@Gallaecio Gallaecio changed the title [Discussion] Make Protego as the default robots.txt parser. Make Protego as the default robots.txt parser Aug 19, 2019
@Gallaecio Gallaecio changed the title Make Protego as the default robots.txt parser Make Protego the default robots.txt parser Aug 19, 2019
@kmike
Copy link
Member

@kmike kmike commented Oct 7, 2019

Fixed by #4006.

@kmike kmike closed this as completed Oct 7, 2019
@bfelds
Copy link

@bfelds bfelds commented Nov 2, 2019

@whalebot-helmsman an fyi: just installed with conda and it seems protego isn't in conda-forge and not listed as a dependency. i had to manually install with pip into the conda environment

@Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Nov 2, 2019

@whalebot-helmsman an fyi: just installed with conda and it seems protego isn't in conda-forge and not listed as a dependency. i had to manually install with pip into the conda environment

Thanks, I’ll try to get it fixed next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants