Protego is more compliant robots.txt parser than the current default RobotFileParser. It supports wildcard matching, length based directive ordering, sitemaps, crawl-delay and request-rate directive. It is also compatible with Google's robots.txt parser. AFAIK, It has bigger test suite than any other open source robots.txt parser. The latest version of Protego is on PyPI and can be installed with pip install protego.
It also faster than other Python based candidates. I performed a benchmark using popular robots.txt parsers -
I crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. I downloaded 111, 824 links in total. Next I made each robots.txt parser - parse and answer query (I made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :
-
Protego (Python based) :
25th percentile : 0.002419 seconds
50th percentile : 0.006798 seconds
75th percentile : 0.014307 seconds
100th percentile : 2.546978 seconds
Total Time : 19.545984 seconds
-
RobotFileParser (default in Scrapy) :
25th percentile : 0.002188 seconds
50th percentile : 0.005350 secondsstyle
75th percentile : 0.010492 seconds
100th percentile : 1.805923 seconds
Total Time : 13.799954 seconds
-
Robotexclusionrulesparser (Python based) :
25th percentile : 0.001288 seconds
50th percentile : 0.005222 seconds
75th percentile : 0.014640 seconds
100th percentile : 52.706880 seconds
Total Time : 76.460496 seconds
-
Reppy Parser (C++ based with Python interface) :
25th percentile : 0.000210 seconds
50th percentile : 0.000492 seconds
75th percentile : 0.000997 seconds
100th percentile : 0.129440 seconds
Total Time: 1.405558 seconds
The code and data used for benchmarking is here.
Protego is more compliant robots.txt parser than the current default RobotFileParser. It supports wildcard matching, length based directive ordering, sitemaps, crawl-delay and request-rate directive. It is also compatible with Google's robots.txt parser. AFAIK, It has bigger test suite than any other open source robots.txt parser. The latest version of Protego is on PyPI and can be installed with
pip install protego.It also faster than other Python based candidates. I performed a benchmark using popular robots.txt parsers -
I crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. I downloaded 111, 824 links in total. Next I made each robots.txt parser - parse and answer query (I made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :
Protego (Python based) :
25th percentile : 0.002419 seconds
50th percentile : 0.006798 seconds
75th percentile : 0.014307 seconds
100th percentile : 2.546978 seconds
Total Time : 19.545984 seconds
RobotFileParser (default in Scrapy) :
25th percentile : 0.002188 seconds
50th percentile : 0.005350 secondsstyle
75th percentile : 0.010492 seconds
100th percentile : 1.805923 seconds
Total Time : 13.799954 seconds
Robotexclusionrulesparser (Python based) :
25th percentile : 0.001288 seconds
50th percentile : 0.005222 seconds
75th percentile : 0.014640 seconds
100th percentile : 52.706880 seconds
Total Time : 76.460496 seconds
Reppy Parser (C++ based with Python interface) :
25th percentile : 0.000210 seconds
50th percentile : 0.000492 seconds
75th percentile : 0.000997 seconds
100th percentile : 0.129440 seconds
Total Time: 1.405558 seconds
The code and data used for benchmarking is here.