Don't check robots.txt for local files #5807

Cj-Malone · 2023-01-26T16:18:46Z

Currently robots.txt is attempted to be checked for local files (and data urls). You can reproduce with:

from scrapy import Spider


class TestSpider(Spider):
    name = "test"
    start_urls = [
        "data:text/plain,Hello World data",
        "file:///home/cj/hello world.txt",
    ]

    def parse(self, response, **kwargs):
        print(response.text)

$ pipenv run scrapy runspider test.py -s ROBOTSTXT_OBEY=True

2023-01-26 15:50:47 [scrapy.core.engine] INFO: Spider opened
2023-01-26 15:50:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-01-26 15:50:47 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET data:/robots.txt>: invalid data URI
Traceback (most recent call last):
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/w3lib/url.py", line 504, in parse_data_uri
    is_base64, data = uri.split(b",", 1)
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/twisted/internet/defer.py", line 206, in maybeDeferred
    result = f(*args, **kwargs)
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/scrapy/core/downloader/handlers/datauri.py", line 13, in download_request
    uri = parse_data_uri(request.url)
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/w3lib/url.py", line 506, in parse_data_uri
    raise ValueError("invalid data URI")
ValueError: invalid data URI
2023-01-26 15:50:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET data:text/plain,Hello%20World%20data> (referer: None)
Hello World data
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET file:///robots.txt> (failed 3 times): [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET file:///robots.txt>: [Errno 2] No such file or directory: '/robots.txt'
Traceback (most recent call last):
  File "/home/cj/Documents/Projects/scrapy/scrapy/core/downloader/middleware.py", line 52, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/cj/.local/share/virtualenvs/scrapy-kVY4T-qA/lib/python3.10/site-packages/twisted/internet/defer.py", line 206, in maybeDeferred
    result = f(*args, **kwargs)
  File "/home/cj/Documents/Projects/scrapy/scrapy/core/downloader/handlers/file.py", line 15, in download_request
    body = Path(filepath).read_bytes()
  File "/usr/lib/python3.10/pathlib.py", line 1126, in read_bytes
    with self.open(mode='rb') as f:
  File "/usr/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/cj/hello%20world.txt> (referer: None)
Hello World file

2023-01-26 15:50:50 [scrapy.core.engine] INFO: Closing spider (finished)

$ pipenv run scrapy runspider test.py -s ROBOTSTXT_OBEY=False

2023-01-26 15:47:56 [scrapy.core.engine] INFO: Spider opened
2023-01-26 15:47:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-01-26 15:47:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET data:text/plain,Hello%20World%20data> (referer: None)
Hello World data
2023-01-26 15:47:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/cj/hello%20world.txt> (referer: None)
Hello World file

2023-01-26 15:47:57 [scrapy.core.engine] INFO: Closing spider (finished)

This patch silently ignores those urls so the rest of the project can continue to honour robots.txt.

Gallaecio · 2023-01-26T16:49:03Z

Could you add a test for it?

Cj-Malone · 2023-01-26T17:04:58Z

I've no experience with tox, so it may be easier for you to do it.

Gallaecio · 2023-01-26T17:11:55Z

Tox is trivial to use (tox -e py to run all test with your system version of Python), but I’ve had a look at the existing tests in https://github.com/scrapy/scrapy/blob/master/tests/test_downloadermiddleware_robotstxt.py and adding one for this may not be trivial.

I will look into it when I can find some time, but for the record, it might take weeks.

Cj-Malone · 2023-01-26T19:54:20Z

If that test is good enough 🚀 but if you want something more advanced we'll have to wait until you're available. I guess if we are missing 2.8 it doesn't matter if it's merged tomorrow or in a few weeks.

codecov · 2023-01-27T08:14:10Z

Codecov Report

Merging #5807 (33b85a9) into master (e71eab6) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5807      +/-   ##
==========================================
+ Coverage   88.91%   88.93%   +0.02%     
==========================================
  Files         162      162              
  Lines       10988    10990       +2     
  Branches     1797     1798       +1     
==========================================
+ Hits         9770     9774       +4     
+ Misses        938      937       -1     
+ Partials      280      279       -1

Impacted Files	Coverage Δ
scrapy/downloadermiddlewares/robotstxt.py	`100.00% <100.00%> (ø)`
scrapy/core/downloader/__init__.py	`92.48% <0.00%> (+1.50%)`	⬆️

Gallaecio

Awesome!!!

wRAR · 2023-01-27T19:03:11Z

Thanks!

Don't check robotstxt for local files

3054235

Test local files aren't processed

33b85a9

Gallaecio approved these changes Jan 27, 2023

View reviewed changes

wRAR approved these changes Jan 27, 2023

View reviewed changes

wRAR merged commit 30c0dc7 into scrapy:master Jan 27, 2023

Cj-Malone deleted the patch-1 branch January 27, 2023 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't check robots.txt for local files #5807

Don't check robots.txt for local files #5807

Cj-Malone commented Jan 26, 2023

Gallaecio commented Jan 26, 2023

Cj-Malone commented Jan 26, 2023

Gallaecio commented Jan 26, 2023

Cj-Malone commented Jan 26, 2023

codecov bot commented Jan 27, 2023 •

edited

Loading

Gallaecio left a comment

wRAR commented Jan 27, 2023

Don't check robots.txt for local files #5807

Don't check robots.txt for local files #5807

Conversation

Cj-Malone commented Jan 26, 2023

Gallaecio commented Jan 26, 2023

Cj-Malone commented Jan 26, 2023

Gallaecio commented Jan 26, 2023

Cj-Malone commented Jan 26, 2023

codecov bot commented Jan 27, 2023 • edited Loading

Codecov Report

Gallaecio left a comment

Choose a reason for hiding this comment

wRAR commented Jan 27, 2023

codecov bot commented Jan 27, 2023 •

edited

Loading