Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't check robots.txt for local files #5807

Merged
merged 2 commits into from
Jan 27, 2023
Merged

Conversation

Cj-Malone
Copy link
Contributor

Currently robots.txt is attempted to be checked for local files (and data urls). You can reproduce with:

from scrapy import Spider


class TestSpider(Spider):
    name = "test"
    start_urls = [
        "data:text/plain,Hello World data",
        "file:///home/cj/hello world.txt",
    ]

    def parse(self, response, **kwargs):
        print(response.text)

$ pipenv run scrapy runspider test.py -s ROBOTSTXT_OBEY=True

2023-01-26 15:50:47 [scrapy.core.engine] INFO: Spider opened
2023-01-26 15:50:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-01-26 15:50:47 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET data:/robots.txt>: invalid data URI
Traceback (most recent call last):
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/w3lib/url.py", line 504, in parse_data_uri
    is_base64, data = uri.split(b",", 1)
ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/twisted/internet/defer.py", line 206, in maybeDeferred
    result = f(*args, **kwargs)
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/scrapy/core/downloader/handlers/datauri.py", line 13, in download_request
    uri = parse_data_uri(request.url)
  File "/home/cj/.local/share/virtualenvs/alltheplaces-E_LDghuV/lib/python3.10/site-packages/w3lib/url.py", line 506, in parse_data_uri
    raise ValueError("invalid data URI")
ValueError: invalid data URI
2023-01-26 15:50:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET data:text/plain,Hello%20World%20data> (referer: None)
Hello World data
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 1 times): [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET file:///robots.txt> (failed 2 times): [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET file:///robots.txt> (failed 3 times): [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET file:///robots.txt>: [Errno 2] No such file or directory: '/robots.txt'
Traceback (most recent call last):
  File "/home/cj/Documents/Projects/scrapy/scrapy/core/downloader/middleware.py", line 52, in process_request
    return (yield download_func(request=request, spider=spider))
  File "/home/cj/.local/share/virtualenvs/scrapy-kVY4T-qA/lib/python3.10/site-packages/twisted/internet/defer.py", line 206, in maybeDeferred
    result = f(*args, **kwargs)
  File "/home/cj/Documents/Projects/scrapy/scrapy/core/downloader/handlers/file.py", line 15, in download_request
    body = Path(filepath).read_bytes()
  File "/usr/lib/python3.10/pathlib.py", line 1126, in read_bytes
    with self.open(mode='rb') as f:
  File "/usr/lib/python3.10/pathlib.py", line 1119, in open
    return self._accessor.open(self, mode, buffering, encoding, errors,
FileNotFoundError: [Errno 2] No such file or directory: '/robots.txt'
2023-01-26 15:50:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/cj/hello%20world.txt> (referer: None)
Hello World file

2023-01-26 15:50:50 [scrapy.core.engine] INFO: Closing spider (finished)

$ pipenv run scrapy runspider test.py -s ROBOTSTXT_OBEY=False

2023-01-26 15:47:56 [scrapy.core.engine] INFO: Spider opened
2023-01-26 15:47:56 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-01-26 15:47:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET data:text/plain,Hello%20World%20data> (referer: None)
Hello World data
2023-01-26 15:47:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///home/cj/hello%20world.txt> (referer: None)
Hello World file

2023-01-26 15:47:57 [scrapy.core.engine] INFO: Closing spider (finished)

This patch silently ignores those urls so the rest of the project can continue to honour robots.txt.

@Gallaecio
Copy link
Member

Could you add a test for it?

@Cj-Malone
Copy link
Contributor Author

I've no experience with tox, so it may be easier for you to do it.

@Gallaecio
Copy link
Member

Tox is trivial to use (tox -e py to run all test with your system version of Python), but I’ve had a look at the existing tests in https://github.com/scrapy/scrapy/blob/master/tests/test_downloadermiddleware_robotstxt.py and adding one for this may not be trivial.

I will look into it when I can find some time, but for the record, it might take weeks.

@Cj-Malone
Copy link
Contributor Author

If that test is good enough 🚀 but if you want something more advanced we'll have to wait until you're available. I guess if we are missing 2.8 it doesn't matter if it's merged tomorrow or in a few weeks.

@codecov
Copy link

codecov bot commented Jan 27, 2023

Codecov Report

Merging #5807 (33b85a9) into master (e71eab6) will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5807      +/-   ##
==========================================
+ Coverage   88.91%   88.93%   +0.02%     
==========================================
  Files         162      162              
  Lines       10988    10990       +2     
  Branches     1797     1798       +1     
==========================================
+ Hits         9770     9774       +4     
+ Misses        938      937       -1     
+ Partials      280      279       -1     
Impacted Files Coverage Δ
scrapy/downloadermiddlewares/robotstxt.py 100.00% <100.00%> (ø)
scrapy/core/downloader/__init__.py 92.48% <0.00%> (+1.50%) ⬆️

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!!

@wRAR
Copy link
Member

wRAR commented Jan 27, 2023

Thanks!

@wRAR wRAR merged commit 30c0dc7 into scrapy:master Jan 27, 2023
@Cj-Malone Cj-Malone deleted the patch-1 branch January 27, 2023 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants