Documentation example fails with `proxy URL with no authority` #3331

a-palchikov · 2018-07-11T08:15:32Z

Running the example from the documentation yields this:

10:11 $ scrapy runspider quotes.py 
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Versions: lxml 3.5.0.0, libxml2 2.9.3, cssselect 0.9.1, parsel 1.5.0, w3lib 1.19.0, Twisted 16.0.0, Python 2.7.12 (default, Dec  4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 0.15.1 (OpenSSL 1.0.2g  1 Mar 2016), cryptography 1.2.3, Platform Linux-4.4.0-130-generic-x86_64-with-Ubuntu-16.04-xenial
2018-07-11 10:12:04 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-07-11 10:12:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2018-07-11 10:12:04 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/runspider.py", line 88, in run
    self.crawler_process.crawl(spidercls, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 171, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 175, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 98, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 29, in from_crawler
    return cls(auth_encoding)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 22, in __init__
    self.proxies[type] = self._get_proxy(url, type)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 39, in _get_proxy
    proxy_type, user, password, hostport = _parse_proxy(url)
  File "/usr/lib/python2.7/urllib2.py", line 721, in _parse_proxy
    raise ValueError("proxy URL with no authority: %r" % proxy)
exceptions.ValueError: proxy URL with no authority: '/var/run/docker.sock'
2018-07-11 10:12:04 [twisted] CRITICAL:

Looks like proxy code does not handle no_proxy correctly.

The text was updated successfully, but these errors were encountered:

grammy-jiang · 2018-07-11T08:57:37Z

Hi, @a-palchikov

This exception is reported by Python standard lib, not Scrapy.

Would you mind to post your start_urls, and the environment variables about PROXY here?

Gallaecio · 2019-08-08T11:21:07Z

Closing due to lack of feedback from the author.

otakutyrant · 2020-07-07T09:09:19Z

I have encountered this issue too. My relative proxy environment variable is no_proxy=/var/run/docker.sock, and after unsetting this one the issue is solved. So as the poster said, looks like proxy code does not handle no_proxy correctly.

Gallaecio · 2020-07-08T07:13:14Z

The code probably does not expect a file path in that variable. I guess we should silently ignore those.

kartecianos · 2020-08-21T10:41:35Z

Hi, I am a newcomer and I would like to take this issue

Gallaecio · 2020-08-21T11:20:21Z

No need to ask for permission 🙂

drs-11 · 2020-08-22T20:59:19Z

Also looks like _get_proxy here doesn't handle multiple addresses in the no_proxy env variable well.
For eg:
If the no_proxy env var is set to: no_proxy="127.0.0.1,localhost,localdomain.com"
Then self.proxies in HttpProxyMiddleware will be set as:
{'no': (None, 'no://127.0.0.1,localhost,localdomain.com')}

That's not how it should be, right?

Gallaecio · 2020-08-25T13:46:22Z

Probably no, indeed.

drs-11 · 2020-09-04T18:24:55Z

I'm not sure what could be a solution to this issue.
/var/run/docker.sock seems the only exception for having a socket file in a no_proxy env variable. So either the socket file be ignored and not added to the list of proxies or maybe add it without passing the socket file path to _get_proxy method which is causing the error?

But the second option will cause further errors when the proxy is parsed in other modules.
So I think ignoring the socket file will be the best option? Also I can't find any other cases where a socket file is used in no_proxy.
Thoughts?

a-palchikov · 2020-09-04T18:54:36Z

I guess NO_PROXY handling is very open to specific interpretations and is not standardized. Docker client describes the uses of NO_PROXY for its purposes here while scrapy can just ignore the proxy that the urllib2.parse_proxy fails to parse.

Gallaecio closed this as completed Aug 8, 2019

Gallaecio reopened this Jul 8, 2020

Gallaecio added bug good first issue labels Jul 8, 2020

Gallaecio mentioned this issue Aug 20, 2020

Add test-case for issue#4211 #4739

Closed

drs-11 mentioned this issue Sep 5, 2020

Check for unparseable no_proxy values #4778

Merged

kmike closed this as completed in #4778 Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation example fails with `proxy URL with no authority` #3331

Documentation example fails with `proxy URL with no authority` #3331

a-palchikov commented Jul 11, 2018 •

edited

grammy-jiang commented Jul 11, 2018

Gallaecio commented Aug 8, 2019

otakutyrant commented Jul 7, 2020

Gallaecio commented Jul 8, 2020

kartecianos commented Aug 21, 2020

Gallaecio commented Aug 21, 2020

drs-11 commented Aug 22, 2020

Gallaecio commented Aug 25, 2020

drs-11 commented Sep 4, 2020

a-palchikov commented Sep 4, 2020

Documentation example fails with proxy URL with no authority #3331

Documentation example fails with proxy URL with no authority #3331

Comments

a-palchikov commented Jul 11, 2018 • edited

grammy-jiang commented Jul 11, 2018

Gallaecio commented Aug 8, 2019

otakutyrant commented Jul 7, 2020

Gallaecio commented Jul 8, 2020

kartecianos commented Aug 21, 2020

Gallaecio commented Aug 21, 2020

drs-11 commented Aug 22, 2020

Gallaecio commented Aug 25, 2020

drs-11 commented Sep 4, 2020

a-palchikov commented Sep 4, 2020

Documentation example fails with `proxy URL with no authority` #3331

Documentation example fails with `proxy URL with no authority` #3331

a-palchikov commented Jul 11, 2018 •

edited