Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation example fails with proxy URL with no authority #3331

Closed
a-palchikov opened this issue Jul 11, 2018 · 10 comments · Fixed by #4778
Closed

Documentation example fails with proxy URL with no authority #3331

a-palchikov opened this issue Jul 11, 2018 · 10 comments · Fixed by #4778

Comments

@a-palchikov
Copy link

a-palchikov commented Jul 11, 2018

Running the example from the documentation yields this:

10:11 $ scrapy runspider quotes.py 
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapybot)
2018-07-11 10:12:04 [scrapy.utils.log] INFO: Versions: lxml 3.5.0.0, libxml2 2.9.3, cssselect 0.9.1, parsel 1.5.0, w3lib 1.19.0, Twisted 16.0.0, Python 2.7.12 (default, Dec  4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 0.15.1 (OpenSSL 1.0.2g  1 Mar 2016), cryptography 1.2.3, Platform Linux-4.4.0-130-generic-x86_64-with-Ubuntu-16.04-xenial
2018-07-11 10:12:04 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-07-11 10:12:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
Unhandled error in Deferred:
2018-07-11 10:12:04 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/runspider.py", line 88, in run
    self.crawler_process.crawl(spidercls, **opts.spargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 171, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 175, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 98, in crawl
    six.reraise(*exc_info)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 80, in crawl
    self.engine = self._create_engine()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/__init__.py", line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/middleware.py", line 36, in from_settings
    mw = mwcls.from_crawler(crawler)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 29, in from_crawler
    return cls(auth_encoding)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 22, in __init__
    self.proxies[type] = self._get_proxy(url, type)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/downloadermiddlewares/httpproxy.py", line 39, in _get_proxy
    proxy_type, user, password, hostport = _parse_proxy(url)
  File "/usr/lib/python2.7/urllib2.py", line 721, in _parse_proxy
    raise ValueError("proxy URL with no authority: %r" % proxy)
exceptions.ValueError: proxy URL with no authority: '/var/run/docker.sock'
2018-07-11 10:12:04 [twisted] CRITICAL:

Looks like proxy code does not handle no_proxy correctly.

@grammy-jiang
Copy link
Contributor

Hi, @a-palchikov

This exception is reported by Python standard lib, not Scrapy.

Would you mind to post your start_urls, and the environment variables about PROXY here?

@Gallaecio
Copy link
Member

Closing due to lack of feedback from the author.

@otakutyrant
Copy link

I have encountered this issue too. My relative proxy environment variable is no_proxy=/var/run/docker.sock, and after unsetting this one the issue is solved. So as the poster said, looks like proxy code does not handle no_proxy correctly.

@Gallaecio
Copy link
Member

The code probably does not expect a file path in that variable. I guess we should silently ignore those.

@kartecianos
Copy link

Hi, I am a newcomer and I would like to take this issue

@Gallaecio
Copy link
Member

No need to ask for permission 🙂

@drs-11
Copy link
Contributor

drs-11 commented Aug 22, 2020

Also looks like _get_proxy here doesn't handle multiple addresses in the no_proxy env variable well.
For eg:
If the no_proxy env var is set to: no_proxy="127.0.0.1,localhost,localdomain.com"
Then self.proxies in HttpProxyMiddleware will be set as:
{'no': (None, 'no://127.0.0.1,localhost,localdomain.com')}

That's not how it should be, right?

@Gallaecio
Copy link
Member

Probably no, indeed.

@drs-11
Copy link
Contributor

drs-11 commented Sep 4, 2020

I'm not sure what could be a solution to this issue.
/var/run/docker.sock seems the only exception for having a socket file in a no_proxy env variable. So either the socket file be ignored and not added to the list of proxies or maybe add it without passing the socket file path to _get_proxy method which is causing the error?

But the second option will cause further errors when the proxy is parsed in other modules.
So I think ignoring the socket file will be the best option? Also I can't find any other cases where a socket file is used in no_proxy.
Thoughts?

@a-palchikov
Copy link
Author

I guess NO_PROXY handling is very open to specific interpretations and is not standardized. Docker client describes the uses of NO_PROXY for its purposes here while scrapy can just ignore the proxy that the urllib2.parse_proxy fails to parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants