New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An error occurred while connecting: [Failure instance: Traceback: <class 'ValueError'>: filedescriptor out of range in select() #2905
Comments
Same issue here |
Same here! |
same issue |
Same issue |
1 similar comment
Same issue |
I'm experiencing this as well on macOS. |
Disclaimer: this is just a theoretical workaround, I haven't actually tried it in the context of a broad crawl. By default Twisted uses from twisted.internet import pollreactor
pollreactor.install()
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerProcess
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['https://example.org']
def parse(self, response):
return {'reactor': str(reactor)}
process = CrawlerProcess()
process.crawl(ExampleSpider)
process.start() The logs will show the extracted item with the reactor in use, and commenting out the first two lines will show the difference between the default one and the one that's installed by the script. Hope this helps! |
@elacuesta seems to work. |
Let’s cover this in the FAQ or somewhere else in the documentation. |
Sounds good, I can draft a PR later today if you're not already on it @Gallaecio |
I’m not |
I'm trying crawl ~200k sites, only the home pages. In the beginning the crawl works fine but the logs quickly fill up with the following errors:
2017-08-29 11:18:55,131 - scrapy.core.scraper - ERROR - Error downloading <GET http://axo-suit.eu>
Traceback (most recent call last):
File "venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "venv/lib/python3.6/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.ConnectError: An error occurred while connecting: [Failure instance: Traceback: <class 'ValueError'>: filedescriptor out of range in select()
venv/lib/python3.6/site-packages/twisted/internet/base.py:1243:run
venv/lib/python3.6/site-packages/twisted/internet/base.py:1255:mainLoop
venv/lib/python3.6/site-packages/twisted/internet/selectreactor.py:106:doSelect
venv/lib/python3.6/site-packages/twisted/internet/selectreactor.py:88:_preenDescriptors
--- ---
venv/lib/python3.6/site-packages/twisted/internet/selectreactor.py:85:_preenDescriptors
].
lsof shows that the process indeed has >1024 open network connections, which I believe is the limit for select().
I set CONCURRENT_REQUESTS = 100 and REACTOR_THREADPOOL_MAXSIZE = 20 based on https://doc.scrapy.org/en/latest/topics/broad-crawls.html.
Not sure how the crawl ends up with so many open connections. Maybe it's leaking filedescriptors somewhere?
I'm using:
Python 3.6.2
Scrapy 1.4.0
Twisted 17.5.0
macOS Sierra 10.12.6
Happy to provide more info as needed.
The text was updated successfully, but these errors were encountered: