Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error occurred while connecting: [Failure instance: Traceback: <class 'ValueError'>: filedescriptor out of range in select() #2905

Closed
sseyboth opened this issue Aug 29, 2017 · 11 comments · Fixed by #4294

Comments

@sseyboth
Copy link

I'm trying crawl ~200k sites, only the home pages. In the beginning the crawl works fine but the logs quickly fill up with the following errors:

2017-08-29 11:18:55,131 - scrapy.core.scraper - ERROR - Error downloading <GET http://axo-suit.eu>
Traceback (most recent call last):
File "venv/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "venv/lib/python3.6/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "venv/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.ConnectError: An error occurred while connecting: [Failure instance: Traceback: <class 'ValueError'>: filedescriptor out of range in select()
venv/lib/python3.6/site-packages/twisted/internet/base.py:1243:run
venv/lib/python3.6/site-packages/twisted/internet/base.py:1255:mainLoop
venv/lib/python3.6/site-packages/twisted/internet/selectreactor.py:106:doSelect
venv/lib/python3.6/site-packages/twisted/internet/selectreactor.py:88:_preenDescriptors
--- ---
venv/lib/python3.6/site-packages/twisted/internet/selectreactor.py:85:_preenDescriptors
].

lsof shows that the process indeed has >1024 open network connections, which I believe is the limit for select().

I set CONCURRENT_REQUESTS = 100 and REACTOR_THREADPOOL_MAXSIZE = 20 based on https://doc.scrapy.org/en/latest/topics/broad-crawls.html.

Not sure how the crawl ends up with so many open connections. Maybe it's leaking filedescriptors somewhere?

I'm using:
Python 3.6.2
Scrapy 1.4.0
Twisted 17.5.0
macOS Sierra 10.12.6

Happy to provide more info as needed.

@pancodia
Copy link

Same issue here

@james-turner
Copy link

Same here!

@Weltklasse-Gefahr
Copy link

same issue

@lalit-lintel
Copy link

Same issue

1 similar comment
@lmingzhi
Copy link

Same issue

@ezarowny
Copy link

ezarowny commented Nov 29, 2018

I'm experiencing this as well on macOS.

@elacuesta
Copy link
Member

elacuesta commented Dec 5, 2018

Disclaimer: this is just a theoretical workaround, I haven't actually tried it in the context of a broad crawl.

By default Twisted uses twisted.internet.selectreactor.SelectReactor in MacOS (see https://twistedmatrix.com/documents/current/core/howto/choosing-reactor.html). You could try replacing the reactor with twisted.internet.pollreactor.PollReactor and see if it works. I think there is no built-in way to override this in Scrapy, and monkey-patching your installed version is probably not a good idea so I'd recommend running your spider from a script as explained in https://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script

from twisted.internet import pollreactor
pollreactor.install()
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerProcess


class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.org']
    
    def parse(self, response):
        return {'reactor': str(reactor)}


process = CrawlerProcess()
process.crawl(ExampleSpider)
process.start()

The logs will show the extracted item with the reactor in use, and commenting out the first two lines will show the difference between the default one and the one that's installed by the script.

Hope this helps!

@stygmate
Copy link

stygmate commented Aug 6, 2019

@elacuesta seems to work.

@Gallaecio
Copy link
Member

Gallaecio commented Aug 8, 2019

Let’s cover this in the FAQ or somewhere else in the documentation.

@elacuesta
Copy link
Member

Sounds good, I can draft a PR later today if you're not already on it @Gallaecio

@Gallaecio
Copy link
Member

I’m not 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.