ClosedPoolError on crawling #951

eladbitton · 2016-08-18T16:51:54Z

I am building a crawler with python3 and urllib3. I am using a PoolManager instance that is used by 15 different threads. While crawling thousands of website i get a lot of ClosedPoolError from different website.

On the documentation - ClosedPoolError:

Raised when a request enters a pool after the pool has been closed.

It appears that the PoolManager instance is trying to use a closed connection.

My code:

from urllib3 import PoolManager, util, Retry
from urllib3.exceptions import MaxRetryError

# Instance of PoolManager is started on init
manager = PoolManager(num_pools=15,
                      maxsize=6,
                      timeout=40.0,
                      retries=Retry(connect=2, read=2, redirect=10))

# Every thread execute download by using the pool manager instance
url_to_download = "**"
headers = util.make_headers(accept_encoding='gzip, deflate',
                            keep_alive=True,
                            user_agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0")
headers['Accept-Language'] = "en-US,en;q=0.5"
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
try:
   response = manager.request('GET',
                                   url_to_download,
                                   preload_content=False,
                                   headers=headers)
except MaxRetryError as ex:
   raise FailedToDownload()

How can i make the PoolManager renew the connection and try again?

The text was updated successfully, but these errors were encountered:

shazow · 2016-08-18T17:23:28Z

Related stackoverflow question here: https://stackoverflow.com/questions/39012081/python-urllib3-closedpoolerror-on-crawling

Lukasa · 2016-08-18T23:14:17Z

Do you have a complete traceback, please? I'd like to see where this failure is coming from.

eladbitton · 2016-08-19T07:49:38Z

This is what i have from my log:
<class 'urllib3.exceptions.ClosedPoolError'>, ClosedPoolError("HTTPConnectionPool(host='homepage.ntlworld.com', port=80): Pool is closed.",), <traceback object at 0x7f01db7462c8>

Lukasa · 2016-08-19T12:28:21Z

Hrm, we're really going to need a better traceback. If you're not catching this, can you add a block that catches it and fires log.exception to provide the traceback in the logs?

eladbitton · 2016-08-19T15:22:33Z

Full traceback :)

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 235, in _get_conn
    conn = self.pool.get(block=self.block, timeout=timeout)
AttributeError: 'NoneType' object has no attribute 'get'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elad/workspace/Crawler/crawler/Downloader.py", line 62, in download
    headers=headers)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 69, in request
    **urlopen_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 90, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/poolmanager.py", line 248, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 668, in urlopen
    release_conn=release_conn, **response_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 583, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 238, in _get_conn
    raise ClosedPoolError(self, "Pool is closed.")
urllib3.exceptions.ClosedPoolError: HTTPConnectionPool(host='reader.co.il', port=80): Pool is closed.

Lukasa · 2016-08-19T19:55:16Z

So, that's convenient. It seems like we've followed a redirect and are now attempting to grab a connection out of the connection pool. Somehow, in between this, the pool is getting closed.

It looks like the PoolManager allows this behaviour: essentially, if a pool starts making requests to the host after num_pools previous hosts, that can cause the oldest outstanding connection pool to be forcefully closed. This won't affect outstanding requests, but will affect redirects.

What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it? @haikuginger, @shazow, got theories on that?

haikuginger · 2016-08-19T19:59:56Z

If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?

Because ConnectionPools were originally and still are a first-class object that can be used on their own for a single host. This was a cause for confusion for me as well, and is one of a number of reasons that redirects are refactored to be part of the RequestMethods base class as part of the SessionManager branch.

shazow · 2016-08-19T20:01:29Z

What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?

Historically, ConnectionPools came first, and the first attempt at redirection code lived there. Of course it did not work for cross-host requests.

Then PoolManagers came, and the redirection code was ~copied, then later improved by the Retry objects stuff.

I'd be +1 from removing redirect support from ConnectionPools, as it's very edge-case-y anyways (since it breaks for cross-host). I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.

Bonus points if our redirecting code can be used in stand-alone for those who want to. This would be a great recipe to have in docs, if it's not there already.

Lukasa · 2016-08-19T20:02:19Z

I'd be +1 on pulling it out of the ConnectionPool too.

haikuginger · 2016-08-19T20:09:25Z

I have no objection to that.

sigmavirus24 · 2016-08-19T22:09:54Z

I'm working on this since it's also necessary for #952.

Slightly refactor our PoolManager#urlopen method to turn integers into Retry objects sooner and move our redirect handling tests for connection pools to our pool manager so we don't lose test coverage. Some tests were duplicated and thus not moved over. The rest were not and so I moved them over and modified their logic to exercise the PoolManager instead of HTTPConnectionPool Closes urllib3#951

eladbitton · 2016-11-28T15:01:31Z

When will this fix will get in the main repository?

Lukasa · 2016-11-28T15:03:31Z

When the fix is finished. =)

eladbitton · 2016-11-28T15:09:11Z

Any projection for that? :)

Lukasa · 2016-11-28T15:09:55Z

Nope. The nature of these things is that without any form of contracted work the work will get done when someone is motivated sufficiently to complete that contribution.

sigmavirus24 · 2016-11-28T16:07:02Z

@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.

Yomguithereal · 2019-02-27T20:43:44Z

Hitting the same issue while using a bunch of threads to mass download html content. Is there anything I could do to help you fix this or do you reckon I should handle redirects on my own as @shazow says:

I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.

willianantunes · 2019-03-30T11:50:55Z

Hitting the same issue here as well, but in my case I'm usings requests which wraps it.

@Lukasa, @haikuginger and @shazow I'm really aiming to get this solved. I'm studying how I can submit a PR with tested code and so on to get this fixed and help the community.

If you guys have any tips, just leave them on the table 😄
I'll see what I can do 👍

willianantunes · 2019-03-30T15:17:33Z

One possible way to circumvent this, is to create a code like:

def _retry_if_exception_is_caught(execution_to_be_callable: Callable):
    max_tries = 5
    tries = 0
    while True:
        try:
            response = execution_to_be_callable()
            return response
        except Exception as e:
            logger.warning(f"Exception of type {type(e)} was caught. Trying again...")
        if tries >= max_tries:
            raise CouldNotAccomplishExecutionException()
        tries += 1

Then you can use it like this:

def do_the_call():
    with requests.Session() as session:
        return session.get("http://your-service-url")
    
    
foo = _retry_if_exception_is_caught(lambda: do_the_call())

For those who is facing the same problem.

marcobazzani · 2020-09-01T22:33:57Z

so digging into this
found that even if blocking is False it fails with this error:


Traceback (most recent call last):
  File "/home/mbazzan/project/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 660, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 259, in _get_conn
    raise ClosedPoolError(self, f"Pool is closed. {self.block} {timeout}")
urllib3.exceptions.ClosedPoolError: HTTPSConnectionPool(host='zzz.zzz-api.us-east-1.amazonaws.com', port=443): Pool is closed.

while the doc says:

    If no connections are available and :prop:`.block` is ``False``, then a
    fresh connection is returned.

marcobazzani · 2020-09-02T08:43:14Z

I ended up by patching _get_conn like this:


def _get_conn(self, timeout=None):
    conn = None
    try:
        conn = self.pool.get(block=self.block, timeout=timeout)

    except AttributeError:  # self.pool is None
        return self._new_conn()

    except queue.Empty:
        if self.block:
            raise EmptyPoolError(
                self, "Pool reached maximum size and no more connections are allowed.",
            )
        pass

    if conn and is_connection_dropped(conn):
        log.debug("Resetting dropped connection: %s", self.host)
        conn.close()
        if getattr(conn, "auto_open", 1) == 0:
            conn = None

    return conn or self._new_conn()


connectionpool.HTTPConnectionPool._get_conn = _get_conn

it works but I don't know the drawbacks, any help here?

xloem · 2022-11-08T10:53:49Z

@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.

How large do bounties usually need to get before somebody bites?

FriendsOfGalaxy mentioned this issue Aug 27, 2019

Have to reconnect after a while FriendsOfGalaxy/galaxy-integration-battlenet#5

Closed

This was referenced Sep 2, 2020

PoolManager is not thread-safe #1252

Closed

Our use of urllib3's ConnectionPools is not threadsafe. psf/requests#1871

Closed

Itay4 mentioned this issue Jun 30, 2022

AWSHTTPSConnectionPool - Pool is closed boto/boto3#3313

Closed

xloem mentioned this issue Nov 8, 2022

requests.exceptions.ConnectionError: HTTPSConnectionPool : Pool is Closed (On Some Connections) Ge0rg3/requests-ip-rotator#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClosedPoolError on crawling #951

ClosedPoolError on crawling #951

eladbitton commented Aug 18, 2016 •

edited by shazow

shazow commented Aug 18, 2016

Lukasa commented Aug 18, 2016

eladbitton commented Aug 19, 2016

Lukasa commented Aug 19, 2016

eladbitton commented Aug 19, 2016

Lukasa commented Aug 19, 2016

haikuginger commented Aug 19, 2016

shazow commented Aug 19, 2016 •

edited

Lukasa commented Aug 19, 2016

haikuginger commented Aug 19, 2016

sigmavirus24 commented Aug 19, 2016

eladbitton commented Nov 28, 2016

Lukasa commented Nov 28, 2016

eladbitton commented Nov 28, 2016

Lukasa commented Nov 28, 2016

sigmavirus24 commented Nov 28, 2016

Yomguithereal commented Feb 27, 2019

willianantunes commented Mar 30, 2019 •

edited

willianantunes commented Mar 30, 2019

marcobazzani commented Sep 1, 2020 •

edited

marcobazzani commented Sep 2, 2020

xloem commented Nov 8, 2022

ClosedPoolError on crawling #951

ClosedPoolError on crawling #951

Comments

eladbitton commented Aug 18, 2016 • edited by shazow

shazow commented Aug 18, 2016

Lukasa commented Aug 18, 2016

eladbitton commented Aug 19, 2016

Lukasa commented Aug 19, 2016

eladbitton commented Aug 19, 2016

Lukasa commented Aug 19, 2016

haikuginger commented Aug 19, 2016

shazow commented Aug 19, 2016 • edited

Lukasa commented Aug 19, 2016

haikuginger commented Aug 19, 2016

sigmavirus24 commented Aug 19, 2016

eladbitton commented Nov 28, 2016

Lukasa commented Nov 28, 2016

eladbitton commented Nov 28, 2016

Lukasa commented Nov 28, 2016

sigmavirus24 commented Nov 28, 2016

Yomguithereal commented Feb 27, 2019

willianantunes commented Mar 30, 2019 • edited

willianantunes commented Mar 30, 2019

marcobazzani commented Sep 1, 2020 • edited

marcobazzani commented Sep 2, 2020

xloem commented Nov 8, 2022

eladbitton commented Aug 18, 2016 •

edited by shazow

shazow commented Aug 19, 2016 •

edited

willianantunes commented Mar 30, 2019 •

edited

marcobazzani commented Sep 1, 2020 •

edited