Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClosedPoolError on crawling #951

Open
eladbitton opened this issue Aug 18, 2016 · 22 comments
Open

ClosedPoolError on crawling #951

eladbitton opened this issue Aug 18, 2016 · 22 comments

Comments

@eladbitton
Copy link

eladbitton commented Aug 18, 2016

I am building a crawler with python3 and urllib3. I am using a PoolManager instance that is used by 15 different threads. While crawling thousands of website i get a lot of ClosedPoolError from different website.

On the documentation - ClosedPoolError:

Raised when a request enters a pool after the pool has been closed.

It appears that the PoolManager instance is trying to use a closed connection.

My code:

from urllib3 import PoolManager, util, Retry
from urllib3.exceptions import MaxRetryError

# Instance of PoolManager is started on init
manager = PoolManager(num_pools=15,
                      maxsize=6,
                      timeout=40.0,
                      retries=Retry(connect=2, read=2, redirect=10))

# Every thread execute download by using the pool manager instance
url_to_download = "**"
headers = util.make_headers(accept_encoding='gzip, deflate',
                            keep_alive=True,
                            user_agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0")
headers['Accept-Language'] = "en-US,en;q=0.5"
headers['Accept'] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
try:
   response = manager.request('GET',
                                   url_to_download,
                                   preload_content=False,
                                   headers=headers)
except MaxRetryError as ex:
   raise FailedToDownload()

How can i make the PoolManager renew the connection and try again?

@shazow
Copy link
Member

shazow commented Aug 18, 2016

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Aug 18, 2016

Do you have a complete traceback, please? I'd like to see where this failure is coming from.

@eladbitton
Copy link
Author

This is what i have from my log:
<class 'urllib3.exceptions.ClosedPoolError'>, ClosedPoolError("HTTPConnectionPool(host='homepage.ntlworld.com', port=80): Pool is closed.",), <traceback object at 0x7f01db7462c8>

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Aug 19, 2016

Hrm, we're really going to need a better traceback. If you're not catching this, can you add a block that catches it and fires log.exception to provide the traceback in the logs?

@eladbitton
Copy link
Author

Full traceback :)

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 235, in _get_conn
    conn = self.pool.get(block=self.block, timeout=timeout)
AttributeError: 'NoneType' object has no attribute 'get'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/elad/workspace/Crawler/crawler/Downloader.py", line 62, in download
    headers=headers)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 69, in request
    **urlopen_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/request.py", line 90, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/poolmanager.py", line 248, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 668, in urlopen
    release_conn=release_conn, **response_kw)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 583, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 238, in _get_conn
    raise ClosedPoolError(self, "Pool is closed.")
urllib3.exceptions.ClosedPoolError: HTTPConnectionPool(host='reader.co.il', port=80): Pool is closed.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Aug 19, 2016

So, that's convenient. It seems like we've followed a redirect and are now attempting to grab a connection out of the connection pool. Somehow, in between this, the pool is getting closed.

It looks like the PoolManager allows this behaviour: essentially, if a pool starts making requests to the host after num_pools previous hosts, that can cause the oldest outstanding connection pool to be forcefully closed. This won't affect outstanding requests, but will affect redirects.

What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it? @haikuginger, @shazow, got theories on that?

@haikuginger
Copy link
Contributor

If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?

Because ConnectionPools were originally and still are a first-class object that can be used on their own for a single host. This was a cause for confusion for me as well, and is one of a number of reasons that redirects are refactored to be part of the RequestMethods base class as part of the SessionManager branch.

@shazow
Copy link
Member

shazow commented Aug 19, 2016

What's not clear to me is why there is redirect code in multiple places. If we're capable of doing redirects at the PoolManager level, why do the ConnectionPools also support it?

Historically, ConnectionPools came first, and the first attempt at redirection code lived there. Of course it did not work for cross-host requests.

Then PoolManagers came, and the redirection code was ~copied, then later improved by the Retry objects stuff.

I'd be +1 from removing redirect support from ConnectionPools, as it's very edge-case-y anyways (since it breaks for cross-host). I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.

Bonus points if our redirecting code can be used in stand-alone for those who want to. This would be a great recipe to have in docs, if it's not there already.

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Aug 19, 2016

I'd be +1 on pulling it out of the ConnectionPool too.

@haikuginger
Copy link
Contributor

I have no objection to that.

@sigmavirus24
Copy link
Contributor

I'm working on this since it's also necessary for #952.

sigmavirus24 added a commit to sigmavirus24/urllib3 that referenced this issue Aug 21, 2016
Slightly refactor our PoolManager#urlopen method to turn integers into
Retry objects sooner and move our redirect handling tests for connection
pools to our pool manager so we don't lose test coverage.

Some tests were duplicated and thus not moved over. The rest were not
and so I moved them over and modified their logic to exercise the
PoolManager instead of HTTPConnectionPool

Closes urllib3#951
@eladbitton
Copy link
Author

When will this fix will get in the main repository?

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Nov 28, 2016

When the fix is finished. =)

@eladbitton
Copy link
Author

Any projection for that? :)

@Lukasa
Copy link
Sponsor Contributor

Lukasa commented Nov 28, 2016

Nope. The nature of these things is that without any form of contracted work the work will get done when someone is motivated sufficiently to complete that contribution.

@sigmavirus24
Copy link
Contributor

@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.

@Yomguithereal
Copy link

Hitting the same issue while using a bunch of threads to mass download html content. Is there anything I could do to help you fix this or do you reckon I should handle redirects on my own as @shazow says:

I feel if you're going to use such a low-level primitive, you should be ready to do your own redirecting.

@willianantunes
Copy link

willianantunes commented Mar 30, 2019

Hitting the same issue here as well, but in my case I'm usings requests which wraps it.

@Lukasa, @haikuginger and @shazow I'm really aiming to get this solved. I'm studying how I can submit a PR with tested code and so on to get this fixed and help the community.

If you guys have any tips, just leave them on the table 😄
I'll see what I can do 👍

@willianantunes
Copy link

One possible way to circumvent this, is to create a code like:

def _retry_if_exception_is_caught(execution_to_be_callable: Callable):
    max_tries = 5
    tries = 0
    while True:
        try:
            response = execution_to_be_callable()
            return response
        except Exception as e:
            logger.warning(f"Exception of type {type(e)} was caught. Trying again...")
        if tries >= max_tries:
            raise CouldNotAccomplishExecutionException()
        tries += 1

Then you can use it like this:

def do_the_call():
    with requests.Session() as session:
        return session.get("http://your-service-url")
    
    
foo = _retry_if_exception_is_caught(lambda: do_the_call())

For those who is facing the same problem.

@marcobazzani
Copy link

marcobazzani commented Sep 1, 2020

so digging into this
found that even if blocking is False it fails with this error:


Traceback (most recent call last):
  File "/home/mbazzan/project/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 660, in urlopen
    conn = self._get_conn(timeout=pool_timeout)
  File "/home/mbazzan/project/site-packages/urllib3/connectionpool.py", line 259, in _get_conn
    raise ClosedPoolError(self, f"Pool is closed. {self.block} {timeout}")
urllib3.exceptions.ClosedPoolError: HTTPSConnectionPool(host='zzz.zzz-api.us-east-1.amazonaws.com', port=443): Pool is closed.

while the doc says:

    If no connections are available and :prop:`.block` is ``False``, then a
    fresh connection is returned.

@marcobazzani
Copy link

I ended up by patching _get_conn like this:


def _get_conn(self, timeout=None):
    conn = None
    try:
        conn = self.pool.get(block=self.block, timeout=timeout)

    except AttributeError:  # self.pool is None
        return self._new_conn()

    except queue.Empty:
        if self.block:
            raise EmptyPoolError(
                self, "Pool reached maximum size and no more connections are allowed.",
            )
        pass

    if conn and is_connection_dropped(conn):
        log.debug("Resetting dropped connection: %s", self.host)
        conn.close()
        if getattr(conn, "auto_open", 1) == 0:
            conn = None

    return conn or self._new_conn()


connectionpool.HTTPConnectionPool._get_conn = _get_conn

it works but I don't know the drawbacks, any help here?

@xloem
Copy link

xloem commented Nov 8, 2022

@eladbitton posting a bounty on this bug is a good way to motivate people to fix the bug for you.

How large do bounties usually need to get before somebody bites?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants