Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep alive? #5

Open
hunt3r opened this issue Aug 30, 2016 · 4 comments
Open

Keep alive? #5

hunt3r opened this issue Aug 30, 2016 · 4 comments
Labels

Comments

@hunt3r
Copy link

hunt3r commented Aug 30, 2016

For our application, we are trying to setup asyncmc as a singleton, instantiated at application startup. This way we can reuse the connection many times over without instantiating a new socket each time. We are running into issues in some lesser used environments where memcached connection seem to be getting "stale" and we will receive a timeout and a 504. The connection will be restablished and things go on working fine. This is fairly disruptive.

I thought that using an async callback might work, but it doesn't seem to.

            tornado.ioloop.PeriodicCallback(
                self.settings.get('cache', {}).get('client').stats,
                self.settings.get('cache', {}).get('keep_alive_ping_time'),
                io_loop=tornado.ioloop.IOLoop.current()
            ).start()

Is there a desired approach for working with asyncmc over longer periods?

@ErDmKo
Copy link
Owner

ErDmKo commented Aug 30, 2016

@hunt3r thanks for your feedback. I will try to make test to repeat it. I think we have to support this feature, something like ping connection for long time period.
But what is it "longer periods" years, moths, days or hours?

@hunt3r
Copy link
Author

hunt3r commented Aug 30, 2016

It's more about inactivity. This is a web application that will run for days-weeks without a deploy. We need a reliable way to recycle the connections. The issue seems to be that when we receive a ClientError, our tornado process goes zombie, then when the process is cleaned up by gunicorn, we get a new connection and all is well, but the request that hit the issue is given a 504 that first time. This happens for each of the load balanced processes, 6 servers * 17 processes, so you can see, this is a lot of 504 timeouts, and problematic when there is limited activity in our lower environments for testing. Employees come in the morning and the servers are all not very responsive. If we have this issue in production we have much bigger issues 😆 .

It seems like in a busy server the regular Host.mark_dead is handling broken connections fine and we don't really have issues. I'm not certain exactly what is causing our failures on stale threads, but it may be a chopped socket from Memcached due to inactivity.

A sample of what we sometimes see, like from our health check is the following:

  File "/opt/python3.4/lib/python3.4/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/opt/python3.4/lib/python3.4/site-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/opt/python3.4/lib/python3.4/site-packages/tornado/gen.py", line 876, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/python3.4/lib/python3.4/site-packages/asyncmc/client.py", line 63, in wrapper
    res = yield func(self, conn, *args, **kwargs)
  File "/opt/python3.4/lib/python3.4/site-packages/tornado/gen.py", line 870, in run
    value = future.result()
  File "/opt/python3.4/lib/python3.4/site-packages/tornado/concurrent.py", line 215, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 3, in raise_exc_info
  File "/opt/python3.4/lib/python3.4/site-packages/tornado/gen.py", line 879, in run
    yielded = self.gen.send(value)
  File "/opt/python3.4/lib/python3.4/site-packages/asyncmc/client.py", line 135, in stats
    raise ClientException('stats failed', resp)

I've added a ping that we can call from the async call back which will do a time comparison on each run and clear() the pool and acquire a new connection after say 10 minutes. I'm not sure if this is the right direction to go or not?

I was going to do a test and get it into one of our lower environments to see if it helped. My concern with this approach is if it was possible to clear the pool while connections were waiting under heavy load.

I've subclassed Client and added the following


    @gen.coroutine
    def purge(self):
        self.logger.debug('Clearing Memcached client pool and reconnecting')
        self.pool.clear()
        self.pool.acquire()
        self.client_created_time = time()

    @gen.coroutine
    def ping(self):
        """
        Ping the connection and clear the pool after a certain time frame
        :return:
        """
        stats = None
        if time() - self.client_created_time > MEMCACHED_MAX_CLIENT_POOL_AGE:
            yield self.purge()

        try:
            stats = yield self.stats()
        except ClientException as ex:
            self.logger.debug(ex, exc_info=traceback.format_exc())
        return stats

I added a client_created_time class attribute to track the last clear()

I'm using the ping method in my Application singleton:

        if settings.get('cache', {}).get('provider') == CacheProviderTypes.MEMCACHED:
            tornado.ioloop.PeriodicCallback(
                self.settings.get('cache', {}).get('client').ping,
                self.settings.get('cache', {}).get('keep_alive_ping_time'),
                io_loop=tornado.ioloop.IOLoop.current()
            ).start()

We wrap the asyncmc.Client in a factory because we support other types of caching, and that method does the following, this will enable us to trap client errors, and clear the pool if it happens in the flow.

    @coroutine
    def get(self, cache_key, hash_key=True):
        """
        Get a cached item from memcached
        :param cache_key:
        :param hash_key:
        :return object:
        """
        try:
            if hash_key:
                cache_key = get_cache_key_hash(cache_key)

            data = yield self.client.get(cache_key)
            return data
        except ClientException as ex:
            self.logger.error(ex, exc_info=traceback.format_exc())
            yield self.client.purge()
            raise CacheFactoryClientError(ex)
        except InvalidCacheKeyException as ex:
            self.logger.debug(ex)
        except Exception as ex:
            self.logger.error(ex)
            raise CacheFactoryClientError(ex)
        return None

It would be nice to have an option on the Pool to remove connections after a certain time frame in the normal flow, so after a connection finishes, and is returned to the pool, it would be marked for removal and removed gracefully. This approach I'm using here is probably a bit extreme, but I'm not as familiar with the code quite yet and need to get something working here.

Glad to help in any way I can, thanks for your work on this by the way, its definitely the most viable non-blocking memcached client for Tornado that I've found! 🍻

@hunt3r
Copy link
Author

hunt3r commented Sep 6, 2016

Just an FYI this approach is working for us for now, but seems a bit like a sledge hammer.

I think the better way to handle this would be on a per connection basis. We could iterate over the connection pool and mark connections for removal after some amount of time, on each request if a connection was fetched that was marked for removal, it would dispose of the socket and instantiate a new one.

@hunt3r hunt3r closed this as completed Sep 6, 2016
@hunt3r
Copy link
Author

hunt3r commented Sep 6, 2016

Didn't mean to close.

@hunt3r hunt3r reopened this Sep 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants