-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keel instances stop checking resources #951
Comments
Bump the version of retrofit2, in the hope that it resolves keel issue: spinnaker/keel#951
Bump the version of retrofit2, in the hope that it resolves keel issue: spinnaker/keel#951
Bump the version of retrofit2, in the hope that it resolves keel issue: spinnaker/keel#951
Note: The retrofit2 version bump does not seem to have resolved this issue. |
To attempt to debug, we've added the |
Alas, adding the |
Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging spinnaker#951.
Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging spinnaker#951.
Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging spinnaker#951.
Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging spinnaker#951.
Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging spinnaker#951.
Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging #951.
Add additional debugging logging. Change the threading model of an async call to see if it works around issue spinnaker#951.
Add additional debugging logging. Change the threading model of an async call to see if it works around issue #951.
Bump the OkHttp library to version 4.5.0. Hopefully fixes spinnaker#951
Bump the OkHttp library to version 4.5.0. Hopefully fixes spinnaker#951
Bump the OkHttp library to version 4.5.0. Hopefully fixes #951
hopefully fixes spinnaker#951
Back out code that was added to debug issue spinnaker#951: * log.debug statements * running async in single thread context (originally done to see if it would resolve the issue. It didn't).
Back out code that was added to debug issue spinnaker#951: * log.debug statements * running async in single thread context (originally done to see if it would resolve the issue. It didn't).
Back out code that was added to debug issue spinnaker#951: * log.debug statements * running async in single thread context (originally done to see if it would resolve the issue. It didn't).
Back out code that was added to debug issue spinnaker#951: * log.debug statements * running async in single thread context (originally done to see if it would resolve the issue. It didn't).
Back out code that was added to debug issue #951: * log.debug statements * running async in single thread context (originally done to see if it would resolve the issue. It didn't).
Overview
Keel instances are regularly getting into a state where they stop checking resources. This is apparent when the
keel.resource.check.drift
metric starts to increase unbounded, until it reaches the timeout. Even once it reaches the timeout, it still doesn't make progress.Example graph of an instance getting into a bad state
You can see that all successful checks stop, with a timeout error happening every two minutes.
Atlas query
Current theory: bug in OkHttp library 4.4.1
My current hypothesis is that the issue is a bug in OkHttp version 4.4.1. Here's my reasoning:
Thread dump: spinning in findHealthyConnection
Thread dump on a stuck instance shows that it's in the OkHttp library's
findHealthyConnection
method, which is an infinite loop.stack trace
Bugfix in OkHttp version 4.5.0
We are running OkHttp version 4.4.1. The
findHealthyConnection
code was changed in version 4.5.0.The OkHttp 4.5.0 changelog says this:
General symptoms
current.await() timing out
The code is timing out in ResourceActuator when calling
current.await()
.Timeout stack trace (with kotlin coroutine debug option enabled to show coroutine trace)
Note that the
CannotResolveCurrentState
exception never appears in the log, so it does not appear that thecurrent
method itself is throwing an exception.GET call to clouddriver that doesn't return
Whenever an instance gets into a bad state and times out, the logs show that one of the GET requests made from keel to clouddriver doesn't have a matching response.
Older stuff
(This is from earlier diagnostic work)
fillInStackTrace
One of the symptoms of these stuck instances is that a thread eventually spins in the
java.lang.Throwable.fillInStackTrace
method, triggered by an OkHttp call (thread dump).However, while we always eventually see this behavior, I don't think this is what's causing the problem. In at least one case, I took a thread dump earlier on in of an instance that stopped checking resource, and the thread was not in the
fillInStackTrace
method. Eventually, it ended up there.Detailed thread dumps (old)
(This is before we had kotlin coroutine debugging on, so these are not so insightful)
This gist shows some complete thread dumps:
fillInStackTrace
.Socket closed
We're seeing evidence that keel is trying to do I/O operations against a closed socket:
The following INFO-level messages appear in the log, for the logger
okhttp3.OkHttpClient
in the threadOkHttp https://clouddriver/...
Looking at a flame graph, we're seeing a SocketException thrown on setSoTimeout. One of the ways this could happen is if that call is made on a closed socket.
Flame graph
Thread dumps
The text was updated successfully, but these errors were encountered: