Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keel instances stop checking resources #951

Closed
lorin opened this issue Mar 31, 2020 · 3 comments · Fixed by #1006 or #1028
Closed

Keel instances stop checking resources #951

lorin opened this issue Mar 31, 2020 · 3 comments · Fixed by #1006 or #1028
Assignees
Milestone

Comments

@lorin
Copy link
Contributor

lorin commented Mar 31, 2020

Overview

Keel instances are regularly getting into a state where they stop checking resources. This is apparent when the keel.resource.check.drift metric starts to increase unbounded, until it reaches the timeout. Even once it reaches the timeout, it still doesn't make progress.

Example graph of an instance getting into a bad state

image

You can see that all successful checks stop, with a timeout error happening every two minutes.

Atlas query
name,keel.resource.check.drift,:eq,
1000,:div,
2,:lw,
$name,:legend,

name,keel.resource.checked,:eq,
(,resourceState,),:by,
:stack,
:per-step,
44,:alpha,
$resourceState,:legend,
1,:axis,

:list,
(,nf.node,<INSTANCE ID GOES HERE>,:eq,:cq,),:each

Current theory: bug in OkHttp library 4.4.1

My current hypothesis is that the issue is a bug in OkHttp version 4.4.1. Here's my reasoning:

Thread dump: spinning in findHealthyConnection

Thread dump on a stuck instance shows that it's in the OkHttp library's findHealthyConnection method, which is an infinite loop.

stack trace
"OkHttp https://clouddriver/..." #2130 prio=5 os_prio=0 tid=0x00007f596474c800 nid=0x23fc runnable [0x00007f5914dbe000]
   java.lang.Thread.State: RUNNABLE
	at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:108)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:76)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.okhttp.MetricsInterceptor.doIntercept(MetricsInterceptor.java:84)
	at com.netflix.spinnaker.okhttp.OkHttp3MetricsInterceptor.intercept(OkHttp3MetricsInterceptor.java:33)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.keel.retrofit.AuthorizationHeadersInterceptor.intercept(AuthorizationHeadersInterceptor.kt:30)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:219)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- <0x00000006d174a8c0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Bugfix in OkHttp version 4.5.0

We are running OkHttp version 4.4.1. The findHealthyConnection code was changed in version 4.5.0.

The OkHttp 4.5.0 changelog says this:

This release fixes a severe bug where OkHttp incorrectly detected and recovered from unhealthy connections. Stale or canceled connections were incorrectly attempted when they shouldn’t have been, leading to rare cases of infinite retries. Please upgrade to this release!

General symptoms

current.await() timing out

The code is timing out in ResourceActuator when calling current.await().

Timeout stack trace (with kotlin coroutine debug option enabled to show coroutine trace)
kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 120000 ms
	at (Coroutine boundary.()
	at kotlinx.coroutines.DeferredCoroutine.await$suspendImpl(Builders.common.kt:99)
	at com.netflix.spinnaker.keel.actuation.ResourceActuator$resolve$2.invokeSuspend(ResourceActuator.kt:208)
	at com.netflix.spinnaker.keel.actuation.ResourceActuator$checkResource$2.invokeSuspend(ResourceActuator.kt:77)
	at com.netflix.spinnaker.keel.logging.TracingSupport$Companion.withTracingContext(TracingSupport.kt:39)
Caused by: kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 120000 ms
	at kotlinx.coroutines.TimeoutKt.TimeoutCancellationException(Timeout.kt:149)
	at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:119)
	at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:493)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:272)
	at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:68)
	at java.lang.Thread.run(Thread.java:748)

Note that the CannotResolveCurrentState exception never appears in the log, so it does not appear that the current method itself is throwing an exception.

GET call to clouddriver that doesn't return

Whenever an instance gets into a bad state and times out, the logs show that one of the GET requests made from keel to clouddriver doesn't have a matching response.

Older stuff

(This is from earlier diagnostic work)

fillInStackTrace

One of the symptoms of these stuck instances is that a thread eventually spins in the java.lang.Throwable.fillInStackTrace method, triggered by an OkHttp call (thread dump).

However, while we always eventually see this behavior, I don't think this is what's causing the problem. In at least one case, I took a thread dump earlier on in of an instance that stopped checking resource, and the thread was not in the fillInStackTrace method. Eventually, it ended up there.

Detailed thread dumps (old)

(This is before we had kotlin coroutine debugging on, so these are not so insightful)

This gist shows some complete thread dumps:

Socket closed

We're seeing evidence that keel is trying to do I/O operations against a closed socket:

The following INFO-level messages appear in the log, for the logger okhttp3.OkHttpClient in the thread OkHttp https://clouddriver/...

<-- HTTP FAILED: java.net.SocketException: Socket closed

Looking at a flame graph, we're seeing a SocketException thrown on setSoTimeout. One of the ways this could happen is if that call is made on a closed socket.

Flame graph

flame-graph

Thread dumps

"OkHttp https://clouddriver/..." #1312 prio=5 os_prio=0 tid=0x00007f0594039000 nid=0x2d56 runnable [0x00007f0578dc1000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.Throwable.fillInStackTrace(Native Method)
	at java.lang.Throwable.fillInStackTrace(Throwable.java:784)
	- locked <0x00000006bc328510> (a okhttp3.internal.connection.RouteException)
	at java.lang.Throwable.<init>(Throwable.java:311)
	at java.lang.Exception.<init>(Exception.java:102)
	at java.lang.RuntimeException.<init>(RuntimeException.java:96)
	at okhttp3.internal.connection.RouteException.<init>(RouteException.kt:25)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:90)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.okhttp.MetricsInterceptor.doIntercept(MetricsInterceptor.java:84)
	at com.netflix.spinnaker.okhttp.OkHttp3MetricsInterceptor.intercept(OkHttp3MetricsInterceptor.java:33)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- <0x000000070984cf30> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"OkHttp https://clouddriver/..." #1312 prio=5 os_prio=0 tid=0x00007f0594039000 nid=0x2d56 runnable [0x00007f0578dc1000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.Throwable.fillInStackTrace(Native Method)
	at java.lang.Throwable.fillInStackTrace(Throwable.java:784)
	- locked <0x0000000680cdefd8> (a java.net.SocketException)
	at java.lang.Throwable.<init>(Throwable.java:266)
	at java.lang.Exception.<init>(Exception.java:66)
	at java.io.IOException.<init>(IOException.java:58)
	at java.net.SocketException.<init>(SocketException.java:47)
	at java.net.Socket.setSoTimeout(Socket.java:1155)
	- locked <0x0000000709a6ba08> (a java.net.Socket)
	at sun.security.ssl.BaseSSLSocketImpl.setSoTimeout(BaseSSLSocketImpl.java:633)
	- locked <0x0000000709b0b5d8> (a sun.security.ssl.SSLSocketImpl)
	at sun.security.ssl.SSLSocketImpl.setSoTimeout(SSLSocketImpl.java:2556)
	at okhttp3.internal.connection.RealConnection.newCodec$okhttp(RealConnection.kt:599)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:84)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.okhttp.MetricsInterceptor.doIntercept(MetricsInterceptor.java:84)
	at com.netflix.spinnaker.okhttp.OkHttp3MetricsInterceptor.intercept(OkHttp3MetricsInterceptor.java:33)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@lorin lorin changed the title Blocked in OkHttp call due to lock Thread stuck in fillInStackTrace method from OkHttp call Mar 31, 2020
lorin pushed a commit to lorin/kork that referenced this issue Mar 31, 2020
Bump the version of retrofit2, in the hope that it resolves
keel issue: spinnaker/keel#951
lorin pushed a commit to lorin/kork that referenced this issue Mar 31, 2020
Bump the version of retrofit2, in the hope that it resolves
keel issue: spinnaker/keel#951
@lorin lorin self-assigned this Mar 31, 2020
@lorin lorin added the bug label Mar 31, 2020
mergify bot pushed a commit to spinnaker/kork that referenced this issue Mar 31, 2020
Bump the version of retrofit2, in the hope that it resolves
keel issue: spinnaker/keel#951
@lorin
Copy link
Contributor Author

lorin commented Apr 6, 2020

Note: The retrofit2 version bump does not seem to have resolved this issue.

@lorin lorin changed the title Thread stuck in fillInStackTrace method from OkHttp call Instances stop making progress, thread spinning in fillInStackTrace method from OkHttp call Apr 6, 2020
@luispollo luispollo added the Core label Apr 6, 2020
@luispollo luispollo added this to To do in Declarative delivery 1.0 via automation Apr 6, 2020
@luispollo luispollo added this to the Beta milestone Apr 6, 2020
@lorin
Copy link
Contributor Author

lorin commented Apr 6, 2020

To attempt to debug, we've added the -XX:+OmitStackTraceInFastThrow option in the hopes that this will bypass the code where it's getting stuck so that the exception will throw all of the way up so we can see what it is.

@lorin
Copy link
Contributor Author

lorin commented Apr 7, 2020

Alas, adding the -XX:+OmitStackTraceInFastThrow option did not prevent the code from spinning in fillInSTackTrace.

@lorin lorin changed the title Instances stop making progress, thread spinning in fillInStackTrace method from OkHttp call Keel instances stop checking resources Apr 7, 2020
lorin pushed a commit to lorin/keel that referenced this issue Apr 8, 2020
Create an ad-hoc coroutine id to help make gather related log messages.
This is a temporary change to make it easier to go through the log
messages for debugging spinnaker#951.
lorin pushed a commit to lorin/keel that referenced this issue Apr 8, 2020
Create an ad-hoc coroutine id to help make gather related log messages.
This is a temporary change to make it easier to go through the log
messages for debugging spinnaker#951.
lorin pushed a commit to lorin/keel that referenced this issue Apr 8, 2020
Create an ad-hoc coroutine id to help make gather related log messages.
This is a temporary change to make it easier to go through the log
messages for debugging spinnaker#951.
lorin pushed a commit to lorin/keel that referenced this issue Apr 8, 2020
Create an ad-hoc coroutine id to help make gather related log messages.
This is a temporary change to make it easier to go through the log
messages for debugging spinnaker#951.
lorin pushed a commit to lorin/keel that referenced this issue Apr 8, 2020
Create an ad-hoc coroutine id to help make gather related log messages.
This is a temporary change to make it easier to go through the log
messages for debugging spinnaker#951.
emjburns pushed a commit that referenced this issue Apr 8, 2020
Create an ad-hoc coroutine id to help make gather related log messages.
This is a temporary change to make it easier to go through the log
messages for debugging #951.
lorin pushed a commit to lorin/keel that referenced this issue Apr 10, 2020
Add additional debugging logging.

Change the threading model of an async call to see if it works around
issue spinnaker#951.
emjburns pushed a commit that referenced this issue Apr 10, 2020
Add additional debugging logging.

Change the threading model of an async call to see if it works around
issue #951.
lorin pushed a commit to lorin/keel that referenced this issue Apr 12, 2020
Bump the OkHttp library to version 4.5.0.

Hopefully fixes spinnaker#951
lorin pushed a commit to lorin/keel that referenced this issue Apr 12, 2020
Bump the OkHttp library to version 4.5.0.

Hopefully fixes spinnaker#951
Declarative delivery 1.0 automation moved this from To do to Done Apr 12, 2020
emjburns pushed a commit that referenced this issue Apr 12, 2020
Bump the OkHttp library to version 4.5.0.

Hopefully fixes #951
lorin pushed a commit to lorin/keel that referenced this issue Apr 15, 2020
mergify bot pushed a commit that referenced this issue Apr 15, 2020
lorin pushed a commit to lorin/keel that referenced this issue Apr 16, 2020
Back out code that was added to debug issue spinnaker#951:

* log.debug statements
* running async in single thread context (originally done to see if it would resolve the issue. It didn't).
lorin pushed a commit to lorin/keel that referenced this issue Apr 16, 2020
Back out code that was added to debug issue spinnaker#951:

* log.debug statements
* running async in single thread context (originally done to see if it would resolve the issue. It didn't).
lorin pushed a commit to lorin/keel that referenced this issue Apr 17, 2020
Back out code that was added to debug issue spinnaker#951:

* log.debug statements
* running async in single thread context (originally done to see if it would resolve the issue. It didn't).
lorin pushed a commit to lorin/keel that referenced this issue Apr 17, 2020
Back out code that was added to debug issue spinnaker#951:

* log.debug statements
* running async in single thread context (originally done to see if it would resolve the issue. It didn't).
mergify bot pushed a commit that referenced this issue Apr 17, 2020
Back out code that was added to debug issue #951:

* log.debug statements
* running async in single thread context (originally done to see if it would resolve the issue. It didn't).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants