Keel instances stop checking resources #951

lorin · 2020-03-31T18:48:25Z

Overview

Keel instances are regularly getting into a state where they stop checking resources. This is apparent when the keel.resource.check.drift metric starts to increase unbounded, until it reaches the timeout. Even once it reaches the timeout, it still doesn't make progress.

Example graph of an instance getting into a bad state

You can see that all successful checks stop, with a timeout error happening every two minutes.

Atlas query

name,keel.resource.check.drift,:eq,
1000,:div,
2,:lw,
$name,:legend,

name,keel.resource.checked,:eq,
(,resourceState,),:by,
:stack,
:per-step,
44,:alpha,
$resourceState,:legend,
1,:axis,

:list,
(,nf.node,<INSTANCE ID GOES HERE>,:eq,:cq,),:each

Current theory: bug in OkHttp library 4.4.1

My current hypothesis is that the issue is a bug in OkHttp version 4.4.1. Here's my reasoning:

Thread dump: spinning in findHealthyConnection

Thread dump on a stuck instance shows that it's in the OkHttp library's findHealthyConnection method, which is an infinite loop.

stack trace

"OkHttp https://clouddriver/..." #2130 prio=5 os_prio=0 tid=0x00007f596474c800 nid=0x23fc runnable [0x00007f5914dbe000]
   java.lang.Thread.State: RUNNABLE
	at okhttp3.internal.connection.ExchangeFinder.findHealthyConnection(ExchangeFinder.kt:108)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:76)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.okhttp.MetricsInterceptor.doIntercept(MetricsInterceptor.java:84)
	at com.netflix.spinnaker.okhttp.OkHttp3MetricsInterceptor.intercept(OkHttp3MetricsInterceptor.java:33)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.keel.retrofit.AuthorizationHeadersInterceptor.intercept(AuthorizationHeadersInterceptor.kt:30)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.logging.HttpLoggingInterceptor.intercept(HttpLoggingInterceptor.kt:219)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- <0x00000006d174a8c0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

Bugfix in OkHttp version 4.5.0

We are running OkHttp version 4.4.1. The findHealthyConnection code was changed in version 4.5.0.

The OkHttp 4.5.0 changelog says this:

This release fixes a severe bug where OkHttp incorrectly detected and recovered from unhealthy connections. Stale or canceled connections were incorrectly attempted when they shouldn’t have been, leading to rare cases of infinite retries. Please upgrade to this release!

General symptoms

current.await() timing out

The code is timing out in ResourceActuator when calling current.await().

Timeout stack trace (with kotlin coroutine debug option enabled to show coroutine trace)

kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 120000 ms
	at (Coroutine boundary.()
	at kotlinx.coroutines.DeferredCoroutine.await$suspendImpl(Builders.common.kt:99)
	at com.netflix.spinnaker.keel.actuation.ResourceActuator$resolve$2.invokeSuspend(ResourceActuator.kt:208)
	at com.netflix.spinnaker.keel.actuation.ResourceActuator$checkResource$2.invokeSuspend(ResourceActuator.kt:77)
	at com.netflix.spinnaker.keel.logging.TracingSupport$Companion.withTracingContext(TracingSupport.kt:39)
Caused by: kotlinx.coroutines.TimeoutCancellationException: Timed out waiting for 120000 ms
	at kotlinx.coroutines.TimeoutKt.TimeoutCancellationException(Timeout.kt:149)
	at kotlinx.coroutines.TimeoutCoroutine.run(Timeout.kt:119)
	at kotlinx.coroutines.EventLoopImplBase$DelayedRunnableTask.run(EventLoop.common.kt:493)
	at kotlinx.coroutines.EventLoopImplBase.processNextEvent(EventLoop.common.kt:272)
	at kotlinx.coroutines.DefaultExecutor.run(DefaultExecutor.kt:68)
	at java.lang.Thread.run(Thread.java:748)

Note that the CannotResolveCurrentState exception never appears in the log, so it does not appear that the current method itself is throwing an exception.

GET call to clouddriver that doesn't return

Whenever an instance gets into a bad state and times out, the logs show that one of the GET requests made from keel to clouddriver doesn't have a matching response.

Older stuff

(This is from earlier diagnostic work)

fillInStackTrace

One of the symptoms of these stuck instances is that a thread eventually spins in the java.lang.Throwable.fillInStackTrace method, triggered by an OkHttp call (thread dump).

However, while we always eventually see this behavior, I don't think this is what's causing the problem. In at least one case, I took a thread dump earlier on in of an instance that stopped checking resource, and the thread was not in the fillInStackTrace method. Eventually, it ended up there.

Detailed thread dumps (old)

(This is before we had kotlin coroutine debugging on, so these are not so insightful)

This gist shows some complete thread dumps:

bad-thread-dump-1.txt thread dump from an instance that recently stopped checking resources. Note that the thread is not in fillInStackTrace.
bad-thread-dump-2.txt thread dump from the same bad instance, later on.
good-thread-dump.txt thread dump from a healthy instance

Socket closed

We're seeing evidence that keel is trying to do I/O operations against a closed socket:

The following INFO-level messages appear in the log, for the logger okhttp3.OkHttpClient in the thread OkHttp https://clouddriver/...

<-- HTTP FAILED: java.net.SocketException: Socket closed

Looking at a flame graph, we're seeing a SocketException thrown on setSoTimeout. One of the ways this could happen is if that call is made on a closed socket.

Flame graph

Thread dumps

"OkHttp https://clouddriver/..." #1312 prio=5 os_prio=0 tid=0x00007f0594039000 nid=0x2d56 runnable [0x00007f0578dc1000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.Throwable.fillInStackTrace(Native Method)
	at java.lang.Throwable.fillInStackTrace(Throwable.java:784)
	- locked <0x00000006bc328510> (a okhttp3.internal.connection.RouteException)
	at java.lang.Throwable.<init>(Throwable.java:311)
	at java.lang.Exception.<init>(Exception.java:102)
	at java.lang.RuntimeException.<init>(RuntimeException.java:96)
	at okhttp3.internal.connection.RouteException.<init>(RouteException.kt:25)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:90)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.okhttp.MetricsInterceptor.doIntercept(MetricsInterceptor.java:84)
	at com.netflix.spinnaker.okhttp.OkHttp3MetricsInterceptor.intercept(OkHttp3MetricsInterceptor.java:33)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

   Locked ownable synchronizers:
	- <0x000000070984cf30> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"OkHttp https://clouddriver/..." #1312 prio=5 os_prio=0 tid=0x00007f0594039000 nid=0x2d56 runnable [0x00007f0578dc1000]
   java.lang.Thread.State: RUNNABLE
	at java.lang.Throwable.fillInStackTrace(Native Method)
	at java.lang.Throwable.fillInStackTrace(Throwable.java:784)
	- locked <0x0000000680cdefd8> (a java.net.SocketException)
	at java.lang.Throwable.<init>(Throwable.java:266)
	at java.lang.Exception.<init>(Exception.java:66)
	at java.io.IOException.<init>(IOException.java:58)
	at java.net.SocketException.<init>(SocketException.java:47)
	at java.net.Socket.setSoTimeout(Socket.java:1155)
	- locked <0x0000000709a6ba08> (a java.net.Socket)
	at sun.security.ssl.BaseSSLSocketImpl.setSoTimeout(BaseSSLSocketImpl.java:633)
	- locked <0x0000000709b0b5d8> (a sun.security.ssl.SSLSocketImpl)
	at sun.security.ssl.SSLSocketImpl.setSoTimeout(SSLSocketImpl.java:2556)
	at okhttp3.internal.connection.RealConnection.newCodec$okhttp(RealConnection.kt:599)
	at okhttp3.internal.connection.ExchangeFinder.find(ExchangeFinder.kt:84)
	at okhttp3.internal.connection.RealCall.initExchange$okhttp(RealCall.kt:245)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:32)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:82)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:83)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:74)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at com.netflix.spinnaker.okhttp.MetricsInterceptor.doIntercept(MetricsInterceptor.java:84)
	at com.netflix.spinnaker.okhttp.OkHttp3MetricsInterceptor.intercept(OkHttp3MetricsInterceptor.java:33)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:100)
	at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:197)
	at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:502)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

Bump the version of retrofit2, in the hope that it resolves keel issue: spinnaker/keel#951

lorin · 2020-04-06T16:01:03Z

Note: The retrofit2 version bump does not seem to have resolved this issue.

lorin · 2020-04-06T20:59:59Z

To attempt to debug, we've added the -XX:+OmitStackTraceInFastThrow option in the hopes that this will bypass the code where it's getting stuck so that the exception will throw all of the way up so we can see what it is.

lorin · 2020-04-07T16:20:27Z

Alas, adding the -XX:+OmitStackTraceInFastThrow option did not prevent the code from spinning in fillInSTackTrace.

Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging spinnaker#951.

Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging #951.

Add additional debugging logging. Change the threading model of an async call to see if it works around issue spinnaker#951.

Add additional debugging logging. Change the threading model of an async call to see if it works around issue #951.

Bump the OkHttp library to version 4.5.0. Hopefully fixes spinnaker#951

Bump the OkHttp library to version 4.5.0. Hopefully fixes #951

hopefully fixes spinnaker#951

hopefully fixes #951

Back out code that was added to debug issue spinnaker#951: * log.debug statements * running async in single thread context (originally done to see if it would resolve the issue. It didn't).

Back out code that was added to debug issue #951: * log.debug statements * running async in single thread context (originally done to see if it would resolve the issue. It didn't).

lorin changed the title ~~Blocked in OkHttp call due to lock~~ Thread stuck in fillInStackTrace method from OkHttp call Mar 31, 2020

lorin pushed a commit to lorin/kork that referenced this issue Mar 31, 2020

deps: Bump retrofit2 version

da487cb

Bump the version of retrofit2, in the hope that it resolves keel issue: spinnaker/keel#951

lorin mentioned this issue Mar 31, 2020

deps: Bump retrofit2 version spinnaker/kork#585

Merged

lorin pushed a commit to lorin/kork that referenced this issue Mar 31, 2020

fix(deps): bump retrofit2 version

e476c6e

Bump the version of retrofit2, in the hope that it resolves keel issue: spinnaker/keel#951

lorin self-assigned this Mar 31, 2020

lorin added the bug label Mar 31, 2020

mergify bot pushed a commit to spinnaker/kork that referenced this issue Mar 31, 2020

fix(deps): bump retrofit2 version (#585)

8700222

Bump the version of retrofit2, in the hope that it resolves keel issue: spinnaker/keel#951

lorin changed the title ~~Thread stuck in fillInStackTrace method from OkHttp call~~ Instances stop making progress, thread spinning in fillInStackTrace method from OkHttp call Apr 6, 2020

luispollo added the Core label Apr 6, 2020

luispollo added this to To do in Declarative delivery 1.0 via automation Apr 6, 2020

luispollo added this to the Beta milestone Apr 6, 2020

lorin changed the title ~~Instances stop making progress, thread spinning in fillInStackTrace method from OkHttp call~~ Keel instances stop checking resources Apr 7, 2020

lorin mentioned this issue Apr 8, 2020

fix(actuator): add coroutine id to help debug #985

Merged

emjburns pushed a commit that referenced this issue Apr 8, 2020

fix(actuator): add coroutine id to help debug (#985)

87028e7

Create an ad-hoc coroutine id to help make gather related log messages. This is a temporary change to make it easier to go through the log messages for debugging #951.

lorin pushed a commit to lorin/keel that referenced this issue Apr 10, 2020

fix(resource): tweak thread model, add more debug

35e6722

Add additional debugging logging. Change the threading model of an async call to see if it works around issue spinnaker#951.

lorin mentioned this issue Apr 10, 2020

fix(resource): tweak thread model, add more debug logging #1001

Merged

emjburns pushed a commit that referenced this issue Apr 10, 2020

fix(resource): tweak thread model, add more debug (#1001)

f3997f5

Add additional debugging logging. Change the threading model of an async call to see if it works around issue #951.

lorin pushed a commit to lorin/keel that referenced this issue Apr 12, 2020

chore(dependencies): bump okhttp to 4.5.0

8c55443

Bump the OkHttp library to version 4.5.0. Hopefully fixes spinnaker#951

lorin mentioned this issue Apr 12, 2020

chore(dependencies): bump okhttp to 4.5.0 #1006

Merged

lorin pushed a commit to lorin/keel that referenced this issue Apr 12, 2020

chore(dependencies): bump okhttp to 4.5.0

ea3884a

Bump the OkHttp library to version 4.5.0. Hopefully fixes spinnaker#951

emjburns closed this as completed in #1006 Apr 12, 2020

Declarative delivery 1.0 automation moved this from To do to Done Apr 12, 2020

emjburns pushed a commit that referenced this issue Apr 12, 2020

chore(dependencies): bump okhttp to 4.5.0 (#1006)

1447959

Bump the OkHttp library to version 4.5.0. Hopefully fixes #951

lorin pushed a commit to lorin/keel that referenced this issue Apr 15, 2020

fix(deps): bump okhttp

f04b7e9

hopefully fixes spinnaker#951

lorin mentioned this issue Apr 15, 2020

fix(deps): bump okhttp #1028

Merged

mergify bot pushed a commit that referenced this issue Apr 15, 2020

fix(deps): bump okhttp (#1028)

b462e6b

hopefully fixes #951

lorin mentioned this issue Apr 16, 2020

chore(core): back out debug logging and single thread context #1032

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keel instances stop checking resources #951

Keel instances stop checking resources #951

lorin commented Mar 31, 2020 •

edited

Loading

lorin commented Apr 6, 2020

lorin commented Apr 6, 2020 •

edited

Loading

lorin commented Apr 7, 2020

Keel instances stop checking resources #951

Keel instances stop checking resources #951

Comments

lorin commented Mar 31, 2020 • edited Loading

Overview

Current theory: bug in OkHttp library 4.4.1

Thread dump: spinning in findHealthyConnection

Bugfix in OkHttp version 4.5.0

General symptoms

current.await() timing out

GET call to clouddriver that doesn't return

Older stuff

fillInStackTrace

Detailed thread dumps (old)

Socket closed

Flame graph

Thread dumps

lorin commented Apr 6, 2020

lorin commented Apr 6, 2020 • edited Loading

lorin commented Apr 7, 2020

lorin commented Mar 31, 2020 •

edited

Loading

lorin commented Apr 6, 2020 •

edited

Loading