fix(kubernetes): Improve failure mode for unreachable cluster #3770

ezimanyi · 2019-06-10T17:43:22Z

fix(kubernetes): Improve failure mode for unreachable cluster

We currently cache any call to get the namespaces for an account with an expiry time of 30 s using a memoized supplier.

When a cluster is unreachable, the call to get the cluster's namespaces will hang and eventually time out; we then log a warning and return an empty array of namesapces.

If the call to kubectl returns an error, we don't cache the empty list return value, so every call to get namespaces will call kubectl. This leads to a bad failure mode where a slow/unresponsive cluster leads to more calls than a fast/responsive cluster.

To address this, when a call to get namespaces returns an error, cache the empty list we're returning for the same amount of time as a successful call.
fix(kubernetes): Use custom memoizer for kubectl calls

We're currently using a guava memoized supplier for calls to get namespaces and crds in a cluster, with an expiration time of 30s.

The way the guava memoizer works is to record the timestamp at the time it starts executing the supplier function rather than when the function completes. This means that if the function to get namespaces takes more than 30s, we never get a cache hit at all because the entry has expired by the time it is added to the cache. This leads to cases where the cache is least effective when it is most necessary.

Instead write a small Memoizer class that wraps a caffeine cache, as caffeine caches mark cache entries at the time of insertion to the cache (after the work is finished) rather than when the work starts. Use this for caching kubectl calls instead of the guava cache.

We currently cache any call to get the namespaces for an account with an expiry time of 30 s using a memoized supplier. When a cluster is unreachable, the call to get the cluster's namespaces will hang and eventually time out; we then log a warning and return an empty array of namesapces. If the call to kubectl returns an error, we don't cache the empty list return value, so every call to get namespaces will call kubectl. This leads to a bad failure mode where a slow/unresponsive cluster leads to more calls than a fast/responsive cluster. To address this, when a call to get namespaces returns an error, cache the empty list we're returning for the same amount of time as a successful call.

We're currently using a guava memoized supplier for calls to get namespaces and crds in a cluster, with an expiration time of 30s. The way the guava memoizer works is to record the timestamp at the time it starts executing the supplier function rather than when the function completes. This means that if the function to get namespaces takes more than 30s, we never get a cache hit at all because the entry has expired by the time it is added to the cache. This leads to cases where the cache is least effective when it is most necessary. Instead write a small Memoizer class that wraps a caffeine cache, as caffeine caches mark cache entries at the time of insertion to the cache (after the work is finished) rather than when the work starts. Use this for caching kubectl calls instead of the guava cache.

ezimanyi · 2019-06-10T17:44:49Z

This is intended to be a small incremental improvement to the failure mode for unreachable clusters. Some possible further improvements:

Rely on the actual clouddriver cache rather than the custom cache here for getting namespaces/crds
Defer enumerating namespaces/crds to the caching agents so it doesn't happen during startup

maggieneterval

Nice!!!!! 🕺

…aker#3770) * fix(kubernetes): Improve failure mode for unreachable cluster We currently cache any call to get the namespaces for an account with an expiry time of 30 s using a memoized supplier. When a cluster is unreachable, the call to get the cluster's namespaces will hang and eventually time out; we then log a warning and return an empty array of namesapces. If the call to kubectl returns an error, we don't cache the empty list return value, so every call to get namespaces will call kubectl. This leads to a bad failure mode where a slow/unresponsive cluster leads to more calls than a fast/responsive cluster. To address this, when a call to get namespaces returns an error, cache the empty list we're returning for the same amount of time as a successful call. * fix(kubernetes): Use custom memoizer for kubectl calls We're currently using a guava memoized supplier for calls to get namespaces and crds in a cluster, with an expiration time of 30s. The way the guava memoizer works is to record the timestamp at the time it starts executing the supplier function rather than when the function completes. This means that if the function to get namespaces takes more than 30s, we never get a cache hit at all because the entry has expired by the time it is added to the cache. This leads to cases where the cache is least effective when it is most necessary. Instead write a small Memoizer class that wraps a caffeine cache, as caffeine caches mark cache entries at the time of insertion to the cache (after the work is finished) rather than when the work starts. Use this for caching kubectl calls instead of the guava cache.

ezimanyi · 2019-06-27T13:43:33Z

@spinnakerbot cherry-pick 1.14

* fix(kubernetes): Improve failure mode for unreachable cluster We currently cache any call to get the namespaces for an account with an expiry time of 30 s using a memoized supplier. When a cluster is unreachable, the call to get the cluster's namespaces will hang and eventually time out; we then log a warning and return an empty array of namesapces. If the call to kubectl returns an error, we don't cache the empty list return value, so every call to get namespaces will call kubectl. This leads to a bad failure mode where a slow/unresponsive cluster leads to more calls than a fast/responsive cluster. To address this, when a call to get namespaces returns an error, cache the empty list we're returning for the same amount of time as a successful call. * fix(kubernetes): Use custom memoizer for kubectl calls We're currently using a guava memoized supplier for calls to get namespaces and crds in a cluster, with an expiration time of 30s. The way the guava memoizer works is to record the timestamp at the time it starts executing the supplier function rather than when the function completes. This means that if the function to get namespaces takes more than 30s, we never get a cache hit at all because the entry has expired by the time it is added to the cache. This leads to cases where the cache is least effective when it is most necessary. Instead write a small Memoizer class that wraps a caffeine cache, as caffeine caches mark cache entries at the time of insertion to the cache (after the work is finished) rather than when the work starts. Use this for caching kubectl calls instead of the guava cache.

spinnakerbot · 2019-06-27T13:45:28Z

Cherry pick successful: #3822

…#3823) * fix(kubernetes): Improve failure mode for unreachable cluster We currently cache any call to get the namespaces for an account with an expiry time of 30 s using a memoized supplier. When a cluster is unreachable, the call to get the cluster's namespaces will hang and eventually time out; we then log a warning and return an empty array of namesapces. If the call to kubectl returns an error, we don't cache the empty list return value, so every call to get namespaces will call kubectl. This leads to a bad failure mode where a slow/unresponsive cluster leads to more calls than a fast/responsive cluster. To address this, when a call to get namespaces returns an error, cache the empty list we're returning for the same amount of time as a successful call. * fix(kubernetes): Use custom memoizer for kubectl calls We're currently using a guava memoized supplier for calls to get namespaces and crds in a cluster, with an expiration time of 30s. The way the guava memoizer works is to record the timestamp at the time it starts executing the supplier function rather than when the function completes. This means that if the function to get namespaces takes more than 30s, we never get a cache hit at all because the entry has expired by the time it is added to the cache. This leads to cases where the cache is least effective when it is most necessary. Instead write a small Memoizer class that wraps a caffeine cache, as caffeine caches mark cache entries at the time of insertion to the cache (after the work is finished) rather than when the work starts. Use this for caching kubectl calls instead of the guava cache.

ezimanyi added 2 commits June 10, 2019 12:57

ezimanyi requested a review from maggieneterval June 10, 2019 17:43

Merge branch 'master' into fix-namespace-caching

58361d4

maggieneterval approved these changes Jun 10, 2019

View reviewed changes

ezimanyi merged commit 637fc13 into spinnaker:master Jun 10, 2019

ezimanyi deleted the fix-namespace-caching branch June 10, 2019 18:16

spinnakerbot added the target-release/1.15 label Jun 10, 2019

spinnakerbot mentioned this pull request Jun 27, 2019

fix(kubernetes): Improve failure mode for unreachable cluster (#3770) #3822

Closed

ezimanyi mentioned this pull request Sep 18, 2019

fix(kubernetes): Fix credentials endpoint with unreachable cluster #4041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kubernetes): Improve failure mode for unreachable cluster #3770

fix(kubernetes): Improve failure mode for unreachable cluster #3770

ezimanyi commented Jun 10, 2019

ezimanyi commented Jun 10, 2019

maggieneterval left a comment

ezimanyi commented Jun 27, 2019

spinnakerbot commented Jun 27, 2019

fix(kubernetes): Improve failure mode for unreachable cluster #3770

fix(kubernetes): Improve failure mode for unreachable cluster #3770

Conversation

ezimanyi commented Jun 10, 2019

ezimanyi commented Jun 10, 2019

maggieneterval left a comment

Choose a reason for hiding this comment

ezimanyi commented Jun 27, 2019

spinnakerbot commented Jun 27, 2019