Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(kubernetes): Improve failure mode for unreachable cluster #3770

Merged
merged 3 commits into from
Jun 10, 2019
Merged

fix(kubernetes): Improve failure mode for unreachable cluster #3770

merged 3 commits into from
Jun 10, 2019

Conversation

ezimanyi
Copy link
Contributor

  • fix(kubernetes): Improve failure mode for unreachable cluster

    We currently cache any call to get the namespaces for an account with an expiry time of 30 s using a memoized supplier.

    When a cluster is unreachable, the call to get the cluster's namespaces will hang and eventually time out; we then log a warning and return an empty array of namesapces.

    If the call to kubectl returns an error, we don't cache the empty list return value, so every call to get namespaces will call kubectl. This leads to a bad failure mode where a slow/unresponsive cluster leads to more calls than a fast/responsive cluster.

    To address this, when a call to get namespaces returns an error, cache the empty list we're returning for the same amount of time as a successful call.

  • fix(kubernetes): Use custom memoizer for kubectl calls

    We're currently using a guava memoized supplier for calls to get namespaces and crds in a cluster, with an expiration time of 30s.

    The way the guava memoizer works is to record the timestamp at the time it starts executing the supplier function rather than when the function completes. This means that if the function to get namespaces takes more than 30s, we never get a cache hit at all because the entry has expired by the time it is added to the cache. This leads to cases where the cache is least effective when it is most necessary.

    Instead write a small Memoizer class that wraps a caffeine cache, as caffeine caches mark cache entries at the time of insertion to the cache (after the work is finished) rather than when the work starts. Use this for caching kubectl calls instead of the guava cache.

We currently cache any call to get the namespaces for an account
with an expiry time of 30 s using a memoized supplier.

When a cluster is unreachable, the call to get the cluster's
namespaces will hang and eventually time out; we then log a
warning and return an empty array of namesapces.

If the call to kubectl returns an error, we don't cache the empty
list return value, so every call to get namespaces will call
kubectl.  This leads to a bad failure mode where a slow/unresponsive
cluster leads to more calls than a fast/responsive cluster.

To address this, when a call to get namespaces returns an error,
cache the empty list we're returning for the same amount of time
as a successful call.
We're currently using a guava memoized supplier for calls to
get namespaces and crds in a cluster, with an expiration time of
30s.

The way the guava memoizer works is to record the timestamp at the
time it starts executing the supplier function rather than when
the function completes.  This means that if the function to get
namespaces takes more than 30s, we never get a cache hit at all
because the entry has expired by the time it is added to the
cache. This leads to cases where the cache is least effective when
it is most necessary.

Instead write a small Memoizer class that wraps a caffeine cache,
as caffeine caches mark cache entries at the time of insertion to
the cache (after the work is finished) rather than when the work
starts. Use this for caching kubectl calls instead of the guava
cache.
@ezimanyi
Copy link
Contributor Author

This is intended to be a small incremental improvement to the failure mode for unreachable clusters. Some possible further improvements:

  • Rely on the actual clouddriver cache rather than the custom cache here for getting namespaces/crds
  • Defer enumerating namespaces/crds to the caching agents so it doesn't happen during startup

Copy link
Contributor

@maggieneterval maggieneterval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!!!!! 🕺

@ezimanyi ezimanyi merged commit 637fc13 into spinnaker:master Jun 10, 2019
@ezimanyi ezimanyi deleted the fix-namespace-caching branch June 10, 2019 18:16
justinrlee pushed a commit to justinrlee/clouddriver that referenced this pull request Jun 12, 2019
…aker#3770)

* fix(kubernetes): Improve failure mode for unreachable cluster

We currently cache any call to get the namespaces for an account
with an expiry time of 30 s using a memoized supplier.

When a cluster is unreachable, the call to get the cluster's
namespaces will hang and eventually time out; we then log a
warning and return an empty array of namesapces.

If the call to kubectl returns an error, we don't cache the empty
list return value, so every call to get namespaces will call
kubectl.  This leads to a bad failure mode where a slow/unresponsive
cluster leads to more calls than a fast/responsive cluster.

To address this, when a call to get namespaces returns an error,
cache the empty list we're returning for the same amount of time
as a successful call.

* fix(kubernetes): Use custom memoizer for kubectl calls

We're currently using a guava memoized supplier for calls to
get namespaces and crds in a cluster, with an expiration time of
30s.

The way the guava memoizer works is to record the timestamp at the
time it starts executing the supplier function rather than when
the function completes.  This means that if the function to get
namespaces takes more than 30s, we never get a cache hit at all
because the entry has expired by the time it is added to the
cache. This leads to cases where the cache is least effective when
it is most necessary.

Instead write a small Memoizer class that wraps a caffeine cache,
as caffeine caches mark cache entries at the time of insertion to
the cache (after the work is finished) rather than when the work
starts. Use this for caching kubectl calls instead of the guava
cache.
@ezimanyi
Copy link
Contributor Author

@spinnakerbot cherry-pick 1.14

spinnakerbot pushed a commit that referenced this pull request Jun 27, 2019
* fix(kubernetes): Improve failure mode for unreachable cluster

We currently cache any call to get the namespaces for an account
with an expiry time of 30 s using a memoized supplier.

When a cluster is unreachable, the call to get the cluster's
namespaces will hang and eventually time out; we then log a
warning and return an empty array of namesapces.

If the call to kubectl returns an error, we don't cache the empty
list return value, so every call to get namespaces will call
kubectl.  This leads to a bad failure mode where a slow/unresponsive
cluster leads to more calls than a fast/responsive cluster.

To address this, when a call to get namespaces returns an error,
cache the empty list we're returning for the same amount of time
as a successful call.

* fix(kubernetes): Use custom memoizer for kubectl calls

We're currently using a guava memoized supplier for calls to
get namespaces and crds in a cluster, with an expiration time of
30s.

The way the guava memoizer works is to record the timestamp at the
time it starts executing the supplier function rather than when
the function completes.  This means that if the function to get
namespaces takes more than 30s, we never get a cache hit at all
because the entry has expired by the time it is added to the
cache. This leads to cases where the cache is least effective when
it is most necessary.

Instead write a small Memoizer class that wraps a caffeine cache,
as caffeine caches mark cache entries at the time of insertion to
the cache (after the work is finished) rather than when the work
starts. Use this for caching kubectl calls instead of the guava
cache.
@spinnakerbot
Copy link
Contributor

Cherry pick successful: #3822

ezimanyi added a commit that referenced this pull request Jun 27, 2019
…#3823)

* fix(kubernetes): Improve failure mode for unreachable cluster

We currently cache any call to get the namespaces for an account
with an expiry time of 30 s using a memoized supplier.

When a cluster is unreachable, the call to get the cluster's
namespaces will hang and eventually time out; we then log a
warning and return an empty array of namesapces.

If the call to kubectl returns an error, we don't cache the empty
list return value, so every call to get namespaces will call
kubectl.  This leads to a bad failure mode where a slow/unresponsive
cluster leads to more calls than a fast/responsive cluster.

To address this, when a call to get namespaces returns an error,
cache the empty list we're returning for the same amount of time
as a successful call.

* fix(kubernetes): Use custom memoizer for kubectl calls

We're currently using a guava memoized supplier for calls to
get namespaces and crds in a cluster, with an expiration time of
30s.

The way the guava memoizer works is to record the timestamp at the
time it starts executing the supplier function rather than when
the function completes.  This means that if the function to get
namespaces takes more than 30s, we never get a cache hit at all
because the entry has expired by the time it is added to the
cache. This leads to cases where the cache is least effective when
it is most necessary.

Instead write a small Memoizer class that wraps a caffeine cache,
as caffeine caches mark cache entries at the time of insertion to
the cache (after the work is finished) rather than when the work
starts. Use this for caching kubectl calls instead of the guava
cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants