prevent unintentionally extending cache TTL #1666

robfletcher · 2020-11-19T00:26:42Z

An issue arose recently where a new base AMI caused a new version of an image to get baked. Because on some instances the older image was cached Keel could arbitrarily pick either of the two available images (the one with the older base AMI, and the one with the newer one). This caused the desired state to flip-flop depending on which Keel instance checked it.

Exacerbating this is the fact that the cache is aggressively primed with any other regions found in the image. This meant that although the cache has a TTL of 1 hour it kept getting refreshed with the same value, extending the TTL.

This PR attempts to ensure we only prime the cache with additional regions if it currently doesn't contain a mapping for the app version / account / region combination, or the cached image is older than the one just retrieved from CloudDriver.

robfletcher · 2020-11-19T00:33:18Z

keel-clouddriver/src/main/kotlin/com/netflix/spinnaker/keel/clouddriver/ImageService.kt

-          namedImageCache.put(NamedImageCacheKey(appVersion, account, otherRegion), completedFuture(image))
+          namedImageCache.putIfMissingOrOlderThan(key.copy(region = otherRegion), image)


Instead of always adding the image to the cache for any other regions, only do so if it's not cached currently, or the current cached image is older than this one.

This should prevent us from unnecessarily extending the TTL of the cache entry as we will only add it once.

robfletcher · 2020-11-19T00:36:18Z

keel-clouddriver/src/main/kotlin/com/netflix/spinnaker/keel/clouddriver/ImageService.kt

+  private fun AsyncCache<NamedImageCacheKey, NamedImage>.putIfMissingOrOlderThan(
+    key: NamedImageCacheKey,
+    image: NamedImage
+  ) {
+    synchronous().apply {
+      val updateCache = getIfPresent(key)
+        ?.let { it.creationDate < image.creationDate }
+        ?: true
+      if (updateCache) {
+        put(key, image)
+      }
+    }
+  }


This is the implementation of that logic. It operates on a synchronous view of the cache as it's using getIfPresent to avoid calling CloudDriver so there's no need to handle the result asynchronously.

robfletcher · 2020-11-19T01:17:54Z

keel-ec2-api/src/main/kotlin/com/netflix/spinnaker/keel/api/ec2/ServerGroup.kt

  data class ActiveServerGroupImage(
+    @get:ExcludedFromDiff
    val imageId: String,
    val appVersion: String?,
    val baseImageVersion: String?,


There seems to be no compelling reason to include imageId in the diff we generate. It's meaningless to users and the information we need to decide whether to redeploy is contained in appVersion and baseImageVersion. If either of those changes, we'll re-deploy. The imageId field is necessary when we launch the Orca task but not in the diff.

robfletcher · 2020-11-19T17:40:32Z

keel-clouddriver/src/main/kotlin/com/netflix/spinnaker/keel/clouddriver/ImageService.kt

-  private fun getAllTags(image: NamedImage): Map<String, String> {
-    val allTags = HashMap<String, String>()
-    image.tagsByImageId.forEach { (_, tags) ->
-      tags?.forEach { k, v ->
-        if (v != null) {
-          allTags[k] = v
-        }
-      }
-    }
-    return allTags
-  }
-
-  private fun amiMatches(tags: Map<String, String>, buildHost: String, buildName: String, buildNumber: String): Boolean {
-    if (!tags.containsKey("build_host") || !tags.containsKey("appversion") || tags["build_host"] != buildHost) {
-      return false
-    }
-
-    val appversion = tags["appversion"]?.split("/") ?: error("appversion tag is missing")
-
-    if (appversion.size != 3) {
-      return false
-    }
-
-    if (appversion[1] != buildName || appversion[2] != buildNumber) {
-      return false
-    }
-    return true
-  }


Both these methods were unused

emjburns

Seems reasonable to me

lorin

My understanding

The ImageResolver uses an in-memory cache to cache the AMI (image ids) associated with the most recent version of a baked image for an artifact version. Because of AWS's architecture, the same logical version of an (artifact version, base os) will have a different AMI in region, so we need a separate cache entry for each region that an artifact will be deployed to as specified in the delivery config. (Presumably, the same happens for account).

This means that an AMI is uniquely identified by a (version, account, region) triple. However, the value stored in the cache is a NamedImage object which contains the AMIs for all of the accounts/regions associated with the image name.

Because's the cache is in-memory, the state of the cache can vary by instance. For example, after a baseos update, one keel instance may have a stale version of the latest baseos AMI, and another instance may have the newer version of the baseos update.

If two different keel instances have different views of the latest version of baseos, then we can get flapping behavior where the ImageResolvers on the different instances keep computing different values for the desired state of the image id of the cluster.

The namedImageCache of the ImageService has a TTL of 60 minutes, so in theory it should settle down after at most an hour. However, there was logic in the ImageService to handle the case where the image was baked in multiple regions. In that case, the retrieved value was re-added to the cache for the other (version, account, region) triples. By re-adding them to the cache, this had the effect of resetting the TTLs for those other regions. If the image resolver queried the cache for two different regions often enough, the cache would never expire.

The proposed fix is to only re-add data to the cache in cases where:

it's not there yet
the cached data is stale

Here's the decision table. Note that we only care about cases where there's multi-region data, since it only updates the cache
for regions other than the one that was just retrieves.

Multi-reigon data	(version, account, region) entry exists in cache	cached image is older than river data	prime cache?
T	T	T	T
T	T	F	F
T	F	*	T
F	*	*	F

Feedback

Looks good to me!

While this fix doesn't prevent the problem of inconsistent cached image data across keel nodes, it's definitely an improvement over the current situation.

robfletcher force-pushed the avoid-refreshing-cache-ttl branch from 505cf70 to 9c3f985 Compare November 19, 2020 00:32

robfletcher commented Nov 19, 2020

View reviewed changes

robfletcher marked this pull request as ready for review November 19, 2020 01:18

robfletcher requested review from lorin, emjburns, gal-yardeni and luispollo November 19, 2020 01:18

robfletcher force-pushed the avoid-refreshing-cache-ttl branch from ef97bd4 to d188688 Compare November 19, 2020 17:40

robfletcher commented Nov 19, 2020

View reviewed changes

robfletcher changed the title ~~prevent overly-eagerly extending cache TTL~~ prevent unintentionally extending cache TTL Nov 19, 2020

robfletcher force-pushed the avoid-refreshing-cache-ttl branch 2 times, most recently from afbeba6 to ef8e164 Compare November 19, 2020 21:50

emjburns approved these changes Nov 20, 2020

View reviewed changes

robfletcher added 3 commits November 20, 2020 14:09

fix(AMIs): prevent overly-eagerly extending cache TTL

995bc15

fix(AMIs): don't diff AMI id

5d7b366

chore(cruft): removed unused methods from ImageService

f2dccf6

robfletcher force-pushed the avoid-refreshing-cache-ttl branch from ef8e164 to f2dccf6 Compare November 20, 2020 22:09

lorin approved these changes Nov 21, 2020

View reviewed changes

robfletcher added the ready to merge Approved and ready for merge label Nov 21, 2020

mergify bot merged commit 81c1b84 into spinnaker:master Nov 21, 2020

mergify bot added the auto merged label Nov 21, 2020

robfletcher deleted the avoid-refreshing-cache-ttl branch November 21, 2020 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent unintentionally extending cache TTL #1666

prevent unintentionally extending cache TTL #1666

robfletcher commented Nov 19, 2020

robfletcher Nov 19, 2020 •

edited

Loading

robfletcher Nov 19, 2020

robfletcher Nov 19, 2020

robfletcher Nov 19, 2020

emjburns left a comment

lorin left a comment

		namedImageCache.put(NamedImageCacheKey(appVersion, account, otherRegion), completedFuture(image))
		namedImageCache.putIfMissingOrOlderThan(key.copy(region = otherRegion), image)

prevent unintentionally extending cache TTL #1666

prevent unintentionally extending cache TTL #1666

Conversation

robfletcher commented Nov 19, 2020

robfletcher Nov 19, 2020 • edited Loading

Choose a reason for hiding this comment

robfletcher Nov 19, 2020

Choose a reason for hiding this comment

robfletcher Nov 19, 2020

Choose a reason for hiding this comment

robfletcher Nov 19, 2020

Choose a reason for hiding this comment

emjburns left a comment

Choose a reason for hiding this comment

lorin left a comment

Choose a reason for hiding this comment

My understanding

Feedback

robfletcher Nov 19, 2020 •

edited

Loading