Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent unintentionally extending cache TTL #1666

Merged
merged 3 commits into from
Nov 21, 2020

Conversation

robfletcher
Copy link
Contributor

An issue arose recently where a new base AMI caused a new version of an image to get baked. Because on some instances the older image was cached Keel could arbitrarily pick either of the two available images (the one with the older base AMI, and the one with the newer one). This caused the desired state to flip-flop depending on which Keel instance checked it.

Exacerbating this is the fact that the cache is aggressively primed with any other regions found in the image. This meant that although the cache has a TTL of 1 hour it kept getting refreshed with the same value, extending the TTL.

This PR attempts to ensure we only prime the cache with additional regions if it currently doesn't contain a mapping for the app version / account / region combination, or the cached image is older than the one just retrieved from CloudDriver.

Comment on lines -138 to +137
namedImageCache.put(NamedImageCacheKey(appVersion, account, otherRegion), completedFuture(image))
namedImageCache.putIfMissingOrOlderThan(key.copy(region = otherRegion), image)
Copy link
Contributor Author

@robfletcher robfletcher Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of always adding the image to the cache for any other regions, only do so if it's not cached currently, or the current cached image is older than this one.

This should prevent us from unnecessarily extending the TTL of the cache entry as we will only add it once.

Comment on lines +142 to +154
private fun AsyncCache<NamedImageCacheKey, NamedImage>.putIfMissingOrOlderThan(
key: NamedImageCacheKey,
image: NamedImage
) {
synchronous().apply {
val updateCache = getIfPresent(key)
?.let { it.creationDate < image.creationDate }
?: true
if (updateCache) {
put(key, image)
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the implementation of that logic. It operates on a synchronous view of the cache as it's using getIfPresent to avoid calling CloudDriver so there's no need to handle the result asynchronously.

Comment on lines 71 to 75
data class ActiveServerGroupImage(
@get:ExcludedFromDiff
val imageId: String,
val appVersion: String?,
val baseImageVersion: String?,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be no compelling reason to include imageId in the diff we generate. It's meaningless to users and the information we need to decide whether to redeploy is contained in appVersion and baseImageVersion. If either of those changes, we'll re-deploy. The imageId field is necessary when we launch the Orca task but not in the diff.

@robfletcher robfletcher marked this pull request as ready for review November 19, 2020 01:18
Comment on lines -155 to -182
private fun getAllTags(image: NamedImage): Map<String, String> {
val allTags = HashMap<String, String>()
image.tagsByImageId.forEach { (_, tags) ->
tags?.forEach { k, v ->
if (v != null) {
allTags[k] = v
}
}
}
return allTags
}

private fun amiMatches(tags: Map<String, String>, buildHost: String, buildName: String, buildNumber: String): Boolean {
if (!tags.containsKey("build_host") || !tags.containsKey("appversion") || tags["build_host"] != buildHost) {
return false
}

val appversion = tags["appversion"]?.split("/") ?: error("appversion tag is missing")

if (appversion.size != 3) {
return false
}

if (appversion[1] != buildName || appversion[2] != buildNumber) {
return false
}
return true
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both these methods were unused

@robfletcher robfletcher changed the title prevent overly-eagerly extending cache TTL prevent unintentionally extending cache TTL Nov 19, 2020
@robfletcher robfletcher force-pushed the avoid-refreshing-cache-ttl branch 2 times, most recently from afbeba6 to ef8e164 Compare November 19, 2020 21:50
Copy link
Contributor

@emjburns emjburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me

Copy link
Contributor

@lorin lorin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding

The ImageResolver uses an in-memory cache to cache the AMI (image ids) associated with the most recent version of a baked image for an artifact version. Because of AWS's architecture, the same logical version of an (artifact version, base os) will have a different AMI in region, so we need a separate cache entry for each region that an artifact will be deployed to as specified in the delivery config. (Presumably, the same happens for account).

This means that an AMI is uniquely identified by a (version, account, region) triple. However, the value stored in the cache is a NamedImage object which contains the AMIs for all of the accounts/regions associated with the image name.

Because's the cache is in-memory, the state of the cache can vary by instance. For example, after a baseos update, one keel instance may have a stale version of the latest baseos AMI, and another instance may have the newer version of the baseos update.

If two different keel instances have different views of the latest version of baseos, then we can get flapping behavior where the ImageResolvers on the different instances keep computing different values for the desired state of the image id of the cluster.

The namedImageCache of the ImageService has a TTL of 60 minutes, so in theory it should settle down after at most an hour. However, there was logic in the ImageService to handle the case where the image was baked in multiple regions. In that case, the retrieved value was re-added to the cache for the other (version, account, region) triples. By re-adding them to the cache, this had the effect of resetting the TTLs for those other regions. If the image resolver queried the cache for two different regions often enough, the cache would never expire.

The proposed fix is to only re-add data to the cache in cases where:

  • it's not there yet
  • the cached data is stale

Here's the decision table. Note that we only care about cases where there's multi-region data, since it only updates the cache
for regions other than the one that was just retrieves.

Multi-reigon data (version, account, region) entry exists in cache cached image is older than river data prime cache?
T T T T
T T F F
T F * T
F * * F

Feedback

Looks good to me!

While this fix doesn't prevent the problem of inconsistent cached image data across keel nodes, it's definitely an improvement over the current situation.

@robfletcher robfletcher added the ready to merge Approved and ready for merge label Nov 21, 2020
@mergify mergify bot merged commit 81c1b84 into spinnaker:master Nov 21, 2020
@mergify mergify bot added the auto merged label Nov 21, 2020
@robfletcher robfletcher deleted the avoid-refreshing-cache-ttl branch November 21, 2020 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto merged ready to merge Approved and ready for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants