-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prevent unintentionally extending cache TTL #1666
prevent unintentionally extending cache TTL #1666
Conversation
505cf70
to
9c3f985
Compare
namedImageCache.put(NamedImageCacheKey(appVersion, account, otherRegion), completedFuture(image)) | ||
namedImageCache.putIfMissingOrOlderThan(key.copy(region = otherRegion), image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of always adding the image to the cache for any other regions, only do so if it's not cached currently, or the current cached image is older than this one.
This should prevent us from unnecessarily extending the TTL of the cache entry as we will only add it once.
private fun AsyncCache<NamedImageCacheKey, NamedImage>.putIfMissingOrOlderThan( | ||
key: NamedImageCacheKey, | ||
image: NamedImage | ||
) { | ||
synchronous().apply { | ||
val updateCache = getIfPresent(key) | ||
?.let { it.creationDate < image.creationDate } | ||
?: true | ||
if (updateCache) { | ||
put(key, image) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the implementation of that logic. It operates on a synchronous
view of the cache as it's using getIfPresent
to avoid calling CloudDriver so there's no need to handle the result asynchronously.
data class ActiveServerGroupImage( | ||
@get:ExcludedFromDiff | ||
val imageId: String, | ||
val appVersion: String?, | ||
val baseImageVersion: String?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be no compelling reason to include imageId
in the diff we generate. It's meaningless to users and the information we need to decide whether to redeploy is contained in appVersion
and baseImageVersion
. If either of those changes, we'll re-deploy. The imageId
field is necessary when we launch the Orca task but not in the diff.
ef97bd4
to
d188688
Compare
private fun getAllTags(image: NamedImage): Map<String, String> { | ||
val allTags = HashMap<String, String>() | ||
image.tagsByImageId.forEach { (_, tags) -> | ||
tags?.forEach { k, v -> | ||
if (v != null) { | ||
allTags[k] = v | ||
} | ||
} | ||
} | ||
return allTags | ||
} | ||
|
||
private fun amiMatches(tags: Map<String, String>, buildHost: String, buildName: String, buildNumber: String): Boolean { | ||
if (!tags.containsKey("build_host") || !tags.containsKey("appversion") || tags["build_host"] != buildHost) { | ||
return false | ||
} | ||
|
||
val appversion = tags["appversion"]?.split("/") ?: error("appversion tag is missing") | ||
|
||
if (appversion.size != 3) { | ||
return false | ||
} | ||
|
||
if (appversion[1] != buildName || appversion[2] != buildNumber) { | ||
return false | ||
} | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both these methods were unused
afbeba6
to
ef8e164
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me
ef8e164
to
f2dccf6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding
The ImageResolver uses an in-memory cache to cache the AMI (image ids) associated with the most recent version of a baked image for an artifact version. Because of AWS's architecture, the same logical version of an (artifact version, base os) will have a different AMI in region, so we need a separate cache entry for each region that an artifact will be deployed to as specified in the delivery config. (Presumably, the same happens for account).
This means that an AMI is uniquely identified by a (version, account, region) triple. However, the value stored in the cache is a NamedImage
object which contains the AMIs for all of the accounts/regions associated with the image name.
Because's the cache is in-memory, the state of the cache can vary by instance. For example, after a baseos update, one keel instance may have a stale version of the latest baseos AMI, and another instance may have the newer version of the baseos update.
If two different keel instances have different views of the latest version of baseos, then we can get flapping behavior where the ImageResolvers on the different instances keep computing different values for the desired
state of the image id of the cluster.
The namedImageCache of the ImageService has a TTL of 60 minutes, so in theory it should settle down after at most an hour. However, there was logic in the ImageService to handle the case where the image was baked in multiple regions. In that case, the retrieved value was re-added to the cache for the other (version, account, region) triples. By re-adding them to the cache, this had the effect of resetting the TTLs for those other regions. If the image resolver queried the cache for two different regions often enough, the cache would never expire.
The proposed fix is to only re-add data to the cache in cases where:
- it's not there yet
- the cached data is stale
Here's the decision table. Note that we only care about cases where there's multi-region data, since it only updates the cache
for regions other than the one that was just retrieves.
Multi-reigon data | (version, account, region) entry exists in cache | cached image is older than river data | prime cache? |
---|---|---|---|
T | T | T | T |
T | T | F | F |
T | F | * | T |
F | * | * | F |
Feedback
Looks good to me!
While this fix doesn't prevent the problem of inconsistent cached image data across keel nodes, it's definitely an improvement over the current situation.
An issue arose recently where a new base AMI caused a new version of an image to get baked. Because on some instances the older image was cached Keel could arbitrarily pick either of the two available images (the one with the older base AMI, and the one with the newer one). This caused the desired state to flip-flop depending on which Keel instance checked it.
Exacerbating this is the fact that the cache is aggressively primed with any other regions found in the image. This meant that although the cache has a TTL of 1 hour it kept getting refreshed with the same value, extending the TTL.
This PR attempts to ensure we only prime the cache with additional regions if it currently doesn't contain a mapping for the app version / account / region combination, or the cached image is older than the one just retrieved from CloudDriver.