-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(export): Export deployment strategy #1055
feat(export): Export deployment strategy #1055
Conversation
val rollbackOnFailure: Boolean? = true, | ||
val resizePreviousToZero: Boolean? = false, | ||
val maxServerGroups: Int? = 2, | ||
val delayBeforeDisable: Duration? = ZERO, | ||
val delayBeforeScaleDown: Duration? = ZERO, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to make these nullable in order to fix #625
keel-core/src/main/kotlin/com/netflix/spinnaker/keel/core/api/ClusterDeployStrategy.kt
Show resolved
Hide resolved
keel-core/src/main/kotlin/com/netflix/spinnaker/keel/plugin/ResourceSpecExportHelper.kt
Outdated
Show resolved
Hide resolved
@@ -310,7 +316,7 @@ class ClusterHandler( | |||
} | |||
|
|||
override suspend fun export(exportable: Exportable): ClusterSpec { | |||
// Get existing infrastructure | |||
// Get existing infrastructure -- this is a very costly call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this note because it routinely takes more than 30 seconds to retrieve the details for a single server group from clouddriver, which is insane... I'm not sure we could do much from our end, since it's not really information we would want to cache... Or maybe we do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth putting a metric and alert to make sure we don't end up calling this too often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question... One of my goals for the week is to take some time to teach myself metrics and Atlas (for real this time, unlike during onboarding 😝), so I could definitely add one here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it routinely takes more than 30 seconds to retrieve the details for a single server group from clouddriver
wut? unless we're talking like p99 latency, that seems unbelievably wrong and bad. do you have a sense of what percentile 'routinely' maps to, and if it's not something crazy high like p99 have you considered reaching out to our platform team folks to figure out why we're seeing such unreasonable latency on a regular basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Routinely as in > 90% of the calls. I believe I did reach out in chat about this at one point, and meant to do it again when I was testing this, but ended up forgetting. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: collected some data from the logs and it turns out it's not a single call, but a combination of dozens of calls with near-second response times. But it looks like we can improve on our end, so digging into that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@luispollo would you mind creating a separate issue to look into why this is taking 30+ seconds? You can assign it to me - I'm happy to look into it. I also feel it shouldn't take that long, maybe I can make it better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is: #1060
I'm already on it: spinnaker/clouddriver#4540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth removing this comment if you've fixed the problem? or do we still need to see if this fixes the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still testing. Will remove the comment anyway as it's not actionable.
* server group. | ||
*/ | ||
private fun ServerGroup.discoverDeploymentStrategy(): ClusterDeployStrategy? { | ||
val entityTags = runBlocking { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First step is to look for the spinnaker:metadata
entity tag.
|
||
val executionType = spinnakerMetadata["executionType"]!!.toString() | ||
val executionId = spinnakerMetadata["executionId"]!!.toString() | ||
val execution = runBlocking { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once found, we look up the corresponding execution in orca.
} | ||
|
||
// get context from the appropriate execution stage for the execution type, drilling down into the data as needed | ||
val context = if (executionType == "pipeline") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This set of statements finds the right spot in the data structure (which is different between a task execution and a pipeline execution) to extract the context data we need.
keel-core/src/main/kotlin/com/netflix/spinnaker/keel/core/api/ClusterDeployStrategy.kt
Show resolved
Hide resolved
keel-ec2-plugin/src/main/kotlin/com/netflix/spinnaker/keel/ec2/resource/ClusterHandler.kt
Outdated
Show resolved
Hide resolved
} | ||
|
||
return when (val strategy = context?.get("strategy")) { | ||
"redblack" -> RedBlack.fromOrcaStageContext(context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, once we've found the strategy + other context bits, convert those into an instance of the corresponding ClusterDeployStrategy
sub-type.
@@ -400,18 +502,6 @@ class ClusterHandler( | |||
} | |||
} | |||
|
|||
override suspend fun actuationInProgress(resource: Resource<ClusterSpec>) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved up above to come before the private extension functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some questions, but nothing that would block merge.
keel-core/src/main/kotlin/com/netflix/spinnaker/keel/core/api/ClusterDeployStrategy.kt
Outdated
Show resolved
Hide resolved
keel-core/src/main/kotlin/com/netflix/spinnaker/keel/core/api/ClusterDeployStrategy.kt
Show resolved
Hide resolved
keel-core/src/main/kotlin/com/netflix/spinnaker/keel/core/api/ClusterDeployStrategy.kt
Show resolved
Hide resolved
@@ -310,7 +316,7 @@ class ClusterHandler( | |||
} | |||
|
|||
override suspend fun export(exportable: Exportable): ClusterSpec { | |||
// Get existing infrastructure | |||
// Get existing infrastructure -- this is a very costly call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth putting a metric and alert to make sure we don't end up calling this too often?
keel-ec2-plugin/src/main/kotlin/com/netflix/spinnaker/keel/ec2/resource/ClusterHandler.kt
Outdated
Show resolved
Hide resolved
keel-ec2-plugin/src/main/kotlin/com/netflix/spinnaker/keel/ec2/resource/ClusterHandler.kt
Outdated
Show resolved
Hide resolved
keel-ec2-plugin/src/main/kotlin/com/netflix/spinnaker/keel/ec2/resource/ClusterHandler.kt
Outdated
Show resolved
Hide resolved
keel-ec2-plugin/src/main/kotlin/com/netflix/spinnaker/keel/ec2/resource/ClusterHandler.kt
Outdated
Show resolved
Hide resolved
keel-ec2-plugin/src/main/kotlin/com/netflix/spinnaker/keel/ec2/resource/ClusterHandler.kt
Outdated
Show resolved
Hide resolved
c914248
to
c412dd0
Compare
It looks like you're only doing this for ec2 - do you have any plans to implement this in a more generic way that would work for both ec2 and titus? |
keel-ec2-plugin/src/main/kotlin/com/netflix/spinnaker/keel/ec2/resource/ClusterHandler.kt
Outdated
Show resolved
Hide resolved
You know, I completely forgot that Titus clusters had a deployment strategy too. 🤦♂️ |
* Provides common logic for multiple cloud plugins to export aspects of compute clusters. | ||
*/ | ||
@Component | ||
class ClusterExportHelper( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure in which module to put this class since it has a dependency on keel-orca
and keel-clouddriver
. It turned out that keel-orca
already had a dependency on keel-clouddriver
, so I decided to put it here. 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me!
application: String, | ||
serverGroupName: String | ||
): ClusterDeployStrategy? { | ||
return kotlinx.coroutines.coroutineScope { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: does this import need to to be so fully qualified?
keel-orca/src/main/kotlin/com/netflix/spinnaker/keel/orca/ClusterExportHelper.kt
Outdated
Show resolved
Hide resolved
any { | ||
!it.isCapacityOrAutoScalingOnly() | ||
} | ||
override suspend fun actuationInProgress(resource: Resource<ClusterSpec>) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved up above to come before the private extension functions.
eab0954
to
ec8bea1
Compare
...s-plugin/src/main/kotlin/com/netflix/spinnaker/keel/api/titus/cluster/TitusClusterHandler.kt
Show resolved
Hide resolved
keel-orca/src/main/kotlin/com/netflix/spinnaker/keel/orca/ClusterExportHelper.kt
Outdated
Show resolved
Hide resolved
keel-orca/src/main/kotlin/com/netflix/spinnaker/keel/orca/ClusterExportHelper.kt
Outdated
Show resolved
Hide resolved
keel-orca/src/main/kotlin/com/netflix/spinnaker/keel/orca/ClusterExportHelper.kt
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just two small nits around making an import less fully qualified, and removing the comment around "this call is expensive" if you're actively addressing it
66b15b6
to
5ed5870
Compare
Adds support for exporting the deployment strategy associated with a cluster based on information retrieved from pipelines (for not-yet-migrated apps) and tasks (for already-managed apps or clusters created manually in the UI).
Limitations: currently, the export supports
createServerGroup
tasks anddeploy
pipeline stages only. Support for clone operations, if necessary, will be addressed in a separate PR.Closes #753. As I was working in this area of the code anyway, I've taken the opportunity to also fix #625.