Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(export): Export deployment strategy #1055

Merged
merged 12 commits into from
Apr 24, 2020

Conversation

luispollo
Copy link
Contributor

Adds support for exporting the deployment strategy associated with a cluster based on information retrieved from pipelines (for not-yet-migrated apps) and tasks (for already-managed apps or clusters created manually in the UI).

Limitations: currently, the export supports createServerGroup tasks and deploy pipeline stages only. Support for clone operations, if necessary, will be addressed in a separate PR.

Closes #753. As I was working in this area of the code anyway, I've taken the opportunity to also fix #625.

Comment on lines +36 to +40
val rollbackOnFailure: Boolean? = true,
val resizePreviousToZero: Boolean? = false,
val maxServerGroups: Int? = 2,
val delayBeforeDisable: Duration? = ZERO,
val delayBeforeScaleDown: Duration? = ZERO,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make these nullable in order to fix #625

@@ -310,7 +316,7 @@ class ClusterHandler(
}

override suspend fun export(exportable: Exportable): ClusterSpec {
// Get existing infrastructure
// Get existing infrastructure -- this is a very costly call
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this note because it routinely takes more than 30 seconds to retrieve the details for a single server group from clouddriver, which is insane... I'm not sure we could do much from our end, since it's not really information we would want to cache... Or maybe we do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth putting a metric and alert to make sure we don't end up calling this too often?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question... One of my goals for the week is to take some time to teach myself metrics and Atlas (for real this time, unlike during onboarding 😝), so I could definitely add one here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it routinely takes more than 30 seconds to retrieve the details for a single server group from clouddriver

wut? unless we're talking like p99 latency, that seems unbelievably wrong and bad. do you have a sense of what percentile 'routinely' maps to, and if it's not something crazy high like p99 have you considered reaching out to our platform team folks to figure out why we're seeing such unreasonable latency on a regular basis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Routinely as in > 90% of the calls. I believe I did reach out in chat about this at one point, and meant to do it again when I was testing this, but ended up forgetting. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: collected some data from the logs and it turns out it's not a single call, but a combination of dozens of calls with near-second response times. But it looks like we can improve on our end, so digging into that now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luispollo would you mind creating a separate issue to look into why this is taking 30+ seconds? You can assign it to me - I'm happy to look into it. I also feel it shouldn't take that long, maybe I can make it better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is: #1060
I'm already on it: spinnaker/clouddriver#4540

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth removing this comment if you've fixed the problem? or do we still need to see if this fixes the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still testing. Will remove the comment anyway as it's not actionable.

* server group.
*/
private fun ServerGroup.discoverDeploymentStrategy(): ClusterDeployStrategy? {
val entityTags = runBlocking {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First step is to look for the spinnaker:metadata entity tag.


val executionType = spinnakerMetadata["executionType"]!!.toString()
val executionId = spinnakerMetadata["executionId"]!!.toString()
val execution = runBlocking {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once found, we look up the corresponding execution in orca.

}

// get context from the appropriate execution stage for the execution type, drilling down into the data as needed
val context = if (executionType == "pipeline") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of statements finds the right spot in the data structure (which is different between a task execution and a pipeline execution) to extract the context data we need.

}

return when (val strategy = context?.get("strategy")) {
"redblack" -> RedBlack.fromOrcaStageContext(context)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, once we've found the strategy + other context bits, convert those into an instance of the corresponding ClusterDeployStrategy sub-type.

@@ -400,18 +502,6 @@ class ClusterHandler(
}
}

override suspend fun actuationInProgress(resource: Resource<ClusterSpec>) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved up above to come before the private extension functions.

Copy link
Contributor

@lorin lorin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some questions, but nothing that would block merge.

@@ -310,7 +316,7 @@ class ClusterHandler(
}

override suspend fun export(exportable: Exportable): ClusterSpec {
// Get existing infrastructure
// Get existing infrastructure -- this is a very costly call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth putting a metric and alert to make sure we don't end up calling this too often?

@emjburns
Copy link
Contributor

It looks like you're only doing this for ec2 - do you have any plans to implement this in a more generic way that would work for both ec2 and titus?

@luispollo
Copy link
Contributor Author

It looks like you're only doing this for ec2 - do you have any plans to implement this in a more generic way that would work for both ec2 and titus?

You know, I completely forgot that Titus clusters had a deployment strategy too. 🤦‍♂️

* Provides common logic for multiple cloud plugins to export aspects of compute clusters.
*/
@Component
class ClusterExportHelper(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure in which module to put this class since it has a dependency on keel-orca and keel-clouddriver. It turned out that keel-orca already had a dependency on keel-clouddriver, so I decided to put it here. 🤷‍♂️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me!

@luispollo luispollo self-assigned this Apr 23, 2020
application: String,
serverGroupName: String
): ClusterDeployStrategy? {
return kotlinx.coroutines.coroutineScope {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does this import need to to be so fully qualified?

any {
!it.isCapacityOrAutoScalingOnly()
}
override suspend fun actuationInProgress(resource: Resource<ClusterSpec>) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved up above to come before the private extension functions.

Copy link
Contributor

@emjburns emjburns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just two small nits around making an import less fully qualified, and removing the comment around "this call is expensive" if you're actively addressing it

@luispollo luispollo merged commit a4cdc68 into spinnaker:master Apr 24, 2020
@luispollo luispollo deleted the export-deployment-strategy branch April 24, 2020 00:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants