Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large deployment pipelines with multiple children results in huge pipeline body #6159

Open
titirigaiulian opened this issue Nov 5, 2020 · 23 comments
Labels
no-lifecycle pipelines/executions sig/none Issues that do not fall under scope of any SIG

Comments

@titirigaiulian
Copy link

Issue Summary:

For large k8s deployment pipelines with multiple parent levels the pipeline body grows exponentially causing high load on the DB.

Description:

Usecase

Some of our teams have large deployments of tightly coupled components on multiple services, clusters, regions and environments. The way the deployment is structured:
Level 1: Generic deployment pipeline for one service
Level 2: Deployment pipeline for multiple coupled services
Level 3: Deployment for one region
Level 4: Deploy full env
Level 5: Full deploy: dev->stage->prod

Each child pipeline Level 1 pipeline has its own bake + deploy stages so it can be executed independent
This flow helps us create wave deployments.

The full deployment evolves to a tree like:

image

Steps to Reproduce:

Problem 1:

Each leaf of the tree (Level 1 pipelines) produces k8s manifests after deployments and appends to output. I can see in the pipeline outputs, the manifests multiplied under different keys: outputs.createdArtifacts, optionalArtifacts etc.

Problem 2:

Each child pipeline receives as trigger the entire parent pipeline payload. For the leaf pipeline it results into 5 levels of nested trigger -> parentPipeline

Each problem by its own does not cause any issue, but hitting both the pipelines become unmanageable.

Additional Details:

Both issues are currently in progress:

#5909

spinnaker/orca#3986

Alternative

Introduce a flag in PipelineStage context e.g. skipDownstreamOutput which can be introduced at the desired parent level that clears the output from pipeline. This ends up in splitting the tree into smaller sub-trees and manageable pipelines without loosing the ability to have large orchestrations and modular pipelines.

@Hertog-PJ
Copy link

I can confirm that we run into this exact same issue. The proposed fix would be really helpful.

@emagana-zz
Copy link

Salesforce use cases are also affected by this issue. We have a short/medium-term solution by compressing body payload. PR on its way. Thank you!

@Nirmalyasen
Copy link

Somewhere around the 1.20 release, the number of produced artifact out of a deployment manifest stage has increased significantly. And combined with the output context carried from child pipeline to another is bloating up the execution context. This is having two effects:

  1. The size of the payload stored in the database
  2. The size of the payload being downloaded by UI

This is becoming a huge performance issue that needs to be addressed!!!

@titirigaiulian
Copy link
Author

@Nirmalyasen the issue started for us after upgrading from 1.20.6, so I think somewhere in 1.21, 1.22 was introduced this performance regression, but didn't managed to find a specific commit.
@emagana compressing sounds like a good idea. Just curious, do you have any numbers for the json compression before and after?
We did a quick workaround by cleaning the outputs after parent execution under a feature flag, but this is useful if you do not use stuff from previous executions

@titirigaiulian
Copy link
Author

titirigaiulian commented Dec 17, 2020

Further investigations so far:
It can affect performance also single pipelines with large manifests e.g. bake 1Mb manifest + deploy results in >15Mb of data.
There is an open issue #6159 for replacing full manifests with coords, but even with the full manifest in the context there are still a lots of duplicated manifests

I can see the baked manifest exists in the execution 6 times for a simple bake and deploy pipeline under different keys:

  • Bake stage:
    • context.artifacts.reference
    • context.resolvedExpectedArtifacts.boundArtifact.reference
    • outputs.artifacts.reference
    • outputs.resolvedExpectedArtifacts.boundArtifact.reference
  • Deploy manifest stage:
    • context.optionalArtifacts.reference
    • outputs.optionalArtifacts.reference

The full manifests exists 4 times

  • context.manifests
  • context.kato.tasks.resultObjects.manifests
  • context.outputs.manifests
  • outputs.manifests
  • outputs.outputs.manifests

Some conclusions:

  • Some keys are duplicated under context and outputs for the base64 produced artifact after bake
  • For the full manifests, the outputs.manifest exists at top level, but also duplicated in context and once again under as nested under outputs (???). Don't think is the desired behavior

Having a large baked manifest the actual value could be replaced with a reference?
Even with the full manifests in the pipeline it could be manageable with some cleanup around this duplication.
cc @maggieneterval

P.S. I see in the deployment stage there is a CleanupArtifactsTask seems it's not doing its job :)

@Nirmalyasen
Copy link

The other thing I noticed is that the output.manifests (in a deploy manifest stage) has an annotation kubectl.kubernetes.io/last-applied-configuration that is a copy of the existing manifest. So, the attribute outputs.manifest itself almost doubles itself!!!! Is that result of kubectl version upgrade?

@dbrougham
Copy link

This looks related to something we are seeing. Results in large manifests being stored which makes the whole system sluggish especially with large pipeline hierarchy's. Can't wait to see this in a release!

@amuraru
Copy link

amuraru commented Feb 17, 2021

+1 impacted by this as well! thanks @titirigaiulian

@dancb10
Copy link

dancb10 commented Feb 25, 2021

+1

@titirigaiulian
Copy link
Author

Added another PR to control the default behaviour and override it at pipeline level for the exceptions. Will add also some docs once we'll have this merged

@spinnakerbot
Copy link

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@spinnakerbot
Copy link

This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

@hugohenley
Copy link

+1

@dbyron-sf
Copy link
Contributor

@spinnakerbot remove-label to-be-closed

@spinnakerbot
Copy link

This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

@dbyron-sf
Copy link
Contributor

@spinnakerbot remove-label to-be-closed

@dbyron-sf
Copy link
Contributor

@spinnakerbot remove-label stale

@spinnakerbot
Copy link

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@spinnakerbot
Copy link

This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

@dbrougham
Copy link

@spinnakerbot remove-label stale

@spinnakerbot
Copy link

This issue hasn't been updated in 45 days, so we are tagging it as 'stale'. If you want to remove this label, comment:

@spinnakerbot remove-label stale

@spinnakerbot
Copy link

This issue is tagged as 'stale' and hasn't been updated in 45 days, so we are tagging it as 'to-be-closed'. It will be closed in 45 days unless updates are made. If you want to remove this label, comment:

@spinnakerbot remove-label to-be-closed

@spinnakerbot
Copy link

This issue is tagged as 'to-be-closed' and hasn't been updated in 45 days, so we are closing it. You can always reopen this issue if needed.

CharlieTLe added a commit to CharlieTLe/orca that referenced this issue May 25, 2022
The way that the monitor pipeline task worked before was that it only
stored the _status_ of the child pipeline when the child was running
(see file before the commit
[here](https://github.com/spinnaker/orca/pull/3902/files#diff-4c694b67271c7d9436d5af695f160f0b60c7ad7753a0ba1eb9aea76d7ba4e851R128-R130)).
After the commit was merged, the MonitorPipelineTask was storing the
context of the child pipeline on every update regardless of the child
pipeline’s status ([see line in
commit](https://github.com/spinnaker/orca/blob/20e2b9ea61037e7fc8eb59c817dc5d33f4718af3/orca-front50/src/main/groovy/com/netflix/spinnaker/orca/front50/tasks/MonitorPipelineTask.groovy#L96)).

This should address the issue seen where large nested pipelines
especially those child pipelines that generate a lot of outputs (Deploy
Manifest) negatively impact the performance of Spinnaker.

See: spinnaker/spinnaker#6159
CharlieTLe added a commit to CharlieTLe/orca that referenced this issue May 25, 2022
The way that the monitor pipeline task worked before was that it only
stored the _status_ of the child pipeline when the child was running
(see file before the commit
[here](https://github.com/spinnaker/orca/pull/3902/files#diff-4c694b67271c7d9436d5af695f160f0b60c7ad7753a0ba1eb9aea76d7ba4e851R128-R130)).
After the commit was merged, the MonitorPipelineTask was storing the
context of the child pipeline on every update regardless of the child
pipeline’s status ([see line in
commit](https://github.com/spinnaker/orca/blob/20e2b9ea61037e7fc8eb59c817dc5d33f4718af3/orca-front50/src/main/groovy/com/netflix/spinnaker/orca/front50/tasks/MonitorPipelineTask.groovy#L96)).

This should address the issue seen where large nested pipelines
especially those child pipelines that generate a lot of outputs (Deploy
Manifest) negatively impact the performance of Spinnaker.

See: spinnaker/spinnaker#6159
mergify bot added a commit to spinnaker/orca that referenced this issue May 31, 2022
The way that the monitor pipeline task worked before was that it only
stored the _status_ of the child pipeline when the child was running
(see file before the commit
[here](https://github.com/spinnaker/orca/pull/3902/files#diff-4c694b67271c7d9436d5af695f160f0b60c7ad7753a0ba1eb9aea76d7ba4e851R128-R130)).
After the commit was merged, the MonitorPipelineTask was storing the
context of the child pipeline on every update regardless of the child
pipeline’s status ([see line in
commit](https://github.com/spinnaker/orca/blob/20e2b9ea61037e7fc8eb59c817dc5d33f4718af3/orca-front50/src/main/groovy/com/netflix/spinnaker/orca/front50/tasks/MonitorPipelineTask.groovy#L96)).

This should address the issue seen where large nested pipelines
especially those child pipelines that generate a lot of outputs (Deploy
Manifest) negatively impact the performance of Spinnaker.

See: spinnaker/spinnaker#6159

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
CharlieTLe added a commit to CharlieTLe/orca that referenced this issue Jun 1, 2022
The way that the monitor pipeline task worked before was that it only
stored the _status_ of the child pipeline when the child was running
(see file before the commit
[here](https://github.com/spinnaker/orca/pull/3902/files#diff-4c694b67271c7d9436d5af695f160f0b60c7ad7753a0ba1eb9aea76d7ba4e851R128-R130)).
After the commit was merged, the MonitorPipelineTask was storing the
context of the child pipeline on every update regardless of whether the
child pipeline’s status ([see line in
commit](https://github.com/spinnaker/orca/blob/20e2b9ea61037e7fc8eb59c817dc5d33f4718af3/orca-front50/src/main/groovy/com/netflix/spinnaker/orca/front50/tasks/MonitorPipelineTask.groovy#L96)).

This should address the issue seen where large nested pipelines
especially those child pipelines that generate a lot of outputs (Deploy
Manifest) negatively impact the performance of Spinnaker.

See: spinnaker/spinnaker#6159
@jasonmcintosh jasonmcintosh reopened this Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no-lifecycle pipelines/executions sig/none Issues that do not fall under scope of any SIG
Projects
None yet
Development

No branches or pull requests