Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controlling max parallel jobs per pipeline #2591

Open
ysaakpr opened this issue May 9, 2020 · 38 comments
Open

Controlling max parallel jobs per pipeline #2591

ysaakpr opened this issue May 9, 2020 · 38 comments
Labels
area/api Indicates an issue or PR that deals with the API. area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@ysaakpr
Copy link

ysaakpr commented May 9, 2020

What is the way to control concurrency? The pipeline has 100 independent steps. But I don't want them to run all 100 together. For different pipeline run, I wish to adjust the concurrency as well.

@imjasonh
Copy link
Member

imjasonh commented May 9, 2020

There isn't a configuration for this today, but it should be possible if there's demand and the use cases make sense.

In the meantime you can run pipelines in a namespace with a resource limit such that no more than X CPUs are available to tasks, and those over the cap will queue until others finish. If you're just trying to limit the resource footprint this is likely the best way to express the limitation.

Can you give more details about why you want to limit concurrency of tasks?

https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/

@ysaakpr
Copy link
Author

ysaakpr commented May 10, 2020

Using Kubernetes resource-wise limits is one of the ways. But which pretty hard to achieve due to the kind of resource limits that I have to dynamically do, based on different aspects.
In my pipeline, we are loading 100 of DB dump to a new database, for creating a new environment from a seed database instance. All the jobs can run parallel, But if we run that the DB instance. from which we taking data will be chocked out and cause connection limits error.

Controlling concurrency is well needed for a CI-CD system, where though all can be run in parallel, the user should get an option to control due resource limitation/availability.

Note: Currently I have achieved it using our consul server, and using consul-cli to wrap the run script on a shared key and concurrency. eg Snippet below.

consul lock -n ${concurrency} -child-exit-code ${jobkey} "bash $script $@"

There are few problems in this case:

  1. Pods are already started running, so total run time to completion will be wait time + runtime of the script
  2. Not easy to configure using pipeline run args, need to modify my script to achieve concurrency control

But the same can be directly added as a feature to tekton to use a semaphore and lock max concurrent jobs, on pipelinerun and create pods only after the locks are aquiired.

@ysaakpr
Copy link
Author

ysaakpr commented May 10, 2020

@imjasonh I could contribute to this if someone can give me some hints on the code structure as well as the standards to follow, plus any other technical issues if exists, which will conflict against this behaviour

@imjasonh
Copy link
Member

Thanks, I think this seems like a reasonable addition, and would be happy to help you with it.

First, what kind of API addition are you envisioning? What's the minimum addition we can add that we could extend later if we need to? Is there any precedent in other workflow tool we could borrow/steal from?

Depending on the size of the change, we'd probably ask for a design doc answering those questions, and describing the use case (which you've done above, thanks for that!)

@ysaakpr
Copy link
Author

ysaakpr commented May 10, 2020

At first glance, I think a property in pipelinerun or|and pipeline resource named concurrency
concurrency if defined should be a value gte 0. Zero, can make the pipeline to pause between run, any positive value will set the max parallel that can be triggered. If no value set, all possible will run in parallel.

Initial version for this feature can be just concurrency field in the pipeline run. I am not sure exactly similar feature that we can borrow, But GitLab has runner concurrency that a user can set per runner.

@jlpettersson
Copy link
Member

jlpettersson commented May 10, 2020

One way of handling this is by using a Pod quota for the Namespace.

@ysaakpr
Copy link
Author

ysaakpr commented May 11, 2020

One way of handling this is by using a Pod quota for the Namespace.

Of course, you have an option from Kubernetes resource limits as you mentioned, but not always practical. For example, In my namespace, I am not just using the only tekton. And configuration and usability point of view, using pod quota is much more complex than using a value of max parallel in pipelinerun

@imjasonh
Copy link
Member

@ysaakpr Configuring pipeline-wide concurrency limits definitely seems easiest, but I wonder if that's expressive/powerful enough to satisfy more complex use cases. We should explore other options, even if only to be able to dismiss them with well thought out reasons.

Consider a pipeline that does two highly parallelizeable things (e.g., initializing a database, then later deploying code to multiple AZs), but each of those parallelizeable things have different concurrency caps -- it might make sense to kick off 1000 init-db tasks at once, max 100 at a time, then later in that same pipeline kick off 10 deploy tasks, max 3 at a time. Configuring concurrency: 100 at the pipeline level wouldn't help limit the second group of tasks. A user could manually configure their pipeline to perform 3 deploy tasks in parallel, then the next 3, etc., but that's exactly the kind of manual configuration we're trying to avoid -- they also could have manually configured the pipeline to do 100 init-db tasks in parallel, then the next 100, etc., today, but that's toilsome.

(To be clear, this example isn't reason enough by itself to discount the pipeline-wide concurrency config, but it's worth considering and at least explicitly acknowledging this shortcoming.)

One way to express the different concurrency levels would be to group tasks together, then express concurrency limits per-group. Is that worth the additional config required to express/understand/visualize this grouping? I'm not sure. Would it be possible to support group-wise limits and pipeline-wide limits side-by-side? I truly have nothing to offer but open-ended questions. :)

@imjasonh imjasonh changed the title Controlling max parrallel jobs per pipeline Controlling max parallel jobs per pipeline May 11, 2020
@ysaakpr
Copy link
Author

ysaakpr commented May 11, 2020

@imjasonh that's a good thought. There are already two other tickets for Task grouping in a pipeline, which actually discussing the pipeline task grouping
#2592 and #2586 (comment).

As you mentioned, the idea of concurrency should not be limited to just at the pipeline level. I agree/accept that for a more complex pipeline, configuring this at group of task-level would be an always amazing feature.

Pipeline level concurrency will be the max possible/default concurrency. And task group level concurrency can be used to fine-tune it further.

@dibyom
Copy link
Member

dibyom commented May 11, 2020

/kind feature
/area api

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. area/api Indicates an issue or PR that deals with the API. labels May 11, 2020
@dibyom dibyom added the kind/design Categorizes issue or PR as related to design. label May 11, 2020
@vdemeester
Copy link
Member

/priority important-longterm

@tekton-robot tekton-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label May 18, 2020
@ysaakpr
Copy link
Author

ysaakpr commented May 20, 2020

How could I contribute on this, Are there any discussion forum? Where I can also be part of the design/implementation discussions.

@takirala
Copy link
Contributor

takirala commented Jun 9, 2020

+1 for this feature.
I am also looking for something similar and open to contributing to any discussions/design/code.

@afrittoli afrittoli added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 15, 2020
@holly-cummins
Copy link

See also #2828.

@jlpettersson
Copy link
Member

I think it would not be so difficult to add logic for this.

e.g. right before we create a TaskRun - we could check if we have less than X uncompleted TaskRuns or else omit creating a new.

later when a TaskRun is completed, the PipelineRun will do the reconciliation again, and the creation of a TaskRun will be re-evaluated.

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vdemeester
Copy link
Member

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

@tekton-robot tekton-robot reopened this Aug 17, 2020
@tekton-robot
Copy link
Collaborator

@vdemeester: Reopened this issue.

In response to this:

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 17, 2020
@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2021
@dibyom dibyom removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 9, 2021
@afrittoli
Copy link
Member

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 9, 2021
@afrittoli
Copy link
Member

Controlling max taskruns: #3796

@afrittoli
Copy link
Member

Related issue in experimental: tektoncd/experimental#699

@afrittoli
Copy link
Member

Related approval task issue: tektoncd/experimental#728

@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 5, 2021
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 15, 2021
@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen with a justification.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@afrittoli
Copy link
Member

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 2, 2022
@afrittoli
Copy link
Member

/lifecycle frozen

@tekton-robot tekton-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Feb 2, 2022
@jerop jerop added the area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) label Feb 17, 2022
@amitjha780
Copy link

Hi all, Is there any stable solution available in Tekton to attain Concurrent build for pipelines?
I have gone through many issues/ideas raised with the community relevant to this but not get any exact solutions for the same.
#6112
#1305
#2134
If anyone have any views or any latest update on this open issues added on Tekton Community Roadmap- #2591 Pls update here accordingly.

@eudescosta
Copy link

A google search for controlling the parallel jobs per pipeline brought me here :)
I +1 that this would be extremely handy, looking forward to see this on the roadmap!

@sibelius
Copy link

what is the best approach for now ?

@eudescosta
Copy link

what is the best approach for now ?

I am checking on a shell script ... something on these lines

check_job() {
  pipeline_run_name=$(kubectl get pipelinerun --sort-by=.status.startTime | grep -E '<pipenline_name>' | awk 'END{ print $1 }')
  kubectl get pipelinerun $pipeline_run -o jsonpath='{.status.conditions[?(@.reason == "Running")].reason}'
}

@caiocampoos
Copy link

  • 1 for this, it would be super useful.

Another similar feature would be cancel pipelines based on other types of pipelines running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue or PR that deals with the API. area/roadmap Issues that are part of the project (or organization) roadmap (usually an epic) kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
Status: Todo
Status: Todo
Development

No branches or pull requests