-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEP-0135: Per-PipelineRun (instead of per-workspace) affinity assistant #6543
Comments
Chatted about this in Monday's API WG-- seemed like there was interest in this feature which was great! @pritidesai expressed some concern about how this feature would handle a node going down. I think a node going down would kill pods running on it and cause taskruns to fail regardless of whether the existing affinity assistant is used or not, so I think this could make sense to address separately. (Looks like a contributor created a separate issue for this: #6558). @pritidesai would you be ok with descoping this requirement? @QuanZhang-William and I brainstormed a few different ideas for how to build this feature. The clearest way is to create only one placeholder pod per pipelinerun and mount all workspace persistent volumes onto it, which Quan prototyped here: https://github.com/QuanZhang-William/pod-affinity. This requires a cluster operator to set the default storage class to one that uses WaitForFirstConsumer VolumeBindingMode, so that the PVs are not bound to nodes until the placeholder pod is scheduled, to prevent the PVs from being bound to different nodes. We would have to document that cluster operators need to make this change in order to use the per-pipelinerun affinity assistant, which isn't ideal but I think it could be workable. I want to go into a bit more detail on the problem @skaegi brought up: a situation where an operator cordons a node in order to do maintenance on it.
@QuanZhang-William and I tried to come up with a few ways to get around this problem with a per-PipelineRun affinity assistant, but haven't come up with any great solutions yet. Here are some ideas we thought of:
@skaegi is this requirement table stakes for you or do you think we could descope it? |
I chatted with @pritidesai last week about concerns related to nodes going down. To summarize, Priti has general concerns about long CI/CD PipelineRuns failing due to transient issues and users having a limited ability to recover the work that has already been done. We explored whether it would make sense for a per-PipelineRun affinity assistant (PPAA) to handle node failures by moving a PipelineRun to a new node if the one it was running on went down. However, we don't currently handle node failures for PipelineRuns in general, so we agreed that it wouldn't make sense for the PPAA to provide stronger guarantees than what we currently have. We're also not really sure how we could implement this guarantee for PPAA. Therefore, we decided to descope this concern and address it separately, making sure to document this limitation for cluster operators. |
@lbernick i think that if we can't handle a cordon correctly (e.g. controlled NoSchedule) it says we're in trouble in general for robustness. Maybe #6558 is the higher priority as it's the simple case. I recognize that the "Trying to delete and recreate the placeholder pod" game might get complex but I'd start there as we can do that in a backward compatible way and at least fix some cases a bit better. Something like... for a PipelineRun, if its TaskRuns are both not running and cannot be scheduled on the affinity assistant's node then delete the affinity assistant pod and select a candidate TaskRun and re-run the affinity assistant placement algorithm. Better than nothing and might really help simple cases if we can begin to figure them out. |
We should definitely fix #6558 asap but I think it's a slightly separate problem
The tricky thing here is that a pod could be unschedulable for all sorts of reasons, and not all of them are deadlock situations where we'd want to recreate the affinity assistant. For example, what if a pod is just pending because it's waiting for the cluster autoscaler to scale up the number of nodes? One way we could handle the node maintenance case specifically could be to watch nodes and see if any of them become "unschedulable". As far as I can tell a node only has "unschedulable=true" when it's cordoned or being drained (or not ready yet or about to be torn down by the autoscaler). If an unschedulable node has any affinity assistant pods, we could create new copies of them and then delete the old ones. (Create would happen first so that TaskRun pods can't schedule in between when a placeholder pod gets deleted and the new one gets scheduled.) It feels a bit hacky tbh; will probably need to design in more detail. @skaegi Can you elaborate on what you mean by "affinity assistant placement algorithm"? The only affinity terms we apply to the affinity assistant pod itself are any affinity terms from the pipelinerun pod template, and the affinity term to repel other affinity assistants. k8s handles scheduling the placeholder pod from there. |
Yes exactly -- we might look to see if re: create/delete - Off the top of my head the order shouldn't matter as the node is already unscheduable and the taskrun pods should already have pod affinity to prevent being created on an arbitrary node. Either way part of the design.
It's a more advanced case but ... the new smarter placement algorithm that still has yet to be written 😃 but I'm hoping it could examine a TaskRun and match affinity requirements and possibly tolerations. Really not sure if the current repelling aspect is correct or what it hopes to accomplish. |
/assign @QuanZhang-William |
@lbernick: GitHub didn't allow me to assign the following users: QuanZhang-William. Note that only tektoncd members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign |
I am really looking forward to this feature. I have a question: Are we scheduling each Pod in the Pipelinerun to the same node in order to build the tasks in parallel as much as possible? If so,
|
@l-qing great questions! I forgot to comment on this issue earlier, but we've merged TEP-0135: Coscheduling PipelineRun pods as our design proposal for this feature. Please feel free to leave any feedback on the design proposal on this issue!
I think with the initial implementation, we're going to ensure all pods are scheduled on the same node. We can definitely explore making the per-pipelinerun affinity assistant a bit smarter, especially if more users indicate they'd prefer to allow free scheduling of pods that don't mount PVCs. This will be implemented as an alpha feature initially, so we have some freedom to make changes if needed.
We currently don't take PVC access modes into account; we've been assuming so far that cluster-level constraints affect whether users are using ReadWriteMany or ReadWriteOnce PVCs, and their decision on whether or not to enable the affinity assistant depends on what type of PVCs they are using in general. Does your use case involve PipelineRuns that mix different access modes for PVCs, and if so, can you please go into more detail? |
@lbernick Thank you for your reply. I discovered this issue when I encountered performance-related problems. In my environment, there is only one shared storage class, such as Especially for the nodejs frontend pipeline, they need to download dependencies into the cache of the pvc, and copy them between multiple pvc directories. The storage pressure on a single pipeline is very high. So, I was wondering if it's possible to mix and use different types of PVC. For example, for pipeline like nodejs, The PVC of the Kubernetes also optimizes for local storage, ensuring that Pods using the
No action (setting pod affinity) is required to ensure the pipeline can be executed successfully when only one local storage PVC is used. Only using multiple local storage PVCs with a specific combination may lead to pipeline execution failure due to the PVCs being bound to different nodes. I'm considering the |
@l-qing thanks for the detailed response, super helpful! To make sure I understand you correctly, it sounds like you're trying to do something similar to the user who reported #5275, where you have a PVC you're using to cache dependencies between pipelineruns. Each PipelineRun also needs somewhere to store the repo contents (not shared between PipelineRuns), and you could use the same PVC you're using for caching, but that's slow. You'd like to use the per-PipelineRun affinity assistant feature, when it's available, to be able to use a separate local storage PVC in each PipelineRun in addition to the cache PVC, and you can't currently do this because you can't use two PVCs in one TaskRun (i.e. the problem reported in #5275). It sounds like you're concerned, though, that once you enable the per-PipelineRun affinity assistant, that any pods that don't rely on one of these two PVCs will still be forced onto the same node as the other PipelineRun pods, causing inefficient cluster usage. Does that sound right? It sounds like we could make two improvements to the proposed design:
I'm thinking for right now, we should take the simple approach proposed in the design and just run all the PipelineRun pods on the same node, see how well that works for people, and make these adjustments if necessary. However, I'm also open to updating the design proposal now. (I also got another comment on the tekton slack from a user who says "After going through everything I think we're leaning more towards switching out all our PVCs used for workspaces to use a ReadWriteMany and not have the option to mix with ReadWriteOnce which would then mean we can disable the affinity-assistant completely." So it sounds like others may have similar concerns.) |
Not all of it.
It will work well. Because after k8s binds the Code PVC for the first time, any subsequent use of the PVC will only be scheduled on that node. here
Yes.
Currently we do it this way as well, but have indeed encountered performance issues caused by single storage classes. Regarding the front-end pipeline, our current workaround is: The Code PVC is only used for sharing data. Each task copies its contents to the |
@l-qing per-pipelinerun affinity assistant has now been implemented as an alpha feature. Please feel free to try it out and share feedback! You can enable this by setting "disable-affinity-assistant" to "true" and "coschedule" to "pipelineruns" (docs here). We incorporated your feedback that pods that don't bind PVCs should be able to be scheduled freely. I also created a follow-up issue to track your feedback about PVC access modes: #6989. I'm going to close this issue since the implementation is complete in alpha; if you'd like to provide feedback, please comment on #6990. |
Feature request
A cluster-level, opt-in configuration option to allow all of a PipelineRun's pods to be scheduled to the same node.
Use case
PipelineRuns with parallel TaskRuns using the same PVC-backed workspace. (This use case is addressed with the current affinity assistant.) Discussed in Difficult to use parallel Tasks that share files using workspace #2586
PipelineRuns with TaskRuns using multiple PVC-backed workspaces. (This use case is not addressed with the current affinity assistant.)
Existing Workarounds
Potential drawbacks
The following drawbacks associated with the existing affinity assistant won't be addressed by this feature:
This also introduces a new drawback compared to the current affinity assistant: Some PipelineRuns need to have TaskRuns run on different nodes; for example, multi-arch builds. Discussed in more detail here
More details
Looking to gauge interest in this feature, as well as whether there is any interest in a configuration option that would prevent multiple PipelineRuns from running concurrently on the same node.
@skaegi has expressed interest in this feature as long as PipelineRuns can be moved to a new node in case a cluster operator needs to do node maintenance. (He is not interested in preventing multiple PipelineRuns from running on the same node, as this results in worse bin-packing.)
The text was updated successfully, but these errors were encountered: