Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficult to use parallel Tasks that share files using workspace #2586

Closed
jlpettersson opened this issue May 8, 2020 · 11 comments · Fixed by #2630
Closed

Difficult to use parallel Tasks that share files using workspace #2586

jlpettersson opened this issue May 8, 2020 · 11 comments · Fixed by #2630
Assignees
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature.

Comments

@jlpettersson
Copy link
Member

jlpettersson commented May 8, 2020

Expected Behavior

A key feature for a CI-system is to provide fast feedback. Parallelization is one way to provide faster feedback.

I would be interested in an easy way to run parallel "commands" (e.g. steps or tasks), sharing the same workspace volume.

Actual Behavior

Tasks can be run parallel today. But there is a lot of complexity when using a PVC based workspace (using PVC workspace should be the common case).

Clusters can be regional crossing multiple zones. While PVC often is zonal. This is not a problem for most applications, but with Tekton we want to mount the same PVC in multiple Tasks, sequential or parallel.

PVC access mode ReadWriteOnce is most common, and with that pods can only execute in parallel if scheduled to the same node. An access mode ReadWriteMany would be best for Tekton workload, but that is the least common and those solutions are commonly not "cloud native", e.g. based on NFS servers, or custom file system view of buckets.

Proposal

SOLUTION ALTERNATIVES

  • Investigate if we could support parallel steps within a Task (a DAG of steps per Task). That way, all containers in the Task is scheduled to the same Node and can share the volume when executed parallel.

    This can still be compatible with the Tasks and Pipelines as how they are today, but it is an additional feature that also allow parallel steps (or a DAG) within the Task, instead of strictly sequential as of today.

    This was the initially proposed solution but it has a major drawback in that it makes Tasks less reusable. In addition, it is a bit tricky to implement. Most comments below is in context of this alternative.

  • Introduce a concept of TaskGroup and execute all Tasks part of the same group (a DAG) within a single Pod. This keeps the reusability of Tasks. Is it a major API change and is not easy to implement. A special case, to schedule all Task pods in a PipelineRun within a single Pod was also considered.

  • Using Kubernetes Pod affinity was considered difficult to use, since Task pods only lives during task execution and then terminates.

/kind design
/kind feature

@tekton-robot tekton-robot added kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. labels May 8, 2020
@afrittoli
Copy link
Member

Results is a feature that has been introduce to allow for PVC-less data share across tasks, at least for parameter / text type of data.

One concern for parallel steps is scheduling, since a Task with multiple parallel steps might be tougher to schedule, even though we could have a maximum parallel steps setting to ease that issue.

Another small concern I have is re-usability of tasks. Since the smallest reusable unit is the Task, a Task with parallel steps may lead to building tasks that accomplish many different tasks and are thus less re-usable.

That said I think it would be really useful to explore this - define concrete use cases and assess benefits and issues.

@GregDritschler
Copy link
Contributor

I'm confused. Steps in a Task map to containers in a Pod. I assume that "parallel steps within a Task" then would map to parallel containers in a Pod. There wouldn't be any scheduling involved for parallel steps. Right?

@jlpettersson jlpettersson changed the title Is it feasible to introduce parallel steps in tasks Proposal: Introduce parallel steps in tasks May 8, 2020
@bobcatfish
Copy link
Collaborator

@jlpettersson could you give some examples of steps you'd want to run in parallel?

@jlpettersson
Copy link
Member Author

Since the smallest reusable unit is the Task

I also see the step container image as a reusable unit. But a full step including parameters is also reusable, but only via copy-and-paste.

define concrete use cases and assess benefits and issues.

The typical use-case is a CI-pipeline that starts with a git-clone and the following steps analyze or generate outputs depending on all files that was cloned from git - hence the usefulness with a shared workspace volume.

e.g. I have a pipeline similar to this in a project - highly parallel for fast feedback:

git-clone -> install dependencies -> build app       -> build artifact -> upload artifact
                                  -> test js
                                  -> lint js
                                  -> validate swagger file

It is beneficial to do all work that use the files from git-clone on the same node (most easily done when within the same Pod), also hopefully parallel to some extent.

I assume that "parallel steps within a Task" then would map to parallel containers in a Pod.

Yes. More exactly, I would love to be able to execute the steps in a DAG within the Pod.

There wouldn't be any scheduling involved for parallel steps.

Only the Pod is scheduled to a Node - once. But something e.g. the entrypoint binary, must coordinate in what order the containers within the pod should execute.

You probably would like to dedicate more CPU resources to this kind of fat Task, but I let that to be a different question.

Tekton is a distributed system compared to older single-node CI-systems. But Tekton would still be a distributed system with fat Tasks. It is just that bigger "container groups" is co-located - but this is beneficial when sharing resources (e.g. workspace files)

@bobcatfish
Copy link
Collaborator

Interesting! Thanks for the details. I haven't thought this through very far but a couple of intial reactions:

  1. git-clone in particular being a separate task is a shortcoming we've run into as a result of seeing how things go without pipeline resources, but i think we'll have other options here eventually (+ @sbwsg )
  2. it seems from your description like what you really want is the graph expressive-ness of Pipelines, but you want the "executing with access to the same disk" features of Tasks. I wonder if "fat tasks" are the best answer here, or if instead we could look into an execution mode for pipelines where all Tasks can more easily share the same disk (e.g. be in teh same node) - that way you can get the reusability of Tasks as they currently exist, the expressiveness of Pipelines and avoid PVCs

@bobcatfish
Copy link
Collaborator

Re. (2) -> #617 might be relevant to your interests

@jlpettersson
Copy link
Member Author

jlpettersson commented May 8, 2020

Linking task output (result) to task input (params) is a great feature.

But this is more about sharing all files in the workspace, typically a larger number of files from a git-clone operation.

The closest alternative to this is to use pod-affinity as in my comment

I am curious if we can use some kind of pod affinity to get tasks co-located on the same node.

Possibly co-locate all pods belonging to a single PipelineRun so they perfectly fine can use the same PVC as a workspace and perfectly fine can execute parallel. (this is essentially what any single-node CI/CD system does).

We would still be a distributed system where different PipelineRuns possibly scheduled to different nodes. Using different PVCs is "easier" for fan-out, but not for fan-in (e.g. git-clone and then parallel tasks using the same files)

My idea was to co-locate later pods with the first pods scheduled to a node. But I have come to the conclusion that this is very hard to do, especially in a DAG and when some pods terminate.

Therefore I am proposing to investigate if we could execute a Pipeline DAG within a pod instead. That would do it trivial to share a volume/files (e.g. emptyDir or PVC) and also trivial to schedule to the same node, as the whole Pod is located in one node.

it seems from your description like what you really want is the graph expressive-ness of Pipelines, but you want the "executing with access to the same disk" features of Tasks.

Yes, exactly. Essentially it is a way to co-located a group of processes that operate on the same set of files (from git-clone), but potentially executed in order or in parallel specified in a DAG (pipeline).

@jlpettersson
Copy link
Member Author

A use-case from Tekton Catalog is the three golang-tasks: build, lint and test. For fastest feedback, it would be good to execute all three in parallel. They are all three operating on the same set of files, typically originated from a git-clone operation in a CI-pipeline.

However, it is possible alternative ways to implement this. One could be to introduce a TaskGroup so when I put these four tasks (including git-clone) in a TaskGroup in my pipeline, they are all executed within the same Pod - but in a DAG (managed by e.g. entrypoint binary or other solution).

@jlpettersson jlpettersson changed the title Proposal: Introduce parallel steps in tasks Proposal: execute containers in a DAG within a Pod May 8, 2020
@jlpettersson
Copy link
Member Author

jlpettersson commented May 8, 2020

A full example of a Golang CI-pipeline that uses a proposed "TaskGroup":

git-clone -> go-lint -> kaniko-build
          -> go-test
          -> go-build

Example Pipeline definition. I have excempted params for clarity. The taskGroup syntax could be designed differently.

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: golang-ci-pipeline
spec:
  tasks:
  - name: git-clone
    taskRef:
      name: git-clone
    taskGroup: ci
  - name: go-build
    runAfter: [git-clone]
    taskRef:
      name: golang-build
    taskGroup: ci
  - name: go-lint
    runAfter: [git-clone]
    taskRef:
      name: golang-lint
    taskGroup: ci
  - name: go-test
    runAfter: [git-clone]
    taskRef:
      name: golang-test
    taskGroup: ci
  - name: kaniko
     runAfter:
     # possibly use runAfter: ['ci'] here instead of individual task names.
     - go-build
     - go-lint
     - go-test
    taskRef:
      name: kaniko

Since all tasks about belong to the same taskGroup: ci they will all be executed within the same Pod.

@jlpettersson jlpettersson changed the title Proposal: execute containers in a DAG within a Pod Proposal: execute containers in a DAG within a Pod/Node May 11, 2020
@jlpettersson jlpettersson changed the title Proposal: execute containers in a DAG within a Pod/Node Proposal: execute steps/tasks in a DAG within a Pod/Node May 11, 2020
@jlpettersson
Copy link
Member Author

/assign

@jlpettersson jlpettersson changed the title Proposal: execute steps/tasks in a DAG within a Pod/Node Difficult to use parallel Tasks that share files with workspace May 15, 2020
@jlpettersson jlpettersson changed the title Difficult to use parallel Tasks that share files with workspace Difficult to use parallel Tasks that share files using workspace May 15, 2020
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 16, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
@jlpettersson
Copy link
Member Author

I wonder if "fat tasks" are the best answer here, or if instead we could look into an execution mode for pipelines where all Tasks can more easily share the same disk (e.g. be in teh same node) - that way you can get the reusability of Tasks as they currently exist, the expressiveness of Pipelines

@bobcatfish you are right.

With a small trick, by introducing a new component (not user-facing), I found a working solution for this problem using Inter-pod affinity, implemented in #2630

jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 16, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 18, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 18, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 19, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 20, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 20, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 22, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
jlpettersson added a commit to jlpettersson/pipeline that referenced this issue May 22, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in tektoncd#2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves tektoncd#2586
/kind feature
tekton-robot pushed a commit that referenced this issue Jun 1, 2020
TaskRuns within a PipelineRun may share files using a workspace volume.
The typical case is files from a git-clone operation. Tasks in a CI-pipeline often
perform operations on the filesystem, e.g. generate files or analyze files,
so the workspace abstraction is very useful.

The Kubernetes way of using file volumes is by using [PersistentVolumeClaims](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims).
PersistentVolumeClaims use PersistentVolumes with different [access modes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes).
The most commonly available PV access mode is ReadWriteOnce, volumes with this
access mode can only be mounted on one Node at a time.

When using parallel Tasks in a Pipeline, the pods for the TaskRuns is
scheduled to any Node, most likely not to the same Node in a cluster.
Since volumes with the commonly available ReadWriteOnce access mode cannot
be use by multiple nodes at a time, these "parallel" pods is forced to
execute sequentially, since the volume only is available on one node at a time.
This may make that your TaskRuns time out.

Clusters are often _regional_, e.g. they are deployed across 3 Availability
Zones, but Persistent Volumes are often _zonal_, e.g. they are only available
for the Nodes within a single zone. Some cloud providers offer regional PVs,
but sometimes regional PVs is only replicated to one additional zone, e.g. not
all 3 zones within a region. This works fine for most typical stateful application,
but Tekton uses storage in a different way - it is designed so that multiple pods
access the same volume, in a sequece or parallel.

This makes it difficult to design a Pipeline that starts with parallel tasks using
its own PVC and then have a common tasks that mount the volume from the earlier
tasks - since - what happens if those tasks were scheduled to different zones -
the common task can not mount the PVCs that now is located in different zones, so
the PipelineRun is deadlocked.

There are a few technical solutions that offer parallel executions of Tasks
even when sharing PVC workspace:

- Using PVC access mode ReadWriteMany. But this access mode is not widely available,
  and is typically a NFS server or another not so "cloud native" solution.

- An alternative is to use a storage that is tied to a specific node, e.g. local volume
  and then configure so pods are scheduled to this node, but this is not commonly
  available and it has drawbacks, e.g. the pod may need to consume and mount a whole
  disk e.g. several hundreds GB.

Consequently, it would be good to find a way so that TaskRun pods that share
workspace are scheduled to the same Node - and thereby make it easy to use parallel
tasks with workspace - while executing concurrently - on widely available Kubernetes
cluster and storage configurations.

A few alternative solutions have been considered, as documented in #2586.
However, they all have major drawbacks, e.g. major API and contract changes.

This commit introduces an "Affinity Assistant" - a minimal placeholder-pod,
so that it is possible to use [Kubernetes inter-pod affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#inter-pod-affinity-and-anti-affinity) for TaskRun pods that need to be scheduled to the same Node.

This solution has several benefits: it does not introduce any API changes,
it does not break or change any existing Tekton concepts and it is
implemented with very few changes. Additionally it can be disabled with a feature-flag.

**How it works:** When a PipelineRun is initiated, an "Affinity Assistant" is
created for each PVC workspace volume. TaskRun pods that share workspace
volume is configured with podAffinity to the "Affinity Assisant" pod that
was created for the volume. The "Affinity Assistant" lives until the
PipelineRun is completed, or deleted. "Affinity Assistant" pods are
configured with podAntiAffinity to repel other "Affinity Assistants" -
in a Best Effort fashion.

The Affinity Assistant is _singleton_ workload, since it acts as a
placeholder pod and TaskRun pods with affinity must be scheduled to the
same Node. It is implemented with [QoS class Guaranteed](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/#create-a-pod-that-gets-assigned-a-qos-class-of-guaranteed) but with minimal resource requests -
since it does not provide any work other than beeing a placeholder.

Singleton workloads can be implemented in multiple ways, and they differ
in behavior when the Node becomes unreachable:

- as a Pod - the Pod is not managed, so it will not be recreated.
- as a Deployment - the Pod will be recreated and puts Availability before
  the singleton property
- as a StatefulSet - the Pod will be recreated but puds the singleton
  property before Availability

Therefor the Affinity Assistant is implemented as a StatefulSet.

Essentialy this commit provides an effortless way to use a functional
task parallelism with any Kubernetes cluster that has any PVC based
storage.

Solves #2586
/kind feature
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants