New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add cancel buildrun support #809
add cancel buildrun support #809
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gabemontero, long-wished feature. :-)
Similar to us translating the TaskRun status TaskRunTimeout into BuildRunTimeout, we should do the same with TaskRunCancelled into BuildRunCanceled (and make it American ;-) ).
BuildRun documentation will need an update.
Your YAML change looks huge. Are you running the correct version of controller-gen?
Usage looks straightforward to me, thank you! Not too firm on the need for an enhancement proposal as long as everyone votes up on the API modification. |
:-) sounds good @SaschaSchwarze0 ... I'll take that as a +1
Agreed ... will add that along with tests
$ controller-gen --version which lines up with https://github.com/shipwright-io/build/blob/main/HACK.md per the recent updates from you and @imjasonh not sure what is up ... I see about re-running |
cool thanks @sbose78 and fyi everyone I know I need to add godoc to get the verify job to pass |
the supporting cli feature to wrapper the patch needed to cancel buildruns is ^^ shipwright-io/cli#22 |
ok I've pushed commits to
I'll move onto tests next. Then circle back to seen if there was any input on my conditions note in the description, and go from there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small suggestions.
docs/buildrun.md
Outdated
|
||
To cancel a `BuildRun` that's currently executing, update its status to mark it as canceled. | ||
|
||
Whe you cancel a `BuildRun`, the underlying `TaskRun` is marked as canceled per the [Tekton cancel `TaskRun` feature](https://github.com/tektoncd/pipeline/blob/main/docs/taskruns.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whe you cancel a `BuildRun`, the underlying `TaskRun` is marked as canceled per the [Tekton cancel `TaskRun` feature](https://github.com/tektoncd/pipeline/blob/main/docs/taskruns.md). | |
When you cancel a `BuildRun`, the underlying `TaskRun` is marked as canceled per the [Tekton cancel `TaskRun` feature](https://github.com/tektoncd/pipeline/blob/main/docs/taskruns.md#cancelling-a-taskrun). |
docs/buildrun.md
Outdated
| Unknown | Running | No | The BuildRun has been validate and started to perform its work. | | ||
| True | Succeeded | Yes | The BuildRun Pod is done. | | ||
| Unknown | TaskRunCancelled | No | The user requested the BuildRun to be canceled. This results in the BuildRun controller requesting the TaskRun be canceled. Cancellation has not been done yet. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our status reason should also be "BuildRunCanceled"
. The code change might simply be extending
build/pkg/reconciler/buildrun/resources/conditions.go
Lines 43 to 48 in ea7a886
case v1beta1.TaskRunReasonTimedOut: | |
reason = "BuildRunTimeout" | |
message = fmt.Sprintf("BuildRun %s failed to finish within %s", | |
buildRun.Name, | |
taskRun.Spec.Timeout.Duration, | |
) |
A nuance here would be to set "BuildRunCanceling"
once we notice that the user sent us a request to cancel until the actual TaskRun got canceled and then set "BuildRunCanceled"
. But that's an edge case differentiation that I do not insist on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so you concur with my speculation in this PR's description, and that the current form of the prototype is incomplete wrt what I showed there wrt TaskRunCancelled
being shown.
Yes as I evolve this PR beyond WIP, assuming there are not dissenters to the API, I'll do this.
docs/buildrun.md
Outdated
@@ -140,6 +160,7 @@ The following table illustrates the different states a BuildRun can have under i | |||
| False | ServiceAccountNotFound | Yes | The referenced service account was not found in the cluster. | | |||
| False | BuildRegistrationFailed | Yes | The related Build in the BuildRun is on a Failed state. | | |||
| False | BuildNotFound | Yes | The related Build in the BuildRun was not found. | | |||
| False | TaskRunCancelled | Yes | The BuildRun and underlying TaskRun were canceled successfully. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, should be "BuildRunCanceled"
.
|
||
const ( | ||
// BuildRunSpecStatusCanceled indicates that the user wants to cancel the BuildRun, | ||
// if not already cancelled or terminated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// if not already cancelled or terminated | |
// if not already canceled or terminated |
;-)
return c != nil && c.GetStatus() == corev1.ConditionTrue | ||
} | ||
|
||
// IsCanceled returns true if the BuildRun's spec status is set to Cancelled state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// IsCanceled returns true if the BuildRun's spec status is set to Cancelled state. | |
// IsCanceled returns true if the BuildRun's spec status is set to BuildRunCanceled state. |
pkg/reconciler/buildrun/buildrun.go
Outdated
Path: path, | ||
Value: value, | ||
}} | ||
data, _ := json.Marshal(payload) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data, _ := json.Marshal(payload) | |
data, err := json.Marshal(payload) | |
if err != nil { | |
return err | |
} |
Just in case we eventually have some static code analyzer that complains about ignored errors.
82a060b
to
cf2e943
Compare
fyi I pushed some unit tests and code changes along the move from citing TaskRunCancelled to citing BuildRunCanceled but I would consider all that still "in progress" for reviewers but I wanted to confirm what I have in my branch passed in CI as well. I'm also looking at updating the integration tests today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some feedback I got from our UX folks on the API design.
cf2e943
to
72b516a
Compare
72b516a
to
03fbac5
Compare
hmmm the verify job is still complaining saying to run |
OK, I've progressed things along enough to remove the WIP Remaining items:
All that said, and with minimally some subsequent commit squashing, PTAL :-) |
the integration tests have passed
|
pkg/reconciler/buildrun/buildrun.go
Outdated
succeededCondition := buildRun.Status.GetCondition(buildv1alpha1.Succeeded) | ||
if buildRun.IsCanceled() && lastTaskRun.IsCancelled() && (succeededCondition == nil || succeededCondition.Reason != buildv1alpha1.BuildRunSpecStatusCanceled) { | ||
if updateErr := resources.UpdateConditionWithFalseStatus(ctx, r.client, buildRun, "the BuildRun is marked canceled.", buildv1alpha1.BuildRunSpecStatusCanceled); updateErr != nil { | ||
return reconcile.Result{}, updateErr | ||
} | ||
return reconcile.Result{}, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut feeling is that this code should not be here, but rather the later code that calls UpdateBuildRunUsingTaskRunCondition
should handle this case. This will also make sure that a generated service account gets deleted when a BuildRun is canceled (that scenario is maybe worth an integration test). Though, I am not sure what it means about the completion date handling (has a canceled TaskRun has a completionDate ?).
Beside that, I searched for the code that sets the BuildRun status to status=Unknown and reason=BuildRunCanceled? Basically when we are reacting to an update for BuildRunCanceled in the spec and patch the TaskRun, but this is still running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK circling back to this @SaschaSchwarze0
First, to answer your question, yes, task run cancelling sets a completion date. See
- https://github.com/tektoncd/pipeline/blob/7ceaffb1257b5af072f24a0e7b5d837c3ef0a96d/pkg/reconciler/taskrun/taskrun.go#L157 followed by
- https://github.com/tektoncd/pipeline/blob/94bb776234cbc45130a49e27a6f550247553fa20/pkg/reconciler/taskrun/taskrun.go#L568
Now, as it turns out, I originally had this call later in the path here, but ran into headaches when I started adding unit tests, and then rationalized that short circuiting processing sooner made sense.
That said, I'll take another pass and look at moving to where the calls to UpdateBuildRunUsingTaskRunCondition
are and see how things unravel and then how I can subsequently sort them out.
For your "Beside that.." point, I'll circle back to that, after I look into ^^ and I am ready to provide my update to my #809 (comment) .... I have an understanding of what is going on with that, but before I dive into it, I want to see where I land with ^^ and then level set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interim, the full taskrun yaml with what I currently have implementation wise:
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
annotations:
pipeline.tekton.dev/release: v0.22.0
creationTimestamp: "2021-06-30T16:07:55Z"
generateName: buildpack-nodejs-build-htk8q-
generation: 2
labels:
app.kubernetes.io/managed-by: tekton-pipelines
build.shipwright.io/generation: "1"
build.shipwright.io/name: buildpack-nodejs-build
buildrun.shipwright.io/generation: "1"
buildrun.shipwright.io/name: buildpack-nodejs-build-htk8q
clusterbuildstrategy.shipwright.io/generation: "1"
clusterbuildstrategy.shipwright.io/name: buildpacks-v3
name: buildpack-nodejs-build-htk8q-88qsq
namespace: ggmtest
ownerReferences:
- apiVersion: shipwright.io/v1alpha1
blockOwnerDeletion: true
controller: true
kind: BuildRun
name: buildpack-nodejs-build-htk8q
uid: b8993808-520c-4a67-8e6e-1ef154239ee0
resourceVersion: "172515"
uid: 29c563ad-b95d-4577-af85-7cf666952df0
spec:
params:
- name: shp-output-image
value: docker.io/gmontero/sample-nodejs:latest
- name: shp-source-root
value: /workspace/source
- name: CONTEXT_DIR
value: source-build
- name: shp-source-context
value: /workspace/source/source-build
serviceAccountName: pipeline
status: TaskRunCancelled
taskSpec:
params:
- default: Dockerfile
description: Path to the Dockerfile
name: DOCKERFILE
type: string
- default: .
description: The root of the code
name: CONTEXT_DIR
type: string
- description: The URL of the image that the build produces
name: shp-output-image
type: string
- description: The context directory inside the source directory
name: shp-source-context
type: string
- description: The source directory
name: shp-source-root
type: string
results:
- description: The digest of the image
name: shp-image-digest
- description: The compressed size of the image
name: shp-image-size
- description: The commit SHA of the cloned source.
name: shp-source-default-commit-sha
steps:
- args:
- --url
- https://github.com/shipwright-io/sample-nodejs
- --target
- $(params.shp-source-root)
- --result-file-commit-sha
- $(results.shp-source-default-commit-sha.path)
command:
- /ko-app/git
image: docker.io/gmontero/git-77f36d96a091f12d11365ad77da28c66@sha256:9e8bb16e0432a354a10f3caf351093b23028c34013ebcdba41261be7f154216e
name: source-default
resources: {}
securityContext:
runAsGroup: 1000
runAsUser: 1000
- args:
- -c
- |
chown -R "1000:1000" /workspace/source && chown -R "1000:1000" /tekton/home && chown -R "1000:1000" /cache && chown -R "1000:1000" /layers
command:
- /bin/bash
image: docker.io/paketobuildpacks/builder:full
name: prepare
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 65Mi
securityContext:
capabilities:
add:
- CHOWN
runAsUser: 0
volumeMounts:
- mountPath: /cache
name: cache-dir
- mountPath: /layers
name: layers-dir
- args:
- -app=/workspace/source/$(inputs.params.CONTEXT_DIR)
- -cache-dir=/cache
- -layers=/layers
- $(params.shp-output-image)
command:
- /cnb/lifecycle/creator
image: docker.io/paketobuildpacks/builder:full
name: build-and-push
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 65Mi
securityContext:
runAsGroup: 1000
runAsUser: 1000
volumeMounts:
- mountPath: /cache
name: cache-dir
- mountPath: /layers
name: layers-dir
volumes:
- name: cache-dir
- name: layers-dir
workspaces:
- name: source
timeout: 0s
workspaces:
- emptyDir: {}
name: source
status:
completionTime: "2021-06-30T16:08:00Z"
conditions:
- lastTransitionTime: "2021-06-30T16:08:00Z"
message: TaskRun "buildpack-nodejs-build-htk8q-88qsq" was cancelled
reason: TaskRunCancelled
status: "False"
type: Succeeded
podName: buildpack-nodejs-build-htk8q-88qsq-pod-ff8fk
startTime: "2021-06-30T16:07:55Z"
steps:
- container: step-source-default
imageID: docker.io/gmontero/git-77f36d96a091f12d11365ad77da28c66@sha256:9e8bb16e0432a354a10f3caf351093b23028c34013ebcdba41261be7f154216e
name: source-default
terminated:
exitCode: 1
finishedAt: "2021-06-30T16:08:00Z"
reason: TaskRunCancelled
startedAt: "2021-06-30T16:07:58Z"
- container: step-prepare
imageID: docker.io/paketobuildpacks/builder@sha256:e3462130656ff77b7d64f0c1660e3116d70074a0568918f1a0898dc687bf9087
name: prepare
terminated:
exitCode: 1
finishedAt: "2021-06-30T16:08:00Z"
reason: TaskRunCancelled
startedAt: "2021-06-30T16:07:59Z"
- container: step-build-and-push
imageID: docker.io/paketobuildpacks/builder@sha256:e3462130656ff77b7d64f0c1660e3116d70074a0568918f1a0898dc687bf9087
name: build-and-push
terminated:
exitCode: 1
finishedAt: "2021-06-30T16:08:00Z"
reason: TaskRunCancelled
startedAt: "2021-06-30T16:07:59Z"
taskSpec:
params:
- default: Dockerfile
description: Path to the Dockerfile
name: DOCKERFILE
type: string
- default: .
description: The root of the code
name: CONTEXT_DIR
type: string
- description: The URL of the image that the build produces
name: shp-output-image
type: string
- description: The context directory inside the source directory
name: shp-source-context
type: string
- description: The source directory
name: shp-source-root
type: string
results:
- description: The digest of the image
name: shp-image-digest
- description: The compressed size of the image
name: shp-image-size
- description: The commit SHA of the cloned source.
name: shp-source-default-commit-sha
steps:
- args:
- --url
- https://github.com/shipwright-io/sample-nodejs
- --target
- $(params.shp-source-root)
- --result-file-commit-sha
- $(results.shp-source-default-commit-sha.path)
command:
- /ko-app/git
image: docker.io/gmontero/git-77f36d96a091f12d11365ad77da28c66@sha256:9e8bb16e0432a354a10f3caf351093b23028c34013ebcdba41261be7f154216e
name: source-default
resources: {}
securityContext:
runAsGroup: 1000
runAsUser: 1000
- args:
- -c
- |
chown -R "1000:1000" /workspace/source && chown -R "1000:1000" /tekton/home && chown -R "1000:1000" /cache && chown -R "1000:1000" /layers
command:
- /bin/bash
image: docker.io/paketobuildpacks/builder:full
name: prepare
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 65Mi
securityContext:
capabilities:
add:
- CHOWN
runAsUser: 0
volumeMounts:
- mountPath: /cache
name: cache-dir
- mountPath: /layers
name: layers-dir
- args:
- -app=/workspace/source/$(inputs.params.CONTEXT_DIR)
- -cache-dir=/cache
- -layers=/layers
- $(params.shp-output-image)
command:
- /cnb/lifecycle/creator
image: docker.io/paketobuildpacks/builder:full
name: build-and-push
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 250m
memory: 65Mi
securityContext:
runAsGroup: 1000
runAsUser: 1000
volumeMounts:
- mountPath: /cache
name: cache-dir
- mountPath: /layers
name: layers-dir
volumes:
- name: cache-dir
- name: layers-dir
workspaces:
- name: source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the buildrun yaml:
apiVersion: shipwright.io/v1alpha1
kind: BuildRun
metadata:
creationTimestamp: "2021-06-30T16:07:55Z"
generateName: buildpack-nodejs-build-
generation: 2
labels:
build.shipwright.io/generation: "1"
build.shipwright.io/name: buildpack-nodejs-build
name: buildpack-nodejs-build-htk8q
namespace: ggmtest
resourceVersion: "172510"
uid: b8993808-520c-4a67-8e6e-1ef154239ee0
spec:
buildRef:
name: buildpack-nodejs-build
status: BuildRunCanceled
timeout: 0s
status:
buildSpec:
output:
credentials:
name: push-secret
image: docker.io/gmontero/sample-nodejs:latest
source:
contextDir: source-build
url: https://github.com/shipwright-io/sample-nodejs
strategy:
kind: ClusterBuildStrategy
name: buildpacks-v3
completionTime: "2021-06-30T16:08:00Z"
conditions:
- lastTransitionTime: "2021-06-30T16:08:00Z"
message: the BuildRun is marked canceled.
reason: BuildRunCanceled
status: "False"
type: Succeeded
latestTaskRunRef: buildpack-nodejs-build-htk8q-88qsq
startTime: "2021-06-30T16:07:55Z"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beside that, I searched for the code that sets the BuildRun status to status=Unknown and reason=BuildRunCanceled? Basically when we are reacting to an update for BuildRunCanceled in the spec and patch the TaskRun, but this is still running.
OK I have the move to UpdateBuildRunUsingTaskRunCondition
working locally and will push that up in a bit.
On this portion of the comment, I may need some elaboration, though I have a notion of what you may want.
When I run kubectl get br -o yaml -w
when running a buildrun, the Succeeded condition remains with status Unknown until the taskrun fails / succeeds / cancels ... i.e. the taskrun reaches a terminal state.
The current code hence does not bother with setting the status of the Succeeded condition when it first sees that the user patched the buildrun with cancel. Hence it "stays" unknown. I believe that is because we just use the taskRun condition's status, and it stays unknown. i.e. https://github.com/shipwright-io/build/blob/main/pkg/reconciler/buildrun/resources/conditions.go#L106
A quick yaml snippet:
conditions:
- lastTransitionTime: "2021-06-30T16:55:15Z"
message: Not all Steps in the Task have finished executing
reason: Running
status: Unknown
type: Succeeded
Then when it observes the taskrun has been processed, it sets the suceeded condition to false.
Now, I can be explicit around making sure this holds. I'll be including that in my change, with a comment that we can discuss in this PR about whether to keep it or not. Among other things, we can provide a message that is more precise doing it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ change, plus the cancelled build run autogen sa integration test, pushed
7112af9
to
13b49b9
Compare
OK I've crafted an ubuntu controller-gen container that allowed me to run make generate / make generate-crds and the commit looks good wrt line breaks the verify job just passed @imjasonh bumping controller-gen from 0.5.0 to 0.6.0 not surprisingly lead to the need for a pretty massive k8s related go mod bump that I didn't want to couple with this PR, so I punted on that for now. I'll see about revisiting that once I get over the hump here I also got the update to the controller's role so we can patch PRs back up per prior discussion with @SaschaSchwarze0 so that is sorted out but overall, I'm putting a /hold while I next sort through some oddities wrt my manual testing (I'll be curious to see if they arise in the CI testing) ... maybe related to @SaschaSchwarze0 comment about ordering of my changes in the reconciler, but TBH I have not had the cycles to sufficiently focus on it. When I can next circle back from openshift to shipwright, that is my next step. |
34010aa
to
bd79905
Compare
just saw #829 from @SaschaSchwarze0 .... looks very related to the functional test pain I've recently noted here. FWIW, the tekton namespace events from my latest debug:
hence the pod did not get scheduled. Going to revert my debug, see how the next test run goes, and then go back on vacation for today. |
bd79905
to
cf56b96
Compare
OK just 1 of my new tests failed with an intermittent update conflict:
I'll add some redundancy to the update in question when I'm officially back to work tomorrow. thanks @SaschaSchwarze0 for otherwise fixing CI |
9227e9f
to
9860e1c
Compare
9860e1c
to
be61bba
Compare
all green CI @coreydaley @SaschaSchwarze0 commits squashed I believe all comments have either received code updates from me or declining to change comment responses from me PTAL |
make generate-crds via ubuntu add is cancelled check, task run patch, to buildrun reconciler unit and integration tests doc changes
be61bba
to
a84a08d
Compare
It looks good to me, but I would like @SaschaSchwarze0 to take one last look also. |
thanks and agreed @coreydaley |
still all green on ci e2e's @SaschaSchwarze0 @coreydaley after fixing the remaining oversight on the field rename that @coreydaley caught |
I'll make sure I take a final look this week. Will need to evaluate various scenarios against your code. |
/lgtm |
bump @SaschaSchwarze0 - any idea if you'll get back to cancel builds PR review this week? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, went through some edge cases and everything was working fine. :-)
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: SaschaSchwarze0 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Awesome - thanks ! |
Changes
Fixes #54
/kind feature
This PR still needs automated testing, so I have marked it WIP, but the approach I've taken is far enough along that feedback from the community makes sense.
Most notable, for the minimal, required API change, I chose to duplicate the pattern established with Tekton TaskRuns.
Namely, buildRun.spec.status == BuildRunCancelled, to mirror taskRun.spec.status == TaskRunCancelled
There is possibly some more condition related work we want to consider My prototype current produces these results:
Admittedly, an EP could be warranted. If need be, I can retroactively do that. But this seemed simple enough that perhaps we can avoid those additional cycles. But we'll see how the community responds to this, and we'll go from there.
Lastly, I have an shipwright-io/cli branch I'll push for a PR soon as well (though until this merges, I have to manually vendor in the changes here).
But you can test this as well via
kubectl patch <buildruname> --patch '{"spec": {"status":"BuildRunCancelled"}}' --type=merge
NOTE: commit dc99af5 from #808 was needed in my local testing on top of openshift. I've included it here for now, but once that PR merges, it should go away here.
https://github.com//pull/808Submitter Checklist
See the contributor guidedc99af5
for details on coding conventions, github and prow interactions, and the code review process.
Release Notes