Emit events for all TaskRun lifecycle events #2329

afrittoli · 2020-04-05T10:26:51Z

Changes

Emit events for additional TaskRun lifecyle events:

taskrun started
taskrun running

Fix the logic in events.go to compare semantic equality as
opposed to raw pointer equality.
Fix broken EmitEvents unit tests and extend them to cover new
functionality.

Extend reconcile test to verify new events are sent. To do so,
get the event recorder from the context when creating the
controller - if avaialble. This allows using the fake recorder
for testing instead of having to look for event related actions
in the fake client go action list.

Add documentation on events.

Fixes #2328
Work towards #2082

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

Includes tests (if functionality changed/added)
Includes docs (if user facing)
Commit messages follow commit message best practices

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

We now emit kubernetes events for additional TaskRun lifecyle events:
- taskrun started
- taskrun running

afrittoli · 2020-04-05T12:19:43Z

One of the reason why certain tests fail here, is that we test on the number of actions executed in certain error cases, and this patch adds one "emit event" for the start case.
This got me to think about when do we want to consider the TaskRun as started, and thus emit an event for it.

This is roughly the ordering of events in the controller today.
This is not 100% complete / correct, but it should be enough to highlight the issues.

main method	action	re-queue	set condition	events	notes
`Reconcile`	`hasStarted`	no	if `true`, set start time	start event	start timeout
`Reconcile`	`isDone`	maybe	no		metrics, cloud events, stop timeout
`reconcile`	set defaults	no	no
`reconcile`	convert	no	fail task on failure
`reconcile`	`isCancelled`	no	fail task if `true`	end event
`reconcile`	get task	no	fail task on failure		also convert when fetch from API
`reconcile`	labels and annotations	no	no
`reconcile`	check timeout	no	fail task if expired	end event
`reconcile`	bindings	no	fail task on failure		pipeline resources / params
`reconcile`	create pod if needed	maybe			some errors are handled, some re-queue
`reconcile`	update from pod	maybe	end if pod end	end event	may be success of failure

Issues that I see:

Currently the start event is before we validate the conversion, task and bindings. In case of validation failures we have a start event but no end event.

We could move the start event to later on, but then it would not match with the start-time set in the taskrun anymore. Also later on it is a bit less obvious when a a TaskRun has to be considered as new, since start time has been set already. We could use the fact that the pod does not exists yet as an indication. Alternatively we could send event on other error conditions, like when TaskRun fails validation of fetching the task or so, but that feels a bit artificial. I think it would be consistent with metrics thought, not 100% sure.
I would like to get more opinion on this, I might bring it up at the WG.
@vdemeester @bobcatfish @imjasonh thoghts?

The timeout check is a bit in the middle of validations

This is maybe a cosmetic thing, but I think we should move the timeout check somewhere else, maybe at the beginning of reconcile. I can file a separate issue for this.

We re-run fetching the task, doing conversion and validation on every reconcile cycle that hits reconcile. This is a bit inefficient and also may cause issues if the Task definition was to change in the middle of the TaskRun.

This is not new, and not something that we should change in this PR, but the fact that we might associate events to failures on getting the task / validations adds more to the problem.

The table in diagram format (perhaps less readable, but roughly same content):

afrittoli · 2020-04-08T17:09:45Z

The more I think about this, the more I tend to think that we should emit events at the very start of the TaskRun i.e. when the controller first looks at it, and emit failures in case of validation failure - since that is something that task/pipeline editors could look for when testing their work.

If there is no opposing opinion this week I'll move forward under this assumption, as I'd really like to move on with this series of PR swiftly now.

tekton-robot · 2020-04-09T21:20:56Z

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go	90.5%	61.3%	-29.2

bobcatfish

Overall this looks good to me! Just some minor comments on code organization

Am I right in guessing that your overall goal is to try to align the k8s events we emit with potentially CloudEvents as well?

docs/events.md

pkg/reconciler/taskrun/taskrun.go

bobcatfish · 2020-04-20T14:55:43Z

pkg/reconciler/taskrun/taskrun.go

+	// Emit event
+	afterCondition := tr.Status.GetCondition(apis.ConditionSucceeded)
+	reconciler.EmitEvent(c.Recorder, beforeCondition, afterCondition, tr)
+}


do these need to be functions on the reconciler? maybe we could make a new struct and initialize it with the function to emit events?

I put them on the reconciler to give them access to anything that they may need from it, but indeed I was thinking of changing the implementation there.

I decided to remove these functions - at least until I actually need to use them.
Emitting events is already non-blocking - down in the implementation of generateEvent.
So there was little value in adding functions here.
Besides, fetching the afterCondition in a goroutine triggered a data race in the unit tests, which is a good sign that I should not do that.

bobcatfish · 2020-04-20T15:00:40Z

I like the table a lot but im not quite sure how to interpret the "requeue" and "notes" columns, maybe a bit more detail on those?

i wonder if there is some way to isolate the logic that is used to emit events to make it a bit simpler, e.g. treat reconciling as having phases (i havent looked at your other PR @afrittoli which reorganizes reconciling so maybe you're doing this already!) but something like:

validation / fetching everything
making changes we need to make (e.g. creating TaskRuns, pods)
updating status + emitting events

Maybe we could look at the status the Runs and use that to decide what events to emit, i.e. isolate the logic for this all in one place instead of throughout the reconcile

Sounds like the logic around the start time is a bit more subtle than what above is capturing tho 🤔

afrittoli · 2020-04-22T09:27:59Z

I like the table a lot but im not quite sure how to interpret the "requeue" and "notes" columns, maybe a bit more detail on those?

i wonder if there is some way to isolate the logic that is used to emit events to make it a bit simpler, e.g. treat reconciling as having phases (i havent looked at your other PR @afrittoli which reorganizes reconciling so maybe you're doing this already!) but something like:
1. validation / fetching everything

2. making changes we need to make (e.g. creating TaskRuns, pods)

3. updating status + emitting events
Maybe we could look at the status the Runs and use that to decide what events to emit, i.e. isolate the logic for this all in one place instead of throughout the reconcile

Sounds like the logic around the start time is a bit more subtle than what above is capturing tho 🤔

Yeah, I've been trying to organize things a bit more in the other PR - I would welcome feedback there. Looking at how other resources behave I think I need some changes in the PR related to the start event:

only use Succeeded reason for the end event
emit an event for the Created reason when the reconciler first meets the resource
emit an event for the Started reason when validation is passed and we are about to attempt to create the pod
emit an event for the Running reason once the pod is created successfully

For pipeline we will similarly have:

only use Succeeded reason for the end event
emit an event for the Created reason when the reconciler first meets the pipeline
emit an event for the Started reason when validation is passed and we have a DAG
emit an event for the Running reason once the first TaskRun is applied successfully

afrittoli · 2020-04-25T19:37:01Z

I like the table a lot but im not quite sure how to interpret the "requeue" and "notes" columns, maybe a bit more detail on those?
i wonder if there is some way to isolate the logic that is used to emit events to make it a bit simpler, e.g. treat reconciling as having phases (i havent looked at your other PR @afrittoli which reorganizes reconciling so maybe you're doing this already!) but something like:
1. validation / fetching everything

2. making changes we need to make (e.g. creating TaskRuns, pods)

3. updating status + emitting events
Maybe we could look at the status the Runs and use that to decide what events to emit, i.e. isolate the logic for this all in one place instead of throughout the reconcile
Sounds like the logic around the start time is a bit more subtle than what above is capturing tho 🤔
Yeah, I've been trying to organize things a bit more in the other PR - I would welcome feedback there. Looking at how other resources behave I think I need some changes in the PR related to the start event:
* only use `Succeeded` reason for the end event

* emit an event for the `Created` reason when the reconciler first meets the resource

* emit an event for the `Started` reason when validation is passed and we are about to attempt to create the pod

* emit an event for the `Running` reason once the pod is created successfully
For pipeline we will similarly have:
* only use `Succeeded` reason for the end event

* emit an event for the `Created` reason when the reconciler first meets the pipeline

* emit an event for the `Started` reason when validation is passed and we have a DAG

* emit an event for the `Running` reason once the first TaskRun is applied successfully

@bobcatfish going back on your point of organizing the reconcile in phases, I think it might be nice to have use different conditions during reconcile to identify them, and send events on transitions. The only downside might be more updates to the status. This is something I would prefer to do after this patch though.

We should also add a doc page about different conditions we have today and their meaning.

tekton-robot · 2020-04-28T13:55:23Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/event.go	80.0%	100.0%	20.0
pkg/reconciler/taskrun/taskrun.go	74.5%	75.1%	0.6

afrittoli · 2020-04-28T14:12:12Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
File Old Coverage New Coverage Delta
pkg/reconciler/event.go 80.0% 100.0% 20.0
pkg/reconciler/taskrun/taskrun.go 74.5% 75.1% 0.6

Nice increase in coverage, I only need to fix the data race now...

tekton-robot · 2020-04-28T15:13:10Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/event.go	80.0%	100.0%	20.0
pkg/reconciler/taskrun/taskrun.go	74.5%	74.7%	0.2

afrittoli · 2020-04-28T15:31:00Z

/retest

tekton-robot · 2020-04-28T16:59:37Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/event.go	80.0%	100.0%	20.0
pkg/reconciler/taskrun/taskrun.go	74.5%	74.7%	0.2

afrittoli · 2020-04-28T17:45:30Z

/test pull-tekton-pipeline-integration-tests

afrittoli · 2020-04-28T20:00:14Z

/test pull-tekton-pipeline-build-tests

vdemeester

/meow

vdemeester · 2020-04-29T15:01:54Z

pkg/reconciler/event.go

+//
+// Status "ConditionUnknown":
+//   beforeCondition == nil, emit EventReasonStarted
+//   beforeCondition != mil, emit afterCondition.Reason


s/mil/nil 😝

tekton-robot · 2020-04-29T15:02:42Z

@vdemeester:

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot · 2020-04-29T15:02:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bobcatfish

All of my feedback is super minor! Mostly comments and a couple typos.

The only piece of feedback that's at all substantial is that I'd really like to find away to avoid the sleep in the reconciler test. I don't mind merging this as is but if we could find an alternative to that I think it's worth the effort (or at least a detailed comment explaining why it's unavoidable - imo sleeps, especially in unit tests, should not be the norm)

docs/events.md

bobcatfish · 2020-05-01T15:23:23Z

docs/events.md

+  successfully, including post-steps injected by Tekton.
+- `Failed`: this is triggered if the `TaskRun` is completed, but not successfully.
+  Causes of failure may be: one the steps failed, the `TaskRun` was cancelled or
+  the `TaskRun` timed out.


what if it was skipped? (maybe to be added?)

A TaskRun is something that may only happen when a Pipeline is involved, so I did not look into that on this patch, where I focuses on events generated by the TaskRun controller.
There are two kind of skips:

skip because of a condition not met - which is easy, and we definitely can emit an event for that. I opened an issue which is related [feature] Conditions emit no events like other Tekton resources #2461

skip because it was not reached in the DAG: we don't do anything today we tasks that were not reached in the DAG because of previous failures, but it should be possible to process them as a post-run step in the pipelinerun reconciler

ah right, there's no taskrun at all, makes sense. those 2 skips are totally valid also. like you said, conditions aren't emitting events anyway, maybe we can revisit this then.

thanks!!

docs/events.md

pkg/reconciler/event.go

bobcatfish · 2020-05-01T15:29:48Z

pkg/reconciler/event_test.go

 			}
-		case <-timer.C:
 			if !ts.expectEvent {


i'm a bit confused by the cases that don't require an event, do you think it might make sense to have an event every time the condition changes?

The only cases where we do not emit an event are:

there was no change in condition (except for LastTransitionTime)

afterCondition == nil

For the former I think we do not want an event, it may be we wrote the condition again and updated LastTransitionTime, but nothing really happened to the Condition, so no event.
For the latter, I could not think of a use case where afterCondition is nil, so I saw no point in adding that case now.

bobcatfish · 2020-05-01T15:36:17Z

pkg/reconciler/taskrun/taskrun_test.go

@@ -271,6 +272,29 @@ func getTaskRunController(t *testing.T, d test.Data) (test.Assets, func()) {
 	}, cancel
 }

+func checkEvents(fr *record.FakeRecorder, testName string, wantEvents []string) error {
+	timer := time.NewTimer(1 * time.Second)


whats the timer for and why 1 second? maybe a comment to explain? (im guessing that since this is a "unit"-ish test that this timeout is extremely unlikely to occur)

the fake recorder runs in a go routine, so the timeout is just there to avoid waiting on the channel forever if fewer than expected events are received. I guess it could be reduced, but it will only be reached in case of test failure, so I don't think it matters really.

sounds reasonable! i do think a comment might make sense tho?

pkg/reconciler/taskrun/taskrun_test.go

afrittoli · 2020-05-01T16:22:18Z

All of my feedback is super minor! Mostly comments and a couple typos.

The only piece of feedback that's at all substantial is that I'd really like to find away to avoid the sleep in the reconciler test. I don't mind merging this as is but if we could find an alternative to that I think it's worth the effort (or at least a detailed comment explaining why it's unavoidable - imo sleeps, especially in unit tests, should not be the norm)

Yeah, good catch - this is actually one of the reasons I changed how we run event related unit tests - and finally I forgot to drop the sleep. My bet is that this is not needed anymore, so I will re-spin this without it and with fixes to typos.

tekton-robot · 2020-05-01T17:18:11Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/event.go	80.0%	100.0%	20.0
pkg/reconciler/taskrun/taskrun.go	74.5%	74.7%	0.2

afrittoli · 2020-05-01T17:30:11Z

/test pull-tekton-pipeline-build-tests

Emit events for additional TaskRun lifecyle events: - taskrun started - taskrun running - taskrun timeout Fix the logic in events.go to compare semantic equality as opposed to raw pointer equality. Fix broken EmitEvents unit tests and extend them to cover new functionality. Extend reconcile test to verify new events are sent. To do so, get the event recorder from the context when creating the controller - if avaialble. This allows using the fake recorder for testing instead of having to look for event related actions in the fake client go action list. Add documentation on events. Fixes tektoncd#2328 Work towards tektoncd#2082

tekton-robot · 2020-05-01T17:45:41Z

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/reconciler/event.go	80.0%	100.0%	20.0
pkg/reconciler/taskrun/taskrun.go	74.5%	74.7%	0.2

bobcatfish · 2020-05-01T18:18:10Z

I think there were 1 or 2 comments that might make sense to add but nothing worth blocking over!

Thanks for this @afrittoli !! :D 🎉

/lgtm

bobcatfish · 2020-05-01T18:18:36Z

ps

nice!!! 👍 👍 👍

tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2020

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Apr 5, 2020

tekton-robot requested review from dlorenc and vdemeester April 5, 2020 10:26

tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 5, 2020

afrittoli force-pushed the emit_events_2082 branch from 4f12861 to 9972bff Compare April 5, 2020 11:33

afrittoli mentioned this pull request Apr 8, 2020

No event for PipelineValidationFailed #1201

Closed

afrittoli force-pushed the emit_events_2082 branch from 9972bff to 410252f Compare April 9, 2020 21:18

afrittoli force-pushed the emit_events_2082 branch 2 times, most recently from cb7ae3b to 6859fb8 Compare April 16, 2020 15:34

bobcatfish reviewed Apr 20, 2020

View reviewed changes

afrittoli force-pushed the emit_events_2082 branch 4 times, most recently from 4aed1a8 to fe5538d Compare April 28, 2020 13:50

afrittoli force-pushed the emit_events_2082 branch from fe5538d to 8d24f0f Compare April 28, 2020 15:09

afrittoli force-pushed the emit_events_2082 branch from 8d24f0f to a851d0c Compare April 28, 2020 16:54

afrittoli changed the title ~~[WIP] Emit events for all TaskRun lifecycle events~~ Emit events for all TaskRun lifecycle events Apr 28, 2020

tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2020

afrittoli requested a review from bobcatfish April 28, 2020 17:02

afrittoli mentioned this pull request Apr 29, 2020

TaskRun emits no event when it fails on timeout #2328

Closed

vdemeester approved these changes Apr 29, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2020

bobcatfish reviewed May 1, 2020

View reviewed changes

afrittoli force-pushed the emit_events_2082 branch from a851d0c to f9b4a17 Compare May 1, 2020 17:14

afrittoli force-pushed the emit_events_2082 branch from f9b4a17 to 450cbc2 Compare May 1, 2020 17:42

tekton-robot assigned bobcatfish May 1, 2020

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020

tekton-robot merged commit c76a6ad into tektoncd:master May 1, 2020

qu1queee mentioned this pull request Jul 24, 2020

Upgrade the Tekton version to the latest v0.12 version shipwright-io/build#221

Closed

Emit events for all TaskRun lifecycle events #2329

Emit events for all TaskRun lifecycle events #2329

Conversation

afrittoli commented Apr 5, 2020 • edited Loading

Changes

Submitter Checklist

Reviewer Notes

Release Notes

afrittoli commented Apr 5, 2020 • edited Loading

afrittoli commented Apr 8, 2020

tekton-robot commented Apr 9, 2020

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobcatfish commented Apr 20, 2020

afrittoli commented Apr 22, 2020 • edited Loading

afrittoli commented Apr 25, 2020

tekton-robot commented Apr 28, 2020

afrittoli commented Apr 28, 2020

tekton-robot commented Apr 28, 2020

afrittoli commented Apr 28, 2020

tekton-robot commented Apr 28, 2020

afrittoli commented Apr 28, 2020

afrittoli commented Apr 28, 2020

vdemeester left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tekton-robot commented Apr 29, 2020

tekton-robot commented Apr 29, 2020

bobcatfish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afrittoli commented May 1, 2020

tekton-robot commented May 1, 2020

afrittoli commented May 1, 2020

tekton-robot commented May 1, 2020

bobcatfish commented May 1, 2020

bobcatfish commented May 1, 2020

afrittoli commented Apr 5, 2020 •

edited

Loading

afrittoli commented Apr 5, 2020 •

edited

Loading

afrittoli commented Apr 22, 2020 •

edited

Loading