Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit events for all TaskRun lifecycle events #2329

Merged
merged 1 commit into from
May 1, 2020

Conversation

afrittoli
Copy link
Member

@afrittoli afrittoli commented Apr 5, 2020

Changes

Emit events for additional TaskRun lifecyle events:

  • taskrun started
  • taskrun running

Fix the logic in events.go to compare semantic equality as
opposed to raw pointer equality.
Fix broken EmitEvents unit tests and extend them to cover new
functionality.

Extend reconcile test to verify new events are sent. To do so,
get the event recorder from the context when creating the
controller - if avaialble. This allows using the fake recorder
for testing instead of having to look for event related actions
in the fake client go action list.

Add documentation on events.

Fixes #2328
Work towards #2082

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

We now emit kubernetes events for additional TaskRun lifecyle events:
- taskrun started
- taskrun running

@tekton-robot tekton-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2020
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Apr 5, 2020
@tekton-robot tekton-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 5, 2020
@afrittoli
Copy link
Member Author

afrittoli commented Apr 5, 2020

One of the reason why certain tests fail here, is that we test on the number of actions executed in certain error cases, and this patch adds one "emit event" for the start case.
This got me to think about when do we want to consider the TaskRun as started, and thus emit an event for it.

This is roughly the ordering of events in the controller today.
This is not 100% complete / correct, but it should be enough to highlight the issues.

main method action re-queue set condition events notes
Reconcile hasStarted no if true, set start time start event start timeout
Reconcile isDone maybe no metrics, cloud events, stop timeout
reconcile set defaults no no
reconcile convert no fail task on failure
reconcile isCancelled no fail task if true end event
reconcile get task no fail task on failure also convert when fetch from API
reconcile labels and annotations no no
reconcile check timeout no fail task if expired end event
reconcile bindings no fail task on failure pipeline resources / params
reconcile create pod if needed maybe some errors are handled, some re-queue
reconcile update from pod maybe end if pod end end event may be success of failure

Issues that I see:

  1. Currently the start event is before we validate the conversion, task and bindings. In case of validation failures we have a start event but no end event.

We could move the start event to later on, but then it would not match with the start-time set in the taskrun anymore. Also later on it is a bit less obvious when a a TaskRun has to be considered as new, since start time has been set already. We could use the fact that the pod does not exists yet as an indication. Alternatively we could send event on other error conditions, like when TaskRun fails validation of fetching the task or so, but that feels a bit artificial. I think it would be consistent with metrics thought, not 100% sure.
I would like to get more opinion on this, I might bring it up at the WG.
@vdemeester @bobcatfish @imjasonh thoghts?

  1. The timeout check is a bit in the middle of validations

This is maybe a cosmetic thing, but I think we should move the timeout check somewhere else, maybe at the beginning of reconcile. I can file a separate issue for this.

  1. We re-run fetching the task, doing conversion and validation on every reconcile cycle that hits reconcile. This is a bit inefficient and also may cause issues if the Task definition was to change in the middle of the TaskRun.

This is not new, and not something that we should change in this PR, but the fact that we might associate events to failures on getting the task / validations adds more to the problem.

The table in diagram format (perhaps less readable, but roughly same content):
IMG_20200405_132015

@afrittoli
Copy link
Member Author

The more I think about this, the more I tend to think that we should emit events at the very start of the TaskRun i.e. when the controller first looks at it, and emit failures in case of validation failure - since that is something that task/pipeline editors could look for when testing their work.

If there is no opposing opinion this week I'll move forward under this assumption, as I'd really like to move on with this series of PR swiftly now.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 90.5% 61.3% -29.2

@afrittoli afrittoli force-pushed the emit_events_2082 branch 2 times, most recently from cb7ae3b to 6859fb8 Compare April 16, 2020 15:34
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me! Just some minor comments on code organization

Am I right in guessing that your overall goal is to try to align the k8s events we emit with potentially CloudEvents as well?

docs/events.md Outdated Show resolved Hide resolved
docs/events.md Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun.go Show resolved Hide resolved
// Emit event
afterCondition := tr.Status.GetCondition(apis.ConditionSucceeded)
reconciler.EmitEvent(c.Recorder, beforeCondition, afterCondition, tr)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do these need to be functions on the reconciler? maybe we could make a new struct and initialize it with the function to emit events?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put them on the reconciler to give them access to anything that they may need from it, but indeed I was thinking of changing the implementation there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to remove these functions - at least until I actually need to use them.
Emitting events is already non-blocking - down in the implementation of generateEvent.
So there was little value in adding functions here.
Besides, fetching the afterCondition in a goroutine triggered a data race in the unit tests, which is a good sign that I should not do that.

@bobcatfish
Copy link
Collaborator

I like the table a lot but im not quite sure how to interpret the "requeue" and "notes" columns, maybe a bit more detail on those?

i wonder if there is some way to isolate the logic that is used to emit events to make it a bit simpler, e.g. treat reconciling as having phases (i havent looked at your other PR @afrittoli which reorganizes reconciling so maybe you're doing this already!) but something like:

  1. validation / fetching everything
  2. making changes we need to make (e.g. creating TaskRuns, pods)
  3. updating status + emitting events

Maybe we could look at the status the Runs and use that to decide what events to emit, i.e. isolate the logic for this all in one place instead of throughout the reconcile

Sounds like the logic around the start time is a bit more subtle than what above is capturing tho 🤔

@afrittoli
Copy link
Member Author

afrittoli commented Apr 22, 2020

I like the table a lot but im not quite sure how to interpret the "requeue" and "notes" columns, maybe a bit more detail on those?

i wonder if there is some way to isolate the logic that is used to emit events to make it a bit simpler, e.g. treat reconciling as having phases (i havent looked at your other PR @afrittoli which reorganizes reconciling so maybe you're doing this already!) but something like:

1. validation / fetching everything

2. making changes we need to make (e.g. creating TaskRuns, pods)

3. updating status + emitting events

Maybe we could look at the status the Runs and use that to decide what events to emit, i.e. isolate the logic for this all in one place instead of throughout the reconcile

Sounds like the logic around the start time is a bit more subtle than what above is capturing tho 🤔

Yeah, I've been trying to organize things a bit more in the other PR - I would welcome feedback there. Looking at how other resources behave I think I need some changes in the PR related to the start event:

  • only use Succeeded reason for the end event
  • emit an event for the Created reason when the reconciler first meets the resource
  • emit an event for the Started reason when validation is passed and we are about to attempt to create the pod
  • emit an event for the Running reason once the pod is created successfully

For pipeline we will similarly have:

  • only use Succeeded reason for the end event
  • emit an event for the Created reason when the reconciler first meets the pipeline
  • emit an event for the Started reason when validation is passed and we have a DAG
  • emit an event for the Running reason once the first TaskRun is applied successfully

@afrittoli
Copy link
Member Author

I like the table a lot but im not quite sure how to interpret the "requeue" and "notes" columns, maybe a bit more detail on those?
i wonder if there is some way to isolate the logic that is used to emit events to make it a bit simpler, e.g. treat reconciling as having phases (i havent looked at your other PR @afrittoli which reorganizes reconciling so maybe you're doing this already!) but something like:

1. validation / fetching everything

2. making changes we need to make (e.g. creating TaskRuns, pods)

3. updating status + emitting events

Maybe we could look at the status the Runs and use that to decide what events to emit, i.e. isolate the logic for this all in one place instead of throughout the reconcile
Sounds like the logic around the start time is a bit more subtle than what above is capturing tho 🤔

Yeah, I've been trying to organize things a bit more in the other PR - I would welcome feedback there. Looking at how other resources behave I think I need some changes in the PR related to the start event:

* only use `Succeeded` reason for the end event

* emit an event for the `Created` reason when the reconciler first meets the resource

* emit an event for the `Started` reason when validation is passed and we are about to attempt to create the pod

* emit an event for the `Running` reason once the pod is created successfully

For pipeline we will similarly have:

* only use `Succeeded` reason for the end event

* emit an event for the `Created` reason when the reconciler first meets the pipeline

* emit an event for the `Started` reason when validation is passed and we have a DAG

* emit an event for the `Running` reason once the first TaskRun is applied successfully

@bobcatfish going back on your point of organizing the reconcile in phases, I think it might be nice to have use different conditions during reconcile to identify them, and send events on transitions. The only downside might be more updates to the status. This is something I would prefer to do after this patch though.

We should also add a doc page about different conditions we have today and their meaning.

@afrittoli afrittoli force-pushed the emit_events_2082 branch 4 times, most recently from 4aed1a8 to fe5538d Compare April 28, 2020 13:50
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/event.go 80.0% 100.0% 20.0
pkg/reconciler/taskrun/taskrun.go 74.5% 75.1% 0.6

@afrittoli
Copy link
Member Author

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report
File Old Coverage New Coverage Delta
pkg/reconciler/event.go 80.0% 100.0% 20.0
pkg/reconciler/taskrun/taskrun.go 74.5% 75.1% 0.6

Nice increase in coverage, I only need to fix the data race now...

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/event.go 80.0% 100.0% 20.0
pkg/reconciler/taskrun/taskrun.go 74.5% 74.7% 0.2

@afrittoli
Copy link
Member Author

/retest

@afrittoli afrittoli changed the title [WIP] Emit events for all TaskRun lifecycle events Emit events for all TaskRun lifecycle events Apr 28, 2020
@tekton-robot tekton-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/event.go 80.0% 100.0% 20.0
pkg/reconciler/taskrun/taskrun.go 74.5% 74.7% 0.2

@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-integration-tests

@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-build-tests

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/meow

//
// Status "ConditionUnknown":
// beforeCondition == nil, emit EventReasonStarted
// beforeCondition != mil, emit afterCondition.Reason
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/mil/nil 😝

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tekton-robot
Copy link
Collaborator

@vdemeester: cat image

In response to this:

/meow

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2020
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of my feedback is super minor! Mostly comments and a couple typos.

The only piece of feedback that's at all substantial is that I'd really like to find away to avoid the sleep in the reconciler test. I don't mind merging this as is but if we could find an alternative to that I think it's worth the effort (or at least a detailed comment explaining why it's unavoidable - imo sleeps, especially in unit tests, should not be the norm)

docs/events.md Show resolved Hide resolved
successfully, including post-steps injected by Tekton.
- `Failed`: this is triggered if the `TaskRun` is completed, but not successfully.
Causes of failure may be: one the steps failed, the `TaskRun` was cancelled or
the `TaskRun` timed out.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if it was skipped? (maybe to be added?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A TaskRun is something that may only happen when a Pipeline is involved, so I did not look into that on this patch, where I focuses on events generated by the TaskRun controller.
There are two kind of skips:

  • skip because of a condition not met - which is easy, and we definitely can emit an event for that. I opened an issue which is related [feature] Conditions emit no events like other Tekton resources #2461
  • skip because it was not reached in the DAG: we don't do anything today we tasks that were not reached in the DAG because of previous failures, but it should be possible to process them as a post-run step in the pipelinerun reconciler

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right, there's no taskrun at all, makes sense. those 2 skips are totally valid also. like you said, conditions aren't emitting events anyway, maybe we can revisit this then.

thanks!!

docs/events.md Show resolved Hide resolved
pkg/reconciler/event.go Outdated Show resolved Hide resolved
}
case <-timer.C:
if !ts.expectEvent {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a bit confused by the cases that don't require an event, do you think it might make sense to have an event every time the condition changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only cases where we do not emit an event are:

  • there was no change in condition (except for LastTransitionTime)
  • afterCondition == nil

For the former I think we do not want an event, it may be we wrote the condition again and updated LastTransitionTime, but nothing really happened to the Condition, so no event.
For the latter, I could not think of a use case where afterCondition is nil, so I saw no point in adding that case now.

@@ -271,6 +272,29 @@ func getTaskRunController(t *testing.T, d test.Data) (test.Assets, func()) {
}, cancel
}

func checkEvents(fr *record.FakeRecorder, testName string, wantEvents []string) error {
timer := time.NewTimer(1 * time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats the timer for and why 1 second? maybe a comment to explain? (im guessing that since this is a "unit"-ish test that this timeout is extremely unlikely to occur)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fake recorder runs in a go routine, so the timeout is just there to avoid waiting on the channel forever if fewer than expected events are received. I guess it could be reduced, but it will only be reached in case of test failure, so I don't think it matters really.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable! i do think a comment might make sense tho?

pkg/reconciler/taskrun/taskrun_test.go Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun_test.go Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved
pkg/reconciler/taskrun/taskrun_test.go Outdated Show resolved Hide resolved
@afrittoli
Copy link
Member Author

All of my feedback is super minor! Mostly comments and a couple typos.

The only piece of feedback that's at all substantial is that I'd really like to find away to avoid the sleep in the reconciler test. I don't mind merging this as is but if we could find an alternative to that I think it's worth the effort (or at least a detailed comment explaining why it's unavoidable - imo sleeps, especially in unit tests, should not be the norm)

Yeah, good catch - this is actually one of the reasons I changed how we run event related unit tests - and finally I forgot to drop the sleep. My bet is that this is not needed anymore, so I will re-spin this without it and with fixes to typos.

@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/event.go 80.0% 100.0% 20.0
pkg/reconciler/taskrun/taskrun.go 74.5% 74.7% 0.2

@afrittoli
Copy link
Member Author

/test pull-tekton-pipeline-build-tests

Emit events for additional TaskRun lifecyle events:
- taskrun started
- taskrun running
- taskrun timeout

Fix the logic in events.go to compare semantic equality as
opposed to raw pointer equality.
Fix broken EmitEvents unit tests and extend them to cover new
functionality.

Extend reconcile test to verify new events are sent. To do so,
get the event recorder from the context when creating the
controller - if avaialble. This allows using the fake recorder
for testing instead of having to look for event related actions
in the fake client go action list.

Add documentation on events.

Fixes tektoncd#2328
Work towards tektoncd#2082
@tekton-robot
Copy link
Collaborator

The following is the coverage report on the affected files.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/event.go 80.0% 100.0% 20.0
pkg/reconciler/taskrun/taskrun.go 74.5% 74.7% 0.2

@bobcatfish
Copy link
Collaborator

I think there were 1 or 2 comments that might make sense to add but nothing worth blocking over!

Thanks for this @afrittoli !! :D 🎉

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020
@bobcatfish
Copy link
Collaborator

ps

image

nice!!! 👍 👍 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TaskRun emits no event when it fails on timeout
5 participants