Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and work around timeout handler data races #1308

Merged
merged 3 commits into from
Sep 15, 2019

Conversation

bobcatfish
Copy link
Collaborator

Changes

This mostly fixes #1124, will still need to update the gcr.io/tekton-releases/tests/test-runner to use a newer version of go which can catch this race.

Will re-add logging to the timeout handler in #1307

Remove logging from timeout handler ✏️

Logging in the timeout handler was added as part of #731 cuz it helped
us debug when the timeout handler didn't work as expected. Unfortunately
it looks like the logger we're using can't be used in multiple go
routines (uber-go/zap#99 may be related).
Removing this logging to fix #1124, hopefully can find a safe way to add
logging back in #1307.

Make GetRunKey threadsafe 🔒

GetRunKey is accessed in goroutines via the timeout_handler; accessing
attributes of an object in a goroutine is not threadsafe. This is used
as a key in a map, so for now replacing this with a value that should be
unique but also threadsafe to fix #1124

Remove unused function 👻

The function GetLogMessages isn't used anywhere. I had tried to remove
it to see if it was causing the data race in #1124 - it isnt but still
it's not being used anywhere so why not remove :)

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes
are included, additive changes
must be approved by at least two OWNERS
and backwards incompatible changes
must be approved by more than 50% of the OWNERS,
and they must first be added
in a backwards compatible way.

Release Notes

Removes potential race conditions in timeout handler

The function `GetLogMessages` isn't used anywhere. I had tried to remove
it to see if it was causing the data race in tektoncd#1124 - it _isnt_ but still
it's not being used anywhere so why not remove :)
Logging in the timeout handler was added as part of tektoncd#731 cuz it helped
us debug when the timeout handler didn't work as expected. Unfortunately
it looks like the logger we're using can't be used in multiple go
routines (uber-go/zap#99 may be related).
Removing this logging to fix tektoncd#1124, hopefully can find a safe way to add
logging back in tektoncd#1307.
GetRunKey is accessed in goroutines via the timeout_handler; accessing
attributes of an object in a goroutine is not threadsafe. This is used
as a key in a map, so for now replacing this with a value that should be
unique but also threadsafe to fix tektoncd#1124
@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2019
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Sep 13, 2019
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 13, 2019
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/reconciler/timeout_handler.go 75.2% 79.1% 3.9
test/controller.go 34.6% 37.5% 2.9

@afrittoli
Copy link
Member

I cannot reproduce the races with this code, nice!

@@ -246,7 +246,8 @@ func (pr *PipelineRun) IsCancelled() bool {

// GetRunKey return the pipelinerun key for timeout handler map
func (pr *PipelineRun) GetRunKey() string {
return fmt.Sprintf("%s/%s/%s", pipelineRunControllerName, pr.Namespace, pr.Name)
// The address of the pointer is a threadsafe unique identifier for the pipelinerun
return fmt.Sprintf("%s/%p", pipelineRunControllerName, pr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@@ -86,7 +76,6 @@ type Informers struct {
// TestAssets holds references to the controller, logs, clients, and informers.
type TestAssets struct {
Controller *controller.Impl
Logs *observer.ObservedLogs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is strange - I remember using Logs to debug a test in the past, before the controller refactor.
I might be nice to have GetLogMessages populate the Logs field here, which would give tests access to logs and would enable testing on seeing a certain log message.

But I agree since it's not used now we can remove it and add it back if needed in future.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this work!
/lgtm

@@ -170,7 +170,6 @@ func backoffDuration(count uint, jf jitterFunc) time.Duration {
func (t *TimeoutSet) checkPipelineRunTimeouts(namespace string, pipelineclientset clientset.Interface) {
pipelineRuns, err := pipelineclientset.TektonV1alpha1().PipelineRuns(namespace).List(metav1.ListOptions{})
if err != nil {
t.logger.Errorf("Can't get pipelinerun list in namespace %s: %s", namespace, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super-happy to remove logging just ahead of a release - it may make troubleshooting harder for folks. That said, removing non-thread safe behavior is probably more important, so I'm ok with this - and we still have a chance to bring logging back before the release anyways.

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 15, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: afrittoli, bobcatfish, dibyom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [afrittoli,bobcatfish,dibyom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot merged commit 10b6427 into tektoncd:master Sep 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Race in Pipelinerun reconciler tests
5 participants