Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds pipeline metrics 🔭 #1387

Merged
merged 1 commit into from
Oct 17, 2019
Merged

Conversation

hrishin
Copy link
Member

@hrishin hrishin commented Oct 7, 2019

Changes

Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative pkg/metrics package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Release Notes

Following Pipelines metrics are available at `tekton-pipelines-controller:9090/metrics` endpoint
1. Taskrun/Pipeline execution duration time
2. Pipelinerun/Taskrun `success` and `failure` duration rates
3. Number of Taskruns and Pipelineruns are executing currently
4. Pod scheduling latency for Taskruns 

@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Oct 7, 2019
@tekton-robot tekton-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 7, 2019
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 72.7% 53.3% -19.4
pkg/reconciler/pipelinerun/metrics.go Do not exist 87.5%
pkg/reconciler/pipelinerun/pipelinerun.go 83.1% 83.5% 0.4
pkg/reconciler/taskrun/metrics.go Do not exist 87.5%
pkg/reconciler/taskrun/taskrun.go 71.6% 71.6% -0.0
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 85.6% 84.7% -0.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 72.7% 53.3% -19.4
pkg/reconciler/pipelinerun/metrics.go Do not exist 87.5%
pkg/reconciler/pipelinerun/pipelinerun.go 83.1% 83.5% 0.4
pkg/reconciler/taskrun/metrics.go Do not exist 87.5%
pkg/reconciler/taskrun/taskrun.go 71.6% 71.6% -0.0
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 85.6% 84.7% -0.9

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 72.7% 53.3% -19.4
pkg/reconciler/pipelinerun/metrics.go Do not exist 87.5%
pkg/reconciler/pipelinerun/pipelinerun.go 83.1% 83.5% 0.4
pkg/reconciler/taskrun/metrics.go Do not exist 87.5%
pkg/reconciler/taskrun/taskrun.go 71.6% 71.6% -0.0
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 85.6% 84.7% -0.9

@hrishin hrishin changed the title Adds pipeline metrics Adds pipeline metrics 🔭 Oct 7, 2019
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited to see this coming together, thanks Hrishikesh!

After this lands let me know if you'd like me to do some of the other metrics (taskruns, etc.)

pkg/reconciler/pipelinerun/metrics_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/metrics_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/metrics_test.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/metrics_test.go Show resolved Hide resolved
pkg/reconciler/pipelinerun/metrics_test.go Outdated Show resolved Hide resolved
@@ -156,6 +156,7 @@ func (c *Reconciler) Reconcile(ctx context.Context, key string) error {
c.Logger.Errorf("Failed to update TaskRun status for PipelineRun %s: %v", pr.Name, err)
return err
}
_ = c.metrics.DurationAndCount(pr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since DurationAndCount might take a long time (see comment below), have you considered running this in a goroutine so that Reconcile is not blocked?

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exciting 🕺

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 79.4% 2.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.7% 83.4% 0.7
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.7% -0.9
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 79.4% 2.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.7% 83.4% 0.7
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.7% -0.9
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 79.4% 2.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.7% 83.4% 0.7
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.7% -0.9
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 79.4% 2.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.7% 83.4% 0.7
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.7% -0.9
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

@afrittoli
Copy link
Member

https://thenewstack.io/opentracing-opencensus-merge-into-a-single-new-project-opentelemetry/

I wonder we should start right away with opentelemetry, or if the new API is too young (it's still alpha)?

@@ -28,7 +28,7 @@ import (

const (
// ControllerLogKey is the name of the logger for the controller cmd
ControllerLogKey = "controller"
ControllerLogKey = "tekton"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary in this change? It seems useful to key messages from this binary as coming from the controller, as opposed to "Tekton" which could mean lots of things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A valid argument,

issue: this name used in a metric name as a prefix which I didn't like controller_***.
Ideally, we should fix the limitation in knative pkg or I'm open for any alternative suggestions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so this "log key" is used both for logging, and for metrics, is that the case?

Maybe just file an issue with knative/pkg to make these separate, and link that here in a TODO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware it's is used as log key as well

}

// IsOfAPipeline return true if taskrun is a part of pipeline
func (tr *TaskRun) IsOfAPipeline() (bool, map[string]string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a description of the returned map to the doc comment. And since it returns something more than just a simple bool it might make sense to rephrase the method name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it should return the Pipeline name and PipelineRun name, instead of a map containing both of those strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better now?

pkg/reconciler/pipelinerun/metrics.go Show resolved Hide resolved

if err != nil {
r.initialized = false
return r, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return a nil Recorder, instead of one that will error when asked to record anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to return the nil, however everytime in before calling the metrics methods in controller flow we need to check for nil, right? instead of doing that way we thought to add that check while calling metrics methods. Now trade here is the method has to check if the instance is actually initialized which is either not that great but it avoid many if check scatter accross caller places.

release.yaml Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/metrics_test.go Outdated Show resolved Hide resolved
pkg/reconciler/stats.go Outdated Show resolved Hide resolved
pkg/reconciler/pipelinerun/pipelinerun.go Outdated Show resolved Hide resolved
pkg/reconciler/taskrun/metrics.go Outdated Show resolved Hide resolved
@hrishin
Copy link
Member Author

hrishin commented Oct 15, 2019

After this lands let me know if you'd like me to do some of the other metrics (taskruns, etc.)

Yes @imjasonh why not. I will address all review comments today and get this PR in shape (sorry I was for some time)

@hrishin
Copy link
Member Author

hrishin commented Oct 15, 2019

https://thenewstack.io/opentracing-opencensus-merge-into-a-single-new-project-opentelemetry/

I wonder we should start right away with opentelemetry, or if the new API is too young (it's still alpha)?

@afrittoli thats the valid suggestion. We discussed it in the upstream call. It turned to be very young project but progressively we will make the transition over open telemetry as it becomes consumable

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 77.4% 0.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.6% 83.3% 0.7
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.7% -0.9
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 77.4% 0.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.6% 83.3% 0.7
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.7% -0.9
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM for me but commits will need to be squashed 😝

Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, just a couple more fixes/questions.

return errors.New("ignoring the metrics recording, failed to initialize the metrics recorder")
}

prs, err := lister.List(labels.Everything())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run this in another goroutine, so that this doesn't contribute to reconciliation slowness. I'm glad it's coming from an in-memory cache, but this might still take long enough to matter, especially if some things aren't guaranteed to be in the cache.

pkg/reconciler/pipelinerun/metrics_test.go Outdated Show resolved Hide resolved
pkg/reconciler/taskrun/metrics.go Outdated Show resolved Hide resolved
duration = tr.Status.CompletionTime.Sub(tr.Status.StartTime.Time)
}

taskName := "annonymous"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default value should be "". It's possible to name a Task anonymous which would make this misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally yes, exporter drops labels with empty values when I tried initially(will confirm it again).
We need it for when TaskRun embeds Task specification.

kind: TaskRun
metadata:
  generateName: home-volume-
spec:
  taskSpec:
    steps:
    - name: write
      image: ubuntu
      command: ['bash']
...


status := "success"
if tr.Status.Conditions[0].Status == corev1.ConditionFalse {
status = "failed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we might want to break this out to signal timeout, cancelled, and possibly future states?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like yes, we may have to 🤔

pkg/reconciler/taskrun/metrics_test.go Outdated Show resolved Hide resolved
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 77.4% 0.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.6% 83.3% 0.7
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.7% -0.9
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 77.4% 0.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.6% 83.5% 0.8
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.5% -1.1
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 77.4% 0.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.6% 83.5% 0.8
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.5% -1.1
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

Often, as a developer or administrator(ops) I want some insights
about pipeline behavior in terms of time taken to execute pipleinerun/taskrun,
its success or failure ratio, pod latencies etc.
At present tekton pipelines has very limited ways to surface such information
or it's hard to get those details looking at resources yamls.

This patch exposes above mentioned pipelines metrics on '/metrics'
endpoint using knative `pkg/metrics` package. User can collect such
metrics using prometheus, stackdriver or other supported metrics system.

To some extent its solves
 - tektoncd#540
 - tektoncd#164
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/pipeline/v1alpha1/taskrun_types.go 76.9% 77.4% 0.5
pkg/reconciler/pipelinerun/metrics.go Do not exist 86.0%
pkg/reconciler/pipelinerun/pipelinerun.go 82.6% 83.5% 0.8
pkg/reconciler/taskrun/metrics.go Do not exist 86.7%
pkg/reconciler/taskrun/taskrun.go 71.6% 70.5% -1.1
test/builder/pod.go 97.4% 79.2% -18.3
test/builder/task.go 84.3% 83.4% -0.8

Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2019
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2019
@vdemeester
Copy link
Member

/test pull-tekton-pipeline-integration-tests

@tekton-robot tekton-robot merged commit 4bbd797 into tektoncd:master Oct 17, 2019
Copy link
Member

@imjasonh imjasonh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this got lgtm'ed and merged before addressing some of the comments. We can follow up on those later, but it would be great to have issues to track/assign that work.


status := "success"
if pr.Status.Conditions[0].Status == corev1.ConditionFalse {
status = "failed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file an issue to track adding metrics for timeout/cancelled/etc runs?

duration = tr.Status.CompletionTime.Sub(tr.Status.StartTime.Time)
}

taskName := "anonymous"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file an issue to track a better solution for this?

In your previous comment you said the exporter drops labels with empty-string values, does that mean no data will be reported at all? Or just that it will be reported without a taskName? Because the latter is probably what we want, assuming we can query for data points that don't report a taskName.

@hrishin
Copy link
Member Author

hrishin commented Oct 19, 2019

Yes @imjasonh I'm creating the follow-up issue for it. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants