Skip to content

Commit

Permalink
add namespace label/tag to non-deprecated throttle metrics
Browse files Browse the repository at this point in the history
Back when implementing #6744 for #6631
we failed to realize that with k8s quota policies being namespace scoped, knowing which namespace the throttled items were
in could have some diagnostic value.

Now that we have been using the metric added for a bit, this realization is now very apparent.

This changes introduces the namespace tag.  Also, since last touching this space, the original metric was deprecated and
a new one with a shorter name was added.  This change only updates the non-deprecated metric with the new label.

Lastly, the default behavior is preserved, and use of the new label only occurs when explicitly enabled in observability config map.
  • Loading branch information
gabemontero committed Apr 30, 2024
1 parent 2b4e2b1 commit db76fe0
Show file tree
Hide file tree
Showing 6 changed files with 151 additions and 32 deletions.
35 changes: 18 additions & 17 deletions docs/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,24 +11,24 @@ The following pipeline metrics are available at `controller-service` on port `90

We expose several kinds of exporters, including Prometheus, Google Stackdriver, and many others. You can set them up using [observability configuration](../config/config-observability.yaml).

| Name | Type | Labels/Tags | Status |
|-----------------------------------------------------------------------------------------| ----------- | ----------- | ----------- |
| Name | Type | Labels/Tags | Status |
|-----------------------------------------------------------------------------------------| ----------- |-------------------------------------------------| ----------- |
| `tekton_pipelines_controller_pipelinerun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `*pipeline`=&lt;pipeline_name&gt; <br> `*pipelinerun`=&lt;pipelinerun_name&gt; <br> `status`=&lt;status&gt; <br> `namespace`=&lt;pipelinerun-namespace&gt; | experimental |
| `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `*pipeline`=&lt;pipeline_name&gt; <br> `*pipelinerun`=&lt;pipelinerun_name&gt; <br> `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt;| experimental |
| `tekton_pipelines_controller_pipelinerun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_pipelinerun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_pipelineruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_pipelineruns` | Gauge | | experimental |
| `tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `*pipeline`=&lt;pipeline_name&gt; <br> `*pipelinerun`=&lt;pipelinerun_name&gt; <br> `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt; | experimental |
| `tekton_pipelines_controller_pipelinerun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_pipelinerun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_pipelineruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_pipelineruns` | Gauge | | experimental |
| `tekton_pipelines_controller_taskrun_duration_seconds_[bucket, sum, count]` | Histogram/LastValue(Gauge) | `status`=&lt;status&gt; <br> `*task`=&lt;task_name&gt; <br> `*taskrun`=&lt;taskrun_name&gt;<br> `namespace`=&lt;pipelineruns-taskruns-namespace&gt; | experimental |
| `tekton_pipelines_controller_taskrun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_taskrun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_taskruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns` | Gauge | | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota` | Gauge | | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node` | Gauge | | experimental |
| `tekton_pipelines_controller_client_latency_[bucket, sum, count]` | Histogram | | experimental |
| `tekton_pipelines_controller_taskrun_count` | Counter | `status`=&lt;status&gt; | deprecate |
| `tekton_pipelines_controller_taskrun_total` | Counter | `status`=&lt;status&gt; | experimental |
| `tekton_pipelines_controller_running_taskruns_count` | Gauge | | deprecate |
| `tekton_pipelines_controller_running_taskruns` | Gauge | | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota_count` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node_count` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | deprecate |
| `tekton_pipelines_controller_running_taskruns_throttled_by_quota` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | experimental |
| `tekton_pipelines_controller_running_taskruns_throttled_by_node` | Gauge | <br> `namespace`=&lt;pipelinerun-namespace&gt; | experimental |
| `tekton_pipelines_controller_client_latency_[bucket, sum, count]` | Histogram | | experimental |

The Labels/Tag marked as "*" are optional. And there's a choice between Histogram and LastValue(Gauge) for pipelinerun and taskrun duration metrics.

Expand All @@ -48,7 +48,7 @@ A sample config-map has been provided as [config-observability](./../config/conf
Following values are available in the configmap:

| configmap data | value | description |
| ---------- | ----------- | ----------- |
| -- | ----------- | ----------- |
| metrics.taskrun.level | `taskrun` | Level of metrics is taskrun |
| metrics.taskrun.level | `task` | Level of metrics is task and taskrun label isn't present in the metrics |
| metrics.taskrun.level | `namespace` | Level of metrics is namespace, and task and taskrun label isn't present in the metrics
Expand All @@ -60,6 +60,7 @@ Following values are available in the configmap:
| metrics.pipelinerun.duration-type | `histogram` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type histogram |
| metrics.pipelinerun.duration-type | `lastvalue` | `tekton_pipelines_controller_pipelinerun_duration_seconds` is of type gauge or lastvalue |
| metrics.count.enable-reason | `false` | Sets if the `reason` label should be included on count metrics |
| metrics.taskrun.throttle.enable-namespace | `false` | Sets if the `namespace` label should be included on the `tekton_pipelines_controller_running_taskruns_throttled_by_quota` metric |

Histogram value isn't available when pipelinerun or taskrun labels are selected. The Lastvalue or Gauge will be provided. Histogram would serve no purpose because it would generate a single bar. TaskRun and PipelineRun level metrics aren't recommended because they lead to an unbounded cardinality which degrades the observability database.

Expand Down
9 changes: 9 additions & 0 deletions pkg/apis/config/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@ const (
// countWithReasonKey sets if the reason label should be included on count metrics
countWithReasonKey = "metrics.count.enable-reason"

// throttledWithNamespaceKey sets if the namespace label should be included on the taskrun throttled metrics
throttledWithNamespaceKey = "metrics.taskrun.throttle.enable-namespace"

// DefaultTaskrunLevel determines to what level to aggregate metrics
// when it isn't specified in configmap
DefaultTaskrunLevel = TaskrunLevelAtTask
Expand Down Expand Up @@ -96,6 +99,7 @@ type Metrics struct {
DurationTaskrunType string
DurationPipelinerunType string
CountWithReason bool
ThrottleWithNamespace bool
}

// GetMetricsConfigName returns the name of the configmap containing all
Expand Down Expand Up @@ -129,6 +133,7 @@ func newMetricsFromMap(cfgMap map[string]string) (*Metrics, error) {
DurationTaskrunType: DefaultDurationTaskrunType,
DurationPipelinerunType: DefaultDurationPipelinerunType,
CountWithReason: false,
ThrottleWithNamespace: false,
}

if taskrunLevel, ok := cfgMap[metricsTaskrunLevelKey]; ok {
Expand All @@ -149,6 +154,10 @@ func newMetricsFromMap(cfgMap map[string]string) (*Metrics, error) {
tc.CountWithReason = true
}

if throttleWithNamespace, ok := cfgMap[throttledWithNamespaceKey]; ok && throttleWithNamespace != "false" {
tc.ThrottleWithNamespace = true
}

return &tc, nil
}

Expand Down
15 changes: 15 additions & 0 deletions pkg/apis/config/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ func TestNewMetricsFromConfigMap(t *testing.T) {
DurationTaskrunType: config.DurationPipelinerunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeHistogram,
CountWithReason: false,
ThrottleWithNamespace: false,
},
fileName: config.GetMetricsConfigName(),
},
Expand All @@ -49,6 +50,7 @@ func TestNewMetricsFromConfigMap(t *testing.T) {
DurationTaskrunType: config.DurationTaskrunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeLastValue,
CountWithReason: false,
ThrottleWithNamespace: false,
},
fileName: "config-observability-namespacelevel",
},
Expand All @@ -59,9 +61,21 @@ func TestNewMetricsFromConfigMap(t *testing.T) {
DurationTaskrunType: config.DurationTaskrunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeLastValue,
CountWithReason: true,
ThrottleWithNamespace: false,
},
fileName: "config-observability-reason",
},
{
expectedConfig: &config.Metrics{
TaskrunLevel: config.TaskrunLevelAtNS,
PipelinerunLevel: config.PipelinerunLevelAtNS,
DurationTaskrunType: config.DurationTaskrunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeLastValue,
CountWithReason: true,
ThrottleWithNamespace: true,
},
fileName: "config-observability-throttle",
},
}

for _, tc := range testCases {
Expand All @@ -77,6 +91,7 @@ func TestNewMetricsFromEmptyConfigMap(t *testing.T) {
DurationTaskrunType: config.DurationPipelinerunTypeHistogram,
DurationPipelinerunType: config.DurationPipelinerunTypeHistogram,
CountWithReason: false,
ThrottleWithNamespace: false,
}
verifyConfigFileWithExpectedMetricsConfig(t, MetricsConfigEmptyName, expectedConfig)
}
Expand Down
32 changes: 32 additions & 0 deletions pkg/apis/config/testdata/config-observability-throttle.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright 2019 The Tekton Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: tekton-pipelines
labels:
app.kubernetes.io/instance: default
app.kubernetes.io/part-of: tekton-pipelines
data:
metrics.backend-destination: prometheus
metrics.stackdriver-project-id: "<your stackdriver project id>"
metrics.allow-stackdriver-custom-metrics: "false"
metrics.taskrun.level: "namespace"
metrics.taskrun.duration-type: "histogram"
metrics.pipelinerun.level: "namespace"
metrics.pipelinerun.duration-type: "lastvalue"
metrics.count.enable-reason: "true"
metrics.taskrun.throttle.enable-namespace: "true"
60 changes: 52 additions & 8 deletions pkg/taskrunmetrics/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -268,15 +268,21 @@ func viewRegister(cfg *config.Metrics) error {
Aggregation: view.LastValue(),
}

throttleViewTags := []tag.Key{}
if cfg.ThrottleWithNamespace {
throttleViewTags = append(throttleViewTags, namespaceTag)
}
runningTRsThrottledByQuotaView = &view.View{
Description: runningTRsThrottledByQuota.Description(),
Measure: runningTRsThrottledByQuota,
Aggregation: view.LastValue(),
TagKeys: throttleViewTags,
}
runningTRsThrottledByNodeView = &view.View{
Description: runningTRsThrottledByNode.Description(),
Measure: runningTRsThrottledByNode,
Aggregation: view.LastValue(),
TagKeys: throttleViewTags,
}
podLatencyView = &view.View{
Description: podLatency.Description(),
Expand Down Expand Up @@ -428,21 +434,40 @@ func (r *Recorder) RunningTaskRuns(ctx context.Context, lister listers.TaskRunLi
}

var runningTrs int
var trsThrottledByQuota int
var trsThrottledByNode int
trsThrottledByQuota := map[string]int{}
trsThrottledByQuotaCount := 0
trsThrottledByNode := map[string]int{}
trsThrottledByNodeCount := 0
var trsWaitResolvingTaskRef int
for _, pr := range trs {
// initialize metrics with namespace tag to zero if unset; will then update as needed below
_, ok := trsThrottledByQuota[pr.Namespace]
if !ok {
trsThrottledByQuota[pr.Namespace] = 0
}
_, ok = trsThrottledByNode[pr.Namespace]
if !ok {
trsThrottledByNode[pr.Namespace] = 0
}

if pr.IsDone() {
continue
}
runningTrs++

succeedCondition := pr.Status.GetCondition(apis.ConditionSucceeded)
if succeedCondition != nil && succeedCondition.Status == corev1.ConditionUnknown {
switch succeedCondition.Reason {
case pod.ReasonExceededResourceQuota:
trsThrottledByQuota++
trsThrottledByQuotaCount++
cnt := trsThrottledByQuota[pr.Namespace]
cnt++
trsThrottledByQuota[pr.Namespace] = cnt
case pod.ReasonExceededNodeResources:
trsThrottledByNode++
trsThrottledByNodeCount++
cnt := trsThrottledByNode[pr.Namespace]
cnt++
trsThrottledByNode[pr.Namespace] = cnt
case v1.TaskRunReasonResolvingTaskRef:
trsWaitResolvingTaskRef++
}
Expand All @@ -455,12 +480,31 @@ func (r *Recorder) RunningTaskRuns(ctx context.Context, lister listers.TaskRunLi
}
metrics.Record(ctx, runningTRsCount.M(float64(runningTrs)))
metrics.Record(ctx, runningTRs.M(float64(runningTrs)))
metrics.Record(ctx, runningTRsThrottledByNodeCount.M(float64(trsThrottledByNode)))
metrics.Record(ctx, runningTRsThrottledByQuotaCount.M(float64(trsThrottledByQuota)))
metrics.Record(ctx, runningTRsWaitingOnTaskResolutionCount.M(float64(trsWaitResolvingTaskRef)))
metrics.Record(ctx, runningTRsThrottledByNode.M(float64(trsThrottledByNode)))
metrics.Record(ctx, runningTRsThrottledByQuota.M(float64(trsThrottledByQuota)))
metrics.Record(ctx, runningTRsThrottledByQuotaCount.M(float64(trsThrottledByQuotaCount)))
metrics.Record(ctx, runningTRsThrottledByNodeCount.M(float64(trsThrottledByNodeCount)))

cfg := config.FromContextOrDefaults(ctx)
addNamespaceLabelToQuotaThrottleMetric := cfg.Metrics != nil && cfg.Metrics.ThrottleWithNamespace

for ns, cnt := range trsThrottledByQuota {
var mutators []tag.Mutator
if addNamespaceLabelToQuotaThrottleMetric {
mutators = []tag.Mutator{tag.Insert(namespaceTag, ns)}
}
ctx, err = tag.New(ctx, mutators...)
if err != nil {
return err
}
metrics.Record(ctx, runningTRsThrottledByQuota.M(float64(cnt)))
}
for ns, cnt := range trsThrottledByNode {
ctx, err = tag.New(ctx, []tag.Mutator{tag.Insert(namespaceTag, ns)}...)
if err != nil {
return err
}
metrics.Record(ctx, runningTRsThrottledByNode.M(float64(cnt)))
}
return nil
}

Expand Down
32 changes: 25 additions & 7 deletions pkg/taskrunmetrics/metrics_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ var (
completionTime = metav1.NewTime(startTime.Time.Add(time.Minute))
)

func getConfigContext(countWithReason bool) context.Context {
func getConfigContext(countWithReason, throttleWithNamespace bool) context.Context {
ctx := context.Background()
cfg := &config.Config{
Metrics: &config.Metrics{
Expand All @@ -53,6 +53,7 @@ func getConfigContext(countWithReason bool) context.Context {
DurationTaskrunType: config.DefaultDurationTaskrunType,
DurationPipelinerunType: config.DefaultDurationPipelinerunType,
CountWithReason: countWithReason,
ThrottleWithNamespace: throttleWithNamespace,
},
}
return config.ToContext(ctx, cfg)
Expand Down Expand Up @@ -84,7 +85,7 @@ func TestMetricsOnStore(t *testing.T) {
defer log.Sync()
logger := log.Sugar()

ctx := getConfigContext(false)
ctx := getConfigContext(false, false)
metrics, err := NewRecorder(ctx)
if err != nil {
t.Fatalf("NewRecorder: %v", err)
Expand Down Expand Up @@ -404,7 +405,7 @@ func TestRecordTaskRunDurationCount(t *testing.T) {
t.Run(c.name, func(t *testing.T) {
unregisterMetrics()

ctx := getConfigContext(c.countWithReason)
ctx := getConfigContext(c.countWithReason, false)
metrics, err := NewRecorder(ctx)
if err != nil {
t.Fatalf("NewRecorder: %v", err)
Expand Down Expand Up @@ -459,7 +460,7 @@ func TestRecordRunningTaskRunsCount(t *testing.T) {
}
}

ctx = getConfigContext(false)
ctx = getConfigContext(false, false)
metrics, err := NewRecorder(ctx)
if err != nil {
t.Fatalf("NewRecorder: %v", err)
Expand All @@ -479,6 +480,7 @@ func TestRecordRunningTaskRunsThrottledCounts(t *testing.T) {
nodeCount float64
quotaCount float64
waitCount float64
addNS bool
}{
{
status: corev1.ConditionTrue,
Expand All @@ -488,10 +490,20 @@ func TestRecordRunningTaskRunsThrottledCounts(t *testing.T) {
status: corev1.ConditionTrue,
reason: pod.ReasonExceededResourceQuota,
},
{
status: corev1.ConditionTrue,
reason: pod.ReasonExceededResourceQuota,
addNS: true,
},
{
status: corev1.ConditionTrue,
reason: pod.ReasonExceededNodeResources,
},
{
status: corev1.ConditionTrue,
reason: pod.ReasonExceededNodeResources,
addNS: true,
},
{
status: corev1.ConditionTrue,
reason: v1.TaskRunReasonResolvingTaskRef,
Expand Down Expand Up @@ -537,7 +549,7 @@ func TestRecordRunningTaskRunsThrottledCounts(t *testing.T) {
informer := faketaskruninformer.Get(ctx)
for i := 0; i < multiplier; i++ {
tr := &v1.TaskRun{
ObjectMeta: metav1.ObjectMeta{Name: names.SimpleNameGenerator.RestrictLengthWithRandomSuffix("taskrun-")},
ObjectMeta: metav1.ObjectMeta{Name: names.SimpleNameGenerator.RestrictLengthWithRandomSuffix("taskrun-"), Namespace: "test"},
Status: v1.TaskRunStatus{
Status: duckv1.Status{
Conditions: duckv1.Conditions{{
Expand All @@ -553,7 +565,7 @@ func TestRecordRunningTaskRunsThrottledCounts(t *testing.T) {
}
}

ctx = getConfigContext(false)
ctx = getConfigContext(false, tc.addNS)
metrics, err := NewRecorder(ctx)
if err != nil {
t.Fatalf("NewRecorder: %v", err)
Expand All @@ -563,7 +575,13 @@ func TestRecordRunningTaskRunsThrottledCounts(t *testing.T) {
t.Errorf("RunningTaskRuns: %v", err)
}
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_quota_count", map[string]string{}, tc.quotaCount)
nsMap := map[string]string{}
if tc.addNS {
nsMap = map[string]string{namespaceTag.Name(): "test"}
}
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_quota", nsMap, tc.quotaCount)
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_node_count", map[string]string{}, tc.nodeCount)
metricstest.CheckLastValueData(t, "running_taskruns_throttled_by_node", nsMap, tc.nodeCount)
metricstest.CheckLastValueData(t, "running_taskruns_waiting_on_task_resolution_count", map[string]string{}, tc.waitCount)
}
}
Expand Down Expand Up @@ -620,7 +638,7 @@ func TestRecordPodLatency(t *testing.T) {
t.Run(td.name, func(t *testing.T) {
unregisterMetrics()

ctx := getConfigContext(false)
ctx := getConfigContext(false, false)
metrics, err := NewRecorder(ctx)
if err != nil {
t.Fatalf("NewRecorder: %v", err)
Expand Down

0 comments on commit db76fe0

Please sign in to comment.