Skip to content

enhancement(tag_cardinality_limit transform): A setting for per-metric vs global tag cardinality tracking#25372

Open
ArunPiduguDD wants to merge 1 commit intomasterfrom
arun.pidugu/tag-cardinality-tracking-scope
Open

enhancement(tag_cardinality_limit transform): A setting for per-metric vs global tag cardinality tracking#25372
ArunPiduguDD wants to merge 1 commit intomasterfrom
arun.pidugu/tag-cardinality-tracking-scope

Conversation

@ArunPiduguDD
Copy link
Copy Markdown
Contributor

@ArunPiduguDD ArunPiduguDD commented May 5, 2026

Summary

When metrics do not have an explicit per_metric_limits entry, their tag values were always pooled into a single shared bucket. This can lead to some of the following example scenarios:

  • If metric1 and metric2 have the host tag, but metric1 has a high cardinality for the host tag (above the limit), the host tag will be dropped on metric2 (even if the tag on metric2 only has 1-2 cardinality)
  • If there are ~100 metrics with the host tag, and each tag has 1-2 unique values per metric, then a cardinality limit of 50 will drop this tag across all metrics.

The new tracking_scope setting lets users opt into per-metric tracking buckets instead, providing isolation at the cost of higher memory.

Default is global (current behavior); per_metric gives every distinct (namespace, name) its own bucket regardless of per_metric_limits membership.

Vector configuration

sources:
  otel:
    type: opentelemetry
    grpc:
      address: "0.0.0.0:4317"
    http:
      address: "0.0.0.0:4318"

  cardinality:
    type: tag_cardinality_limit
    inputs: ["otel.metrics"]
    value_limit: 5
    mode: exact
    limit_exceeded_action: drop_event

    # The new setting under test. Try toggling between `global` (current behavior:
    # all metrics without a `per_metric_limits` entry share one bucket) and
    # `per_metric` (every metric name gets its own bucket).
    tracking_scope: per_metric

    per_metric_limits:
      # Tighter override on this specific metric — applies regardless of `tracking_scope`.
      demo_value_gauge:
        value_limit: 2
        mode: exact
        limit_exceeded_action: drop_event

      demo_value_counter:
        value_limit: 6
        mode: exact
        limit_exceeded_action: drop_tag

sinks:
  console:
    type: console
    inputs: ["cardinality"]
    encoding:
      codec: json

How did you test this PR?

Tested with above configuration. Simulated an Otel Collector with the following Python script:

import random
import string
from uuid import uuid4

from opentelemetry import metrics
from opentelemetry.metrics import Observation
from opentelemetry.exporter.otlp.proto.http.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import InMemoryMetricReader, MetricExportResult
from opentelemetry.sdk.metrics.view import View, DropAggregation
from opentelemetry.sdk.resources import Resource


VECTOR_METRICS_ENDPOINT = "http://localhost:4318/v1/metrics"


def rand_token(prefix: str, n: int = 8) -> str:
    return f"{prefix}-{''.join(random.choices(string.ascii_lowercase + string.digits, k=n))}"


def random_trace_id() -> str:
    return uuid4().hex + uuid4().hex


def random_environment() -> str:
    return rand_token(random.choice(["dev", "staging", "prod", "local", "qa"]))


def build_common_tags() -> dict[str, str]:
    return {
        "trace_id": random_trace_id(),
        "environment": random_environment(),
    }


def build_system_process_tags() -> dict[str, str]:
    tags = build_common_tags()
    tags.update(
        {
            "host_id": rand_token("host"),
            "process_group": rand_token("pg"),
            "shard": rand_token("shard"),
            "worker": rand_token("worker"),
        }
    )
    return tags


def main() -> None:
    resource = Resource.create(
        {
            "service.name": "vector-http-metrics-demo",
            "service.version": "1.0.0",
        }
    )

    reader = InMemoryMetricReader()

    provider = MeterProvider(
        resource=resource,
        metric_readers=[reader],
        views=[
            View(instrument_name="*", aggregation=DropAggregation()),
            View(instrument_name="demo_value_gauge"),
            View(instrument_name="system.process.count"),
            View(instrument_name="demo_value_counter"),
            View(instrument_name="demo_value_secondary_gauge"),
            View(instrument_name="demo_value_secondary_counter"),
        ],
    )
    metrics.set_meter_provider(provider)

    exporter = OTLPMetricExporter(
        endpoint=VECTOR_METRICS_ENDPOINT,
        timeout=3000,
    )

    meter = metrics.get_meter("demo-meter")

    state = {
        "demo_value_gauge": {
            "value": 0.0,
            "tags": build_common_tags(),
        },
        "system.process.count": {
            "value": 0.0,
            "tags": build_system_process_tags(),
        },
        "demo_value_counter": {
            "value": 0.0,
            "tags": build_common_tags(),
        },
        "demo_value_secondary_gauge": {
            "value": 0.0,
            "tags": build_common_tags(),
        },
        "demo_value_secondary_counter": {
            "value": 0.0,
            "tags": build_common_tags(),
        },
    }

    def demo_value_gauge_callback(_options):
        s = state["demo_value_gauge"]
        return [Observation(s["value"], s["tags"])]

    def system_process_count_callback(_options):
        s = state["system.process.count"]
        return [Observation(s["value"], s["tags"])]

    def demo_value_counter_callback(_options):
        s = state["demo_value_counter"]
        return [Observation(s["value"], s["tags"])]

    def demo_value_secondary_gauge_callback(_options):
        s = state["demo_value_secondary_gauge"]
        return [Observation(s["value"], s["tags"])]

    def demo_value_secondary_counter_callback(_options):
        s = state["demo_value_secondary_counter"]
        return [Observation(s["value"], s["tags"])]

    meter.create_observable_gauge(
        name="demo_value_gauge",
        callbacks=[demo_value_gauge_callback],
        description="Gauge metric exported to Vector over OTLP/HTTP",
        unit="1",
    )

    meter.create_observable_gauge(
        name="system.process.count",
        callbacks=[system_process_count_callback],
        description="Process count metric exported to Vector over OTLP/HTTP",
        unit="1",
    )

    meter.create_observable_counter(
        name="demo_value_counter",
        callbacks=[demo_value_counter_callback],
        description="Counter metric exported to Vector over OTLP/HTTP",
        unit="1",
    )

    meter.create_observable_gauge(
        name="demo_value_secondary_gauge",
        callbacks=[demo_value_secondary_gauge_callback],
        description="Second demo gauge metric exported to Vector over OTLP/HTTP",
        unit="1",
    )

    meter.create_observable_counter(
        name="demo_value_secondary_counter",
        callbacks=[demo_value_secondary_counter_callback],
        description="Second demo counter metric exported to Vector over OTLP/HTTP",
        unit="1",
    )

    print(f"Configured OTLP/HTTP metrics endpoint: {VECTOR_METRICS_ENDPOINT}")
    print("Press Enter to send all five metrics with random values and random tags.")
    print("Type q and press Enter to quit.")

    try:
        while True:
            user_input = input("> ").strip().lower()
            if user_input in {"q", "quit", "exit"}:
                break

            state["demo_value_gauge"]["value"] = round(random.uniform(0, 100), 2)
            state["system.process.count"]["value"] = float(random.randint(1, 500))
            state["demo_value_counter"]["value"] += float(random.randint(1, 20))
            state["demo_value_secondary_gauge"]["value"] = round(random.uniform(-50, 50), 2)
            state["demo_value_secondary_counter"]["value"] += float(random.randint(1, 10))

            state["demo_value_gauge"]["tags"] = build_common_tags()
            state["system.process.count"]["tags"] = build_system_process_tags()
            state["demo_value_counter"]["tags"] = build_common_tags()
            state["demo_value_secondary_gauge"]["tags"] = build_common_tags()
            state["demo_value_secondary_counter"]["tags"] = build_common_tags()

            metrics_data = reader.get_metrics_data()
            if metrics_data is None:
                print("send failed: no metrics data collected")
                continue

            result = exporter.export(metrics_data)

            if result is MetricExportResult.SUCCESS:
                print("sent all metrics")
                print(
                    f"  demo_value_gauge value={state['demo_value_gauge']['value']} "
                    f"tags={state['demo_value_gauge']['tags']}"
                )
                print(
                    f"  system.process.count value={state['system.process.count']['value']} "
                    f"tags={state['system.process.count']['tags']}"
                )
                print(
                    f"  demo_value_counter value={state['demo_value_counter']['value']} "
                    f"tags={state['demo_value_counter']['tags']}"
                )
                print(
                    f"  demo_value_secondary_gauge value={state['demo_value_secondary_gauge']['value']} "
                    f"tags={state['demo_value_secondary_gauge']['tags']}"
                )
                print(
                    f"  demo_value_secondary_counter value={state['demo_value_secondary_counter']['value']} "
                    f"tags={state['demo_value_secondary_counter']['tags']}"
                )
            else:
                print("send failed for this export batch")

    finally:
        exporter.shutdown()
        provider.shutdown()


if __name__ == "__main__":
    main()

Change Type

  • Bug fix
  • New feature
  • Dependencies
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

Notes

  • Please read our Vector contributor resources.
  • Do not hesitate to use @vectordotdev/vector to reach out to us regarding this PR.
  • Some CI checks run only after we manually approve them.
    • We recommend adding a pre-push hook, please see this template.
    • Alternatively, we recommend running the following locally before pushing to the remote branch:
      • make fmt
      • make check-clippy (if there are failures it's possible some of them can be fixed with make clippy-fix)
      • make test
  • After a review is requested, please avoid force pushes to help us review incrementally.
    • Feel free to push as many commits as you want. They will be squashed into one before merging.
    • For example, you can run git merge origin master and git push.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run make build-licenses to regenerate the license inventory and commit the changes (if any). More details on the dd-rust-license-tool.

@ArunPiduguDD ArunPiduguDD requested review from a team as code owners May 5, 2026 17:24
@github-actions github-actions Bot added docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: transforms Anything related to Vector's transform components domain: external docs Anything related to Vector's external, public documentation labels May 5, 2026
@ArunPiduguDD ArunPiduguDD marked this pull request as draft May 5, 2026 17:25
@ArunPiduguDD ArunPiduguDD changed the title feat(tag_cardinality_limit transform): add tracking_scope setting f… feat(tag_cardinality_limit transform): A setting for per-metric vs global tag cardinality tracking May 5, 2026
@ArunPiduguDD ArunPiduguDD changed the title feat(tag_cardinality_limit transform): A setting for per-metric vs global tag cardinality tracking enhancement(tag_cardinality_limit transform): A setting for per-metric vs global tag cardinality tracking May 5, 2026
@ArunPiduguDD ArunPiduguDD force-pushed the arun.pidugu/tag-cardinality-tracking-scope branch from 8044081 to 9a136f2 Compare May 5, 2026 19:03
@ArunPiduguDD ArunPiduguDD marked this pull request as ready for review May 5, 2026 19:22
Some((metric_namespace, metric_name.clone()))
} else {
None
let metric_key = match self.config.tracking_scope {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern about tracking_scope: per_metric: the accepted_tags map can only grow (no cap, TTL, or eviction). In per_metric mode every distinct (namespace, name) seen on the wire becomes a permanent bucket, and within each bucket every tag key allocates its own AcceptedTagValueSet.

The pre-existing code had this growth pattern too, but it was bounded by the user's per_metric_limits config. With per_metric scope the bound becomes dynamic and controlled by upstream metric names, so if a source emits high cardinality metric names (an anti-pattern but one we see in the wild), the transform's memory grows monotonically for the lifetime of the process.

There are a few options here but adding a max_tracked_metrics (or similar) knob seems reasonable. When we hit this limit, we can reject new metric IDs. I am open to discussing an LRU strategy too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pront I'd say technically this problem also existed before b/c with even a global tag counter the number of tags being tracked is also still unbounded (though definitely agree it's more likely to be an issue with this new per metric tracking scope)

Will go with a "max tracked tags" approach that can be used for either tracking scope that keeps track of the max number of items that can be tracked in total (either [metric, tag] pairs in the case of per metric scope or just [None, tag] in the case of global tracking scope).

Will leave any strategies like LRU cache out for now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also will leave this field as optional for those that do not want to set it

@ArunPiduguDD ArunPiduguDD force-pushed the arun.pidugu/tag-cardinality-tracking-scope branch from 9a136f2 to b402066 Compare May 7, 2026 19:57
…or per-metric vs global tag tracking

When metrics do not have an explicit `per_metric_limits` entry, their tag values
were always pooled into a single shared bucket. The new `tracking_scope` setting
lets users opt into per-metric tracking buckets instead, providing isolation at
the cost of higher memory.

Default is `global` (current behavior); `per_metric` gives every distinct
(namespace, name) its own bucket regardless of `per_metric_limits` membership.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@ArunPiduguDD ArunPiduguDD force-pushed the arun.pidugu/tag-cardinality-tracking-scope branch from b402066 to 445abd6 Compare May 8, 2026 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs review on hold The documentation team reviews PRs only after a PR is approved by the COSE team. domain: external docs Anything related to Vector's external, public documentation domain: transforms Anything related to Vector's transform components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants