Description
Component(s)
collector
Is your feature request related to a problem? Please describe.
When deploying the OpenTelemetry Collector as a StatefulSet using the OpenTelemetry Operator in a GKE cluster, the OpenTelemetryCollector CRD does not allow setting the serviceName to match the headless Service name, causing a mismatch with statefulset.serviceName. Kubernetes requires statefulset.serviceName to match the headless Service name for pods to register DNS hostnames (e.g., ...svc.cluster.local). Without this, the loadbalancing exporter’s kubernetes resolver (return_hostnames: true) fails to resolve pod hostnames, resulting in:
error: couldn't find the exporter for the endpoint ""
when configured as
exporters:
loadbalancing:
routing_key: "traceID"
protocol:
otlp:
timeout: 1s
tls:
insecure: true
resolver:
k8s:
service: opentelemetry-backend-collector-headless.plat-observe-dev
timeout: 3s
return_hostnames: true
For example, my headless Service (opentelemetry-backend-collector-headless.plat-observe-dev) doesn’t match the StatefulSet’s default serviceName (opentelemetry-backend-collector), preventing hostname resolution:
kubectl get statefulset opentelemetry-backend-collector -n plat-observe-dev -o jsonpath='{.spec.serviceName}'
opentelemetry-backend-collector
nslookup opentelemetry-backend-collector-0.plat-observe-dev.svc.cluster.local
** server can't find ...: NXDOMAIN
This breaks traceID routing, critical for tail-based sampling in my setup.
Describe the solution you'd like
Add a serviceName field to the OpenTelemetryCollector CRD to specify the headless Service name when mode: statefulset. The Operator should set statefulset.serviceName to this value, ensuring DNS hostname registration.
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: otel-backend
namespace: plat-observe-dev
spec:
name: "opentelemetry-backend-collector"
serviceName: "opentelemetry-backend-collector-headless"
mode: statefulset
replicas: 9
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
send_batch_size: 20000
timeout: 5s
send_batch_max_size: 25000
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
Describe alternatives you've considered
Manual Workaround: Deploy the StatefulSet outside the Operator (using official collector helm chart) with a matching serviceName. This is successful, but loses Operator benefits.
Static Resolver: Failed due to incorrect DNS domain (plat-observe.dev vs. svc.cluster.local).
cert-manager: Issued certificates with Service DNS name, resolving TLS errors, but doesn’t fix hostname resolution.
Istio Gateway: Considered for ingress, but doesn’t support pod-specific routing needed for traceID.
Additional context
Environment: GKE, Istio 1.19.10-asm.33 (CSM auto mode), mTLS STRICT, OpenTelemetry Collector 0.126.0, Helm chart open-telemetry/opentelemetry-collector.
Setup: StatefulSet with replicas: 9, autoscaling (minReplicas: 3, maxReplicas: 10), headless Service (opentelemetry-backend-collector-headless), loadbalancing exporter (routing_key: "traceID", kubernetes resolver).
Issue: Mismatch between statefulset.serviceName (opentelemetry-backend-collector) and headless Service name prevents DNS hostname registration, breaking kubernetes resolver.
TLS: Certificates lack Service DNS name, addressed with cert-manager, but hostname resolution remains critical.