Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEL Telemetry Exporter isn't setting service.name #7791

Closed
knechtionscoding opened this issue Feb 3, 2023 · 12 comments
Closed

OTEL Telemetry Exporter isn't setting service.name #7791

knechtionscoding opened this issue Feb 3, 2023 · 12 comments
Assignees
Labels
Committed: 1.17 Good First Issue Good issue for newbies Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.17 Size: S 1 - 3 days Type: Bug Something isn't working zendesk

Comments

@knechtionscoding
Copy link

knechtionscoding commented Feb 3, 2023

Gloo Edge Version

1.13.x (latest stable)

Kubernetes Version

1.23.x

Describe the bug

When using the opentelemetry configuration with gloo the resource attribute isn't having the service.name set. It's currently being set as unknown

Steps to reproduce the bug

Follow the OpenTelemetry Tracing guide: https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/otel/

And in Step 8 you will see the following in the logs:

2024-03-13T13:48:59.214Z	info	TracesExporter	{"kind": "exporter", "data_type": "traces", "name": "logging", "#spans": 1}
2024-03-13T13:48:59.214Z	info	ResourceSpans #0
Resource SchemaURL: 
Resource attributes:
     -> service.name: Str(unknown_service:envoy)
 ...

Relevant configuration:

  httpGateway:
    options:
      httpConnectionManagerSettings:
        tracing:
          openTelemetryConfig:
            collectorUpstreamRef:
              name: <otel-collector>
              namespace: <namespace>

Expected Behavior

Spans and traces should have the resource attribute of service.name properly.

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

@knechtionscoding knechtionscoding added the Type: Bug Something isn't working label Feb 3, 2023
@SantoDE
Copy link
Contributor

SantoDE commented Jul 27, 2023

Potentially related to #8494

@nfuden nfuden added Good First Issue Good issue for newbies Size: S 1 - 3 days labels Feb 23, 2024
@sam-heilbron
Copy link
Contributor

Perhaps I'm over-simplifying this, so please correct me if I'm wrong.

Per https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/trace/v3/opentelemetry.proto.html:

The name for the service. This will be populated in the ResourceSpan Resource attributes. If it is not provided, it will default to “unknown_service:envoy”.

We should be populating the service.name field with the name of the gateway-proxy service that is being run, correct?

Should this be the name of the service or the name of the gateway that this is defined on?

@anessi
Copy link

anessi commented Feb 29, 2024

I'd say the gateway name fits well for this value as it's defined in the Gateway resource. In case sharing is used (https://docs.solo.io/gloo-edge/main/introduction/architecture/deployment_arch/#sharded-api-gateway), the value would be different for each Gateway resources which makes sense IMO.

@DuncanDoyle
Copy link
Contributor

@sam-heilbron Feedback from another customer is that setting this to the gateway name also fits their use-case and requirements. So I agree with @anessi, let's set this to the gateway name, keeping in mind that we need to support a shared environment with multiple gateways.

@htpvu htpvu added the Prioritized Indicating issue prioritized to be worked on in RFE stream label Mar 6, 2024
@sheidkamp
Copy link
Contributor

I want to confirm the expected behavior with a specific example, using the example from https://docs.solo.io/gloo-edge/latest/guides/observability/tracing/otel/

This is the gateway config:

apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
  labels:
    app: gloo
  name: gateway-proxy
  namespace: gloo-system
spec:
  bindAddress: '::'
  bindPort: 8080
  httpGateway:
    options:
      httpConnectionManagerSettings:
        tracing:
          openTelemetryConfig:
            collectorUpstreamRef:
              namespace: "gloo-system"
              name: "opentelemetry-collector"

Where right now we are displaying

Resource attributes:
     -> service.name: Str(unknown_service:envoy)

We would instead display:

Resource attributes:
     -> service.name: Str(gateway-proxy)

Should namespace be included in any way?

@sheidkamp
Copy link
Contributor

I have a solution that depends upon the gateway populating some "parent" metadata with the gateway name.

This happens in the "classic" gloo gateway, and I suspect it is not happening in ggv2 (investigation in progress). I also think that since no one is on ggv2 yet, there's value in this approach, and we can add the translation to ggv2 at the appropriate time (if its not currently implemented).

I expect to have a PR for this approach ready Monday.

Also some questions for how to handle some edge cases (which shouldn't occur, but we should protect against).
The edge cases:

  1. We are not able to find the gateway name because of how the how the metadata is formatted
    a. The deprecated, opaque data format is used (deprecated in v11.0)
    b. No gateway metadata is passed
  2. Multiple gateways are defined

These edge cases wouldn't be expected to be a matter of bad user configuration (at least for classic gateway), and would be more of a matter of bad translation/application logic.

The two main approaches to handling this would be:

  • Stop translation of the plugin and return an error. This
  • Log an error/warning and use either an empty string (which would result in the current unknown_service:envoy) or a string reflecting the data situation ("deprecated_metadata", "gateway_not_defined", "multiple_gateways_defined" or "gateway_1, gateway_2...")

Any opinions on which approach to use?

@anessi
Copy link

anessi commented Mar 15, 2024

IMO a namespace is not needed as this is commonly added as generic attributes to the traces (see resource.attributes.k8s@namespace@name in the example below).

See example-traces.json for two example traces, one from rate-limit and one from gateway-proxy.

If a trace can't be clearly related to a gateway, then a more sensible values like the type of service should be used. Stop translating is not an option IMO. It should be possible to always put a sensible value there (hopefully :) ).
Maybe rate-limiting is a good example if multiple gateways are using the same rate limit server, I think it would be perfectly fine to have a value like rate-limit in the serviceName field. Basically the field is there to understand where the trace comes from.

@sheidkamp
Copy link
Contributor

@anessi , thanks for the feedback. I agree we shouldn't stop translating on these edge cases.

However, in most of these cases I don't know that there will be a way to find a more sensible value. At the point in translation where we create the envoy Open Telemetry Configuration, we have already translated Gateways and VirtualServices to Proxys and VirtualHosts, and we don't have access to the runtime information/span definitions. The internal OpenTelemetry configuration only refers to the upstream ref.

These edge cases can't be reached by updating configuration, and would only be encountered if changes to the gateway (or a new gateway) were introduced. If I wanted to create them in a running cluster, I would start by intentionally writing some bad gateway translation code, I don't think I could do it with just yaml and out-of-the-box gloo. So if the metadata for the gateway is wrong, all of our metadata is likely wrong (as in case 1a/1b). The "multiple gateways" use case would also not occur when multiple gateways refer to the same upstream collector. Current translation would create separate Otel Configurations that would each refer to a single gateway.

This might be something easier discussed in a quick call/huddle with interested parties.

@anessi
Copy link

anessi commented Mar 18, 2024

@sheidkamp : thanks for the details. I don't know the internals, so if it's not possible to set any meaningful value, setting a string as you proposed above and logging additional details is the best that can be done I guess. If those are edge cases, it's maybe not such a big deal anyway.

@sheidkamp
Copy link
Contributor

sheidkamp commented Mar 22, 2024

We're going to update the API by adding serviceNameSource to the openTelemetryConfig that will accept gatewayName: {} as a value, which will be used as the default if omitted.

  openTelemetryConfig:
    collectorUpstreamRef:
      name: opentelemetry-collector
      namespace: gloo-system
    serviceNameSource:
      gatewayName: {}

Since there was only one option, we decided to wait to introduce the API until there was a second option to choose from.

@sheidkamp
Copy link
Contributor

Merged into 1.17/main

@soloio-bot
Copy link

Zendesk ticket #3478 has been linked to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Committed: 1.17 Good First Issue Good issue for newbies Prioritized Indicating issue prioritized to be worked on in RFE stream release/1.17 Size: S 1 - 3 days Type: Bug Something isn't working zendesk
Projects
None yet
Development

No branches or pull requests

10 participants