-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Update Internal Collector Telemetry Docs #7035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update Internal Collector Telemetry Docs #7035
Conversation
There are three ways to export internal Collector metrics. | ||
|
||
1. Self-ingesting, exporting internal metrics via the | ||
[Prometheus exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/exporter/prometheusexporter). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik this is not using the prometheus exporter, rather https://github.com/open-telemetry/opentelemetry-go/tree/main/exporters/prometheus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmichalek132 thanks for the clarification. How does the Go Prometheus exporter differ from one in the Collector?? I was under the impression that the Collector's one was based on the Go one??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole can answer that nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two reasons:
First, the exporters implement different interfaces. There is a Reader, and the exporter.Metrics for the collector.
Second, the collector's exporter is designed to aggregate metrics from multiple resources/targets together, similar to how the Prometheus server's /federate endpoint works. The Go SDK exporter is designed to only handle metrics from a single resource, more like prometheus client_golang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashpole so when exporter
is configured to use prometheus to export internal metrics, then it's using the Go Prometheus exporter behind the scenes then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct. You can link to go.opentelemetry.io/otel/exporters/prometheus
exporter: | ||
prometheus: | ||
host: '0.0.0.0' | ||
port: 8888 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be also nice to provide example how to get the original names back
https://github.com/open-telemetry/opentelemetry-collector/blob/e1f670844604a5b119d8560bc079ceca4c92bf72/CHANGELOG.md?plain=1#L347
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmichalek132 happy to do that. Can you elaborate on what is meant by the line Users who do not customize the Prometheus reader should not be impacted.
in the changelog? Is the "Prometheus reader" the same as the "Prometheus receiver"?
Do the Prometheus receiver (receivers::prometheus
) and/or Prometheus exporter (exporters::prometheus
) have to be configured when the internal metrics exporter is prometheus
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmichalek132 friendly reminder for clarification on above item 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avillela We just added a note about this further down the page. You could link to it from here, if you want, and I think that should handle this suggestion.
https://opentelemetry.io/docs/collector/internal-telemetry/#_total-suffix
- [Traces](#configure-internal-traces) | ||
|
||
{{% alert title="Who monitors the monitor?" color="info" %}} Internal Collector | ||
metrics can be exported directly to a backend for analysis, or to the Collector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that a collector shouldn't self monitor it's own telemetry, I would be wary of suggesting the telemetry should be sent directly to a backend.
For example imagine a common agent and gateway pattern on Kubernetes. We could have hundreds or thousands node agents, batching and shipping application telemetry to a gateway layer that could be comprised of only a few instances.
If someone tried to make connections from thousands of node agents to a vendors backend for otelcol telemetry, there may be a lot of scaling issues from that. The internal-telemetry would also never have a chance to be enriched with Kubernetes metadata.
I've been thinking of using a dedicated internal telemetry gateway of otelcol instances for this purpose, so every otelcol instance regardless of whether it's a node agent, a gateway or a load balancing exporter layer, would send to the same collector instances dedicated to otelcol telemetry. I'm not sure what to call this pattern, but maybe we can suggest it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kallangerard I like that idea! I'll make the revisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the same not we just use prometheus to scrape otel collector metrics directly, might be worth calling it out as an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmichalek132 sorry...I'm confused. Isn't that what setting the exporter
to prometheus does in the first config??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, we used to have a similar warning on this page, but it was removed when the self-monitoring section was removed. I'll leave it up to @codeboten if we want to add the warning again.
data), its internal telemetry won't be sent to its intended destination. This | ||
makes it difficult to detect problems with the Collector itself. Likewise, if | ||
the Collector generates additional telemetry related to the above issue, such | ||
error logs, and those logs are sent into the same collector, it can create a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an example of excluding otelcol's own logs from log tailing. I know Spunk's otelcol helm had some examples of this. I believe it was using custom exclude annotations if I recall correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kallangerard can you point me in the direction of this documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just had a look, I was wrong, Splunk used path filtering in the filelog receiver to exclude their self-logs, while otlp logs will filter out logs with an exclude annotation on the pod. They seem be excluding filelog capability entirely in their latest examples though.
I think if someone is using the filelog receiver, they're likely to have already come across this issue and are handling it in their own way.
For internal telemetry logs to an otlp endpoint, I don't think there's any way to do it safely by sending to its own otlp receiver endpoint. I'm not 100% sure on a safe alternative though. I believe there's some internal rate limiting in the otelcol's internal logs, but I haven't tested it with a self-exporting broken logging pipeline. I've been scraping otelcol logs with Datadog for a while and haven't seen any runaway log volumes, but I guess that's not self consumption. 🤷😅
Outside of this PR I'll try and write up some examples of a dedicated internal telemetry collector.
- pull: | ||
exporter: | ||
prometheus: | ||
host: '0.0.0.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 0.0.0.0
right here?
- I'm not sure if this will work for IPv6 only stacks
- The default behaviour has been changed in endpoints and such from
0.0.0.0
tolocalhost
. Should we just be usinglocalhost
if we are intending to expose for self scraping, or be more explicit for public interfaces?
Probably not something we need to tackle here, but I'd love a chime in from anyone with better container/Linux networking knowledge than me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kallangerard this was in the original docs, so I can't speak for it (and haven't used it). Hope someone else will chime in with more info. 😁
@open-telemetry/collector-approvers, PTAL. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR. I have a few questions about the new section
{{% alert title="Internal telemetry configuration changes" %}} | ||
There are three ways to export internal Collector metrics. | ||
|
||
1. Self-ingesting, exporting internal metrics via the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "self-ingesting" mean here? It doesn't look like this section has the Collector ingest its own metrics.
``` | ||
|
||
2. Self-ingesting and exporting, scraping metrics via the Collector's own | ||
[Prometheus Receiver](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it makes sense to have a section about self-ingesting Prometheus-exported metrics, for two reasons:
-
We used to have a section about self-ingesting OTLP-exported metrics in the past, but it was removed, because we want to discourage users from doing self-ingestion, as it can introduce reliability and data loss issues (if the Collector is unhealthy, who's going to export the metrics showing that it is?). See remove suggestion to process internal telemetry through collector #5749 for precedent.
-
If users are going to be doing self-ingestion anyways, I think it makes more sense to do it through OTLP rather than Prometheus. This will preserve all of the telemetry and its semantics as-is, without unintended conversions (eg. metric names with the default config) or data loss (eg. scope attributes until recently) that could occur due to the Prometheus exporter/receiver combo. Moreover, it's possible with the Core Collector distribution, not just Contrib.
``` | ||
|
||
{{% alert title="WARNING" color="warning" %}} Although the above approach is | ||
possible, it is not recommended, as it can introduce scaling issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of scaling issues do you expect when exporting metrics through OTLP? Is this comment meant to be about self-ingestion? I'd rather we didn't discourage people from emitting internal telemetry with our own protocol.
@@ -161,9 +256,12 @@ service: | |||
exporter: | |||
otlp: | |||
protocol: http/protobuf | |||
endpoint: https://backend:4318 | |||
endpoint: https://${OTLP_ENDPOINT} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: People often mix up OTLP/gRPC and OTLP/HTTP, so I think keeping a reminder of the standard port would be good. Maybe add a line like "This will load the endpoint from the OTLP_ENDPOINT environment variable, which should look something like backend:4318
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few suggestions, but I'll do a more thorough copy edit after the content is finalized based on the other suggestions. Thanks, @avillela!
- [Logs](#configure-internal-logs) | ||
- [Traces](#configure-internal-traces) | ||
|
||
{{% alert title="Who monitors the monitor?" color="info" %}} As a matter of best |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{{% alert title="Who monitors the monitor?" color="info" %}} As a matter of best | |
{{% alert title="Who monitors the monitor?" %}} As a matter of best |
exports the telemetry to an OTLP backend for analysis. | ||
|
||
When a Collector is responsible for handling its own telemetry through a traces, | ||
metrics, or logs pipeline and encounters an issue (e.g. memory limiter blocking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics, or logs pipeline and encounters an issue (e.g. memory limiter blocking | |
metrics, or logs pipeline and encounters an issue (for example, memory limiter blocking |
metrics, or logs pipeline and encounters an issue (e.g. memory limiter blocking | ||
data), its internal telemetry won't be sent to its intended destination. This | ||
makes it difficult to detect problems with the Collector itself. Likewise, if | ||
the Collector generates additional telemetry related to the above issue, such |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Collector generates additional telemetry related to the above issue, such | |
the Collector generates additional telemetry related to the above issue, such as |
exporter: | ||
prometheus: | ||
host: '0.0.0.0' | ||
port: 8888 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avillela We just added a note about this further down the page. You could link to it from here, if you want, and I think that should handle this suggestion.
https://opentelemetry.io/docs/collector/internal-telemetry/#_total-suffix
This PR contains updates to the documentation on Internal Collector Telemetry to help clarify some of the approaches for exporting internal Collector metrics. It also includes an explanation on why self-ingesting telemetry is not advisable.