Multiple prometheus_scrape endpoints not scraped in parallel #17659

wjordan · 2023-06-09T23:15:56Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I am using Vector to scrape metrics from a large number (thousands) of instances using the prometheus_scrape source component.

I expected the URLs set in the endpoints config to be scraped in parallel, but it appears to be scraping sequentially. This is not very useful since a single slow or broken endpoint prevents all other endpoints from being scraped properly. Other Prometheus agents scrape the provided endpoints in parallel.

A workaround is to create a separate prometheus_scrape component for each endpoint, but creating a large number (thousands) of components creates heavy CPU load and seems to go against Vector's design.

Configuration

Here's a minimal configuration that reproduces the issue:

[sources.internal_src]
type = "internal_metrics"

[transforms.internal]
type = "filter"
inputs = ["internal_src"]
condition = '.name == "component_sent_bytes_total"'

[sinks.prom_export]
type = "prometheus_exporter"
inputs = ["internal"]
address = "0.0.0.0:9598"

[sources.prom_scrape]
type = "prometheus_scrape"
endpoints = [
    "http://172.0.0.0/metrics",
    "http://127.0.0.1:9598/metrics"
]
endpoint_tag = "endpoint"

[sinks.console]
type = "console"
inputs = ["prom_scrape"]
encoding.codec = "text"

Version

vector 0.30.0 (x86_64-unknown-linux-gnu 38c3f0b 2023-05-22 17:38:48.655488673)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

nullren · 2023-07-13T22:26:55Z

Just came across this issue trying to configure about 1500 nodes for prometheus_scrape—completely unusable in its current state. I'm happy to pick up the work started in #17660. Seems like there are just a couple of tasks to do?

This still needs some additional work:

(blocking issue) Pending requests can start to pile up on the heap causing unbounded memory growth. Scrape requests should implement a timeout less than the interval duration to prevent this.

Alternatively, skip future requests for the same endpoint if a previous scrape request still hasn't completed.

scrape requests with a maximum timeout of the scrape interval value seems like a sane default, eg Request timeouts hyperium/hyper#1097 (comment)

noticed there are a couple more issues related to this:

Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.

Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.

long-lived tasks might be a larger refactor as you'd need to clean them up if the config changes and removes an endpoint.

The timing of endpoint requests should be distributed across the scrape interval instead of all executing at the same time, to spread out the scrape-request load more evenly.

"should" but letting tokio do its thing like @jszwedko mentioned seems more pragmatic.

I'm happy to submit a new PR.

wjordan · 2023-07-20T17:39:43Z

Thanks for picking this up!

@wjordan

…timeouts (#18021)  fixes #14087 fixes #14132 fixes #17659 - [x] make target timeout configurable this builds on what @wjordan did in #17660 ### what's changed - prometheus scrapes happen concurrently - requests to targets can timeout - the timeout can be configured (user facing change) - small change in how the http was instantiated --------- Co-authored-by: Doug Smith <dsmith3197@users.noreply.github.com> Co-authored-by: Stephen Wakely <stephen@lisp.space>

wjordan added the type: bug A code related bug. label Jun 9, 2023

wjordan mentioned this issue Jun 9, 2023

fix(prometheus_scrape source): scrape endpoints in parallel #17660

Closed

jszwedko added type: enhancement A value-adding code change that enhances its existing functionality. source: prometheus_scrape Anything `prometheus_scrape` source related and removed type: bug A code related bug. labels Jun 12, 2023

nullren mentioned this issue Jul 19, 2023

enhancement(prometheus_scrape source): run requests in parallel with timeouts #18021

Merged

1 task

neuronull closed this as completed in #18021 Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple prometheus_scrape endpoints not scraped in parallel #17659

Multiple prometheus_scrape endpoints not scraped in parallel #17659

wjordan commented Jun 9, 2023 •

edited

nullren commented Jul 13, 2023 •

edited

wjordan commented Jul 20, 2023

Multiple prometheus_scrape endpoints not scraped in parallel #17659

Multiple prometheus_scrape endpoints not scraped in parallel #17659

Comments

wjordan commented Jun 9, 2023 • edited

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

nullren commented Jul 13, 2023 • edited

wjordan commented Jul 20, 2023

wjordan commented Jun 9, 2023 •

edited

nullren commented Jul 13, 2023 •

edited