Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple prometheus_scrape endpoints not scraped in parallel #17659

Closed
wjordan opened this issue Jun 9, 2023 · 2 comments · Fixed by #18021
Closed

Multiple prometheus_scrape endpoints not scraped in parallel #17659

wjordan opened this issue Jun 9, 2023 · 2 comments · Fixed by #18021
Labels
source: prometheus_scrape Anything `prometheus_scrape` source related type: enhancement A value-adding code change that enhances its existing functionality.

Comments

@wjordan
Copy link
Contributor

wjordan commented Jun 9, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I am using Vector to scrape metrics from a large number (thousands) of instances using the prometheus_scrape source component.

I expected the URLs set in the endpoints config to be scraped in parallel, but it appears to be scraping sequentially. This is not very useful since a single slow or broken endpoint prevents all other endpoints from being scraped properly. Other Prometheus agents scrape the provided endpoints in parallel.

A workaround is to create a separate prometheus_scrape component for each endpoint, but creating a large number (thousands) of components creates heavy CPU load and seems to go against Vector's design.

Configuration

Here's a minimal configuration that reproduces the issue:

[sources.internal_src]
type = "internal_metrics"

[transforms.internal]
type = "filter"
inputs = ["internal_src"]
condition = '.name == "component_sent_bytes_total"'

[sinks.prom_export]
type = "prometheus_exporter"
inputs = ["internal"]
address = "0.0.0.0:9598"

[sources.prom_scrape]
type = "prometheus_scrape"
endpoints = [
    "http://172.0.0.0/metrics",
    "http://127.0.0.1:9598/metrics"
]
endpoint_tag = "endpoint"

[sinks.console]
type = "console"
inputs = ["prom_scrape"]
encoding.codec = "text"

Version

vector 0.30.0 (x86_64-unknown-linux-gnu 38c3f0b 2023-05-22 17:38:48.655488673)

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@wjordan wjordan added the type: bug A code related bug. label Jun 9, 2023
@jszwedko jszwedko added type: enhancement A value-adding code change that enhances its existing functionality. source: prometheus_scrape Anything `prometheus_scrape` source related and removed type: bug A code related bug. labels Jun 12, 2023
@nullren
Copy link
Contributor

nullren commented Jul 13, 2023

Just came across this issue trying to configure about 1500 nodes for prometheus_scrape—completely unusable in its current state. I'm happy to pick up the work started in #17660. Seems like there are just a couple of tasks to do?

This still needs some additional work:

  1. (blocking issue) Pending requests can start to pile up on the heap causing unbounded memory growth. Scrape requests should implement a timeout less than the interval duration to prevent this.

    • Alternatively, skip future requests for the same endpoint if a previous scrape request still hasn't completed.

noticed there are a couple more issues related to this:

  1. Each request should spawn in a separate short-lived task to spread out the request-processing load across many threads.

    • Alternatively, each endpoint could spawn and reuse a single long-lived task which could be more efficient.

long-lived tasks might be a larger refactor as you'd need to clean them up if the config changes and removes an endpoint.

  1. The timing of endpoint requests should be distributed across the scrape interval instead of all executing at the same time, to spread out the scrape-request load more evenly.

"should" but letting tokio do its thing like @jszwedko mentioned seems more pragmatic.

I'm happy to submit a new PR.

@wjordan
Copy link
Contributor Author

wjordan commented Jul 20, 2023

Thanks for picking this up!

github-merge-queue bot pushed a commit that referenced this issue Jul 24, 2023
…timeouts (#18021)

<!--
**Your PR title must conform to the conventional commit spec!**

  <type>(<scope>)!: <description>

  * `type` = chore, enhancement, feat, fix, docs
  * `!` = OPTIONAL: signals a breaking change
* `scope` = Optional when `type` is "chore" or "docs", available scopes
https://github.com/vectordotdev/vector/blob/master/.github/semantic.yml#L20
  * `description` = short description of the change

Examples:

  * enhancement(file source): Add `sort` option to sort discovered files
  * feat(new source): Initial `statsd` source
  * fix(file source): Fix a bug discovering new files
  * chore(external docs): Clarify `batch_size` option
-->

fixes #14087 
fixes #14132 
fixes #17659

- [x] make target timeout configurable

this builds on what @wjordan did in
#17660

### what's changed
- prometheus scrapes happen concurrently
- requests to targets can timeout
- the timeout can be configured (user facing change)
- small change in how the http was instantiated

---------

Co-authored-by: Doug Smith <dsmith3197@users.noreply.github.com>
Co-authored-by: Stephen Wakely <stephen@lisp.space>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: prometheus_scrape Anything `prometheus_scrape` source related type: enhancement A value-adding code change that enhances its existing functionality.
Projects
None yet
3 participants