-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Host metrics #664
Comments
CC @protochron and @stevelr who have expressed interest in this RFC before |
This set of metrics looks like a nice, simple start to me that we can build on over time. A few questions:
|
There's a force driving wasmCloud to work out-of-the-box with best-of-breed cloud products like Prometheus. Understandable, and many folks interested in wasmCloud are in the cloud space, well-aware of how complex and heavy-weight this has gotten. So, wasmCloud edge computing + Prometheus support is logical, lightweight, refreshing. And so too for the next OOTB integration with CloudProductXYZ. For those not in the cloud space the feeling is different. Like wasmCloud is collecting bells & whistles. I know that this Prometheus integration is probably going to be a separate optional component, a crate or something. How that fits exactly should be part of the RFC. What is the "minimum profile" wasmCloud deployment going to be? What is the default deployment that's documented best? How many moving infra parts does it include? Imho wasmCloud would be strategically best served to leave all paths open to App Developers: Start minimal, add just what you need. Cognitive overhead is growing right now, where I increasingly perceive a cloud-product-done-differently rather than a new Paradigm of Edge Computing. So much of these components being in a monorepo doesn't help alleviate that. There's many features & benefits to Wasm as its used. What is the biggest USP of Wasm/WASI? Isn't it (contract-based) component-oriented development? Having this box of lego blocks that allows me to compose my solution with all/most infra taken care of to just add my business logic. Emphasize that USP. Right now the monorepo doesn't convey that vibe, tbh. The "minimum core" of wasmCloud supports OpenTelemetry, a de-facto open standard. That would be my selection criterium for the product. And secondary the already supported components / building blocks to connect to that.. "Prometheus? 🤔 Hmm, yea possibly. This helps get me going faster". Or not, and choose something else. |
@aschrijver Thank you for sharing a perspective from outside the cloud space 👍 @pgray noted in his proposal that support for prometheus will be accomplished by serving a @protochron, @pgray, and others can speak more clearly to the motivations for adding metrics, but I can say as someone who at times helps with operating/managing many wasmCloud hosts, I have often wished for access to more high-level host data. I can find the answers to some questions today by searching through logs or looking for error traces, but the 3rd pillar of observability (metrics) is missing from wasmCloud today. (Re: monorepo, I would like to hear more about your concerns, but that isn't part of this RFC, so let's pick that up in a different issue) |
I think it might better to think of prometheus => opentelemetry pretty much everywhere we're talking about it in the RFC. We should update it to make it explicit, but I believe the implementation should rely on the OpenTelemetry support we already have built in to the host to collect metrics. Exposing them is a slightly different story, since I keep going back and forth whether or not we should just export them to an opentelemetry collector like we do now with traces or if we should expose the In terms of wanting metrics at all, like @connorsmith256 said as someone who is operating a fleet of wasmcloud hosts in a cloud context, I absolutely want and need more instrumentation than we have now. For example, one of the things that having metrics on invocation counts and response times from actors tells you is whether or not you need to add additional actors to satisfy incoming requests. Without that data it is extremely difficult to determine whether or not your application is saturated. @connorsmith256 re: metric types, actor invocation counts would need to be grouped by actor ID, and probably by link name as well. We could keep track of CPU/memory, but IMHO those are better tracked by other methods or exporters. It would be interesting to see stats around the wasmcloud caches though!
|
@pgray and @connorsmith256 thanks for writing this! I'm looking forward to using host metrics. @protochron where you wrote
Provider metrics:
|
I'm slightly leaning toward the idea of host exporting metrics over OTEL, even for use cases when the consumer needs a Prometheus endpoint.
Perhaps a built-in OTEL metrics capability provider would improve performance and simplify the developer's experience. It could leverage the host's ability to batch/aggregate metrics and report on fixed intervals. If an actor has to report metrics with OTEL over HTTP, it would require a synchronous network call for each method invocation. |
When wasmcloud host shuts down, or needs to restart to upgrade, it should do a last push of metrics to the otel collector before the host process exits. If the host only had prometheus scraping, we'd always lose data collected during the last incomplete scraping interval. |
…-4.3.12 build(deps): bump clap from 4.3.0 to 4.3.12
…-4.3.12 build(deps): bump clap from 4.3.0 to 4.3.12
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this has been closed too eagerly, please feel free to tag a maintainer so we can keep working on the issue. Thank you for contributing to wasmCloud! |
I assume it is not stale, bot. |
Whoops, I thought the stale label had been removed 🙂 |
I closed my draft PR for the host metrics work but I'll leave the branch around in case anyone would like to reference it when they get to the work. cheers |
Hey @joonas if you're open to turning this into an ADR after your PR merges, we could assign this to you? If not, I'd be happy to try and write the ADR here to get the ticket closed. |
@vados-cosmonic You're welcome to take over the ADR, I think my availability over the next month will not be conducive to driving something for that before 1.0 if that's when we'd like to have the ADR included in. |
Said PR is merged! So ADR is good to go 😄 |
RFC: Host metrics
Introduction
wasmcloud host metrics implemented using a prometheus
/metrics
endpoint on a configurable portMotivation
In order to provide users insight into their running workloads wasmcloud needs to expose metrics that reveal the traffic actors and providers have handled over a period of time.
Detailed Design
Hosts already provide basic metrics in health payloads like number of actors/providers/etc. running/configured
In order to provide insight into running workloads, a wasmcloud metrics implementation should provide the following metrics:
The main wasmcloud runtime process should instantiate a prometheus registry and register/deregister actors/providers as they are scheduled on the host.
tags
wasmcloud should tag metrics with basic identifying tags:
Backwards Compatibility
this functionality would be net-new so no backwards incompatible changes are planned
additionally, this RFC only seeks to implement metrics based on what a host can observe based on its runtime behavior... we leave provider-specific metrics for a later RFC
Alternatives Considered
A nats publish topic capable of deriving host/provider/actor metrics would provide cluster operators a simplified deployment model for getting metrics from their wasmcloud deployments. Utilizing nats' queueing capabilities, an agent could be designed to support many copies running and sharing work. Depending on the cluster topology and tenancy, a nats metric agent might require elevated privileges in order to collect metrics from all hosts. Wasmcloud hosts would need to publish metrics to a shared topic in nats in order to avoid the nats metric agent inspecting all traffic flowing on wasmcloud topics.
Unresolved Questions
Conclusion
Adding prometheus metrics to wasmcloud hosts would provide operators with a familiar and easy to use method for collecting metrics on their wasmcloud workloads. I hope this RFC sets defines enough to kick off the discussion.
The text was updated successfully, but these errors were encountered: