Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Host metrics #664

Closed
pgray opened this issue Sep 19, 2023 · 15 comments · Fixed by #1531
Closed

[RFC] Host metrics #664

pgray opened this issue Sep 19, 2023 · 15 comments · Fixed by #1531
Assignees
Labels
RFC rfc-proposed This issue is a newly proposed RFC and is awaiting comments, questions, and approval

Comments

@pgray
Copy link
Contributor

pgray commented Sep 19, 2023

RFC: Host metrics

Introduction

wasmcloud host metrics implemented using a prometheus /metrics endpoint on a configurable port

Motivation

In order to provide users insight into their running workloads wasmcloud needs to expose metrics that reveal the traffic actors and providers have handled over a period of time.

Detailed Design

Hosts already provide basic metrics in health payloads like number of actors/providers/etc. running/configured

In order to provide insight into running workloads, a wasmcloud metrics implementation should provide the following metrics:

  • host uptime (seconds)
  • actors
    • invocation count
    • error count
    • latency (actor execution time)
  • providers
    • restart count
    • provider running seconds (process uptime)

The main wasmcloud runtime process should instantiate a prometheus registry and register/deregister actors/providers as they are scheduled on the host.

tags

wasmcloud should tag metrics with basic identifying tags:

  • host id
  • lattice id
  • oci ref (e.g. azurecr.io/wasmcloud/kv)
  • version (e.g. v1.2.3)

Backwards Compatibility

this functionality would be net-new so no backwards incompatible changes are planned

additionally, this RFC only seeks to implement metrics based on what a host can observe based on its runtime behavior... we leave provider-specific metrics for a later RFC

Alternatives Considered

A nats publish topic capable of deriving host/provider/actor metrics would provide cluster operators a simplified deployment model for getting metrics from their wasmcloud deployments. Utilizing nats' queueing capabilities, an agent could be designed to support many copies running and sharing work. Depending on the cluster topology and tenancy, a nats metric agent might require elevated privileges in order to collect metrics from all hosts. Wasmcloud hosts would need to publish metrics to a shared topic in nats in order to avoid the nats metric agent inspecting all traffic flowing on wasmcloud topics.

Unresolved Questions

  • how should we implement per-provider metrics? (e.g. connection count, latency, etc)
  • should we add user configurable tags?
  • what other metrics should exist by default?

Conclusion

Adding prometheus metrics to wasmcloud hosts would provide operators with a familiar and easy to use method for collecting metrics on their wasmcloud workloads. I hope this RFC sets defines enough to kick off the discussion.

@pgray pgray added RFC rfc-proposed This issue is a newly proposed RFC and is awaiting comments, questions, and approval labels Sep 19, 2023
@connorsmith256
Copy link
Contributor

CC @protochron and @stevelr who have expressed interest in this RFC before

@connorsmith256
Copy link
Contributor

This set of metrics looks like a nice, simple start to me that we can build on over time. A few questions:

  • for actor invocation/error count and latency, this should be split out by actor (ID?) right? Potentially also by link definition/call alias?
  • what's the motivation for provider restart count? Not sure I understand that one
  • are there any other system-level metrics that would be useful, such as CPU/memory, bytes transferred, bytes cached on disk, etc.?

@aschrijver
Copy link

aschrijver commented Sep 23, 2023

There's a force driving wasmCloud to work out-of-the-box with best-of-breed cloud products like Prometheus. Understandable, and many folks interested in wasmCloud are in the cloud space, well-aware of how complex and heavy-weight this has gotten. So, wasmCloud edge computing + Prometheus support is logical, lightweight, refreshing. And so too for the next OOTB integration with CloudProductXYZ.

For those not in the cloud space the feeling is different. Like wasmCloud is collecting bells & whistles. I know that this Prometheus integration is probably going to be a separate optional component, a crate or something. How that fits exactly should be part of the RFC. What is the "minimum profile" wasmCloud deployment going to be? What is the default deployment that's documented best? How many moving infra parts does it include?

Imho wasmCloud would be strategically best served to leave all paths open to App Developers: Start minimal, add just what you need. Cognitive overhead is growing right now, where I increasingly perceive a cloud-product-done-differently rather than a new Paradigm of Edge Computing. So much of these components being in a monorepo doesn't help alleviate that.

There's many features & benefits to Wasm as its used. What is the biggest USP of Wasm/WASI? Isn't it (contract-based) component-oriented development? Having this box of lego blocks that allows me to compose my solution with all/most infra taken care of to just add my business logic. Emphasize that USP. Right now the monorepo doesn't convey that vibe, tbh.

The "minimum core" of wasmCloud supports OpenTelemetry, a de-facto open standard. That would be my selection criterium for the product. And secondary the already supported components / building blocks to connect to that.. "Prometheus? 🤔 Hmm, yea possibly. This helps get me going faster". Or not, and choose something else.

@connorsmith256
Copy link
Contributor

@aschrijver Thank you for sharing a perspective from outside the cloud space 👍

@pgray noted in his proposal that support for prometheus will be accomplished by serving a /metrics endpoint over HTTP. I think it would be fine to use a metrics feature flag to disable this feature (at the slight cost of decreasing code readability). However, I'm not sure whether this helps. If an operator of wasmCloud doesn't care about metrics, or uses a system other than Prometheus, they simply wouldn't configure Prometheus to scrape the host, right? The cost of starting a web server to expose an endpoint that doesn't any receive requests should be negligible.

@protochron, @pgray, and others can speak more clearly to the motivations for adding metrics, but I can say as someone who at times helps with operating/managing many wasmCloud hosts, I have often wished for access to more high-level host data. I can find the answers to some questions today by searching through logs or looking for error traces, but the 3rd pillar of observability (metrics) is missing from wasmCloud today.

(Re: monorepo, I would like to hear more about your concerns, but that isn't part of this RFC, so let's pick that up in a different issue)

@protochron
Copy link
Contributor

I think it might better to think of prometheus => opentelemetry pretty much everywhere we're talking about it in the RFC. We should update it to make it explicit, but I believe the implementation should rely on the OpenTelemetry support we already have built in to the host to collect metrics.

Exposing them is a slightly different story, since I keep going back and forth whether or not we should just export them to an opentelemetry collector like we do now with traces or if we should expose the /metrics HTTP endpoint like @pgray originally proposed. There is a crate we can use that we'd want to use if we want to instantiate a listener on a port (ex. 127.0.0.1:9091/metrics) for enabling Prometheus scraping. The alternative as I mentioned previously would be to simply relay metrics to the configured collector and let the collector handle exporting metrics from there. The advantage to the latter approach is that the host does not need to know how to pull metrics from providers, since they would just be configured to export metrics to the collector just like they export traces. In a world where we want an explicit Prometheus scraping, the host would need to reach out to every provider to collect metrics when a request for metrics comes in to the listener or providers would need to stream that data back to the host which would need to cache them until they needed to be updated.

In terms of wanting metrics at all, like @connorsmith256 said as someone who is operating a fleet of wasmcloud hosts in a cloud context, I absolutely want and need more instrumentation than we have now. For example, one of the things that having metrics on invocation counts and response times from actors tells you is whether or not you need to add additional actors to satisfy incoming requests. Without that data it is extremely difficult to determine whether or not your application is saturated.

@connorsmith256 re: metric types, actor invocation counts would need to be grouped by actor ID, and probably by link name as well. We could keep track of CPU/memory, but IMHO those are better tracked by other methods or exporters. It would be interesting to see stats around the wasmcloud caches though!
Here's what I'm thinking for an initial set of metrics based on what @pgray outlined:

counter: wasmcloud_actor_invocations
  labels: actor_id, link_def, lattice_id, source, version

counter: wasmcloud_actor_errors
  labels: actor_id, link_def, lattice_id, source, version

gauge: wasmcloud_actor_scale
  labels: actor_id, link_def, lattice_id, source, version

gauge: wasmcloud_provider_scale
  labels: provider_id, link_def, lattice_id, source, version

histogram: wasmcloud_actor_request_latency_sec
  labels: actor_id, link_def, lattice_id, source, version

# Should be gauges if we clean anything up
counter: wasmcloud_host_cache_size_kb
counter: wasmcloud_host_oci_cache_size_kb

counter: wasmcloud_host_uptime_sec

# Always 0, only exists for exporting host labels
counter: wasmcloud_host_labels
  labels: labels (list of host labels)

# Always 0, only exists for exporting host metadata
# Probably want to think of what other information we'd want about a host here
counter: wasmcloud_host_information
  labels: os, version, architecture

@stevelr
Copy link
Contributor

stevelr commented Sep 23, 2023

@pgray and @connorsmith256 thanks for writing this! I'm looking forward to using host metrics.

@protochron where you wrote source,version in the labels .. does source mean the oci ref without the version?

  • To provide insight for scaling, I think we want
# total time executing actor invocations ("wall-clock" time from start of invocation to method return)
counter: wasmcloud_actor_execution_time
  labels: actor_id, link_def, method_name, lattice_id, source, version

# total time messages wait for actor to become available (due on instance or thread contention) before invocation
counter: wasmcloud_actor_queue_time
  labels: actor_id, link_def, method_name, lattice_id, source, version

# total time actor waits for response to wasi system call or invocation of actor/provider
counter: wasmcloud_actor_wait_time
  labels: actor_id, link_def, in_method_name, out_id, out_method_name, lattice_id, source, version

# total time executing provider invocations
counter: wasmcloud_provider_execution_time
  labels: provider_id, link_def, method_name, lattice_id, source, version

# (edit: also need count of invocations for actor & provider, in addition to total execution time)
# (edit: Disclaimer: I'm probably not using the right naming conventions for labels or fields. I leave naming to others who know better)
  • Using the above metrics, the time an actor is executing code is
    wasmcloud_actor_execution_time - wasmcloud_actor_wait_time

  • I would also be interested in some metrics around startup latency: time elapsed from when host receives start actor command to invocation, including OCI download, linking and/or dynamic wasifills (if any), and AOT compilation.

  • I agree with @protochron that CPU and memory metrics are better left to other collectors. If the consumer of those metrics needs another value, (for example, disk io, network packets, etc.) it's easier to update configuration file for a collector agent (grafana agent, fluent-bit, etc.) and restart it, than it is to figure out the system api to get the data, change the host source, recompile, and redeploy a host.

  • There should be a way to enable or disable the host metrics endpoint with an environment variable. (Based on secure-by-default principle, default would be disabled). It's useful to be able to enable/disable it with environment variables even if there is also a feature flag to compile-in metrics support.

Provider metrics:
Scraping provider metrics presents more complexities:
- avoiding port conflicts with multiple provider instances requires extra configuration parameters
- the complexity of adding scrapers to a host or external process, with a dynamically-changing list of providers to scrape
- avoiding false alerts if a provider's /metrics endpoint is not available while the provider is starting or stopping.

  • All of the extra complexities for scraping providers make me think that a simpler approach is either (a) let the host report calls and latency, as it does for actors, and use generic process metrics collectors (fluent_bit, grafana agents, etc.) as a rough approximation of performance, and/or (b) have host push to OpenTelemetry collectors. The latter approach leverages existing infrastructure since wasmcloud already uses otel for logs and traces.
  • In the future, when there are wasm providers, the metrics reporting will look more like it does for actors anyway. If that time is relatively near, it may not be worth the collective effort across the community to build instrumentation for providers as separate processes, only for it to become obsolete soon after.

@stevelr
Copy link
Contributor

stevelr commented Sep 23, 2023

I'm slightly leaning toward the idea of host exporting metrics over OTEL, even for use cases when the consumer needs a Prometheus endpoint.

  • This would eliminate all the complexity of configuring prometheus endpoints for hosts and providers. The "senders" of data only need to know the url of the collector. It also reduces attack surface and security auditing complexity by having fewer total open ports. For a manager of multi-tenant wasmcloud I would think this would be a meaningful simplification and allow more flexible independent scaling of the "metrics plane".
  • It's the most flexible, as consumers can use either OTEL-style metrics or prometheus-style metrics. As @protochron mentions, it would be easy enough to have an otel collector process provide the prometheus endpoint. (these already exist)
  • Since OTEL is built on a more modern and flexible set of protocols, it's less effort for a developer to combine host metrics with application metrics into a single metrics pipeline / database / dashboard. For example, if an HttpServer actor wants to report a histogram of request size, they could use an otel library to report the additional metrics (via httpclient) and those metrics would be channeled into the exact same mechanism as the other metrics (whether otel or prometheus -style). If the host were to use Prometheus exports, the developer has to do extra work to join the host-reported data with application data (such as http request size) into the same metrics dashboards or processing

Perhaps a built-in OTEL metrics capability provider would improve performance and simplify the developer's experience. It could leverage the host's ability to batch/aggregate metrics and report on fixed intervals. If an actor has to report metrics with OTEL over HTTP, it would require a synchronous network call for each method invocation.

@stevelr
Copy link
Contributor

stevelr commented Sep 23, 2023

When wasmcloud host shuts down, or needs to restart to upgrade, it should do a last push of metrics to the otel collector before the host process exits. If the host only had prometheus scraping, we'd always lose data collected during the last incomplete scraping interval.

connorsmith256 added a commit to connorsmith256/wasmCloud that referenced this issue Oct 17, 2023
…-4.3.12

build(deps): bump clap from 4.3.0 to 4.3.12
connorsmith256 added a commit to connorsmith256/wasmCloud that referenced this issue Oct 17, 2023
…-4.3.12

build(deps): bump clap from 4.3.0 to 4.3.12
Copy link

stale bot commented Nov 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this has been closed too eagerly, please feel free to tag a maintainer so we can keep working on the issue. Thank you for contributing to wasmCloud!

@stale stale bot added the stale label Nov 22, 2023
@aschrijver
Copy link

I assume it is not stale, bot.

@brooksmtownsend brooksmtownsend added this to the wasmCloud 1.0.0 milestone Nov 29, 2023
@stale stale bot closed this as completed Dec 6, 2023
@connorsmith256
Copy link
Contributor

Whoops, I thought the stale label had been removed 🙂

@connorsmith256 connorsmith256 reopened this Dec 6, 2023
@stale stale bot removed the stale label Dec 6, 2023
@pgray
Copy link
Contributor Author

pgray commented Dec 14, 2023

I closed my draft PR for the host metrics work but I'll leave the branch around in case anyone would like to reference it when they get to the work. cheers

@vados-cosmonic
Copy link
Contributor

Hey @joonas if you're open to turning this into an ADR after your PR merges, we could assign this to you? If not, I'd be happy to try and write the ADR here to get the ticket closed.

@joonas
Copy link
Member

joonas commented Feb 15, 2024

Hey @joonas if you're open to turning this into an ADR after your PR merges, we could assign this to you? If not, I'd be happy to try and write the ADR here to get the ticket closed.

@vados-cosmonic You're welcome to take over the ADR, I think my availability over the next month will not be conducive to driving something for that before 1.0 if that's when we'd like to have the ADR included in.

@vados-cosmonic vados-cosmonic self-assigned this Feb 15, 2024
@brooksmtownsend
Copy link
Member

Said PR is merged! So ADR is good to go 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC rfc-proposed This issue is a newly proposed RFC and is awaiting comments, questions, and approval
Projects
Status: Completed
Development

Successfully merging a pull request may close this issue.

8 participants