[RFC] Host metrics #664

pgray · 2023-09-19T23:40:23Z

RFC: Host metrics

Introduction

wasmcloud host metrics implemented using a prometheus /metrics endpoint on a configurable port

Motivation

In order to provide users insight into their running workloads wasmcloud needs to expose metrics that reveal the traffic actors and providers have handled over a period of time.

Detailed Design

Hosts already provide basic metrics in health payloads like number of actors/providers/etc. running/configured

In order to provide insight into running workloads, a wasmcloud metrics implementation should provide the following metrics:

host uptime (seconds)
actors
- invocation count
- error count
- latency (actor execution time)
providers
- restart count
- provider running seconds (process uptime)

The main wasmcloud runtime process should instantiate a prometheus registry and register/deregister actors/providers as they are scheduled on the host.

Backwards Compatibility

this functionality would be net-new so no backwards incompatible changes are planned

additionally, this RFC only seeks to implement metrics based on what a host can observe based on its runtime behavior... we leave provider-specific metrics for a later RFC

Alternatives Considered

A nats publish topic capable of deriving host/provider/actor metrics would provide cluster operators a simplified deployment model for getting metrics from their wasmcloud deployments. Utilizing nats' queueing capabilities, an agent could be designed to support many copies running and sharing work. Depending on the cluster topology and tenancy, a nats metric agent might require elevated privileges in order to collect metrics from all hosts. Wasmcloud hosts would need to publish metrics to a shared topic in nats in order to avoid the nats metric agent inspecting all traffic flowing on wasmcloud topics.

Unresolved Questions

how should we implement per-provider metrics? (e.g. connection count, latency, etc)
should we add user configurable tags?
what other metrics should exist by default?

Conclusion

Adding prometheus metrics to wasmcloud hosts would provide operators with a familiar and easy to use method for collecting metrics on their wasmcloud workloads. I hope this RFC sets defines enough to kick off the discussion.

The text was updated successfully, but these errors were encountered:

connorsmith256 · 2023-09-22T17:26:29Z

CC @protochron and @stevelr who have expressed interest in this RFC before

connorsmith256 · 2023-09-22T17:31:03Z

This set of metrics looks like a nice, simple start to me that we can build on over time. A few questions:

for actor invocation/error count and latency, this should be split out by actor (ID?) right? Potentially also by link definition/call alias?
what's the motivation for provider restart count? Not sure I understand that one
are there any other system-level metrics that would be useful, such as CPU/memory, bytes transferred, bytes cached on disk, etc.?

aschrijver · 2023-09-23T05:07:03Z

There's a force driving wasmCloud to work out-of-the-box with best-of-breed cloud products like Prometheus. Understandable, and many folks interested in wasmCloud are in the cloud space, well-aware of how complex and heavy-weight this has gotten. So, wasmCloud edge computing + Prometheus support is logical, lightweight, refreshing. And so too for the next OOTB integration with CloudProductXYZ.

For those not in the cloud space the feeling is different. Like wasmCloud is collecting bells & whistles. I know that this Prometheus integration is probably going to be a separate optional component, a crate or something. How that fits exactly should be part of the RFC. What is the "minimum profile" wasmCloud deployment going to be? What is the default deployment that's documented best? How many moving infra parts does it include?

Imho wasmCloud would be strategically best served to leave all paths open to App Developers: Start minimal, add just what you need. Cognitive overhead is growing right now, where I increasingly perceive a cloud-product-done-differently rather than a new Paradigm of Edge Computing. So much of these components being in a monorepo doesn't help alleviate that.

There's many features & benefits to Wasm as its used. What is the biggest USP of Wasm/WASI? Isn't it (contract-based) component-oriented development? Having this box of lego blocks that allows me to compose my solution with all/most infra taken care of to just add my business logic. Emphasize that USP. Right now the monorepo doesn't convey that vibe, tbh.

The "minimum core" of wasmCloud supports OpenTelemetry, a de-facto open standard. That would be my selection criterium for the product. And secondary the already supported components / building blocks to connect to that.. "Prometheus? 🤔 Hmm, yea possibly. This helps get me going faster". Or not, and choose something else.

connorsmith256 · 2023-09-23T17:06:34Z

@aschrijver Thank you for sharing a perspective from outside the cloud space 👍

@pgray noted in his proposal that support for prometheus will be accomplished by serving a /metrics endpoint over HTTP. I think it would be fine to use a metrics feature flag to disable this feature (at the slight cost of decreasing code readability). However, I'm not sure whether this helps. If an operator of wasmCloud doesn't care about metrics, or uses a system other than Prometheus, they simply wouldn't configure Prometheus to scrape the host, right? The cost of starting a web server to expose an endpoint that doesn't any receive requests should be negligible.

@protochron, @pgray, and others can speak more clearly to the motivations for adding metrics, but I can say as someone who at times helps with operating/managing many wasmCloud hosts, I have often wished for access to more high-level host data. I can find the answers to some questions today by searching through logs or looking for error traces, but the 3rd pillar of observability (metrics) is missing from wasmCloud today.

(Re: monorepo, I would like to hear more about your concerns, but that isn't part of this RFC, so let's pick that up in a different issue)

protochron · 2023-09-23T17:54:57Z

I think it might better to think of prometheus => opentelemetry pretty much everywhere we're talking about it in the RFC. We should update it to make it explicit, but I believe the implementation should rely on the OpenTelemetry support we already have built in to the host to collect metrics.

Exposing them is a slightly different story, since I keep going back and forth whether or not we should just export them to an opentelemetry collector like we do now with traces or if we should expose the /metrics HTTP endpoint like @pgray originally proposed. There is a crate we can use that we'd want to use if we want to instantiate a listener on a port (ex. 127.0.0.1:9091/metrics) for enabling Prometheus scraping. The alternative as I mentioned previously would be to simply relay metrics to the configured collector and let the collector handle exporting metrics from there. The advantage to the latter approach is that the host does not need to know how to pull metrics from providers, since they would just be configured to export metrics to the collector just like they export traces. In a world where we want an explicit Prometheus scraping, the host would need to reach out to every provider to collect metrics when a request for metrics comes in to the listener or providers would need to stream that data back to the host which would need to cache them until they needed to be updated.

In terms of wanting metrics at all, like @connorsmith256 said as someone who is operating a fleet of wasmcloud hosts in a cloud context, I absolutely want and need more instrumentation than we have now. For example, one of the things that having metrics on invocation counts and response times from actors tells you is whether or not you need to add additional actors to satisfy incoming requests. Without that data it is extremely difficult to determine whether or not your application is saturated.

@connorsmith256 re: metric types, actor invocation counts would need to be grouped by actor ID, and probably by link name as well. We could keep track of CPU/memory, but IMHO those are better tracked by other methods or exporters. It would be interesting to see stats around the wasmcloud caches though!
Here's what I'm thinking for an initial set of metrics based on what @pgray outlined:

counter: wasmcloud_actor_invocations
  labels: actor_id, link_def, lattice_id, source, version

counter: wasmcloud_actor_errors
  labels: actor_id, link_def, lattice_id, source, version

gauge: wasmcloud_actor_scale
  labels: actor_id, link_def, lattice_id, source, version

gauge: wasmcloud_provider_scale
  labels: provider_id, link_def, lattice_id, source, version

histogram: wasmcloud_actor_request_latency_sec
  labels: actor_id, link_def, lattice_id, source, version

# Should be gauges if we clean anything up
counter: wasmcloud_host_cache_size_kb
counter: wasmcloud_host_oci_cache_size_kb

counter: wasmcloud_host_uptime_sec

# Always 0, only exists for exporting host labels
counter: wasmcloud_host_labels
  labels: labels (list of host labels)

# Always 0, only exists for exporting host metadata
# Probably want to think of what other information we'd want about a host here
counter: wasmcloud_host_information
  labels: os, version, architecture

stevelr · 2023-09-23T19:39:43Z

@pgray and @connorsmith256 thanks for writing this! I'm looking forward to using host metrics.

@protochron where you wrote source,version in the labels .. does source mean the oci ref without the version?

To provide insight for scaling, I think we want

# total time executing actor invocations ("wall-clock" time from start of invocation to method return)
counter: wasmcloud_actor_execution_time
  labels: actor_id, link_def, method_name, lattice_id, source, version

# total time messages wait for actor to become available (due on instance or thread contention) before invocation
counter: wasmcloud_actor_queue_time
  labels: actor_id, link_def, method_name, lattice_id, source, version

# total time actor waits for response to wasi system call or invocation of actor/provider
counter: wasmcloud_actor_wait_time
  labels: actor_id, link_def, in_method_name, out_id, out_method_name, lattice_id, source, version

# total time executing provider invocations
counter: wasmcloud_provider_execution_time
  labels: provider_id, link_def, method_name, lattice_id, source, version

# (edit: also need count of invocations for actor & provider, in addition to total execution time)
# (edit: Disclaimer: I'm probably not using the right naming conventions for labels or fields. I leave naming to others who know better)

Using the above metrics, the time an actor is executing code is
wasmcloud_actor_execution_time - wasmcloud_actor_wait_time
I would also be interested in some metrics around startup latency: time elapsed from when host receives start actor command to invocation, including OCI download, linking and/or dynamic wasifills (if any), and AOT compilation.
I agree with @protochron that CPU and memory metrics are better left to other collectors. If the consumer of those metrics needs another value, (for example, disk io, network packets, etc.) it's easier to update configuration file for a collector agent (grafana agent, fluent-bit, etc.) and restart it, than it is to figure out the system api to get the data, change the host source, recompile, and redeploy a host.
There should be a way to enable or disable the host metrics endpoint with an environment variable. (Based on secure-by-default principle, default would be disabled). It's useful to be able to enable/disable it with environment variables even if there is also a feature flag to compile-in metrics support.

Provider metrics:
Scraping provider metrics presents more complexities:
- avoiding port conflicts with multiple provider instances requires extra configuration parameters
- the complexity of adding scrapers to a host or external process, with a dynamically-changing list of providers to scrape
- avoiding false alerts if a provider's /metrics endpoint is not available while the provider is starting or stopping.

All of the extra complexities for scraping providers make me think that a simpler approach is either (a) let the host report calls and latency, as it does for actors, and use generic process metrics collectors (fluent_bit, grafana agents, etc.) as a rough approximation of performance, and/or (b) have host push to OpenTelemetry collectors. The latter approach leverages existing infrastructure since wasmcloud already uses otel for logs and traces.
In the future, when there are wasm providers, the metrics reporting will look more like it does for actors anyway. If that time is relatively near, it may not be worth the collective effort across the community to build instrumentation for providers as separate processes, only for it to become obsolete soon after.

stevelr · 2023-09-23T20:00:53Z

I'm slightly leaning toward the idea of host exporting metrics over OTEL, even for use cases when the consumer needs a Prometheus endpoint.

This would eliminate all the complexity of configuring prometheus endpoints for hosts and providers. The "senders" of data only need to know the url of the collector. It also reduces attack surface and security auditing complexity by having fewer total open ports. For a manager of multi-tenant wasmcloud I would think this would be a meaningful simplification and allow more flexible independent scaling of the "metrics plane".
It's the most flexible, as consumers can use either OTEL-style metrics or prometheus-style metrics. As @protochron mentions, it would be easy enough to have an otel collector process provide the prometheus endpoint. (these already exist)
Since OTEL is built on a more modern and flexible set of protocols, it's less effort for a developer to combine host metrics with application metrics into a single metrics pipeline / database / dashboard. For example, if an HttpServer actor wants to report a histogram of request size, they could use an otel library to report the additional metrics (via httpclient) and those metrics would be channeled into the exact same mechanism as the other metrics (whether otel or prometheus -style). If the host were to use Prometheus exports, the developer has to do extra work to join the host-reported data with application data (such as http request size) into the same metrics dashboards or processing

Perhaps a built-in OTEL metrics capability provider would improve performance and simplify the developer's experience. It could leverage the host's ability to batch/aggregate metrics and report on fixed intervals. If an actor has to report metrics with OTEL over HTTP, it would require a synchronous network call for each method invocation.

stevelr · 2023-09-23T20:13:27Z

When wasmcloud host shuts down, or needs to restart to upgrade, it should do a last push of metrics to the otel collector before the host process exits. If the host only had prometheus scraping, we'd always lose data collected during the last incomplete scraping interval.

…-4.3.12 build(deps): bump clap from 4.3.0 to 4.3.12

stale · 2023-11-22T20:43:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If this has been closed too eagerly, please feel free to tag a maintainer so we can keep working on the issue. Thank you for contributing to wasmCloud!

aschrijver · 2023-11-23T15:19:05Z

I assume it is not stale, bot.

connorsmith256 · 2023-12-06T16:24:09Z

Whoops, I thought the stale label had been removed 🙂

pgray · 2023-12-14T19:33:23Z

I closed my draft PR for the host metrics work but I'll leave the branch around in case anyone would like to reference it when they get to the work. cheers

vados-cosmonic · 2024-02-15T16:35:26Z

Hey @joonas if you're open to turning this into an ADR after your PR merges, we could assign this to you? If not, I'd be happy to try and write the ADR here to get the ticket closed.

joonas · 2024-02-15T16:52:05Z

Hey @joonas if you're open to turning this into an ADR after your PR merges, we could assign this to you? If not, I'd be happy to try and write the ADR here to get the ticket closed.

@vados-cosmonic You're welcome to take over the ADR, I think my availability over the next month will not be conducive to driving something for that before 1.0 if that's when we'd like to have the ADR included in.

brooksmtownsend · 2024-02-15T19:08:34Z

Said PR is merged! So ADR is good to go 😄

pgray added RFC rfc-proposed This issue is a newly proposed RFC and is awaiting comments, questions, and approval labels Sep 19, 2023

connorsmith256 added a commit to connorsmith256/wasmCloud that referenced this issue Oct 17, 2023

Merge pull request wasmCloud#664 from wasmCloud/dependabot/cargo/clap…

16bca05

…-4.3.12 build(deps): bump clap from 4.3.0 to 4.3.12

connorsmith256 added a commit to connorsmith256/wasmCloud that referenced this issue Oct 17, 2023

Merge pull request wasmCloud#664 from wasmCloud/dependabot/cargo/clap…

2575888

…-4.3.12 build(deps): bump clap from 4.3.0 to 4.3.12

connorsmith256 mentioned this issue Nov 13, 2023

Cloudevents for rpc invocations #637

Closed

pgray mentioned this issue Nov 16, 2023

Metrics RFC follow-up (DRAFT) #1025

Closed

stale bot added the stale label Nov 22, 2023

brooksmtownsend added this to the wasmCloud 1.0.0 milestone Nov 29, 2023

stale bot closed this as completed Dec 6, 2023

connorsmith256 reopened this Dec 6, 2023

stale bot removed the stale label Dec 6, 2023

vados-cosmonic assigned vados-cosmonic and unassigned vados-cosmonic Jan 17, 2024

joonas mentioned this issue Feb 1, 2024

feat: Add initial support for metrics #1431

Merged

vados-cosmonic self-assigned this Feb 15, 2024

vados-cosmonic mentioned this issue Feb 16, 2024

docs(adr): record ADR 0018 #1531

Merged

vados-cosmonic closed this as completed in #1531 Feb 16, 2024

protochron mentioned this issue Feb 29, 2024

feat(metrics) add invocation and error counts for actor invocations #1602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Host metrics #664

[RFC] Host metrics #664

pgray commented Sep 19, 2023

connorsmith256 commented Sep 22, 2023

connorsmith256 commented Sep 22, 2023

aschrijver commented Sep 23, 2023 •

edited

Loading

connorsmith256 commented Sep 23, 2023

protochron commented Sep 23, 2023

stevelr commented Sep 23, 2023 •

edited

Loading

stevelr commented Sep 23, 2023 •

edited

Loading

stevelr commented Sep 23, 2023

stale bot commented Nov 22, 2023

aschrijver commented Nov 23, 2023

connorsmith256 commented Dec 6, 2023

pgray commented Dec 14, 2023

vados-cosmonic commented Feb 15, 2024

joonas commented Feb 15, 2024 •

edited

Loading

brooksmtownsend commented Feb 15, 2024

[RFC] Host metrics #664

[RFC] Host metrics #664

Comments

pgray commented Sep 19, 2023

RFC: Host metrics

Introduction

Motivation

Detailed Design

tags

Backwards Compatibility

Alternatives Considered

Unresolved Questions

Conclusion

connorsmith256 commented Sep 22, 2023

connorsmith256 commented Sep 22, 2023

aschrijver commented Sep 23, 2023 • edited Loading

connorsmith256 commented Sep 23, 2023

protochron commented Sep 23, 2023

stevelr commented Sep 23, 2023 • edited Loading

stevelr commented Sep 23, 2023 • edited Loading

stevelr commented Sep 23, 2023

stale bot commented Nov 22, 2023

aschrijver commented Nov 23, 2023

connorsmith256 commented Dec 6, 2023

pgray commented Dec 14, 2023

vados-cosmonic commented Feb 15, 2024

joonas commented Feb 15, 2024 • edited Loading

brooksmtownsend commented Feb 15, 2024

aschrijver commented Sep 23, 2023 •

edited

Loading

stevelr commented Sep 23, 2023 •

edited

Loading

stevelr commented Sep 23, 2023 •

edited

Loading

joonas commented Feb 15, 2024 •

edited

Loading