Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New top subcommand v1 #3211

Closed
binarylogic opened this issue Jul 26, 2020 · 7 comments
Closed

New top subcommand v1 #3211

binarylogic opened this issue Jul 26, 2020 · 7 comments
Assignees
Labels
domain: cli Anything related to Vector's CLI domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector have: should We should have this feature, but is not required. It is medium priority. type: feature A value-adding code addition that introduce new functionality.

Comments

@binarylogic
Copy link
Contributor

binarylogic commented Jul 26, 2020

As part of https://github.com/timberio/vector-product/issues/24, we want to introduce lightweight CLI-based observability for Vector.

Goals

  1. In v3 of the "Observe & monitor Vector" series, we will be introducing a Web UI. We'd like for the work here to unblock that project.
  2. Provide Vector operators with CLI-based observability. This is useful in situations where an operator will not have access to a browser, such as SSH'ing onto a remote host that Vector is running on.

Out of scope

As noted in https://github.com/timberio/vector-product/issues/24:

  1. Fancy CLI-based graphs are out of scope for this project. As much as my inner-nerd wants to do this, all of the presentation details are a distraction from the purpose of this project.
  2. Full-blown metrics querying is likely out of scope for this project. We will be doing this in the future, but we'd like to use this project to learn more about the UI requirements. I use the term "likely" because it is possible that the beginnings of a query syntax might be easier.

Proposal

I propose that we introduce a vector top subcommand, taking inspiration from the glorious top command. I like this because:

  1. It's a familiar tool to anyone that has used the command line before.
  2. It's clear that this provides current, real-time insight (not historical).
  3. The requirements align tightly with the upcoming UI requirements.

Examples

To demonstrate what I'm thinking:

$ vector help top

USAGE:
    vector top [OPTIONS]

OPTIONS
    --refresh-rate      How often the screen refreshes (default 500ms)
    --resolution        Determines the window size for each value (default 500ms)

And usage is simple:

$ vector top
ID              KIND       TYPE         THRPT   I/O      LATENCY    ERRORS
my-file-source  source     file         5.2k    10.6MiB  10.2ns     251
my-json-parser  transform  json_parser  5.0k    -        1.2ms      523
my-s3-sink      sink       s3           4.5k    5.2MiB   10.2s      12

I am very much open to suggestions/changes here.* I want to start simple, but not in a way that will require rework in the future.

Outstanding questions

  1. Do we want to show host resource usage? It would be nice to communicate Vector's CPU, memory, disk, and network usage as gauges. This will be needed for the Web UI.
  2. Can we communicate resource usage on a per-component basis? I assuming no, but that would be very useful.
  3. How can we communicate back pressure clearly? Backpressure detection #892 touches on this.
  4. How about network errors (retries, failed transmissions, etc)? In my example above I have all of this bucketed under a generic "errors" column, but I'm not sure that's the most helpful.
  5. Finally, the big one, how are we communicating from the Vector binary and a live running Vector instance? This should be done in a way that will unblock the web UI.

Future concerns

  1. Future unknowns. It is very likely we'll want to add/remove/change the data as we progress, and it should be easy to do so. If it is not easy, we should consider a syntax that makes it easy to query data.
@binarylogic binarylogic added domain: cli Anything related to Vector's CLI domain: observability Anything related to monitoring/observing Vector event type: metric needs: rfc Needs an RFC before work can begin. have: should We should have this feature, but is not required. It is medium priority. type: feature A value-adding code addition that introduce new functionality. labels Jul 26, 2020
@binarylogic binarylogic changed the title New top subcommand v1 RFC New top subcommand v1 Jul 26, 2020
@binarylogic binarylogic added domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events and removed event type: metric labels Aug 6, 2020
@leebenson leebenson removed the needs: rfc Needs an RFC before work can begin. label Sep 28, 2020
@leebenson
Copy link
Member

@binarylogic - I'm wondering if vector top should default to showing a snapshot of the stats, and then exit... and instead, we could have an explicit -f to 'follow' the stats that auto-update on the supplied/default --refresh-interval?

I think the common case of top will be to get an at-a-glance view of topology + stats. An explicit -f would separate out the behavior of the console prompt not returning.

@binarylogic
Copy link
Contributor Author

@leebenson I don't think so. When I think top, I think about a persistent updating interface. If we want to print stats we can off a vector stats command. That'll likely print different input, take a window argument to get averages, etc.

@leebenson
Copy link
Member

Got it, thanks for clarifying 👍

@leebenson
Copy link
Member

leebenson commented Oct 7, 2020

@binarylogic - I think I'm hitting a wall with what we're able to currently show in the console. I chatted about this a little earlier with @jamtur01. Interested to get your thoughts.

Blockers:

  1. We don't currently collect internal events/metrics against an individual source. We emit structs such as GeneratorEventProcessed to determine the type of event, but not where it happened. To aggregate stats/metrics by source, I think we'd need to modify all emit! paths to take an ID of the source/sink. This should be relatively straightforward, since SourceConfig.build() already takes a name: str -- so it should (mostly) just be a case of passing that through to the inner methods. There may be code paths where we don't have the context. I need to dig in further.

  2. Collecting stats using get_controller(), by extension, also lacks topology context. We can collect eventsProcessed or bytesProcessed-- but we don't know where they came from. Some work may be required to further split stats by ID.

  3. I can't see any obvious groundwork for some of the stats exampled in the task description. The only results for "latency" are in tests against certain sources. I'm only just getting acquainted with internal events, so I may have missed something, but I don't see any internal concepts for throughout, latency, etc. Is work here ongoing?

  4. There are specific events such as PrometheusParseError and PrometheusErrorResponse, but it's not clear how these should be aggregated to determine an 'errors' stat, or how we'd host these specific stats in a table where for other rows, these may not be applicable.


Based on the above, I think there's a couple of potential next steps:

  1. Attempt to augment existing stats with an ID, and pull them out based on that same ID -- to retain a similar layout to the task description. We're still missing the example columns, but we should be able to pull out obvious stats like events/bytes processed, and a few others.

  2. Defer aggregation by ID, and just dump out the high-level stats. The console can still update with new data -- but it's not related to any individual source/sink.

What do you think?

@binarylogic
Copy link
Contributor Author

@leebenson

We don't currently collect internal events/metrics against an individual source. We emit structs such as GeneratorEventProcessed to determine the type of event, but not where it happened.

#4181 should include all span context as metrics tags. This includes component_kind, component_type and component_id.

The only results for "latency" are in tests against certain sources.

Yep, let's defer this column for now. I opened #3445 and never got a response. My hope is that we can get certain metrics for free, like how long an event spent in a component.

it's not clear how these should be aggregated to determine an 'errors' stat

We have an processing_errors metric that tracks this.

Let me know if that helps. We'll get #4181 merged shortly.

@leebenson
Copy link
Member

Thanks, #4181 should help a lot.

@leebenson
Copy link
Member

Closing and removing the points estimate on this, since this is now partially implemented and being tracked more specifically across multiple issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: cli Anything related to Vector's CLI domain: data model Anything related to Vector's internal data model domain: metrics Anything related to Vector's metrics events domain: observability Anything related to monitoring/observing Vector have: should We should have this feature, but is not required. It is medium priority. type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

No branches or pull requests

3 participants