Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(new source): add internal metrics source #1953

Merged
merged 32 commits into from
Apr 8, 2020
Merged

Conversation

lukesteensen
Copy link
Member

@lukesteensen lukesteensen commented Feb 27, 2020

Closes #1761
Closes #68
Ref #230

This is a spike (i.e. still rough) implementation of a system for collecting internal metrics and running them through a "normal" vector source.

It uses the metrics crate as the API for instrumentation, not the tracing-based system we were using before. This was not my first choice, but I believe it's the best solution for right now, as well as forwards-compatible with things we'll want to do later.

A big benefit of tracing is its ability to propagate contextual information about where you are in the system. We use this extensively for our logging to tell you what type and the name of the current component without having to manually pass that data around everywhere just for log messages. This same functionality would be very beneficial for naming and tagging metric data, so a tracing based solution is what we started with and what I spent quite a bit of time exploring.

Unfortunately, the state of using tracing to collect metrics is extremely early right now. What we want to do is definitely possible, but both the exact implementation and the public API are still undergoing a lot of experimentation. This seems like an area we could jump in and help advance the state of the art, but my current read is that we need a metrics system sooner than that work will be ready, even with our help.

Therefore, I've opted to go with a more traditional metrics library (metrics) as the basis of the system for right now. This will let us get started instrumenting our components much more quickly, but it's also flexible enough that we should be able to slide in a tracing-based implementation behind the same interface down the line.

The primary disadvantage of the non-tracing system right now is that we don't have a good way to tag metrics with the configured name of the current component. Once a tracing-based implementation solidifies, however, we should be able to get that additional data with no change to the existing instrumentation sites.

Still to be done:

  1. Better name for the source (internal_metrics?)
  2. Don't require the user to configure it before getting anything out (e.g. inject default pipeline somewhere)
  3. Carry labels over into our metric event tags
  4. Make publish interval configurable
  5. General cleanup and polish
  6. Go through and add metrics everywhere

Sorry for the wall of text, but this turned out to be significantly more complex than I anticipated. Let me know what you think!

@lukesteensen lukesteensen added type: new feature domain: observability Anything related to monitoring/observing Vector needs: approval Needs review & approval before work can begin. labels Feb 27, 2020
@lukesteensen lukesteensen added this to the Improve observability milestone Feb 27, 2020
Copy link
Contributor

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fantastic!

lib/tracing-metrics/src/lib.rs Show resolved Hide resolved
src/sources/internal.rs Outdated Show resolved Hide resolved
src/sources/internal.rs Outdated Show resolved Hide resolved
src/sources/internal.rs Outdated Show resolved Hide resolved
@binarylogic
Copy link
Contributor

From planning: looking for feedback around the internal API before we start collecting metrics throughout Vector.

@binarylogic binarylogic assigned ghost Mar 9, 2020
@binarylogic binarylogic requested a review from a user March 9, 2020 14:54
@binarylogic
Copy link
Contributor

binarylogic commented Mar 13, 2020

@Hoverbear @a-rodin - would you mind reviewing this? This is probably our #1 requested feature from customers and I want to unblock this work. This doesn't mean rush the review 😄 .

The only concern I have is around sharing context with metrics. As long as we have a clear path for that then I'm ok getting started without that. We'll just have to manually make sure all of the tags are formatted properly.

@lukesteensen lukesteensen marked this pull request as ready for review March 13, 2020 02:09
@lukesteensen
Copy link
Member Author

The only concern I have is around sharing context with metrics. As long as we have a clear path for that then I'm ok getting started without that. We'll just have to manually make sure all of the tags are formatted properly.

Yes, this is the major downside. As I've started instrumenting more things, the need has become clearer. I still think we should get a basic set up metrics up and running ASAP, but work on a tracing backend should probably be a priority pretty soon afterward.

@github-actions
Copy link

github-actions bot commented Mar 13, 2020

Great PR! Please pay attention to the following items before merging:

Files matching Cargo.lock:

  • Has at least one other team member approved the dependency changes?

This is an automatically generated QA checklist based on modified files

@binarylogic
Copy link
Contributor

Agree. We can still add labels manually, right? I haven’t see any in the examples you’ve added.

Also, we should agree on a naming and tagging scheme and then just inventory all of it. We’ll have to do this for the docs anyway and I’m happy to do the heavy lifting on that. Otherwise, this will become a mess very fast. I found the Prometheus naming guidelines interesting, but I’m not sure that fits well with Vector. Notice how they recommend very general naming with the use of tags. Ex: ‘tcp_sent_bytes{component_type=‘sinks.vector’}’. I don’t completely agree with this style for Vector, but I understand the appeal. Graphing all bytes sent over TCP, and then grouping by ‘component_type’, is trivial this way. It would not be as easy with separate names.

@lukesteensen
Copy link
Member Author

Agree. We can still add labels manually, right? I haven’t see any in the examples you’ve added.

Yes, the examples I've added so far have been more classic statsd-style keys. As I mentioned in #2007, we can and maybe should move some of the segments to tags/labels.

Also, we should agree on a naming and tagging scheme and then just inventory all of it.

I have a rough proposal here with a big checklist. Once we settle on a naming scheme, we can edit that into a big list of specific metrics and start working our way through it. So let me know what you all think!

Copy link
Contributor

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking great! I am really happy with how this is turning out :)

src/sinks/blackhole.rs Outdated Show resolved Hide resolved
src/sinks/util/tcp.rs Outdated Show resolved Hide resolved

let labels = key
.labels()
.map(|label| (String::from(label.key()), String::from(label.value())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unfortunate, I feel like labels don't really change that much would be nice if we could somehow do some sort of ref counting here or have some sort of pool of labels. Not a priority but just thinking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I can open another issue for this.

@binarylogic binarylogic modified the milestone: Improve observability Mar 16, 2020
Cargo.toml Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
src/metrics.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@Hoverbear Hoverbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the choice to use quotes on the measured metrics deliberate? I think it might be possible to support unquoted metrics if its desirable.

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
@binarylogic binarylogic self-requested a review April 8, 2020 14:09
Copy link
Contributor

@binarylogic binarylogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good start. I don't want to bikeshed on the metric names given that we'll address that in a separate follow up PR.

Copy link
Contributor

@LucioFranco LucioFranco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, no blockers.

use metrics::counter;

#[derive(Debug)]
pub struct PrometheusRequestCompleted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor nit: but maybe consider making this PrometheusRequestCompleted {} to kinda align with our other structs. I personally find plain structs kinda confusing when reading code.

src/internal_events/syslog.rs Outdated Show resolved Hide resolved
src/internal_events/tcp.rs Show resolved Hide resolved
src/internal_events/tcp.rs Show resolved Hide resolved

fn emit_metrics(&self) {
counter!("tcp_connections_established", 1,
"component_kind" => "sink",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming these don't have a type because they can be crated from many different types of sinks and we don't have that metadata here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This is where we'll need to lean on tracing metadata.

.send_all(futures01::stream::iter_ok(metrics))
.compat()
.await
.expect("sending??");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only fail if the topology task gets shutdown while we are sending, so I think we can ignore that here, instead of panicking?

lukesteensen and others added 2 commits April 8, 2020 15:18
Co-Authored-By: Lucio Franco <luciofranco14@gmail.com>
Signed-off-by: Luke Steensen <luke.steensen@gmail.com>
@lukesteensen lukesteensen changed the title feat(new source): spike out internal metrics source feat(new source): add internal metrics source Apr 8, 2020
@lukesteensen lukesteensen merged commit 5111f3a into master Apr 8, 2020
@lukesteensen lukesteensen deleted the metrics-for-real branch April 8, 2020 20:23
@Hoverbear Hoverbear unassigned ghost May 25, 2020
@Hoverbear Hoverbear removed the request for review from a user May 25, 2020 20:03
@binarylogic binarylogic added type: feature A value-adding code addition that introduce new functionality. and removed type: new feature labels Jun 16, 2020
@binarylogic binarylogic removed this from the Vector Observability milestone Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: observability Anything related to monitoring/observing Vector needs: approval Needs review & approval before work can begin. type: feature A value-adding code addition that introduce new functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Internal observability spike New vector_metrics source
4 participants