feat(new source): add internal metrics source #1953

lukesteensen · 2020-02-27T22:59:51Z

Closes #1761
Closes #68
Ref #230

This is a spike (i.e. still rough) implementation of a system for collecting internal metrics and running them through a "normal" vector source.

It uses the metrics crate as the API for instrumentation, not the tracing-based system we were using before. This was not my first choice, but I believe it's the best solution for right now, as well as forwards-compatible with things we'll want to do later.

A big benefit of tracing is its ability to propagate contextual information about where you are in the system. We use this extensively for our logging to tell you what type and the name of the current component without having to manually pass that data around everywhere just for log messages. This same functionality would be very beneficial for naming and tagging metric data, so a tracing based solution is what we started with and what I spent quite a bit of time exploring.

Unfortunately, the state of using tracing to collect metrics is extremely early right now. What we want to do is definitely possible, but both the exact implementation and the public API are still undergoing a lot of experimentation. This seems like an area we could jump in and help advance the state of the art, but my current read is that we need a metrics system sooner than that work will be ready, even with our help.

Therefore, I've opted to go with a more traditional metrics library (metrics) as the basis of the system for right now. This will let us get started instrumenting our components much more quickly, but it's also flexible enough that we should be able to slide in a tracing-based implementation behind the same interface down the line.

The primary disadvantage of the non-tracing system right now is that we don't have a good way to tag metrics with the configured name of the current component. Once a tracing-based implementation solidifies, however, we should be able to get that additional data with no change to the existing instrumentation sites.

Still to be done:

Better name for the source (internal_metrics?)
Don't require the user to configure it before getting anything out (e.g. inject default pipeline somewhere)
Carry labels over into our metric event tags
Make publish interval configurable
General cleanup and polish
Go through and add metrics everywhere

Sorry for the wall of text, but this turned out to be significantly more complex than I anticipated. Let me know what you think!

LucioFranco

This looks fantastic!

lib/tracing-metrics/src/lib.rs

src/sources/internal.rs

binarylogic · 2020-03-09T14:47:47Z

From planning: looking for feedback around the internal API before we start collecting metrics throughout Vector.

binarylogic · 2020-03-13T01:33:02Z

@Hoverbear @a-rodin - would you mind reviewing this? This is probably our #1 requested feature from customers and I want to unblock this work. This doesn't mean rush the review 😄 .

The only concern I have is around sharing context with metrics. As long as we have a clear path for that then I'm ok getting started without that. We'll just have to manually make sure all of the tags are formatted properly.

lukesteensen · 2020-03-13T02:13:27Z

The only concern I have is around sharing context with metrics. As long as we have a clear path for that then I'm ok getting started without that. We'll just have to manually make sure all of the tags are formatted properly.

Yes, this is the major downside. As I've started instrumenting more things, the need has become clearer. I still think we should get a basic set up metrics up and running ASAP, but work on a tracing backend should probably be a priority pretty soon afterward.

github-actions · 2020-03-13T03:06:26Z

Great PR! Please pay attention to the following items before merging:

Files matching Cargo.lock:

Has at least one other team member approved the dependency changes?

This is an automatically generated QA checklist based on modified files

binarylogic · 2020-03-13T03:08:03Z

Agree. We can still add labels manually, right? I haven’t see any in the examples you’ve added.

Also, we should agree on a naming and tagging scheme and then just inventory all of it. We’ll have to do this for the docs anyway and I’m happy to do the heavy lifting on that. Otherwise, this will become a mess very fast. I found the Prometheus naming guidelines interesting, but I’m not sure that fits well with Vector. Notice how they recommend very general naming with the use of tags. Ex: ‘tcp_sent_bytes{component_type=‘sinks.vector’}’. I don’t completely agree with this style for Vector, but I understand the appeal. Graphing all bytes sent over TCP, and then grouping by ‘component_type’, is trivial this way. It would not be as easy with separate names.

lukesteensen · 2020-03-13T03:56:44Z

Agree. We can still add labels manually, right? I haven’t see any in the examples you’ve added.

Yes, the examples I've added so far have been more classic statsd-style keys. As I mentioned in #2007, we can and maybe should move some of the segments to tags/labels.

Also, we should agree on a naming and tagging scheme and then just inventory all of it.

I have a rough proposal here with a big checklist. Once we settle on a naming scheme, we can edit that into a big list of specific metrics and start working our way through it. So let me know what you all think!

LucioFranco

looking great! I am really happy with how this is turning out :)

src/sinks/blackhole.rs

src/sinks/util/tcp.rs

LucioFranco · 2020-03-13T16:17:12Z

src/sources/internal_metrics.rs

+
+    let labels = key
+        .labels()
+        .map(|label| (String::from(label.key()), String::from(label.value())))


this is unfortunate, I feel like labels don't really change that much would be nice if we could somehow do some sort of ref counting here or have some sort of pool of labels. Not a priority but just thinking.

Agree, I can open another issue for this.

Cargo.toml

src/metrics.rs

Hoverbear

Is the choice to use quotes on the measured metrics deliberate? I think it might be possible to support unquoted metrics if its desirable.

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>

binarylogic

This looks like a good start. I don't want to bikeshed on the metric names given that we'll address that in a separate follow up PR.

LucioFranco

LGTM, no blockers.

LucioFranco · 2020-04-08T14:06:26Z

src/internal_events/prometheus.rs

+use metrics::counter;
+
+#[derive(Debug)]
+pub struct PrometheusRequestCompleted;


minor nit: but maybe consider making this PrometheusRequestCompleted {} to kinda align with our other structs. I personally find plain structs kinda confusing when reading code.

src/internal_events/syslog.rs

src/internal_events/tcp.rs

LucioFranco · 2020-04-08T14:28:15Z

src/internal_events/tcp.rs

+
+    fn emit_metrics(&self) {
+        counter!("tcp_connections_established", 1,
+            "component_kind" => "sink",


I'm assuming these don't have a type because they can be crated from many different types of sinks and we don't have that metadata here?

Correct. This is where we'll need to lean on tracing metadata.

LucioFranco · 2020-04-08T14:30:43Z

src/sources/internal_metrics.rs

+            .send_all(futures01::stream::iter_ok(metrics))
+            .compat()
+            .await
+            .expect("sending??");


This can only fail if the topology task gets shutdown while we are sending, so I think we can ignore that here, instead of panicking?

Co-Authored-By: Lucio Franco <luciofranco14@gmail.com>

Signed-off-by: Luke Steensen <luke.steensen@gmail.com>

lukesteensen added type: new feature domain: observability Anything related to monitoring/observing Vector needs: approval Needs review & approval before work can begin. labels Feb 27, 2020

lukesteensen added this to the Improve observability milestone Feb 27, 2020

lukesteensen requested review from Hoverbear and LucioFranco February 27, 2020 22:59

binarylogic assigned Hoverbear and LucioFranco Feb 27, 2020

LucioFranco reviewed Feb 27, 2020

View reviewed changes

lib/tracing-metrics/src/lib.rs Show resolved Hide resolved

src/sources/internal.rs Outdated Show resolved Hide resolved

src/sources/internal.rs Outdated Show resolved Hide resolved

src/sources/internal.rs Outdated Show resolved Hide resolved

binarylogic assigned ghost Mar 9, 2020

binarylogic requested a review from a user March 9, 2020 14:54

lukesteensen marked this pull request as ready for review March 13, 2020 02:09

lukesteensen force-pushed the metrics-for-real branch from 74ce4f8 to 4d48d69 Compare March 13, 2020 03:06

lukesteensen mentioned this pull request Mar 13, 2020

Specification for metrics collection #2007

Closed

LucioFranco reviewed Mar 13, 2020

View reviewed changes

binarylogic mentioned this pull request Mar 14, 2020

RFC for event driven observability #2064

Closed

binarylogic modified the milestone: Improve observability Mar 16, 2020