Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement(datadog_metrics sink): rewrite to the new model + add sketch support #9178

Merged
merged 32 commits into from
Oct 29, 2021

Conversation

tobz
Copy link
Contributor

@tobz tobz commented Sep 16, 2021

This PR is an amalgamation of a few key desired outcomes that are all inextricably tied together:

  • the desire to send aggregated histograms to Datadog (used by our internal telemetry, and sources like Prometheus)
  • the desire to rewrite the sink in the new style (performance, consistency, etc)
  • the desire to support metric sketches, specifically DDSketch, to allow for upcoming work around acting as a forward for all Datadog Agent traffic, representing logs, traces, and metrics in a lossless fashion

All of that, and more, is occurring in this PR. I've listed the changes below in a section-by-section format as the changes are many.

Refactored to the new model for sinks

This one is self-explanatory. Prior to this PR, the sink was written in the old style, using BatchedHttpSink and all of that old machinery. No longer. We've also reworked a bit of the module/import path stuff to better extricate the Datadog sinks to their own module paths, and feature flags, which matches other components.

Implemented a version of DDSketch based on Datadog Agent

As part of wanting to make sure we faithfully represent the data sent to us by the Datadog Agent, we wanted to make sure our DDSketch implementation would interoperate. As there is no official (yet!) Rust crate for DDSketch that mirrors the specific implementation details that the implementation in Datadog Agent (namely, quirks specific to Go and the fact that it diverged from the open-source versions), I decided to port it over in its entirety.

It mirrors the API which was designed primarily for batch insertion, with support for interpolating histogram buckets. This is what unlocks being able to use it to convert an aggregated histogram into a sketch. Additionally, we have fairly comprehensive unit tests to assert it matches the behavior of the Datadog Agent, such that any sketch coming from the agent through Vector should be emitted (so long as nothing modifies it in the pipeline, obviously) identically.

I had a lot of "fun" implementing this. Learning about the nitty gritty of Go's operator precedence as I ported over math.RoundToEven was a particular high point. I assume you'll read all the code comments I left and see just how much "fun" I had. :)

Added support for sketches natively in Metric

Since we want to be able to shuttle around sketches natively and in a lossless fashion, we have to be able to represent them internally as such... which necessitated adding a new variant to MetricValue to support them. Currently, we only support the AgentDDSketch variant, but this may be expanded or changed in the future to support more sketch types, or to support them generically by specifying the layout of the sketch.

OpenTelemetry takes an approach like this as all sketches are effectively representable as histograms with variable-boundary buckets and a set of values that define the bucket layout. This would be simpler than explicitly caring about one sketch implementation vs another, but means having to add enough support to be able to interoperate between the various options. We'll pay attention to how OpenTelemetry proceeds here so that we can better align ourselves in the future.

Add support for sending aggregated histograms to Datadog

After doing those the above two things, we can now send aggregated histograms to Datadog by virtue of converting them to sketches first. We additionally added logic to convert distributions to sketches, as this is what the agent itself does. There's no "sketch" type at the agent level, since it uses them for distributions... but since we might be shuttling these native data representations internally, we have both a distribution and sketch type.

... and now some additional bits of information that aren't terribly germane to the desired outcomes:

New IncrementalRequestBuilder for more flexible request building

While RequestBuilder provides a simple interface for defining how to build requests from a batch of events, so long as it's configured with a codec and compression type, what it does not easily allow for is defining how to add constraints around request building, such as ensuring a given payload does not exceed a certain number of bytes. Datadog specifically has limits around request payload sizes when interacting with its API, which the Datadog Agent handles by doing some advanced request building, incrementally encoding individual series/sketches until it hits the limit... and then continuing on with building a new request, so on and so forth, until all inputs are encoded as a number of requests.

We have done roughly the same thing here. The builder itself handles more of the heavy lifting, essentially being given all of the events in encode_events_incrementally, and expected to give back a vector of payloads that can be fed one-by-one to the build_request method.

We've written a tailored encoder for this PR (DatadogMetricsEncoder) that is designed around incremental encoding. As the process for this sink itself is not able to be done entirely in an incremental fashion (yet), I wasn't able to raise the builder and encoder to a level where they could be reused by other components, nor provide a default implementation that handled the boilerplate. This work is left as a future improvement.

Made RetryAction support copy-on-write values

This one touches a few non-Datadog sinks, but the premise is simple: it used to want owned String values, but in many cases, values are static strings. There's no point in allocating if we don't have to, as retries themselves already do a ton of allocating that we want to get rid of.

Small change to AggregatedSummary: no more upper_limit

Summaries don't have buckets, so using the term "upper limit" does not make sense. They do have various quantiles, though, where something like the shorthand q does make sense. I went ahead and made that change, which involved a few changes and, namely, is a breaking change as it affects the output of metric_to_log. I added that to the upgrade guide.

Signed-off-by: Toby Lawrence toby@nuclearfurnace.com

@netlify
Copy link

netlify bot commented Sep 16, 2021

✔️ Deploy Preview for vector-project canceled.

🔨 Explore the source changes: 7c13036

🔍 Inspect the deploy log: https://app.netlify.com/sites/vector-project/deploys/617c36aa3f8f2d0007628603

@tobz
Copy link
Contributor Author

tobz commented Sep 16, 2021

Note to self: right now the internal aggregated histograms we ship do not look right in the DD metrics UI. We should be sending absolute (monotonically increasing, at that) bucket values but currentl the bucket values fluctuate a bit, like they're not being interpreted as gauges correctly or the data itself actually is going up and down.

@tobz tobz force-pushed the tobz/datadog-metrics-sink-electric-boogaloo branch from e5a18c6 to e55ee49 Compare October 15, 2021 19:48
@tobz tobz changed the title enhancement(datadog_metrics sink): add support for aggregated histograms and summaries enhancement(datadog_metrics sink): rewrite to the new model + add sketch support Oct 25, 2021
@@ -440,8 +444,8 @@ sources-apache_metrics = []
sources-aws_ecs_metrics = []
sources-aws_kinesis_firehose = ["base64", "infer", "sources-utils-tls", "warp", "codecs"]
sources-aws_s3 = ["rusoto", "rusoto_s3", "rusoto_sqs", "semver", "uuid", "codecs", "zstd"]
sources-datadog = ["snap", "sources-utils-tls", "warp", "sources-utils-http-error", "codecs"]
sources-dnstap = ["base64", "data-encoding", "trust-dns-proto", "dnsmsg-parser", "tonic-build", "prost-build"]
sources-datadog_agent = ["snap", "sources-utils-tls", "warp", "sources-utils-http-error", "protobuf-build", "codecs"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this? Because we'll need the proto to be compiled when we accept sketches from the agent, so just doing it now with the knowledge we're going to need it anyways.

})
}
MetricValue::Sketch { sketch } => {
let quantiles = [0.5, 0.75, 0.9, 0.99]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same quantiles supported by Datadog when you enable percentile aggregation for a given distribution, which figured like a reasonable thing to follow.

Copy link
Contributor

@blt blt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on the whole excellent. I have mostly small nits and my request change is solely based on the property test comment.

Well done.

lib/vector-core/Cargo.toml Show resolved Hide resolved
lib/vector-core/proto/event.proto Show resolved Hide resolved
lib/vector-core/src/event/metric.rs Outdated Show resolved Hide resolved
lib/vector-core/src/event/metric.rs Show resolved Hide resolved
lib/vector-core/src/event/metric.rs Outdated Show resolved Hide resolved
lib/vector-core/src/event/metric.rs Show resolved Hide resolved
lib/vector-core/src/event/metric.rs Show resolved Hide resolved
lib/vector-core/src/event/test/common.rs Show resolved Hide resolved
lib/vector-core/src/metrics/ddsketch.rs Show resolved Hide resolved
src/sinks/datadog/metrics/config.rs Outdated Show resolved Hide resolved
…ams and summaries

Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
@tobz tobz force-pushed the tobz/datadog-metrics-sink-electric-boogaloo branch from e0a3b3f to 69b2a33 Compare October 28, 2021 20:40
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
@tobz tobz enabled auto-merge (squash) October 29, 2021 15:56
Signed-off-by: Toby Lawrence <toby@nuclearfurnace.com>
@tobz tobz disabled auto-merge October 29, 2021 19:17
@tobz tobz merged commit 4db23b3 into master Oct 29, 2021
@tobz tobz deleted the tobz/datadog-metrics-sink-electric-boogaloo branch October 29, 2021 19:21
lucperkins pushed a commit that referenced this pull request Nov 2, 2021
…tch support (#9178)

* enhancement(datadog_metrics sink): massive rewrite of the sink

some of the included changes:
- support for aggregated summaries (quantiles as discrete series)
- support for aggregated histograms (via conversion to sketches)
- native sketch support in Vector! and for sending to DD!
- sink rewritten in the "new" style
- many other core changes to support the work: metric-specific sink builder extension methods, incremental request builder, etc
github-merge-queue bot pushed a commit that referenced this pull request Jun 30, 2023
## Context

When support was added for encoding/sending sketches in #9178, logic was
added to handle "splitting" payloads if a metric exceeded the
(un)compressed payload limits. As we lacked (at the time) the ability to
encode sketch metrics one-by-one, we were forced to collect all of them,
and then attempt to encode them all at once, which had a tendency to
grow the response size past the (un)compressed payload limits. This
"splitting" mechanism allowed us to compensate for that.

However, in order to avoid getting stuck in pathological loops where
payloads were too big, and thus required multiple splits (after already
attempting at least one split), the logic was configured such that a
batch of metrics would only be split once, and if the two subsequent
slices couldn't be encoded without also exceeding the limits, they would
be dropped and we would give up trying to split further.

Despite the gut feeling during that work that it should be exceedingly
rare to ever need to split further, real life has shown otherwise:
#13175

## Solution

This PR introduces proper incremental encoding of sketches, which
doesn't eliminate the possibility of needing to split (more below) but
significantly reduces the likelihood that splitting will need to happen
down to a purely theoretical level.

We're taking advantage of hidden-from-docs methods in `prost` to encode
each `SketchPayload` object and append the bytes into a single buffer.
This is possible due to how Protocol Buffers functions. Additionally,
we're now generating "file descriptors" for our compiled Protocol
Buffers definitions. We use this to let us programmatically query the
field number of the "sketches" field in the `SketchPayload` message,
which is a slightly more robust way than just hardcoding it and hoping
it doesn't ever change in the future.

In Protocol Buffers, each field in a message is written out such that
the field data is preceded by the field number. This is part and parcel
to its ability to allow for backwards compatible changes to a
definition. Further, for repeated fields -- i.e. `Vec<Sketch>` -- the
repetitive nature is determined simply by write the same field multiple
times rather than needing to write everything all together. Practically
speaking, this means that we can encode a vector of two messages, or
encode those two messages individually, and end up with the same encoded
output of `[field N][field data][field N][field data]`.

### Ancillary changes

We've additionally fixed a bug with the "bytes sent" metric being
reported for the `datadog_metrics` sink due to some very tangled and
miswired code around how compressed/uncompressed/event bytes/etc sizes
were being shuttled from the request builder logic down to `Driver`.

We've also reworked some of the encoder error types just to clean them
up and simplify things a bit.

## Reviewer notes

### Still needing to handle splits

The encoder still does need to care about splits, in a theoretical
sense, because while we can accurately track and avoid ever exceeding
the uncompressed payload limit, we can't know the final compressed
payload size until we finalize the builder/payload.

Currently, the encoder does a check to see if adding the current metric
would cause us to exceed the compressed payload limit, assuming the
compressor couldn't actually compress the encoded metric at all. This is
a fairly robust check since it tries to optimally account for the
overhead of an entirely incompressible payload, and so on... but we
really want to avoid dropping events if possible, obviously, and that's
why the splitting code is still in place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants