Skip to content

V9 Optimization Log

Mykhailo Shevchuk edited this page Jun 6, 2026 · 2 revisions

Part of the Performance section - see Benchmarks for the overview, methodology and headline results.

Tracks the effect of each allocation/throughput optimization on the v9 benchmark suite, applied one change at a time. Every change is validated against the full unit + integration test suites before its numbers are recorded here.

How to read this

  • Machine: AMD Ryzen 9 3950X, Windows 10 (22H2), .NET 8.0.27 host. Absolute numbers are machine-specific - only the deltas within this page are meaningful.
  • Regenerate: dotnet run -c Release --project benchmarks/Serilog.Sinks.Grafana.Loki.Benchmarks.V9 for v9 alone, or dotnet fsi build.fsx -- --target Benchmark for all three suites.
  • Allocated is exact (deterministic regardless of the timing job) - treat it as the primary signal.
  • Mean at 1 000 events is noisy (large StdDev - a fixed per-flush overhead dominates). The 10 000-event rows are the reliable throughput signal.
  • Target to beat - YetAnother 4.0.5: Simple·10k = 4.63 MB / 20.67 ms; Exception·10k = 70.09 MB / 126.77 ms (see Benchmarks).

Result (TL;DR)

Ten changes (#3, #7, #2, #5, #6, #1, #8, #9, #10) applied one at a time, each validated against the full test suite. End-to-end sink, 10 000 events, vs baseline and vs YetAnother:

Workload Baseline Final Δ baseline YetAnother v9 vs YA
Simple - Allocated 19.48 MB 1.36 MB −93% 4.63 MB 3.4× less
Simple - Mean 26.60 ms 17.66 ms −34% 20.67 ms 1.17× faster
Exception - Allocated 133.57 MB 66.21 MB −50% 70.09 MB 1.06× less
Exception - Mean 188.42 ms 121.03 ms −36% 126.77 ms 1.05× faster

v9 now beats YetAnother on both workloads, on both allocation and throughput. Biggest wins: the IEqualityComparer grouping (#1, removed the per-event F# Map + structural-hash boxing) and caching ex.StackTrace (#10, halved exception allocation). All output behavior preserved (WireFormatTests + Loki roundtrip integration tests unchanged), bar one deliberate refinement: DateTime/DateTimeOffset body values now render as ISO 8601 / invariant (#7).

Summary - end-to-end sink Allocated (MB), lower is better

Step Change Simple·1k Simple·10k Exc·1k Exc·10k
- Baseline (1ca0904) 1.94 19.48 13.33 133.57
1 #3 timestamp via Utf8Formatter 1.91 19.03 13.29 133.12
2 #7 typed Guid/DateTime/DTO writes 1.79 18.11 13.30 133.12
3 #2 message via RenderMessage(TextWriter) 1.69 16.86 13.23 132.21
4 #5 Events list→array, in-place sort 1.17 11.44 12.68 126.79
5 #6 metadata written direct (no Map) 1.11 11.21 12.67 126.57
6 #1 comparer grouping (no per-event Map) 0.26 2.51 11.79 117.86
7 #8 reuse body Utf8JsonWriter 0.14 1.36 11.68 116.72
8 #9 hoist formatter dispatch (CPU only) 0.14 1.36 11.68 116.72
9 #10 cache StackTrace/Source 0.14 1.36 6.61 66.21
- YetAnother 4.0.5 (target) 0.47 4.63 7.00 70.09

Final vs YetAnother: Simple·10k 1.36 MB vs 4.63 (3.4× less); Exception·10k 66.21 MB vs 70.09 (1.06× less) - v9 now wins both.

Summary - end-to-end sink Mean (ms), lower is better

Step Change Simple·1k Simple·10k Exc·1k Exc·10k
- Baseline (1ca0904) 7.60 26.60 25.02 188.42
1 #3 timestamp via Utf8Formatter 6.24 26.26 23.21 188.00
2 #7 typed Guid/DateTime/DTO writes 6.54 26.21 23.96 191.09
3 #2 message via RenderMessage(TextWriter) 6.92 26.82 28.37 191.96
4 #5 Events list→array, in-place sort 7.54 22.97 27.64 185.83
5 #6 metadata written direct (no Map) 6.38 22.95 23.29 185.15
6 #1 comparer grouping (no per-event Map) 5.52 18.41 27.74 181.35
7 #8 reuse body Utf8JsonWriter 5.44 18.37 50.50† 182.73
8 #9 hoist formatter dispatch (CPU only) 5.40 18.60 19.75 178.08
9 #10 cache StackTrace/Source 5.35 17.66 33.52 121.03
- YetAnother 4.0.5 (target) 6.65 20.67 34.95 126.77

† Exc·1k mean is a noise outlier (that case has very high StdDev across all runs); use the Exc·10k column.

Summary - per-event formatter (1 000 events)

Step Change Simple alloc Simple mean Exc alloc Exc mean
- Baseline (1ca0904) 1.84 MB 1.200 ms 13.65 MB 12.778 ms
1 #3 timestamp via Utf8Formatter 1.84 MB 1.156 ms 13.65 MB 12.551 ms
2 #7 typed Guid/DateTime/DTO writes 1.75 MB 1.152 ms 13.65 MB 13.235 ms
3 #2 message via RenderMessage(TextWriter) 1.71 MB 1.283 ms 13.65 MB 13.066 ms
4 #5 Events list→array, in-place sort 1.71 MB 1.263 ms 13.65 MB 13.220 ms
5 #6 metadata written direct (no Map) 1.71 MB 1.238 ms 13.65 MB 13.078 ms
6 #1 comparer grouping (no per-event Map) 1.71 MB 1.245 ms 13.65 MB 12.731 ms
7 #8 reuse body Utf8JsonWriter 1.71 MB 1.250 ms 13.65 MB 12.773 ms
8 #9 hoist formatter dispatch 1.71 MB 1.245 ms 13.65 MB 12.9 ms
9 #10 cache StackTrace/Source 1.71 MB 1.239 ms 8.60 MB 7.162 ms

Planned changes (low-risk → higher-risk)

  1. #3 Utf8Formatter timestamp - replace string(ns) with Utf8Formatter.TryFormat into a stack buffer.
  2. #7 Typed WriteStringValue for Guid/DateTime/DateTimeOffset in writeValue.
  3. #2 Render Message via RenderMessage(TextWriter) into a reused UTF-8 buffer (drop the StringWriter/string).
  4. #5 LokiStream.Events: F# list → .NET list/array, in-place sort (drop Seq.sortBy+List.ofSeq cons cells).
  5. #4/#6 Kill per-event Array.fold closures + the per-event metadata Map (write metadata directly).
  6. #1 Group via IEqualityComparer<LogEvent>; build the label set once per stream not per event (drop the per-event Map + structural-hash enumerators).
  7. #8 Reuse Utf8JsonWriter instances via Reset() (body + envelope).
  8. #9 Hoist the per-event formatter type-check out of the batch loop (micro).

Change log

Baseline - 1ca0904

BenchmarkDotNet v0.15.8, AMD Ryzen 9 3950X, .NET 8.0.27, fake in-process transport.

End-to-end sink (SinkBenchmarks.Push)

EventCount Payload Mean Allocated
1000 Exception 25.019 ms 13.33 MB
1000 Simple 7.599 ms 1.94 MB
10000 Exception 188.420 ms 133.57 MB
10000 Simple 26.604 ms 19.48 MB

Per-event formatter (FormatterBenchmarks, 1 000 events)

Method Mean Allocated
Format_Simple 1.200 ms 1.84 MB
Format_Exception 12.778 ms 13.65 MB

Tests: 98/98 unit · 11/11 integration ✅

Step 1 - #3 timestamp via Utf8Formatter

Serialization.fs: replaced WriteStringValue(string (toUnixNanoseconds ts)) (a per-event System.String from the F# string operator) with Utf8Formatter.TryFormat into a stack buffer (NativePtr.stackalloc<byte> 20, hoisted once per batch out of the event loop) written via WriteStringValue(ReadOnlySpan<byte>). Zero allocation, culture-invariant digits.

EventCount · Payload Alloc Δ vs base Mean Δ
Simple · 10000 19.03 MB −0.45 MB (−2.3%) 26.26 ms −0.34 ms
Simple · 1000 1.91 MB −0.03 MB 6.24 ms (noisy)
Exception · 10000 133.12 MB −0.45 MB 188.00 ms ~flat
Exception · 1000 13.29 MB −0.04 MB 23.21 ms (noisy)

~45 B/event saved (the timestamp string), as predicted. Formatter group unchanged (the timestamp is emitted by the serializer, not the formatter) - confirms the groups isolate cleanly.

Tests: 98/98 unit · 11/11 integration ✅

Step 2 - #7 typed WriteStringValue for Guid / DateTime / DateTimeOffset

LokiJsonTextFormatter.writeValue: added :? Guid, :? DateTime, :? DateTimeOffset cases using the typed Utf8JsonWriter overloads instead of the v.ToString() fallback. Guid output is byte-identical (D format). Behavior refinement: DateTime/DateTimeOffset now emit ISO 8601 / invariant (round-trippable) instead of the previous culture-dependent ToString() - consistent with how renderLabelValue already renders labels. 3 new WireFormatTests lock this (Guid exact; DateTime/DTO round-trip).

EventCount · Payload Alloc Δ vs base cumulative Δ
Simple · 10000 18.11 MB −0.92 MB −1.37 MB (−7.0%)
Simple · 1000 1.79 MB −0.12 MB −0.15 MB
Exception · 10000 133.12 MB 0 (no Guid/DateTime props) −0.45 MB

~92 B/event = the CorrelationId Guid string in the Simple workload. Formatter Simple 1.84 → 1.75 MB confirms it's formatter-side. Exception flat, as expected.

Tests: 101/101 unit (+3) · 11/11 integration ✅

Step 3 - #2 stream the Message via RenderMessage(TextWriter)

Utf8TextWriter gained WrittenSpan + Clear(). LokiJsonTextFormatter.FormatToBuffer now takes a caller-owned Utf8TextWriter, renders the message into it via LogEvent.RenderMessage(TextWriter, InvariantCulture), and emits the bytes as the JSON Message value - dropping the per-event StringWriter + StringBuilder + backing char[] + final String. The sink owns one reusable messageBuffer/messageWriter (serial use → safe); the public Format passes a throwaway pair (stays thread-safe).

EventCount · Payload Alloc Δ vs prev cumulative Δ
Simple · 10000 16.86 MB −1.25 MB −2.62 MB (−13.4%)
Simple · 1000 1.69 MB −0.10 MB −0.25 MB
Exception · 10000 132.21 MB −0.91 MB −1.36 MB

~125 B/event (Simple) - smaller than first estimated; Serilog's RenderMessage(string) wrapper is lighter than assumed and the residual cost is the per-token int.ToString Serilog does internally (unchanged, not worth chasing). The formatter group barely moves: its public path is dominated by the body Encoding.UTF8.GetString it still performs.

Tests: 101/101 unit · 11/11 integration ✅

Step 4 - #5 LokiStream.Events: F# listLogEvent[] + boxing-free sort

Grouping.fs: Events: LogEvent listLogEvent[]; evs |> Seq.sortBy (fun e -> e.Timestamp) |> List.ofSeqSeq.toArray + Array.sortInPlaceWith byTimestamp, where byTimestamp a b = a.Timestamp.CompareTo b.Timestamp.

EventCount · Payload Alloc Δ vs prev Mean Δ
Simple · 10000 11.44 MB −5.42 MB 22.97 ms −3.85 ms (−14%)
Simple · 1000 1.17 MB −0.52 MB 7.54 ms (noisy)
Exception · 10000 126.79 MB −5.42 MB 185.83 ms −6.1 ms

Far bigger than the cons-cell estimate, and the reason matters: the old Seq.sortBy (fun e -> e.Timestamp) sorted through F#'s generic structural comparison, which boxes the DateTimeOffset key on every comparison - O(n log n) boxes per stream (~8 MB of garbage at 10 000 events). The strongly-typed DateTimeOffset.CompareTo is boxing-free, so the whole pile vanished - and the boxing removal is why Mean dropped 14% too. The F# list's per-event cons cell was only a minor part. Formatter group unchanged (sort is serializer-side).

Cumulative Simple·10k: 19.48 → 11.44 MB (−41%), 26.60 → 22.97 ms (−14%).

Tests: 101/101 unit · 11/11 integration ✅

Step 5 - #6 structured metadata written directly to the writer (no per-event Map)

LokiSink gained writeMetadata: Utf8JsonWriter -> LogEvent -> unit (lazy object open, pre-encoded TraceId/SpanId names); Serialization.serialize calls it inline instead of materializing a Map<string,string> per event. buildStructuredMetadata deleted from Labels.fs.

EventCount · Payload Alloc Δ vs prev
Simple · 10000 11.21 MB −0.23 MB
Exception · 10000 126.57 MB −0.22 MB

~23 B/event - the leftover per-event Array.fold closure from the empty-metadata default path. (The bench configures no metadata, so the bigger saving - dropping the per-event Map build when metadata is enabled - isn't reflected here.) Formatter unchanged (metadata is serializer-side).

Tests: 101/101 unit · 11/11 integration ✅

Step 6 - #1 grouping via IEqualityComparer<LogEvent> (no per-event Map)

Grouping.fs: new LabelEqualityComparer (hashes Level + effective label-property values off the event via System.HashCode, boxing-free). groupIntoStreams now keys a Dictionary<LogEvent, ResizeArray<LogEvent>>(comparer) and builds each stream's LabelSet once from its head via labelOf, instead of Seq.groupBy over a freshly-built Map per event. The sink precomputes effectiveLabelProps (label property names minus reserved keys) for the comparer. buildLabelSet/LabelSet/LabelsTests are unchanged; GroupingTests rewritten to drive the comparer (+3 comparer tests).

EventCount · Payload Alloc Δ vs prev Mean Δ
Simple · 10000 2.51 MB −8.70 MB 18.41 ms −4.54 ms (−20%)
Simple · 1000 0.26 MB −0.85 MB 5.52 ms −0.86 ms
Exception · 10000 117.86 MB −8.71 MB 181.35 ms −3.8 ms
Exception · 1000 11.79 MB −0.88 MB (noisy)

~870 B/event removed on Simple - far more than estimated, because the old Seq.groupBy keyed on an F# Map paid (a) the per-event Map.add tree nodes and (b) Map's structural GetHashCode/Equals, which enumerate the tree via heap MapTreeEnumerators on every event. All gone. The −20% on Mean is the structural hashing disappearing.

Milestone: v9 now beats YetAnother on the Simple workload - 2.51 MB vs 4.63 MB (1.8× less allocation) and 18.41 ms vs 20.67 ms (~11% faster). Cumulative Simple·10k: 19.48 → 2.51 MB (−87%), 26.60 → 18.41 ms (−31%). Exception still trails YetAnother (117.9 vs 70.1 MB) - dominated by per-event ex.StackTrace materialization (both pay it); see notes below.

Tests: 104/104 unit (+3) · 11/11 integration ✅

Post-review/simplify pass - allocation-neutral (quality + one regression fix)

After step 9 the perf changes + the structured-metadata commit went through /code-review and /simplify (9 specialist agents). Applied: comment corrections; metadata property keys precomputed + pre-encoded once at construction and de-duplicated (Array.distinct, fixing a duplicate-JSON-key regression vs the committed feature); a LabelEqualityComparer.ForLabels factory + shared isReservedLabelKey predicate + null guard; a SerializationBuffers bundle (serialize 9→6 params); and 4 new tests (multi-event body correctness, multi-property comparer, non-string + duplicate metadata). Re-benchmarked to confirm no regression:

EventCount · Payload Alloc (post) vs step 9 Mean (post) vs step 9
Simple · 10000 1.36 MB identical 18.45 ms +0.8 ms (noise)
Simple · 1000 0.14 MB identical 5.39 ms ~flat
Exception · 10000 66.21 MB identical (±0.003%) 121.82 ms +0.8 ms (noise)
Exception · 1000 6.62 MB ~flat 33.56 ms ~flat

Byte-identical allocation, time within noise - the changes touch construction, param bundling, comments, and the (bench-inactive) metadata path, not the measured hot path. Final standings vs YetAnother unchanged: Simple 1.36 MB vs 4.63 (3.4×); Exception 66.2 MB vs 70.1 - v9 wins both.

Tests: 108/108 unit (+4) · 11/11 integration ✅ (all green on net8.0/net9.0/net10.0)

Step 7 - #8 reuse the per-event body Utf8JsonWriter

FormatToBuffer now takes a caller-owned Utf8JsonWriter (Reset() at start, Flush() at end) instead of new-ing one per event. The sink owns one bodyJsonWriter (over bodyBuffer, reused across events); the public Format passes a throwaway. (The envelope writer stays per-batch - one allocation, negligible.)

EventCount · Payload Alloc Δ vs prev
Simple · 10000 1.36 MB −1.17 MB
Simple · 1000 0.14 MB −0.12 MB
Exception · 10000 116.72 MB −1.14 MB

~117 B/event - the Utf8JsonWriter object, as estimated. Mean is flat (Gen0 was already 0 on Simple, so this wasn't GC-bound). Formatter group unchanged - its public path intentionally still uses throwaway writers. v9 Simple·10k is now 1.36 MB vs YetAnother 4.63 MB - 3.4× less.

Tests: 104/104 unit · 11/11 integration ✅

Step 8 - #9 resolve the formatter dispatch once per batch

Serialization.serialize now resolves the built-in-vs-custom formatter once (a typed reference, null ⇒ custom), replacing the per-event isinst + GetType() match with a single reference check. writeBody folded inline.

Allocations identical to step 7 (1.36 MB / 116.72 MB) - this is a CPU-only change. Mean moves within noise. Kept for code quality / removing per-event work; not an allocation win.

Tests: 104/104 unit · 11/11 integration ✅

Step 9 - #10 cache StackTrace/Source in the exception formatter (bonus)

Not in the original list - found while inspecting the Exception path. LokiExceptionFormatter.Write read ex.StackTrace twice (String.IsNullOrEmpty ex.StackTrace then WriteString(…, ex.StackTrace)) and ex.Source twice. Exception.StackTrace rebuilds the full stack-trace string on every access, so each exception (and every nested inner exception) materialized its stack trace twice. Now read once into a local. YetAnother reads it once - this was the bulk of the remaining Exception gap.

EventCount · Payload Alloc Δ vs prev Mean Δ
Exception · 10000 66.21 MB −50.51 MB (−43%) 121.03 ms −57 ms (−32%)
Exception · 1000 6.61 MB −5.07 MB (noisy)
Simple (both) unchanged - unchanged -

The per-event formatter group shows it cleanly: Format_Exception 13.65 → 8.60 MB, 12.77 → 7.16 ms. v9 now also beats YetAnother on Exception (66.21 MB / 121 ms vs 70.09 MB / 127 ms).

Tests: 104/104 unit · 11/11 integration ✅

Clone this wiki locally