feat: instrument mpc signing phase durations by GregTheGreek · Pull Request #171 · sprintertech/sprinter-signing

GregTheGreek · 2026-04-18T15:41:40Z

We can revert this later, but Im putting this up to try and really understand where some of the bottlenecks are coming from.

Summary

Instruments the signing path with phase-level OTel histograms so we can measure where time is spent before applying any optimizations. No behavior changes.

Motivation: signing rounds reportedly take 2-3s. We already have relayer.SessionTime (full session duration), but there's no breakdown by phase or visibility into libp2p send cost. This PR adds that breakdown so the next PRs (DNS caching, broadcast error handling, keccak fix, etc.) can be validated numerically.

New metrics

All are Float64Histograms in seconds, emitted via the existing MpcMetrics / OTel collector pipeline.

`relayer.InitiateTime`

Duration of the coordinator handshake: from when the coordinator broadcasts TssInitiateMsg to the moment enough TssReadyMsg replies arrive to satisfy threshold+1 (i.e. right before the coordinator broadcasts TssStartMsg). Only emitted on the elected coordinator. Recorded in tss/coordinator.go:initiate.

Read it as: how long the pre-TSS rendezvous took. If this is large, the bottleneck is peer readiness / re-broadcast latency (including the 1s initiatePeriod rebroadcast fallback), not the MPC crypto itself.

`relayer.CommSendTime`

Wall-clock duration of a single outbound libp2p message send in Libp2pCommunication.sendMessage: DNS resolve + stream acquisition + bufio.NewWriterSize + WriteStream + Flush. Labelled by target peer (peer attribute) so you can spot one slow remote.

Read it as: total per-message send cost. Multiply by the number of messages per signing session to see how much of SessionTime is spent in comm.

`relayer.CommDnsResolveTime`

Duration of Libp2pCommunication.resolveDNS: madns.NewResolver() + resolver.Resolve + host.Connect. Called once per outbound sendMessage, unconditionally.

Read it as: overhead we pay per message for DNS + redundant Connect. This is the biggest suspected hotspot - the resolver is recreated every call and Connect is invoked even when the peer is already connected. If this histogram dominates CommSendTime, Phase 1 DNS caching is the right next step.

Derived views (no new metric)

CommSendTime - CommDnsResolveTime per peer: pure stream write cost
SessionTime - InitiateTime: TSS crypto + result distribution time (non-coordinator peers only record SessionTime, since they never enter initiate)

Wiring

Libp2pCommunication now takes a local Metrics interface:

app/app.go passes sygmaMetrics (*SprinterMetrics embeds *MpcMetrics which satisfies the interface)
jobs/jobs.go (health check) and comm/elector/elector.go (coordinator election) pass p2p.NoopMetrics{} - these comms are not on the signing hot path
Tests use p2p.NoopMetrics{}
tss.Metrics interface grew RecordInitiateDuration(d time.Duration); mock regenerated; CoordinatorTestSuite.SetupTest expects it with AnyTimes()

Test plan

`go build ./...` clean
`go test ./comm/p2p/... ./comm/elector/... ./tss/... ./metrics/...` all pass
Deploy to staging, run a signing round, confirm the three new histograms show up in Grafana
Capture p50/p95 baseline for each histogram as reference numbers for the Phase 1 optimization PR
One pre-existing unrelated failure: `comm/subID_test.go:32` (nanosecond collision in `NewSubscriptionID` under macOS timer resolution) - fails on unmodified main too; out of scope here

Notes for reviewers

No logic changes, only `defer record` wrappers and the interface plumbing. Behavior is identical on the NoopMetrics path.
Follow-up PR will apply the actual optimizations (cache `madns.NewResolver`, skip `h.Connect` when already connected, bounded `Connect` context, non-fatal broadcast errors, remove dead keccak hash in peer sort). Baseline histograms from this PR are the gate for that work.

Add three histograms to measure where time goes in a signing round, so the DNS/Connect-per-message and coordinator handshake can be attributed numerically before we optimize. - relayer.InitiateTime: coordinator broadcast-to-ready-threshold - relayer.CommSendTime: per-outbound-message send (peer attr) - relayer.CommDnsResolveTime: madns resolve + libp2p Connect per send Libp2pCommunication now takes a Metrics interface with a NoopMetrics default for health-check and elector comms that don't need telemetry. Co-Authored-By: Claude

github-actions · 2026-04-18T15:44:47Z

Go Test coverage is 53.3 %\ ✨ ✨ ✨

Align struct field alignment per goimports to satisfy linter. Co-Authored-By: Claude

Paillier key generation during TSS keygen is wall-clock heavy and CI runner speed varies widely. Recent passing PRs on the 1.23.x job took ~20s for this test; slower runners were hitting the 30s context deadline and failing with missing StoreKeyshare calls even though nothing was actually wrong. Bumping the test-local context deadline to 60s removes the CI flake without changing product timeouts. Co-Authored-By: Claude

github-actions · 2026-04-18T15:55:20Z

Go Test coverage is 53.3 %\ ✨ ✨ ✨

mpetrun5

I don't think we need to revert this regardless of what happens.

One thing to consider is that as I remember histograms have a default distribution of buckets and comm send and dns resolve being fractions of a second the data buckets might not be useful.
Something to check in the histogram docs (if we need it) as you can pass custom buckets.

## Summary Adds `metric.WithUnit("s")` to the four `Float64Histogram`s in `MpcMetrics`: - `relayer.SessionTime` (PR #143) - `relayer.InitiateTime` (PR #171) - `relayer.CommSendTime` (PR #171) - `relayer.CommDnsResolveTime` (PR #171) ## Why `sygma-core` registers a sub-second bucket view in `observability.InitMetricProvider`: ```go // observability/metrics.go (initSecondView) sdkmetric.NewView( sdkmetric.Instrument{Unit: "s"}, sdkmetric.Stream{Aggregation: aggregation.ExplicitBucketHistogram{ Boundaries: []float64{1e-6, 1e-5, 1e-4, 1e-3, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 100, 1000, 10000}, }}, ) ``` The view is keyed by `Instrument{Unit: "s"}`. The histograms in `metrics/mpc.go` declared a description but no unit, so the view never matched and the SDK fell back to OTel's default histogram boundaries `[5, 10, 25, 50, ..., 10000]` — which are tuned for milliseconds. Signing-phase durations (sub-second to a few seconds) collapse into the `le=5` bucket, making `histogram_quantile` return values pinned to bucket boundaries rather than real percentiles. Values are already recorded in seconds via `d.Seconds()`, so no math changes — only bucketing. ## Grafana query change The OTLP→Prometheus exporter appends a `_seconds` suffix when the instrument carries `Unit: "s"`. After this PR, dashboard queries change: | Before | After | |---|---| | `relayer_SessionTime_bucket` | `relayer_SessionTime_seconds_bucket` | | `relayer_InitiateTime_bucket` | `relayer_InitiateTime_seconds_bucket` | | `relayer_CommSendTime_bucket` | `relayer_CommSendTime_seconds_bucket` | | `relayer_CommDnsResolveTime_bucket` | `relayer_CommDnsResolveTime_seconds_bucket` | In practice no dashboards depend on the old names yet — the OTel collector URL was only wired into staging in #174, so historical data is empty. ## Test plan - [x] `go build ./...` clean - [x] `go test ./metrics/... ./tss/... ./comm/p2p/...` pass - [ ] After deploy, confirm `relayer_SessionTime_seconds_count` and the three new `_seconds_count` series are non-empty in Grafana - [ ] Confirm `histogram_quantile(0.95, ...)` returns values that vary with workload rather than snapping to bucket boundaries

GregTheGreek added 2 commits April 18, 2026 18:45

style: gofmt metrics/mpc.go

be30c99

Align struct field alignment per goimports to satisfy linter. Co-Authored-By: Claude

GregTheGreek marked this pull request as ready for review April 18, 2026 16:29

GregTheGreek requested review from MakMuftic, mpetrun5, mpetrunic and tcar121293 as code owners April 18, 2026 16:29

mpetrun5 approved these changes Apr 20, 2026

View reviewed changes

mpetrunic approved these changes Apr 20, 2026

View reviewed changes

mpetrunic merged commit 6b3ab57 into main Apr 20, 2026
7 checks passed

mpetrunic deleted the feat/mpc-signing-instrumentation branch April 20, 2026 10:31

GregTheGreek mentioned this pull request Apr 27, 2026

feat: declare seconds unit on duration histograms #177

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: instrument mpc signing phase durations#171

feat: instrument mpc signing phase durations#171
mpetrunic merged 3 commits intomainfrom
feat/mpc-signing-instrumentation

GregTheGreek commented Apr 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

mpetrun5 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

GregTheGreek commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New metrics

relayer.InitiateTime

relayer.CommSendTime

relayer.CommDnsResolveTime

Derived views (no new metric)

Wiring

Test plan

Notes for reviewers

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

mpetrun5 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GregTheGreek commented Apr 18, 2026 •

edited

Loading

`relayer.InitiateTime`

`relayer.CommSendTime`

`relayer.CommDnsResolveTime`