feat: instrument mpc signing phase durations#171
Merged
Conversation
Add three histograms to measure where time goes in a signing round, so the DNS/Connect-per-message and coordinator handshake can be attributed numerically before we optimize. - relayer.InitiateTime: coordinator broadcast-to-ready-threshold - relayer.CommSendTime: per-outbound-message send (peer attr) - relayer.CommDnsResolveTime: madns resolve + libp2p Connect per send Libp2pCommunication now takes a Metrics interface with a NoopMetrics default for health-check and elector comms that don't need telemetry. Co-Authored-By: Claude
|
Go Test coverage is 53.3 %\ ✨ ✨ ✨ |
Align struct field alignment per goimports to satisfy linter. Co-Authored-By: Claude
Paillier key generation during TSS keygen is wall-clock heavy and CI runner speed varies widely. Recent passing PRs on the 1.23.x job took ~20s for this test; slower runners were hitting the 30s context deadline and failing with missing StoreKeyshare calls even though nothing was actually wrong. Bumping the test-local context deadline to 60s removes the CI flake without changing product timeouts. Co-Authored-By: Claude
|
Go Test coverage is 53.3 %\ ✨ ✨ ✨ |
mpetrun5
approved these changes
Apr 20, 2026
Collaborator
mpetrun5
left a comment
There was a problem hiding this comment.
I don't think we need to revert this regardless of what happens.
One thing to consider is that as I remember histograms have a default distribution of buckets and comm send and dns resolve being fractions of a second the data buckets might not be useful.
Something to check in the histogram docs (if we need it) as you can pass custom buckets.
mpetrunic
approved these changes
Apr 20, 2026
4 tasks
mpetrun5
pushed a commit
that referenced
this pull request
Apr 28, 2026
## Summary
Adds `metric.WithUnit("s")` to the four `Float64Histogram`s in
`MpcMetrics`:
- `relayer.SessionTime` (PR #143)
- `relayer.InitiateTime` (PR #171)
- `relayer.CommSendTime` (PR #171)
- `relayer.CommDnsResolveTime` (PR #171)
## Why
`sygma-core` registers a sub-second bucket view in
`observability.InitMetricProvider`:
```go
// observability/metrics.go (initSecondView)
sdkmetric.NewView(
sdkmetric.Instrument{Unit: "s"},
sdkmetric.Stream{Aggregation: aggregation.ExplicitBucketHistogram{
Boundaries: []float64{1e-6, 1e-5, 1e-4, 1e-3, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 100, 1000, 10000},
}},
)
```
The view is keyed by `Instrument{Unit: "s"}`. The histograms in
`metrics/mpc.go` declared a description but no unit, so the view never
matched and the SDK fell back to OTel's default histogram boundaries
`[5, 10, 25, 50, ..., 10000]` — which are tuned for milliseconds.
Signing-phase durations (sub-second to a few seconds) collapse into the
`le=5` bucket, making `histogram_quantile` return values pinned to
bucket boundaries rather than real percentiles.
Values are already recorded in seconds via `d.Seconds()`, so no math
changes — only bucketing.
## Grafana query change
The OTLP→Prometheus exporter appends a `_seconds` suffix when the
instrument carries `Unit: "s"`. After this PR, dashboard queries change:
| Before | After |
|---|---|
| `relayer_SessionTime_bucket` | `relayer_SessionTime_seconds_bucket` |
| `relayer_InitiateTime_bucket` | `relayer_InitiateTime_seconds_bucket`
|
| `relayer_CommSendTime_bucket` | `relayer_CommSendTime_seconds_bucket`
|
| `relayer_CommDnsResolveTime_bucket` |
`relayer_CommDnsResolveTime_seconds_bucket` |
In practice no dashboards depend on the old names yet — the OTel
collector URL was only wired into staging in #174, so historical data is
empty.
## Test plan
- [x] `go build ./...` clean
- [x] `go test ./metrics/... ./tss/... ./comm/p2p/...` pass
- [ ] After deploy, confirm `relayer_SessionTime_seconds_count` and the
three new `_seconds_count` series are non-empty in Grafana
- [ ] Confirm `histogram_quantile(0.95, ...)` returns values that vary
with workload rather than snapping to bucket boundaries
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We can revert this later, but Im putting this up to try and really understand where some of the bottlenecks are coming from.
Summary
Instruments the signing path with phase-level OTel histograms so we can measure where time is spent before applying any optimizations. No behavior changes.
Motivation: signing rounds reportedly take 2-3s. We already have
relayer.SessionTime(full session duration), but there's no breakdown by phase or visibility into libp2p send cost. This PR adds that breakdown so the next PRs (DNS caching, broadcast error handling, keccak fix, etc.) can be validated numerically.New metrics
All are
Float64Histograms in seconds, emitted via the existingMpcMetrics/ OTel collector pipeline.relayer.InitiateTimeDuration of the coordinator handshake: from when the coordinator broadcasts
TssInitiateMsgto the moment enoughTssReadyMsgreplies arrive to satisfy threshold+1 (i.e. right before the coordinator broadcastsTssStartMsg). Only emitted on the elected coordinator. Recorded intss/coordinator.go:initiate.Read it as: how long the pre-TSS rendezvous took. If this is large, the bottleneck is peer readiness / re-broadcast latency (including the 1s
initiatePeriodrebroadcast fallback), not the MPC crypto itself.relayer.CommSendTimeWall-clock duration of a single outbound libp2p message send in
Libp2pCommunication.sendMessage: DNS resolve + stream acquisition +bufio.NewWriterSize+WriteStream+Flush. Labelled by target peer (peerattribute) so you can spot one slow remote.Read it as: total per-message send cost. Multiply by the number of messages per signing session to see how much of
SessionTimeis spent in comm.relayer.CommDnsResolveTimeDuration of
Libp2pCommunication.resolveDNS:madns.NewResolver()+resolver.Resolve+host.Connect. Called once per outboundsendMessage, unconditionally.Read it as: overhead we pay per message for DNS + redundant Connect. This is the biggest suspected hotspot - the resolver is recreated every call and
Connectis invoked even when the peer is already connected. If this histogram dominatesCommSendTime, Phase 1 DNS caching is the right next step.Derived views (no new metric)
CommSendTime - CommDnsResolveTimeper peer: pure stream write costSessionTime - InitiateTime: TSS crypto + result distribution time (non-coordinator peers only recordSessionTime, since they never enterinitiate)Wiring
Libp2pCommunicationnow takes a localMetricsinterface:app/app.gopassessygmaMetrics(*SprinterMetricsembeds*MpcMetricswhich satisfies the interface)jobs/jobs.go(health check) andcomm/elector/elector.go(coordinator election) passp2p.NoopMetrics{}- these comms are not on the signing hot pathp2p.NoopMetrics{}tss.Metricsinterface grewRecordInitiateDuration(d time.Duration); mock regenerated;CoordinatorTestSuite.SetupTestexpects it withAnyTimes()Test plan
Notes for reviewers