Skip to content

feat: instrument mpc signing phase durations#171

Merged
mpetrunic merged 3 commits intomainfrom
feat/mpc-signing-instrumentation
Apr 20, 2026
Merged

feat: instrument mpc signing phase durations#171
mpetrunic merged 3 commits intomainfrom
feat/mpc-signing-instrumentation

Conversation

@GregTheGreek
Copy link
Copy Markdown

@GregTheGreek GregTheGreek commented Apr 18, 2026

We can revert this later, but Im putting this up to try and really understand where some of the bottlenecks are coming from.

Summary

Instruments the signing path with phase-level OTel histograms so we can measure where time is spent before applying any optimizations. No behavior changes.

Motivation: signing rounds reportedly take 2-3s. We already have relayer.SessionTime (full session duration), but there's no breakdown by phase or visibility into libp2p send cost. This PR adds that breakdown so the next PRs (DNS caching, broadcast error handling, keccak fix, etc.) can be validated numerically.

New metrics

All are Float64Histograms in seconds, emitted via the existing MpcMetrics / OTel collector pipeline.

relayer.InitiateTime

Duration of the coordinator handshake: from when the coordinator broadcasts TssInitiateMsg to the moment enough TssReadyMsg replies arrive to satisfy threshold+1 (i.e. right before the coordinator broadcasts TssStartMsg). Only emitted on the elected coordinator. Recorded in tss/coordinator.go:initiate.

Read it as: how long the pre-TSS rendezvous took. If this is large, the bottleneck is peer readiness / re-broadcast latency (including the 1s initiatePeriod rebroadcast fallback), not the MPC crypto itself.

relayer.CommSendTime

Wall-clock duration of a single outbound libp2p message send in Libp2pCommunication.sendMessage: DNS resolve + stream acquisition + bufio.NewWriterSize + WriteStream + Flush. Labelled by target peer (peer attribute) so you can spot one slow remote.

Read it as: total per-message send cost. Multiply by the number of messages per signing session to see how much of SessionTime is spent in comm.

relayer.CommDnsResolveTime

Duration of Libp2pCommunication.resolveDNS: madns.NewResolver() + resolver.Resolve + host.Connect. Called once per outbound sendMessage, unconditionally.

Read it as: overhead we pay per message for DNS + redundant Connect. This is the biggest suspected hotspot - the resolver is recreated every call and Connect is invoked even when the peer is already connected. If this histogram dominates CommSendTime, Phase 1 DNS caching is the right next step.

Derived views (no new metric)

  • CommSendTime - CommDnsResolveTime per peer: pure stream write cost
  • SessionTime - InitiateTime: TSS crypto + result distribution time (non-coordinator peers only record SessionTime, since they never enter initiate)

Wiring

Libp2pCommunication now takes a local Metrics interface:

  • app/app.go passes sygmaMetrics (*SprinterMetrics embeds *MpcMetrics which satisfies the interface)
  • jobs/jobs.go (health check) and comm/elector/elector.go (coordinator election) pass p2p.NoopMetrics{} - these comms are not on the signing hot path
  • Tests use p2p.NoopMetrics{}
  • tss.Metrics interface grew RecordInitiateDuration(d time.Duration); mock regenerated; CoordinatorTestSuite.SetupTest expects it with AnyTimes()

Test plan

  • `go build ./...` clean
  • `go test ./comm/p2p/... ./comm/elector/... ./tss/... ./metrics/...` all pass
  • Deploy to staging, run a signing round, confirm the three new histograms show up in Grafana
  • Capture p50/p95 baseline for each histogram as reference numbers for the Phase 1 optimization PR
  • One pre-existing unrelated failure: `comm/subID_test.go:32` (nanosecond collision in `NewSubscriptionID` under macOS timer resolution) - fails on unmodified main too; out of scope here

Notes for reviewers

  • No logic changes, only `defer record` wrappers and the interface plumbing. Behavior is identical on the NoopMetrics path.
  • Follow-up PR will apply the actual optimizations (cache `madns.NewResolver`, skip `h.Connect` when already connected, bounded `Connect` context, non-fatal broadcast errors, remove dead keccak hash in peer sort). Baseline histograms from this PR are the gate for that work.

Add three histograms to measure where time goes in a signing round,
so the DNS/Connect-per-message and coordinator handshake can be
attributed numerically before we optimize.

- relayer.InitiateTime: coordinator broadcast-to-ready-threshold
- relayer.CommSendTime: per-outbound-message send (peer attr)
- relayer.CommDnsResolveTime: madns resolve + libp2p Connect per send

Libp2pCommunication now takes a Metrics interface with a NoopMetrics
default for health-check and elector comms that don't need telemetry.

Co-Authored-By: Claude
@github-actions
Copy link
Copy Markdown

Go Test coverage is 53.3 %\ ✨ ✨ ✨

Align struct field alignment per goimports to satisfy linter.

Co-Authored-By: Claude
Paillier key generation during TSS keygen is wall-clock heavy and
CI runner speed varies widely. Recent passing PRs on the 1.23.x
job took ~20s for this test; slower runners were hitting the 30s
context deadline and failing with missing StoreKeyshare calls
even though nothing was actually wrong. Bumping the test-local
context deadline to 60s removes the CI flake without changing
product timeouts.

Co-Authored-By: Claude
@github-actions
Copy link
Copy Markdown

Go Test coverage is 53.3 %\ ✨ ✨ ✨

@GregTheGreek GregTheGreek marked this pull request as ready for review April 18, 2026 16:29
Copy link
Copy Markdown
Collaborator

@mpetrun5 mpetrun5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to revert this regardless of what happens.

One thing to consider is that as I remember histograms have a default distribution of buckets and comm send and dns resolve being fractions of a second the data buckets might not be useful.
Something to check in the histogram docs (if we need it) as you can pass custom buckets.

@mpetrunic mpetrunic merged commit 6b3ab57 into main Apr 20, 2026
7 checks passed
@mpetrunic mpetrunic deleted the feat/mpc-signing-instrumentation branch April 20, 2026 10:31
mpetrun5 pushed a commit that referenced this pull request Apr 28, 2026
## Summary

Adds `metric.WithUnit("s")` to the four `Float64Histogram`s in
`MpcMetrics`:

- `relayer.SessionTime` (PR #143)
- `relayer.InitiateTime` (PR #171)
- `relayer.CommSendTime` (PR #171)
- `relayer.CommDnsResolveTime` (PR #171)

## Why

`sygma-core` registers a sub-second bucket view in
`observability.InitMetricProvider`:

```go
// observability/metrics.go (initSecondView)
sdkmetric.NewView(
    sdkmetric.Instrument{Unit: "s"},
    sdkmetric.Stream{Aggregation: aggregation.ExplicitBucketHistogram{
        Boundaries: []float64{1e-6, 1e-5, 1e-4, 1e-3, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 100, 1000, 10000},
    }},
)
```

The view is keyed by `Instrument{Unit: "s"}`. The histograms in
`metrics/mpc.go` declared a description but no unit, so the view never
matched and the SDK fell back to OTel's default histogram boundaries
`[5, 10, 25, 50, ..., 10000]` — which are tuned for milliseconds.
Signing-phase durations (sub-second to a few seconds) collapse into the
`le=5` bucket, making `histogram_quantile` return values pinned to
bucket boundaries rather than real percentiles.

Values are already recorded in seconds via `d.Seconds()`, so no math
changes — only bucketing.

## Grafana query change

The OTLP→Prometheus exporter appends a `_seconds` suffix when the
instrument carries `Unit: "s"`. After this PR, dashboard queries change:

| Before | After |
|---|---|
| `relayer_SessionTime_bucket` | `relayer_SessionTime_seconds_bucket` |
| `relayer_InitiateTime_bucket` | `relayer_InitiateTime_seconds_bucket`
|
| `relayer_CommSendTime_bucket` | `relayer_CommSendTime_seconds_bucket`
|
| `relayer_CommDnsResolveTime_bucket` |
`relayer_CommDnsResolveTime_seconds_bucket` |

In practice no dashboards depend on the old names yet — the OTel
collector URL was only wired into staging in #174, so historical data is
empty.

## Test plan

- [x] `go build ./...` clean
- [x] `go test ./metrics/... ./tss/... ./comm/p2p/...` pass
- [ ] After deploy, confirm `relayer_SessionTime_seconds_count` and the
three new `_seconds_count` series are non-empty in Grafana
- [ ] Confirm `histogram_quantile(0.95, ...)` returns values that vary
with workload rather than snapping to bucket boundaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants