-
Notifications
You must be signed in to change notification settings - Fork 83
ENG-506 Add performance metrics tracking for key operations #3857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
piotr-roslaniec
merged 18 commits into
threshold-network:main
from
lionakhnazarov:ENG-506/signer-performance-metrics
Jan 15, 2026
Merged
ENG-506 Add performance metrics tracking for key operations #3857
piotr-roslaniec
merged 18 commits into
threshold-network:main
from
lionakhnazarov:ENG-506/signer-performance-metrics
Jan 15, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Introduced a new system to monitor various operations within the Keep Core node, including wallet actions, DKG processes, signing operations, coordination procedures, and network activities. - Metrics are recorded through a new interface, allowing for optional integration without impacting performance when disabled. - Updated relevant components to wire in metrics recording, ensuring comprehensive coverage of critical operations. - Added documentation detailing implemented metrics and their usage. This enhancement provides better visibility into node performance and health, facilitating monitoring and troubleshooting.
- Introduced performance metrics for deposit and redemption process, including execution and proof submission metrics. - Updated the .gitignore file to exclude new directories: data/, logs/, and storage/. - Enhanced existing code to wire in metrics recording for redemption actions, improving visibility into redemption performance and potential bottlenecks. - Added documentation outlining the new metrics and their implementation details.
jose-blockchain
suggested changes
Jan 5, 2026
Contributor
jose-blockchain
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated recommendations:
- Fix the deadlock in wallet.go before merge - this will freeze the node if triggered, is confirmed
- Add context cancellation to
monitorQueueSizes- minor resource leak, not urgent but good to fix - Document that metrics endpoint should be firewalled - standard practice, just worth noting in docs
the code doesn't introduce direct vulnerabilities like injection or auth bypass. The metrics are useful operational data that node operators need. Just ensure port 9601 isn't exposed publicly (standard practice for any metrics endpoint).
- Updated the performance metrics initialization to accept an existing instance, preventing duplicate registrations. - Improved error handling in the metrics observer to log duplicate registrations at the debug level instead of warnings. - Added a method to periodically observe gauge metrics, ensuring better monitoring capabilities.
- Updated the performance metrics registration to include a suffix for duration metrics, enhancing consistency. - Ensured that metrics recorder is set for all cached coordination executors, improving reliability in metric tracking. - Streamlined the coordination procedure execution by eliminating redundant metric recordings, relying on the executor for accurate timing and failure counts.
- Implemented a new method to periodically collect and update system metrics, including CPU utilization, memory usage, and goroutine count.
…erver - Simplified the registration of performance metrics by consolidating the source map creation and conditionally excluding the count metric for 'ping_test_duration_seconds'. - Removed the unused gauge observation method, as gauge metrics are now automatically handled by the ObserveApplicationSource function.
…r's sign function for clarity and maintainability.
…on in tbtc.go for improved clarity and maintainability.
Fix 15 issues identified by code review: Data Race Fixes: - channel_manager.go: move metricsRecorder assignment inside mutex lock - wallet.go: add RWMutex to protect metricsRecorder concurrent access - spv.go: add RWMutex for globalMetricsRecorder package-level variable - libp2p.go: use atomic.Value for metricsRecorder in notifiee callbacks - rpc_health.go: fix 3 data races by storing error locally before mutex unlock Resource Leak Fixes: - channel.go: prevent duplicate goroutines in setMetricsRecorder using sync.Once - channel.go: fix handler queue metrics to include handler index suffix - performance.go: add context cancellation for observeSystemMetrics goroutine Logic and Documentation Fixes: - start.go: add nil guards for all clientInfoRegistry usages - signing.go: clarify error metrics recording comment - performance.go: add overflow bucket for histogram durations >600s - performance.go: update NewPerformanceMetrics to accept context parameter - docs/performance-metrics.adoc: add missing coordination and relay entry count metrics All changes validated with successful build.
- Fix double mutex lock in IncrementCounter for better performance - Remove duplicate registry observer registrations (memory leak) - Extract histogram magic keys (-1, -2) to named constants - Move metrics recording outside messageHandlersMutex to prevent deadlock - Ensure all error paths in sign() record failure metrics - Replace hardcoded metric strings with constants across codebase - Add comprehensive unit tests with race detection
- Remove unused metricsRecorder field from spvMaintainer struct - Fix metric type inconsistencies in documentation (Counter→Gauge) - Add sync.Once guard to RPCHealthChecker.Start() for concurrency safety - Replace hardcoded metric name strings with defined constants - Add explanatory comment for timeout metrics accuracy - Fix flaky TestWatchCoordinationWindows test with proper synchronization
Eng 506/review feedback
- Introduced new metrics for individual wallet actions: total, success, failed, and duration. - Updated performance metrics registration to include these new metrics dynamically. - Enhanced documentation to reflect the new per-action metrics structure and examples.
piotr-roslaniec
approved these changes
Jan 15, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The Keep Core node now exposes 31+ performance metrics via the
/metricsendpoint (port 9601). These metrics provide comprehensive visibility into node operations, network health, and system performance.Integrated Metrics by Category
1. DKG (Distributed Key Generation) Metrics (6 metrics)
Counters:
performance_dkg_joined_total- Total number of DKG joins (members joined)performance_dkg_failed_total- Total number of failed DKG executionsperformance_dkg_validation_total- Total number of DKG result validations performedperformance_dkg_challenges_submitted_total- Total number of DKG challenges submitted on-chainperformance_dkg_approvals_submitted_total- Total number of DKG approvals submitted on-chainDuration Metrics:
performance_dkg_duration_seconds- Average duration of DKG operationsperformance_dkg_duration_seconds_count- Total count of DKG operationsPerformance Insights:
dkg_joined_total / (dkg_joined_total + dkg_failed_total)- Monitor DKG participation and success ratesdkg_duration_secondsexceeds 300 seconds (5 minutes) - indicates slow DKG operationsdkg_challenges_submitted_totalanddkg_approvals_submitted_totalto monitor dispute resolution activitydkg_validation_totalrelative to joins indicates active validation of DKG results2. Signing Operations Metrics (5 metrics)
Counters:
performance_signing_operations_total- Total number of signing operations attemptedperformance_signing_success_total- Total number of successful signing operationsperformance_signing_failed_total- Total number of failed signing operationsperformance_signing_timeouts_total- Total number of signing operations that timed outDuration Metrics:
performance_signing_duration_seconds- Average duration of signing operationsperformance_signing_duration_seconds_count- Total count of signing operationsPerformance Insights:
signing_success_total / signing_operations_total- Critical metric for node reliabilitysigning_failed_totalrate > 10% of total operationssigning_timeouts_total / signing_operations_total- Indicates network or coordination issuessigning_duration_secondsexceeds 60 seconds - indicates slow signing operationssigning_operations_totalrate to understand signing workload3. Wallet Dispatcher Metrics (6 metrics)
Counters:
performance_wallet_actions_total- Total number of wallet actions dispatchedperformance_wallet_action_success_total- Total number of successfully completed wallet actionsperformance_wallet_action_failed_total- Total number of failed wallet actionsperformance_wallet_dispatcher_rejected_total- Total number of wallet actions rejected (wallet busy)performance_wallet_heartbeat_failures_total- Total number of wallet heartbeat failuresGauges:
performance_wallet_dispatcher_active_actions- Current number of wallets with active actionsDuration Metrics:
performance_wallet_action_duration_seconds- Average duration of wallet actionsperformance_wallet_action_duration_seconds_count- Total count of wallet actionsPerformance Insights:
wallet_dispatcher_rejected_total / wallet_actions_total- Alert if > 5% indicates wallet saturationwallet_action_success_total / wallet_actions_total- Monitor wallet action reliabilitywallet_dispatcher_active_actionsshows current wallet workloadwallet_heartbeat_failures_totalindicates wallet connectivity issues4. Coordination Operations Metrics (4 metrics)
Counters:
performance_coordination_windows_detected_total- Total number of coordination windows detectedperformance_coordination_procedures_executed_total- Total number of coordination procedures executed successfullyperformance_coordination_failed_total- Total number of failed coordination proceduresDuration Metrics:
performance_coordination_duration_seconds- Average duration of coordination proceduresperformance_coordination_duration_seconds_count- Total count of coordination proceduresPerformance Insights:
coordination_procedures_executed_total / coordination_windows_detected_total- Success rate of coordinationcoordination_failed_totalrate > 5% of detected windowscoordination_windows_detected_totalto understand coordination frequencycoordination_duration_secondsto identify slow coordination operations5. Network Operations Metrics (10 metrics)
Peer Connection Metrics:
performance_peer_connections_total- Total number of peer connections establishedperformance_peer_disconnections_total- Total number of peer disconnectionsMessage Metrics:
performance_message_broadcast_total- Total number of messages broadcast to the networkperformance_message_received_total- Total number of messages received from the networkQueue Size Metrics (Gauges):
performance_incoming_message_queue_size- Current size of incoming message queue (withchannellabel)performance_message_handler_queue_size- Current size of message handler queues (withchannelandhandlerlabels)Ping Test Metrics:
performance_ping_test_total- Total number of ping tests performedperformance_ping_test_success_total- Total number of successful ping testsperformance_ping_test_failed_total- Total number of failed ping testsperformance_ping_test_duration_seconds- Average duration of ping testsperformance_ping_test_duration_seconds_count- Total count of ping testsPerformance Insights:
peer_connections_totalvspeer_disconnections_total- Monitor connection stabilitymessage_broadcast_totalandmessage_received_totalratesincoming_message_queue_size> 3000 (75% of 4096 capacity) - indicates message processing bottleneckmessage_handler_queue_size> 400 (75% of 512 capacity) - indicates handler saturationping_test_duration_secondsshows network round-trip timeping_test_failed_totalrate > 10% of ping tests - indicates network issues6. System level metrics
CPU Utilization: Estimated CPU utilization based on goroutine count and GC activityMemory Usage: Current allocated memory (heap) in bytesGoroutine Count: Current number of active goroutines