-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
P2P metrics #1163
P2P metrics #1163
Conversation
6482f45
to
03fb228
Compare
@@ -188,6 +200,12 @@ func New(opts ...Option) *MetricsReporter { | |||
} | |||
} | |||
|
|||
// reset metric every hour because it can grow infinitely | |||
async.Interval(ctx, 8*time.Hour, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are infinitely growing counts an issue? Should we perhaps just use rate per second on dashboards? Also, comment says every hour, but the interval is 8 hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nkryuchkov it's an issue because you discover more unique peers over time. After a few weeks of uptime, you potentially have seen many thousands of peers.
@AKorpusenko Rather than resetting the metrics (which might disrupt the Grafana chart), I think we can just call DeleteLabelValues
on a peer_id after we've disconnected from it. That way, you'd only have 60 (max peers) different values top per metric.
messagesReceivedTotal
does not need to be reset at all, as it has no labels and it's not growing in size (you can confirm this in your localhost:15000/metrics
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nkryuchkov Infinitely growing values (a counter) for a given metric is fine. What we are discussing here as an issue is an infinitely growing amount of labels, e.g.
metric{key="peer1"} 0
metric{key="peer2"} 0
...
Prometheus doesn't deal well with these case scenarios at large scale.
network/topics/metrics.go
Outdated
@@ -24,6 +24,12 @@ var ( | |||
Name: "ssv:p2p:pubsub:score:inspect", | |||
Help: "Gauge for negative peer scores", | |||
}, []string{"pid"}) | |||
|
|||
// score.go invalidMessageDeliveries value per topic | |||
metricPubSubPeerP4Score = promauto.NewGaugeVec(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "P4" mean?
@AKorpusenko @MatusKysel please include @zktaiga as a reviewer once you ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @moshe-blox - I think it's a great use case for DeleteLabelValues()
. It allows us to have fixed and predictable cardinality.
I think where possible we should avoid the Reset()
pattern on a predefined interval, today an interval might seem reasonable but produce a lot of noise in the future.
The rest looks good to me!
…signatures verification
# Conflicts: # .gitlab-ci.yml
03fb228
to
1e4c409
Compare
…signatures verification
# Conflicts: # .gitlab-ci.yml
@moshe-blox |
@AKorpusenko Very nice, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- IMO the naming is really bad,
metricsreporter.MetricsReporter
is just super long and repetitive. we could probably go withmetrics.Reporter
- I think its a bad idea to centralize all metrics code from our codebase into this package. its ok to pass around the struct that will do the actual reporting and add all metrics to it, but I think each package should define its own metrics then register it to the passed metrics reporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed over DIscord, this will be done in a future PR to refactor metrics
) | ||
|
||
type MetricsReporter struct { | ||
type MetricsReporter interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this interface? Can't we just export metricsReporter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree, i think this interface should be split to multiple interfaces, one in each package where it belongs
and that package can just expose the implementation or maybe we can get rid of this package altogether and do the implementation inside each respective package
protocol/v2/types/crypto.go
Outdated
// VerifyByOperators verifies signature by the provided operators | ||
// This is a copy of a function with the same name from the spec, except for it's use of | ||
// DeserializeBLSPublicKey function and bounded.CGO | ||
// | ||
// TODO: rethink this function and consider moving/refactoring it. | ||
func VerifyByOperators(s spectypes.Signature, data spectypes.MessageSignature, domain spectypes.DomainType, sigType spectypes.SignatureType, operators []*spectypes.Operator) error { | ||
metricsSignaturesVerifications.WithLabelValues().Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good metric, though is this the only place we verify sigs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GalRogozinski do we need this to track other signature verification methods as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also track RSA sigs in another metric
Are there more places where we do BLS verifications?
This looks like the main function to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I guess better to add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this regardless of all my comments, I think we need a overhaul to how we do metrics codebase-wide, so this isn't the place to treat all of that. regarding what this PR is meant to handle I think it looks OK.
* fix: stale operator ID in `p2pNetwork` (#1229) * fix: stale operator ID in `p2pNetwork` --------- Co-authored-by: Lior Rutenberg <liorr@blox.io> * fix: sufficient timeout for initial duty fetch (#1214) * fix: sufficient timeout for initial duty fetch (#1214) --------- Co-authored-by: Matus Kysel <matus@blox.io> * deployment: change bootnode for holesky-stage (#1233) * deploy new bootnode ENR for holesky stage * Voluntary exit (#1200) * Voluntary exit --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> * feat: include 'connected' in peer scores log (#1224) * feat: include 'connected' in peer scores log * P2P metrics (#1163) * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * review comments fixes * deploy * deploy * testing metrics differ * added duties created and finalized * removed extra monitoring.metricsreporter usage * removed topic from p4score. made it a sum of squares * deploy to all nodes * deploy p2p_metrics * pr review fixes * trigger ci * deploy 9-20 * Fix msg validation rsa verification counter metric * allocate p2p_metrics * inspect scores more frequently, but don't log every time * renamed duties created/finalized metrics and refactored to record role * deploy to 5--8 * deploy to 5--8 * rename metric * blank space * approve spec change (just a metric) * refactors * reset scores * metric help msg * revert gitlab * deploy * revert deploy * Add metrics for signature verifications * Update differ.config.yaml with approved changes --------- Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: moshe-blox <moshe@blox.io> * Node health (#1203) * chore: link to SSV API docs in configs & README * initial commit * add node health route to ssv API * update health route * update health route * update health route * update health route * deploy to stage * add plaintext response * lint * lint * change to good/bad * lint * lint * refactor * Revert "deploy to stage" This reverts commit 2f54f4e. * lint * add inbound/outbound count for health + deploy to stage * change ports back * update count * lint * update conns * lint * remove connected peer count * test blocked ports * Revert "remove connected peer count" This reverts commit 79e2b94. * leave only active peers count * Revert "test blocked ports" This reverts commit 6fc9282. * ci to stage * add mutex to nodes access * refactor: node health API (#1222) * refactor: node health API * added cpu_cores to healthcheck output * fix inbound/outbound stats * Remove CPU core reporting --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> * feat: subscribe to a random subnet with 0 validators (#1245) * feat: subscribe to a random subnet with 0 validators * Fix: health route host addres (#1246) * node health api route advertises the host addresses from the config. * set up listenaddress directly * feat: rate limit inbound connections by IP (#1226) * feat: rate limit inbound connections by IP * activate conngater * deploy to 5--8 * fix * fix * Refactor connection gating in p2p setup * Update ipLimiter parameters * revert gitlab * Revert "revert gitlab" This reverts commit fcc7902. * Revert "Revert "revert gitlab"" This reverts commit feb9e4e. --------- Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> --------- Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> Co-authored-by: Lior Rutenberg <liorr@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com> Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: Anton Korpusenko <antokorp@gmail.com> Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
* fix: stale operator ID in `p2pNetwork` (#1229) (#1230) * fix: stale operator ID in `p2pNetwork` --------- Co-authored-by: Lior Rutenberg <liorr@blox.io> * v1.2.2 (#1248) * fix: stale operator ID in `p2pNetwork` (#1229) * fix: stale operator ID in `p2pNetwork` --------- Co-authored-by: Lior Rutenberg <liorr@blox.io> * fix: sufficient timeout for initial duty fetch (#1214) * fix: sufficient timeout for initial duty fetch (#1214) --------- Co-authored-by: Matus Kysel <matus@blox.io> * deployment: change bootnode for holesky-stage (#1233) * deploy new bootnode ENR for holesky stage * Voluntary exit (#1200) * Voluntary exit --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> * feat: include 'connected' in peer scores log (#1224) * feat: include 'connected' in peer scores log * P2P metrics (#1163) * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * metrics: added p4 score (invalidMessageDeliveries) * metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification * deploy to the stage # Conflicts: # .gitlab-ci.yml * metrics: added counters flush once in 8h * metrics: added counters for duties * Added auto delete disconnected peers label metrics. Added RSA verifications metric * disabled ci * Updated p4 score metric update * review comments fixes * deploy * deploy * testing metrics differ * added duties created and finalized * removed extra monitoring.metricsreporter usage * removed topic from p4score. made it a sum of squares * deploy to all nodes * deploy p2p_metrics * pr review fixes * trigger ci * deploy 9-20 * Fix msg validation rsa verification counter metric * allocate p2p_metrics * inspect scores more frequently, but don't log every time * renamed duties created/finalized metrics and refactored to record role * deploy to 5--8 * deploy to 5--8 * rename metric * blank space * approve spec change (just a metric) * refactors * reset scores * metric help msg * revert gitlab * deploy * revert deploy * Add metrics for signature verifications * Update differ.config.yaml with approved changes --------- Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: moshe-blox <moshe@blox.io> * Node health (#1203) * chore: link to SSV API docs in configs & README * initial commit * add node health route to ssv API * update health route * update health route * update health route * update health route * deploy to stage * add plaintext response * lint * lint * change to good/bad * lint * lint * refactor * Revert "deploy to stage" This reverts commit 2f54f4e. * lint * add inbound/outbound count for health + deploy to stage * change ports back * update count * lint * update conns * lint * remove connected peer count * test blocked ports * Revert "remove connected peer count" This reverts commit 79e2b94. * leave only active peers count * Revert "test blocked ports" This reverts commit 6fc9282. * ci to stage * add mutex to nodes access * refactor: node health API (#1222) * refactor: node health API * added cpu_cores to healthcheck output * fix inbound/outbound stats * Remove CPU core reporting --------- Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> * feat: subscribe to a random subnet with 0 validators (#1245) * feat: subscribe to a random subnet with 0 validators * Fix: health route host addres (#1246) * node health api route advertises the host addresses from the config. * set up listenaddress directly * feat: rate limit inbound connections by IP (#1226) * feat: rate limit inbound connections by IP * activate conngater * deploy to 5--8 * fix * fix * Refactor connection gating in p2p setup * Update ipLimiter parameters * revert gitlab * Revert "revert gitlab" This reverts commit fcc7902. * Revert "Revert "revert gitlab"" This reverts commit feb9e4e. --------- Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> --------- Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> Co-authored-by: Lior Rutenberg <liorr@blox.io> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com> Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: Anton Korpusenko <antokorp@gmail.com> Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com> * Revert "v1.2.2 (#1248)" (#1263) This reverts commit ff2f594. * chore: fix some function names Signed-off-by: linghuying <1599935829@qq.com> --------- Signed-off-by: linghuying <1599935829@qq.com> Co-authored-by: Lior Rutenberg <liorr@blox.io> Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com> Co-authored-by: rehs0y <lyosher@gmail.com> Co-authored-by: Matus Kysel <matus@blox.io> Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com> Co-authored-by: moshe-blox <moshe@blox.io> Co-authored-by: olegshmuelov <oleg@blox.io> Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com> Co-authored-by: Anton Korpusenko <antokorp@gmail.com> Co-authored-by: Anton Korpusenko <anton@blox.io> Co-authored-by: Gal Rogozinski <galrogogit@gmail.com> Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
No description provided.