Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P2P metrics #1163

Merged
merged 52 commits into from
Dec 19, 2023
Merged

P2P metrics #1163

merged 52 commits into from
Dec 19, 2023

Conversation

AKorpusenko
Copy link
Contributor

No description provided.

@AKorpusenko AKorpusenko added enhancement New feature or request DO NOT MERGE labels Oct 17, 2023
@AKorpusenko AKorpusenko self-assigned this Oct 17, 2023
@MatusKysel MatusKysel force-pushed the p2p_metrics branch 3 times, most recently from 6482f45 to 03fb228 Compare October 25, 2023 12:11
@@ -188,6 +200,12 @@ func New(opts ...Option) *MetricsReporter {
}
}

// reset metric every hour because it can grow infinitely
async.Interval(ctx, 8*time.Hour, func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are infinitely growing counts an issue? Should we perhaps just use rate per second on dashboards? Also, comment says every hour, but the interval is 8 hours

Copy link
Contributor

@moshe-blox moshe-blox Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkryuchkov it's an issue because you discover more unique peers over time. After a few weeks of uptime, you potentially have seen many thousands of peers.

@AKorpusenko Rather than resetting the metrics (which might disrupt the Grafana chart), I think we can just call DeleteLabelValues on a peer_id after we've disconnected from it. That way, you'd only have 60 (max peers) different values top per metric.

messagesReceivedTotal does not need to be reset at all, as it has no labels and it's not growing in size (you can confirm this in your localhost:15000/metrics)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nkryuchkov Infinitely growing values (a counter) for a given metric is fine. What we are discussing here as an issue is an infinitely growing amount of labels, e.g.

metric{key="peer1"} 0
metric{key="peer2"} 0
...

Prometheus doesn't deal well with these case scenarios at large scale.

@@ -24,6 +24,12 @@ var (
Name: "ssv:p2p:pubsub:score:inspect",
Help: "Gauge for negative peer scores",
}, []string{"pid"})

// score.go invalidMessageDeliveries value per topic
metricPubSubPeerP4Score = promauto.NewGaugeVec(prometheus.GaugeOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "P4" mean?

network/topics/metrics.go Outdated Show resolved Hide resolved
@AKorpusenko AKorpusenko changed the title Peer metrics P2P metrics Oct 27, 2023
@liorrutenberg
Copy link
Contributor

@AKorpusenko @MatusKysel please include @zktaiga as a reviewer once you ready

Copy link
Contributor

@zktaiga zktaiga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @moshe-blox - I think it's a great use case for DeleteLabelValues(). It allows us to have fixed and predictable cardinality.

I think where possible we should avoid the Reset() pattern on a predefined interval, today an interval might seem reasonable but produce a lot of noise in the future.

The rest looks good to me!

@moshe-blox
Copy link
Contributor

moshe-blox commented Dec 17, 2023

Made a few changes:

  • Inspect scores and update metrics every minute rather than 5 minutes, but still log every 5 minutes only
  • Added "role" label to instances started/decided metrics, for more granularity
  • Replaced hourly score Reset() with a quick Reset() right before copying the scores from pubsub's PeerScoreInspector
  • Renamed metrics from "duties" to "instances" (because it only tracks QBFTs and not ValidatorRegistration duties, etc..)
  • Created and refactored metrics interfaces

Peer score metrics update much more often now (left vs. right):
image

Potential issue:
The metrics with peer.ID record scores also for disconnected peers, even after Reset(), because in each PeerScoreInspector we record everything pubsub reports, and pubsub seems to remember all or nearly all peers, so this might lead to memory issues if someone attacks with lots of unique peer IDs?

MatusKysel
MatusKysel previously approved these changes Dec 18, 2023
MatusKysel
MatusKysel previously approved these changes Dec 18, 2023
@AKorpusenko
Copy link
Contributor Author

@moshe-blox
I've fixed the charts for QBFT instances. Now we can see the start / decided instances counters splitted by roles
image

@moshe-blox
Copy link
Contributor

moshe-blox commented Dec 19, 2023

@AKorpusenko Very nice, thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. IMO the naming is really bad, metricsreporter.MetricsReporter is just super long and repetitive. we could probably go with metrics.Reporter
  2. I think its a bad idea to centralize all metrics code from our codebase into this package. its ok to pass around the struct that will do the actual reporting and add all metrics to it, but I think each package should define its own metrics then register it to the passed metrics reporter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we discussed over DIscord, this will be done in a future PR to refactor metrics

)

type MetricsReporter struct {
type MetricsReporter interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this interface? Can't we just export metricsReporter

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, i think this interface should be split to multiple interfaces, one in each package where it belongs

and that package can just expose the implementation or maybe we can get rid of this package altogether and do the implementation inside each respective package

// VerifyByOperators verifies signature by the provided operators
// This is a copy of a function with the same name from the spec, except for it's use of
// DeserializeBLSPublicKey function and bounded.CGO
//
// TODO: rethink this function and consider moving/refactoring it.
func VerifyByOperators(s spectypes.Signature, data spectypes.MessageSignature, domain spectypes.DomainType, sigType spectypes.SignatureType, operators []*spectypes.Operator) error {
metricsSignaturesVerifications.WithLabelValues().Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good metric, though is this the only place we verify sigs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GalRogozinski do we need this to track other signature verification methods as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also track RSA sigs in another metric

Are there more places where we do BLS verifications?
This looks like the main function to me

Copy link
Contributor

@moshe-blox moshe-blox Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I guess better to add

y0sher
y0sher previously approved these changes Dec 19, 2023
Copy link
Contributor

@y0sher y0sher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this regardless of all my comments, I think we need a overhaul to how we do metrics codebase-wide, so this isn't the place to treat all of that. regarding what this PR is meant to handle I think it looks OK.

@moshe-blox moshe-blox dismissed stale reviews from y0sher and MatusKysel via c05364c December 19, 2023 14:55
@y0sher y0sher merged commit f452a79 into stage Dec 19, 2023
5 checks passed
@AKorpusenko AKorpusenko deleted the p2p_metrics branch December 19, 2023 15:59
moshe-blox added a commit that referenced this pull request Jan 2, 2024
* fix: stale operator ID in `p2pNetwork` (#1229)

* fix: stale operator ID in `p2pNetwork`

---------

Co-authored-by: Lior Rutenberg <liorr@blox.io>

* fix: sufficient timeout for initial duty fetch (#1214)

* fix: sufficient timeout for initial duty fetch (#1214)
---------

Co-authored-by: Matus Kysel <matus@blox.io>

* deployment: change bootnode for holesky-stage (#1233)

* deploy new bootnode ENR for holesky stage

* Voluntary exit (#1200)

* Voluntary exit

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>

* feat: include 'connected' in peer scores log (#1224)

* feat: include 'connected' in peer scores log

* P2P metrics  (#1163)

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* review comments fixes

* deploy

* deploy

* testing metrics differ

* added duties created and finalized

* removed extra monitoring.metricsreporter usage

* removed topic from p4score. made it a sum of squares

* deploy to all nodes

* deploy p2p_metrics

* pr review fixes

* trigger ci

* deploy 9-20

* Fix msg validation rsa verification counter metric

* allocate p2p_metrics

* inspect scores more frequently, but don't log every time

* renamed duties created/finalized metrics and refactored to record role

* deploy to 5--8

* deploy to 5--8

* rename metric

* blank space

* approve spec change (just a metric)

* refactors

* reset scores

* metric help msg

* revert gitlab

* deploy

* revert deploy

* Add metrics for signature verifications

* Update differ.config.yaml with approved changes

---------

Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: moshe-blox <moshe@blox.io>

* Node health (#1203)

* chore: link to SSV API docs in configs & README

* initial commit

* add node health route to ssv API

* update health route

* update health route

* update health route

* update health route

* deploy to stage

* add plaintext response

* lint

* lint

* change to good/bad

* lint

* lint

* refactor

* Revert "deploy to stage"

This reverts commit 2f54f4e.

* lint

* add inbound/outbound count for health + deploy to stage

* change ports back

* update count

* lint

* update conns

* lint

* remove connected peer count

* test blocked ports

* Revert "remove connected peer count"

This reverts commit 79e2b94.

* leave only active peers count

* Revert "test blocked ports"

This reverts commit 6fc9282.

* ci to stage

* add mutex to nodes access

* refactor: node health API (#1222)

* refactor: node health API

* added cpu_cores to healthcheck output

* fix inbound/outbound stats

* Remove CPU core reporting

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>

* feat: subscribe to a random subnet with 0 validators (#1245)

* feat: subscribe to a random subnet with 0 validators

* Fix: health route host addres (#1246)

* node health api route advertises the host addresses from the config.

* set up listenaddress directly

* feat: rate limit inbound connections by IP (#1226)

* feat: rate limit inbound connections by IP

* activate conngater

* deploy to 5--8

* fix

* fix

* Refactor connection gating in p2p setup

* Update ipLimiter parameters

* revert gitlab

* Revert "revert gitlab"

This reverts commit fcc7902.

* Revert "Revert "revert gitlab""

This reverts commit feb9e4e.

---------

Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>

---------

Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>
Co-authored-by: Lior Rutenberg <liorr@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com>
Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: Anton Korpusenko <antokorp@gmail.com>
Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
y0sher added a commit that referenced this pull request Jul 28, 2024
* fix: stale operator ID in `p2pNetwork` (#1229) (#1230)

* fix: stale operator ID in `p2pNetwork`

---------

Co-authored-by: Lior Rutenberg <liorr@blox.io>

* v1.2.2 (#1248)

* fix: stale operator ID in `p2pNetwork` (#1229)

* fix: stale operator ID in `p2pNetwork`

---------

Co-authored-by: Lior Rutenberg <liorr@blox.io>

* fix: sufficient timeout for initial duty fetch (#1214)

* fix: sufficient timeout for initial duty fetch (#1214)
---------

Co-authored-by: Matus Kysel <matus@blox.io>

* deployment: change bootnode for holesky-stage (#1233)

* deploy new bootnode ENR for holesky stage

* Voluntary exit (#1200)

* Voluntary exit

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>

* feat: include 'connected' in peer scores log (#1224)

* feat: include 'connected' in peer scores log

* P2P metrics  (#1163)

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* metrics: added p4 score (invalidMessageDeliveries)

* metrics: added total msgs accepted, msgs accepted from peer with Id, signatures verification

* deploy to the stage

# Conflicts:
#	.gitlab-ci.yml

* metrics: added counters flush once in 8h

* metrics: added counters for duties

* Added auto delete disconnected peers label metrics. Added RSA verifications metric

* disabled ci

* Updated p4 score metric update

* review comments fixes

* deploy

* deploy

* testing metrics differ

* added duties created and finalized

* removed extra monitoring.metricsreporter usage

* removed topic from p4score. made it a sum of squares

* deploy to all nodes

* deploy p2p_metrics

* pr review fixes

* trigger ci

* deploy 9-20

* Fix msg validation rsa verification counter metric

* allocate p2p_metrics

* inspect scores more frequently, but don't log every time

* renamed duties created/finalized metrics and refactored to record role

* deploy to 5--8

* deploy to 5--8

* rename metric

* blank space

* approve spec change (just a metric)

* refactors

* reset scores

* metric help msg

* revert gitlab

* deploy

* revert deploy

* Add metrics for signature verifications

* Update differ.config.yaml with approved changes

---------

Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: moshe-blox <moshe@blox.io>

* Node health (#1203)

* chore: link to SSV API docs in configs & README

* initial commit

* add node health route to ssv API

* update health route

* update health route

* update health route

* update health route

* deploy to stage

* add plaintext response

* lint

* lint

* change to good/bad

* lint

* lint

* refactor

* Revert "deploy to stage"

This reverts commit 2f54f4e.

* lint

* add inbound/outbound count for health + deploy to stage

* change ports back

* update count

* lint

* update conns

* lint

* remove connected peer count

* test blocked ports

* Revert "remove connected peer count"

This reverts commit 79e2b94.

* leave only active peers count

* Revert "test blocked ports"

This reverts commit 6fc9282.

* ci to stage

* add mutex to nodes access

* refactor: node health API (#1222)

* refactor: node health API

* added cpu_cores to healthcheck output

* fix inbound/outbound stats

* Remove CPU core reporting

---------

Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>

* feat: subscribe to a random subnet with 0 validators (#1245)

* feat: subscribe to a random subnet with 0 validators

* Fix: health route host addres (#1246)

* node health api route advertises the host addresses from the config.

* set up listenaddress directly

* feat: rate limit inbound connections by IP (#1226)

* feat: rate limit inbound connections by IP

* activate conngater

* deploy to 5--8

* fix

* fix

* Refactor connection gating in p2p setup

* Update ipLimiter parameters

* revert gitlab

* Revert "revert gitlab"

This reverts commit fcc7902.

* Revert "Revert "revert gitlab""

This reverts commit feb9e4e.

---------

Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>

---------

Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>
Co-authored-by: Lior Rutenberg <liorr@blox.io>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com>
Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: Anton Korpusenko <antokorp@gmail.com>
Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>

* Revert "v1.2.2 (#1248)" (#1263)

This reverts commit ff2f594.

* chore: fix some function names

Signed-off-by: linghuying <1599935829@qq.com>

---------

Signed-off-by: linghuying <1599935829@qq.com>
Co-authored-by: Lior Rutenberg <liorr@blox.io>
Co-authored-by: moshe-blox <89339422+moshe-blox@users.noreply.github.com>
Co-authored-by: rehs0y <lyosher@gmail.com>
Co-authored-by: Matus Kysel <matus@blox.io>
Co-authored-by: Nikita Kryuchkov <nkryuchkov10@gmail.com>
Co-authored-by: moshe-blox <moshe@blox.io>
Co-authored-by: olegshmuelov <oleg@blox.io>
Co-authored-by: MatheusFranco99 <48058141+MatheusFranco99@users.noreply.github.com>
Co-authored-by: Anton Korpusenko <antokorp@gmail.com>
Co-authored-by: Anton Korpusenko <anton@blox.io>
Co-authored-by: Gal Rogozinski <galrogogit@gmail.com>
Co-authored-by: Pavel Krolevets <pavelkrolevets@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants