-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: Add additional metrics to p2p and consensus #2425
Conversation
Two things I'd like direct feedback on:
I've also deliberately avoided implementing the configurable namespacing with this PR - will do that separately so as to ease review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start! I've left a few comments that needs to be addressed though
.gitignore
Outdated
@@ -26,6 +26,7 @@ scripts/cutWALUntil/cutWALUntil | |||
*.iml | |||
|
|||
libs/pubsub/query/fuzz_test/output | |||
libs/db/test* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry about this. it should be fixed on develop as of now. please rebase and remove
config/metrics.go
Outdated
@@ -0,0 +1,3 @@ | |||
package config | |||
|
|||
const MetricsNamespace = "tendermint" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a variable to allow modifications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking I'd make it configurable in a separate PR that would add support for passed-in namespaces.
consensus/metrics.go
Outdated
@@ -38,75 +41,103 @@ type Metrics struct { | |||
BlockSizeBytes metrics.Gauge | |||
// Total number of transactions. | |||
TotalTxs metrics.Gauge | |||
// The latest block height. | |||
LatestBlockHeight metrics.Gauge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be a confusion with Height
. Maybe we should rename it to CommittedHeight
?
consensus/reactor.go
Outdated
@@ -8,7 +8,7 @@ import ( | |||
|
|||
"github.com/pkg/errors" | |||
|
|||
amino "github.com/tendermint/go-amino" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please do no remove prefix
@@ -814,6 +819,16 @@ func (conR *ConsensusReactor) StringIndented(indent string) string { | |||
return s | |||
} | |||
|
|||
func (conR *ConsensusReactor) setCatchingUp() { | |||
var catchingUp float64 | |||
if conR.fastSync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Whether or not a node is synced. 0 if syncing, 1 if synced." 1 and 0 mixed up
if you're adding this condition, you need to modify metric name and description to be
"Whether or not a node is fast synced. 1 if fast syncing, 0 if synced or syncing without fast sync."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice catch.
consensus/reactor.go
Outdated
} | ||
|
||
// NewConsensusReactor returns a new ConsensusReactor with the given | ||
// consensusState. | ||
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool) *ConsensusReactor { | ||
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool, csMetrics *Metrics) *ConsensusReactor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics should be an option. see consensus/state.go for example
// WithMetrics sets the metrics.
func WithMetrics(metrics *Metrics) CSOption {
return func(cs *ConsensusState) { cs.metrics = metrics }
}
mempool/metrics.go
Outdated
stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
"github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmcfg to be consistent or better just cfg
p2p/peer.go
Outdated
} | ||
|
||
p.metrics.PeerPendingSendBytes.With("peer-id", string(p.ID())).Set(sendQueueSize) | ||
case <-p.quitChan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
peer already has a quit channel: p.Quit()
p2p/peer.go
Outdated
return err | ||
} | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please extract to a separate function
p2p/peer.go
Outdated
@@ -109,12 +116,16 @@ func newPeer( | |||
reactorsByCh map[byte]Reactor, | |||
chDescs []*tmconn.ChannelDescriptor, | |||
onPeerError func(Peer, interface{}), | |||
metrics *Metrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, should be an option, not a requirement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for picking this up, definitely a good set of important datapoints for increased introspectability.
While most of the new metrics seem in the right place, the changes to the Peer are awfully intrusive and have time sensistivity and own stateful setup cost. I don't think we are well served adding more stop-gap solutions to that part of the codebase. We would be better served with the conn being an interface and have the actual implementation wrapped with an instrumenting one to keep concerns isolated. Another red flag is the peer-id
label, which we have to investigate will not lead to a cardinality explosion as this can be the cause for degraded performance of the monitoring pipeline to the point of it being downright unusable if the values in that label exceed the 10s/100s.
We seem to have introduced a data race as well with this change.
CHANGELOG_PENDING.md
Outdated
@@ -19,4 +19,6 @@ FEATURES: | |||
|
|||
IMPROVEMENTS: | |||
|
|||
- Added additional metrics to p2p and consensus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our CHANGELOG format and tense for entries looks like this: `[component] effect in present tense (#issure reference, @contributor)
So I'd propose:
[consensus] Add additional metrics
[p2p] Add addtional metrics
Alternatively expanding what metrics have been added and a reference to all issues concerning those changes.
consensus/metrics.go
Outdated
stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
tmcfg "github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect bundling of imports, we follwo this convention:
import (
// stdlib packages
// external packages
// tendermint packages
)
consensus/metrics.go
Outdated
} | ||
|
||
// PrometheusMetrics returns Metrics build using Prometheus client library. | ||
func PrometheusMetrics() *Metrics { | ||
return &Metrics{ | ||
Height: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | ||
Subsystem: "consensus", | ||
Namespace: tmcfg.MetricsNamespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to move away from our package having a dependency on config, while it's tempting it leaks to much and leads to hard to reason about states. A better approach here would be to have namespace be a param fro the function.
consensus/reactor.go
Outdated
} | ||
|
||
// NewConsensusReactor returns a new ConsensusReactor with the given | ||
// consensusState. | ||
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool) *ConsensusReactor { | ||
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool, csMetrics *Metrics) *ConsensusReactor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of requiring it to be passed this would be better served as an option. We employ functional options in a couple of places for this use-case. The default would be the NopMetrics()
and only overriden if passed in. For reference:
Lines 41 to 48 in d419fff
// SocketPVOption sets an optional parameter on the SocketPV. | |
type SocketPVOption func(*SocketPV) | |
// SocketPVAcceptDeadline sets the deadline for the SocketPV listener. | |
// A zero time value disables the deadline. | |
func SocketPVAcceptDeadline(deadline time.Duration) SocketPVOption { | |
return func(sc *SocketPV) { sc.acceptDeadline = deadline } | |
} |
mempool/metrics.go
Outdated
stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
"github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment about import order.
mempool/metrics.go
Outdated
@@ -19,6 +20,7 @@ type Metrics struct { | |||
func PrometheusMetrics() *Metrics { | |||
return &Metrics{ | |||
Size: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | |||
Namespace: config.MetricsNamespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment about dependency on config and passing in of the namespace.
node/node.go
Outdated
@@ -279,6 +279,8 @@ func NewNode(config *cfg.Config, | |||
bcReactor := bc.NewBlockchainReactor(state.Copy(), blockExec, blockStore, fastSync) | |||
bcReactor.SetLogger(logger.With("module", "blockchain")) | |||
|
|||
csm := cs.WithMetrics(csMetrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason to break it out into its own variable?
p2p/metrics.go
Outdated
stdprometheus "github.com/prometheus/client_golang/prometheus" | ||
"github.com/tendermint/tendermint/config" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment about import order.
p2p/metrics.go
Outdated
} | ||
|
||
// PrometheusMetrics returns Metrics build using Prometheus client library. | ||
func PrometheusMetrics() *Metrics { | ||
return &Metrics{ | ||
Peers: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | ||
Subsystem: "p2p", | ||
Namespace: config.MetricsNamespace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment about depending on config and passing of namespace.
9936e82
to
9fd2344
Compare
@xla - the data race is related to the Re: separating the metrics code into an |
consensus/metrics.go
Outdated
@@ -38,75 +39,102 @@ type Metrics struct { | |||
BlockSizeBytes metrics.Gauge | |||
// Total number of transactions. | |||
TotalTxs metrics.Gauge | |||
// The latest block height. | |||
CommittedHeight metrics.Gauge | |||
// Whether or not a node is fast synced. 1 if yes, 0 if no. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced
-> syncing
consensus/metrics.go
Outdated
FastSyncing: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{ | ||
Namespace: namespace, | ||
Subsystem: MetricsSubsystem, | ||
Name: "catching_up", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast_syncing
consensus/metrics.go
Outdated
Namespace: namespace, | ||
Subsystem: MetricsSubsystem, | ||
Name: "catching_up", | ||
Help: "Whether or not a node is synced. 0 if syncing, 1 if synced.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether or not a node is fast syncing. 1 if yes, 0 if no.
consensus/reactor.go
Outdated
} | ||
|
||
type ROption func(*ConsensusReactor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReactorOption
is fine too
consensus/state.go
Outdated
// WithMetrics sets the metrics. | ||
func WithMetrics(metrics *Metrics) CSOption { | ||
// StateWithMetrics sets the metrics. | ||
func StateWithMetrics(metrics *Metrics) CSOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StateMetrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSOption
-> StateOption
p2p/peer.go
Outdated
@@ -13,6 +13,8 @@ import ( | |||
tmconn "github.com/tendermint/tendermint/p2p/conn" | |||
) | |||
|
|||
const MetricsTickerDuration = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to expose it?
p2p/peer.go
Outdated
return err | ||
} | ||
|
||
go p.metricsCollector() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename to metricsReporter()
p2p/peer.go
Outdated
@@ -314,16 +343,41 @@ func (p *peer) String() string { | |||
return fmt.Sprintf("Peer{%v %v in}", p.mconn, p.ID()) | |||
} | |||
|
|||
func PeerWithMetrics(metrics *Metrics) PeerOption { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PeerMetrics
p2p/peer.go
Outdated
@@ -314,16 +343,41 @@ func (p *peer) String() string { | |||
return fmt.Sprintf("Peer{%v %v in}", p.mconn, p.ID()) | |||
} | |||
|
|||
func PeerWithMetrics(metrics *Metrics) PeerOption { | |||
return func(p *peer) { | |||
if metrics != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why compare?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks
Cool. I have another PR with a bunch more metrics once this guy is merged :). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow-up!
@xla - the data race is related to the
Status()
method. It already has a TODO indicating thatatomic
should be used: https://github.com/tendermint/tendermint/blob/master/p2p/conn/connection.go#L588. Would you like me to refactor that code to use atomic?
That would be really helpful, it seems that the recent changes surfaced it more prominently.
Re: separating the metrics code into an
instrumenting
interface - I'm happy to do this as part of this PR, but I'm not sure that the overhead of adding metrics directly to the peer is very high. From the discussion that spawned this issue (cosmos/cosmos-sdk#2169), my understanding is that Prometheus metrics are very lightweight and there isn't a common pattern in the codebase yet to expose metrics. The only 'heavy' piece of this implementation is the goroutine that polls a peer's status, but I don't think running a fairly tight loop once every 10 seconds will be a huge performance hit.
You are right, this is outside of the scope of the changes you are working on and better suited for the ongoing p2p refactor, see #2067. My original comment was not in regards to the time complexity of gathering the metrics, that's not a concern. I was referring to the added complexity and interleaving of concerns in our code, instead of aiming for clear separation.
consensus/metrics.go
Outdated
@@ -3,11 +3,12 @@ package consensus | |||
import ( | |||
"github.com/go-kit/kit/metrics" | |||
"github.com/go-kit/kit/metrics/discard" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This newline is meaningful, we group by: stdlib
, external packages
, internal packages
.
config/toml.go
Outdated
@@ -284,6 +284,9 @@ prometheus_listen_addr = "{{ .Instrumentation.PrometheusListenAddr }}" | |||
# you increase your OS limits. | |||
# 0 - unlimited. | |||
max_open_connections = {{ .Instrumentation.MaxOpenConnections }} | |||
|
|||
# Instrumentation namespace | |||
namespace = {{ .Instrumentation.Namespace }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to break the integration tests, check circle.
p2p/peer.go
Outdated
err := p.mconn.Start() | ||
return err | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be inlined: if err := p.mconn.Start(); err != nil {
p2p/peer.go
Outdated
ourNodeInfo NodeInfo, | ||
timeout time.Duration, | ||
ourNodeInfo NodeInfo, | ||
timeout time.Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non meaningful whitespace.
p2p/peer.go
Outdated
reactorsByCh map[byte]Reactor, | ||
chDescs []*tmconn.ChannelDescriptor, | ||
onPeerError func(Peer, interface{}), | ||
config tmconn.MConnConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non meaningful whitespace.
p2p/peer.go
Outdated
reactorsByCh map[byte]Reactor, | ||
chDescs []*tmconn.ChannelDescriptor, | ||
onPeerError func(Peer, interface{}), | ||
options ...PeerOption, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non meaningful whitespace.
p2p/peer.go
Outdated
@@ -99,21 +101,30 @@ type peer struct { | |||
|
|||
// User data | |||
Data *cmn.CMap | |||
|
|||
metrics *Metrics | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary newline.
Codecov Report
@@ Coverage Diff @@
## develop #2425 +/- ##
==========================================
- Coverage 61.56% 61.4% -0.16%
==========================================
Files 197 197
Lines 16291 16388 +97
==========================================
+ Hits 10029 10063 +34
- Misses 5430 5489 +59
- Partials 832 836 +4
|
@mslipper Thanks for all the ammendments. Could you resolve the CHANGELOG conflict? After that the PR shouldbe good to go. |
Partially addresses cosmos/cosmos-sdk#2169.