Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Add additional metrics to p2p and consensus #2425

Merged
merged 13 commits into from
Sep 25, 2018

Conversation

mslipper
Copy link
Contributor

@mslipper mslipper commented Sep 18, 2018

Partially addresses cosmos/cosmos-sdk#2169.

  • Updated all relevant documentation in docs
  • Updated all code comments where relevant
  • Wrote tests
  • Updated CHANGELOG_PENDING.md

@mslipper
Copy link
Contributor Author

mslipper commented Sep 18, 2018

Two things I'd like direct feedback on:

  1. It seems like the RPC API uses the value of FastSync to determine if the node is syncing or not. I've followed that model for the metrics. Let me know if this isn't correct.
  2. I'm using a goroutine to poll the Status() method on MConnection to get the state of the send queue, since it seemed expensive to calculate all the time and I didn't want to inject metrics into the MConnection struct.

I've also deliberately avoided implementing the configurable namespacing with this PR - will do that separately so as to ease review.

@xla xla self-assigned this Sep 18, 2018
@xla xla added C:p2p Component: P2P pkg C:consensus Component: Consensus T:observability Type: Observability labels Sep 18, 2018
Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start! I've left a few comments that needs to be addressed though

.gitignore Outdated
@@ -26,6 +26,7 @@ scripts/cutWALUntil/cutWALUntil
*.iml

libs/pubsub/query/fuzz_test/output
libs/db/test*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about this. it should be fixed on develop as of now. please rebase and remove

@@ -0,0 +1,3 @@
package config

const MetricsNamespace = "tendermint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a variable to allow modifications?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was thinking I'd make it configurable in a separate PR that would add support for passed-in namespaces.

@@ -38,75 +41,103 @@ type Metrics struct {
BlockSizeBytes metrics.Gauge
// Total number of transactions.
TotalTxs metrics.Gauge
// The latest block height.
LatestBlockHeight metrics.Gauge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be a confusion with Height. Maybe we should rename it to CommittedHeight?

@@ -8,7 +8,7 @@ import (

"github.com/pkg/errors"

amino "github.com/tendermint/go-amino"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do no remove prefix

@@ -814,6 +819,16 @@ func (conR *ConsensusReactor) StringIndented(indent string) string {
return s
}

func (conR *ConsensusReactor) setCatchingUp() {
var catchingUp float64
if conR.fastSync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Whether or not a node is synced. 0 if syncing, 1 if synced." 1 and 0 mixed up

if you're adding this condition, you need to modify metric name and description to be

"Whether or not a node is fast synced. 1 if fast syncing, 0 if synced or syncing without fast sync."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice catch.

}

// NewConsensusReactor returns a new ConsensusReactor with the given
// consensusState.
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool) *ConsensusReactor {
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool, csMetrics *Metrics) *ConsensusReactor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics should be an option. see consensus/state.go for example


// WithMetrics sets the metrics.
func WithMetrics(metrics *Metrics) CSOption {
	return func(cs *ConsensusState) { cs.metrics = metrics }
}

stdprometheus "github.com/prometheus/client_golang/prometheus"
"github.com/tendermint/tendermint/config"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmcfg to be consistent or better just cfg

p2p/peer.go Outdated
}

p.metrics.PeerPendingSendBytes.With("peer-id", string(p.ID())).Set(sendQueueSize)
case <-p.quitChan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peer already has a quit channel: p.Quit()

p2p/peer.go Outdated
return err
}

go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please extract to a separate function

p2p/peer.go Outdated
@@ -109,12 +116,16 @@ func newPeer(
reactorsByCh map[byte]Reactor,
chDescs []*tmconn.ChannelDescriptor,
onPeerError func(Peer, interface{}),
metrics *Metrics,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, should be an option, not a requirement

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, definitely a good set of important datapoints for increased introspectability.

While most of the new metrics seem in the right place, the changes to the Peer are awfully intrusive and have time sensistivity and own stateful setup cost. I don't think we are well served adding more stop-gap solutions to that part of the codebase. We would be better served with the conn being an interface and have the actual implementation wrapped with an instrumenting one to keep concerns isolated. Another red flag is the peer-id label, which we have to investigate will not lead to a cardinality explosion as this can be the cause for degraded performance of the monitoring pipeline to the point of it being downright unusable if the values in that label exceed the 10s/100s.

We seem to have introduced a data race as well with this change.

@@ -19,4 +19,6 @@ FEATURES:

IMPROVEMENTS:

- Added additional metrics to p2p and consensus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CHANGELOG format and tense for entries looks like this: `[component] effect in present tense (#issure reference, @contributor)

So I'd propose:

[consensus] Add additional metrics
[p2p] Add addtional metrics

Alternatively expanding what metrics have been added and a reference to all issues concerning those changes.

stdprometheus "github.com/prometheus/client_golang/prometheus"
tmcfg "github.com/tendermint/tendermint/config"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect bundling of imports, we follwo this convention:

import (
	// stdlib packages

	// external packages

	// tendermint packages
)

}

// PrometheusMetrics returns Metrics build using Prometheus client library.
func PrometheusMetrics() *Metrics {
return &Metrics{
Height: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{
Subsystem: "consensus",
Namespace: tmcfg.MetricsNamespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to move away from our package having a dependency on config, while it's tempting it leaks to much and leads to hard to reason about states. A better approach here would be to have namespace be a param fro the function.

}

// NewConsensusReactor returns a new ConsensusReactor with the given
// consensusState.
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool) *ConsensusReactor {
func NewConsensusReactor(consensusState *ConsensusState, fastSync bool, csMetrics *Metrics) *ConsensusReactor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of requiring it to be passed this would be better served as an option. We employ functional options in a couple of places for this use-case. The default would be the NopMetrics() and only overriden if passed in. For reference:

// SocketPVOption sets an optional parameter on the SocketPV.
type SocketPVOption func(*SocketPV)
// SocketPVAcceptDeadline sets the deadline for the SocketPV listener.
// A zero time value disables the deadline.
func SocketPVAcceptDeadline(deadline time.Duration) SocketPVOption {
return func(sc *SocketPV) { sc.acceptDeadline = deadline }
}

stdprometheus "github.com/prometheus/client_golang/prometheus"
"github.com/tendermint/tendermint/config"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about import order.

@@ -19,6 +20,7 @@ type Metrics struct {
func PrometheusMetrics() *Metrics {
return &Metrics{
Size: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{
Namespace: config.MetricsNamespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about dependency on config and passing in of the namespace.

node/node.go Outdated
@@ -279,6 +279,8 @@ func NewNode(config *cfg.Config,
bcReactor := bc.NewBlockchainReactor(state.Copy(), blockExec, blockStore, fastSync)
bcReactor.SetLogger(logger.With("module", "blockchain"))

csm := cs.WithMetrics(csMetrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to break it out into its own variable?

p2p/metrics.go Outdated
stdprometheus "github.com/prometheus/client_golang/prometheus"
"github.com/tendermint/tendermint/config"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about import order.

p2p/metrics.go Outdated
}

// PrometheusMetrics returns Metrics build using Prometheus client library.
func PrometheusMetrics() *Metrics {
return &Metrics{
Peers: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{
Subsystem: "p2p",
Namespace: config.MetricsNamespace,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment about depending on config and passing of namespace.

@mslipper mslipper changed the base branch from master to develop September 19, 2018 03:42
@mslipper
Copy link
Contributor Author

@xla @melekes updates in.

@xla - the data race is related to the Status() method. It already has a TODO indicating that atomic should be used: https://github.com/tendermint/tendermint/blob/master/p2p/conn/connection.go#L588. Would you like me to refactor that code to use atomic?

Re: separating the metrics code into an instrumenting interface - I'm happy to do this as part of this PR, but I'm not sure that the overhead of adding metrics directly to the peer is very high. From the discussion that spawned this issue (cosmos/cosmos-sdk#2169), my understanding is that Prometheus metrics are very lightweight and there isn't a common pattern in the codebase yet to expose metrics. The only 'heavy' piece of this implementation is the goroutine that polls a peer's status, but I don't think running a fairly tight loop once every 10 seconds will be a huge performance hit.

config/config.go Show resolved Hide resolved
@@ -38,75 +39,102 @@ type Metrics struct {
BlockSizeBytes metrics.Gauge
// Total number of transactions.
TotalTxs metrics.Gauge
// The latest block height.
CommittedHeight metrics.Gauge
// Whether or not a node is fast synced. 1 if yes, 0 if no.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced -> syncing

FastSyncing: prometheus.NewGaugeFrom(stdprometheus.GaugeOpts{
Namespace: namespace,
Subsystem: MetricsSubsystem,
Name: "catching_up",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fast_syncing

Namespace: namespace,
Subsystem: MetricsSubsystem,
Name: "catching_up",
Help: "Whether or not a node is synced. 0 if syncing, 1 if synced.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether or not a node is fast syncing. 1 if yes, 0 if no.

}

type ROption func(*ConsensusReactor)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReactorOption is fine too

// WithMetrics sets the metrics.
func WithMetrics(metrics *Metrics) CSOption {
// StateWithMetrics sets the metrics.
func StateWithMetrics(metrics *Metrics) CSOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StateMetrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CSOption -> StateOption

p2p/peer.go Outdated
@@ -13,6 +13,8 @@ import (
tmconn "github.com/tendermint/tendermint/p2p/conn"
)

const MetricsTickerDuration = 10 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to expose it?

p2p/peer.go Outdated
return err
}

go p.metricsCollector()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename to metricsReporter()

p2p/peer.go Outdated
@@ -314,16 +343,41 @@ func (p *peer) String() string {
return fmt.Sprintf("Peer{%v %v in}", p.mconn, p.ID())
}

func PeerWithMetrics(metrics *Metrics) PeerOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PeerMetrics

p2p/peer.go Outdated
@@ -314,16 +343,41 @@ func (p *peer) String() string {
return fmt.Sprintf("Peer{%v %v in}", p.mconn, p.ID())
}

func PeerWithMetrics(metrics *Metrics) PeerOption {
return func(p *peer) {
if metrics != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why compare?

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks

@mslipper
Copy link
Contributor Author

Cool. I have another PR with a bunch more metrics once this guy is merged :).

Copy link
Contributor

@xla xla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up!

@xla - the data race is related to the Status() method. It already has a TODO indicating that atomic should be used: https://github.com/tendermint/tendermint/blob/master/p2p/conn/connection.go#L588. Would you like me to refactor that code to use atomic?

That would be really helpful, it seems that the recent changes surfaced it more prominently.

Re: separating the metrics code into an instrumenting interface - I'm happy to do this as part of this PR, but I'm not sure that the overhead of adding metrics directly to the peer is very high. From the discussion that spawned this issue (cosmos/cosmos-sdk#2169), my understanding is that Prometheus metrics are very lightweight and there isn't a common pattern in the codebase yet to expose metrics. The only 'heavy' piece of this implementation is the goroutine that polls a peer's status, but I don't think running a fairly tight loop once every 10 seconds will be a huge performance hit.

You are right, this is outside of the scope of the changes you are working on and better suited for the ongoing p2p refactor, see #2067. My original comment was not in regards to the time complexity of gathering the metrics, that's not a concern. I was referring to the added complexity and interleaving of concerns in our code, instead of aiming for clear separation.

@@ -3,11 +3,12 @@ package consensus
import (
"github.com/go-kit/kit/metrics"
"github.com/go-kit/kit/metrics/discard"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This newline is meaningful, we group by: stdlib, external packages, internal packages.

config/toml.go Outdated
@@ -284,6 +284,9 @@ prometheus_listen_addr = "{{ .Instrumentation.PrometheusListenAddr }}"
# you increase your OS limits.
# 0 - unlimited.
max_open_connections = {{ .Instrumentation.MaxOpenConnections }}

# Instrumentation namespace
namespace = {{ .Instrumentation.Namespace }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to break the integration tests, check circle.

p2p/peer.go Outdated
err := p.mconn.Start()
return err
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be inlined: if err := p.mconn.Start(); err != nil {

p2p/peer.go Outdated
ourNodeInfo NodeInfo,
timeout time.Duration,
ourNodeInfo NodeInfo,
timeout time.Duration,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non meaningful whitespace.

p2p/peer.go Outdated
reactorsByCh map[byte]Reactor,
chDescs []*tmconn.ChannelDescriptor,
onPeerError func(Peer, interface{}),
config tmconn.MConnConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non meaningful whitespace.

p2p/peer.go Outdated
reactorsByCh map[byte]Reactor,
chDescs []*tmconn.ChannelDescriptor,
onPeerError func(Peer, interface{}),
options ...PeerOption,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non meaningful whitespace.

p2p/peer.go Outdated
@@ -99,21 +101,30 @@ type peer struct {

// User data
Data *cmn.CMap

metrics *Metrics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary newline.

@xla xla added this to Ongoing in current iteration via automation Sep 20, 2018
@xla xla added this to the launch milestone Sep 20, 2018
@codecov-io
Copy link

Codecov Report

Merging #2425 into develop will decrease coverage by 0.15%.
The diff coverage is 39.84%.

@@            Coverage Diff             @@
##           develop   #2425      +/-   ##
==========================================
- Coverage    61.56%   61.4%   -0.16%     
==========================================
  Files          197     197              
  Lines        16291   16388      +97     
==========================================
+ Hits         10029   10063      +34     
- Misses        5430    5489      +59     
- Partials       832     836       +4
Impacted Files Coverage Δ
config/toml.go 53.19% <ø> (ø) ⬆️
mempool/metrics.go 27.27% <0%> (-2.73%) ⬇️
p2p/test_util.go 60.15% <0%> (-0.46%) ⬇️
config/config.go 82.67% <100%> (+0.08%) ⬆️
p2p/switch.go 52.49% <100%> (+0.31%) ⬆️
p2p/transport.go 80.37% <100%> (+0.09%) ⬆️
consensus/metrics.go 15.68% <11.9%> (-2.74%) ⬇️
p2p/metrics.go 18.75% <12.5%> (-11.25%) ⬇️
consensus/state.go 77.06% <50%> (+0.02%) ⬆️
p2p/peer.go 59% <66.66%> (+0.35%) ⬆️
... and 6 more

@ebuchman ebuchman modified the milestones: launch, v1.0 Sep 21, 2018
@xla xla changed the title Add additional metrics to p2p and consensus metrics: Add additional metrics to p2p and consensus Sep 25, 2018
@xla
Copy link
Contributor

xla commented Sep 25, 2018

@mslipper Thanks for all the ammendments. Could you resolve the CHANGELOG conflict? After that the PR shouldbe good to go.

@xla xla removed this from Ongoing in current iteration Sep 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C:consensus Component: Consensus C:p2p Component: P2P pkg T:observability Type: Observability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants