Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hare inbound gossip message processing delays #2478

Closed
lrettig opened this issue Jun 24, 2021 · 8 comments
Closed

Hare inbound gossip message processing delays #2478

lrettig opened this issue Jun 24, 2021 · 8 comments
Assignees
Milestone

Comments

@lrettig
Copy link
Member

lrettig commented Jun 24, 2021

Description

Nodes are taking as long as a minute to receive inbound Hare gossip messages and report them as valid (after which they're gossiped on to other peers).

I can't tell exactly what's going on under the hood without full debug logs. It could be an issue with the Hare broker priority queue, message validation, or something else. I suspect it's happening during the Hare broker eventLoop.

Environment

Please complete the following information:

  • OS: Linux
  • Node Version: v0.1.41

Additional Resources

image
from https://testnet-kibana.spacemesh.io/goto/29beacaf0fe3b5c3757deed8d6f8788a

@lrettig
Copy link
Member Author

lrettig commented Jun 25, 2021

Dug a bit deeper using debug logs:

image
full data:
hare backlog (1).csv
hare backlog.csv
from https://testnet-kibana.spacemesh.io/goto/91ecf641e57a09491279144a2f4d0d67

The hare broker is getting stuck somewhere in here:

go-spacemesh/hare/broker.go

Lines 232 to 269 in 3645848

msgLogger.With().Debug("broker received hare message", log.Int("queue_length", len(b.inbox)))
// TODO: fix metrics
// metrics.MessageTypeCounter.With("type_id", hareMsg.InnerMsg.Type.String(), "layer", strconv.FormatUint(uint64(msgInstID), 10), "reporter", "brokerHandler").Add(1)
msgInstID := hareMsg.InnerMsg.InstanceID
msgLogger = msgLogger.WithFields(log.FieldNamed("msg_layer_id", types.LayerID(msgInstID)))
isEarly := false
if err := b.validate(messageCtx, hareMsg); err != nil {
if err != errEarlyMsg {
// not early, validation failed
msgLogger.With().Debug("broker received a message to a consensus process that is not registered",
log.Err(err))
continue
}
msgLogger.With().Debug("early message detected", log.Err(err))
isEarly = true
}
// the msg is either early or has instance
// create msg
iMsg, err := newMsg(messageCtx, hareMsg, b.stateQuerier)
if err != nil {
msgLogger.With().Warning("message validation failed: could not construct msg", log.Err(err))
continue
}
// validate msg
if !b.eValidator.Validate(messageCtx, iMsg) {
msgLogger.With().Warning("message validation failed: eligibility validator returned false",
log.String("hare_msg", hareMsg.String()))
continue
}
// validation passed, report
msg.ReportValidation(ctx, protoName)
msgLogger.With().Debug("broker reported hare message as valid", hareMsg)

It seems like it's either the eligibility validator or the newMsg call. Both of these rely on the Hare oracle, which does use a mutex. And there's one or two other parallel hare goroutines also using the Hare oracle from the main consensus process code in algorithm.go, including:

// check participation and send message
go func() {
// check participation
if proc.shouldParticipate(ctx) {
// set pre-round InnerMsg and send
builder, err := proc.initDefaultBuilder(proc.s)
if err != nil {
logger.With().Error("init default builder failed", log.Err(err))
return
}
m := builder.SetType(pre).Sign(proc.signing).Build()
proc.sendMessage(ctx, m)
} else {
logger.With().Info("should not participate",
log.Int32("current_k", proc.k),
types.LayerID(proc.instanceID))
}
}()

But I can't find any blocking calls that would cause it to get stuck for as long as 60 secs.

@lrettig lrettig self-assigned this Jun 25, 2021
@dshulyak
Copy link
Contributor

there is a graph traversal function that may do many reads from disk and it is executed while holding the exclusive lock in the oracle

func (db *DB) CalcActiveSetSize(epoch types.EpochID, blocks map[types.BlockID]struct{}) (map[string]struct{}, error) {
if epoch == 0 {
return nil, errors.New("tried to retrieve active set for epoch 0")
}
firstLayerOfPrevEpoch := (epoch - 1).FirstLayer()
countedAtxs := make(map[string]types.ATXID)
penalties := make(map[string]struct{})
traversalFunc := db.createTraversalActiveSetCounterFunc(countedAtxs, penalties, db.LayersPerEpoch, epoch)
startTime := time.Now()
if err := db.meshDb.ForBlockInView(blocks, firstLayerOfPrevEpoch, traversalFunc); err != nil {
return nil, err
}
db.log.With().Info("done calculating active set size",
log.Int("size", len(countedAtxs)),
log.String("duration", time.Now().Sub(startTime).String()))
result := make(map[string]struct{}, len(countedAtxs))
for k := range countedAtxs {
result[k] = struct{}{}
}
return result, nil
}
. i don't understand why i can't find the log line "done calculating active set size" with timings in the elastic. the text of log has changed between 0.1.41 and current version, but i can't find either.

@moshababo moshababo added this to the June 21 Sprint milestone Jun 30, 2021
@lrettig lrettig added the BLOCKER label Jul 2, 2021
@lrettig
Copy link
Member Author

lrettig commented Jul 2, 2021

@dshulyak you were right:

done calculating active set size | atxDb | 1m13.812238934s

Here it is in the live testnet logs: https://testnet-kibana.spacemesh.io/goto/48b95795d097bdd1cabca9d825b3ca93

@lrettig
Copy link
Member Author

lrettig commented Jul 2, 2021

Each node only needs to calculate this once per epoch, but it takes a very long time (then it's cached). This explains why the issue only pops up sporadically.

As a quick workaround, we could trigger the calculation before it's needed, run it concurrently and cache it. But I think we need to understand why this is so slow (I/O, I suppose) and speed it up.

@lrettig
Copy link
Member Author

lrettig commented Jul 2, 2021

Here's a workaround: 377e908

I'll PR this into develop as well but need to rework it, as testnet and develop have diverged.

@dshulyak
Copy link
Contributor

dshulyak commented Jul 5, 2021

I would also like to profile that graph traversal function to understand how it can take 1m.

@lrettig lrettig removed their assignment Jul 17, 2021
bors bot pushed a commit that referenced this issue Jul 19, 2021
## Motivation
The exclusive lock is acquired in StoreAtx, which may lead to contention if GetAtxHeader is called concurrently many times.
The difference in benchmark is significant:

```
benchmark                                          old ns/op     new ns/op     delta
BenchmarkGetAtxHeaderWithConcurrentStoreAtx-16     68058         13280         -80.49%
```

related: #2478

## Changes
Read lock is not acquired when reading a single key from the database. 


## Test Plan
unit tests

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this issue Jul 19, 2021
## Motivation
The exclusive lock is acquired in StoreAtx, which may lead to contention if GetAtxHeader is called concurrently many times.
The difference in benchmark is significant:

```
benchmark                                          old ns/op     new ns/op     delta
BenchmarkGetAtxHeaderWithConcurrentStoreAtx-16     68058         13280         -80.49%
```

related: #2478

## Changes
Read lock is not acquired when reading a single key from the database. 


## Test Plan
unit tests

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this issue Jul 19, 2021
## Motivation
The exclusive lock is acquired in StoreAtx, which may lead to contention if GetAtxHeader is called concurrently many times.
The difference in benchmark is significant:

```
benchmark                                          old ns/op     new ns/op     delta
BenchmarkGetAtxHeaderWithConcurrentStoreAtx-16     68058         13280         -80.49%
```

related: #2478

## Changes
Read lock is not acquired when reading a single key from the database. 


## Test Plan
unit tests

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this issue Jul 22, 2021
…abase (#2562)

## Motivation

blk.Initialize function is very slow, see the difference in benchmark:

```
benchmark                      old ns/op     new ns/op     delta
BenchmarkGetBlockHeader-16     119538        2810          -97.65%
```

It computes MinerID from signature and ID from encoded block every time when we load block from database.

related: #2478

## Changes
Introduce DBBlock type that will encode ID and MinerID as a part of the block when block is saved to the local database.

## Test Plan
unit tests
## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this issue Jul 22, 2021
…abase (#2562)

## Motivation

blk.Initialize function is very slow, see the difference in benchmark:

```
benchmark                      old ns/op     new ns/op     delta
BenchmarkGetBlockHeader-16     119538        2810          -97.65%
```

It computes MinerID from signature and ID from encoded block every time when we load block from database.

related: #2478

## Changes
Introduce DBBlock type that will encode ID and MinerID as a part of the block when block is saved to the local database.

## Test Plan
unit tests
## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this issue Jul 22, 2021
…abase (#2562)

## Motivation

blk.Initialize function is very slow, see the difference in benchmark:

```
benchmark                      old ns/op     new ns/op     delta
BenchmarkGetBlockHeader-16     119538        2810          -97.65%
```

It computes MinerID from signature and ID from encoded block every time when we load block from database.

related: #2478

## Changes
Introduce DBBlock type that will encode ID and MinerID as a part of the block when block is saved to the local database.

## Test Plan
unit tests
## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this issue Jul 23, 2021
…abase (#2562)

## Motivation

blk.Initialize function is very slow, see the difference in benchmark:

```
benchmark                      old ns/op     new ns/op     delta
BenchmarkGetBlockHeader-16     119538        2810          -97.65%
```

It computes MinerID from signature and ID from encoded block every time when we load block from database.

related: #2478

## Changes
Introduce DBBlock type that will encode ID and MinerID as a part of the block when block is saved to the local database.

## Test Plan
unit tests
## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
bors bot pushed a commit that referenced this issue Jul 23, 2021
…abase (#2562)

## Motivation

blk.Initialize function is very slow, see the difference in benchmark:

```
benchmark                      old ns/op     new ns/op     delta
BenchmarkGetBlockHeader-16     119538        2810          -97.65%
```

It computes MinerID from signature and ID from encoded block every time when we load block from database.

related: #2478

## Changes
Introduce DBBlock type that will encode ID and MinerID as a part of the block when block is saved to the local database.

## Test Plan
unit tests
## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Update documentation as needed

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [x] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@dshulyak
Copy link
Contributor

dshulyak commented Aug 6, 2021

On devnet 0.2 it takes between 5-10 ms to compute this value:
Screenshot from 2021-08-06 13-11-39

@dshulyak
Copy link
Contributor

Let's close it I guess. I will make sure to monitor a larger testnet to see if latency is still acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants