Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Self-healing #2400

Closed
wants to merge 251 commits into from
Closed

[Merged by Bors] - Self-healing #2400

wants to merge 251 commits into from

Conversation

lrettig
Copy link
Member

@lrettig lrettig commented Apr 28, 2021

Motivation

See spacemeshos/SMIPS#46

Requires #2394, #2393, #2357
Closes #2203
Closes #2687

Changes

  • adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector)
  • Hare explicitly reports failed CPs to the mesh, with the results stored in memory

Test Plan

TBD

TODO

  • Explain motivation or link existing issue(s)
  • Test changes and document test plan
  • Finish rescoring block goodness after healing
  • Rescoring block unit test
  • Block voting weight unit test
  • Make sure healing -> verifying tortoise handoff is working
  • App test (for decay in active set size)
  • Split-then-rejoin test (two way split, three way split w/o majority fork)
  • Late blocks unit test(s)
  • Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: Weight tortoise votes based on beacon correctness #2540)
  • Multi tortoise unit tests

DevOps Notes

  • This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
  • This PR does not affect public APIs
  • This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
  • This PR does not make changes to log messages (which monitoring infrastructure may rely on)

Bring context down into verifying tortoise
Hare reports to mesh when it explicitly fails for a layer.
This allows tortoise to add votes against all blocks in such a layer.
Also adds some more minor cleanup.
Increase default hdist
Invalidate layers with no input vector older than zdist
Also clean up some interfaces
@lrettig lrettig self-assigned this Apr 28, 2021
Comments
- Set local opinion as opposed to all blocks in a layer if Hare failed
  for that layer.
- Correctly process exception votes when processing blocks (rather than
  summing base block vote + exception vote); handle abstain votes here
- Correctly interpret the lack of an explicit vote in support of a block
  in an older late as a vote against that block when summing global
  opinion

Lots of minor cleanup and improvements of comments, logs, variable names
Change some panics to errors
tortoise/verifying_tortoise.go Outdated Show resolved Hide resolved
tortoise/verifying_tortoise.go Outdated Show resolved Hide resolved
tortoise/verifying_tortoise.go Outdated Show resolved Hide resolved
tortoise/verifying_tortoise.go Outdated Show resolved Hide resolved
tortoise/verifying_tortoise.go Show resolved Hide resolved
@lrettig lrettig mentioned this pull request May 1, 2021
Per spacemeshos/research#33 we want to look
back to the start of the sliding window (currently defined as hdist)
when adding exceptions to a candidate base block
Make comments clearer, some minor cleanup
- Mostly finished self-healing core logic in tortoise
- Remove lots of unused code from block builder
- Adds weak coin stub to hare while waiting for #2393 to be merged
Change direction of data flow, hare -> mesh -> tortoise
Improve some log output
Remove unused variables, clean up method signatures
Regarding block and ATX weights, per Tal
@dshulyak
Copy link
Contributor

bors merge

bors bot pushed a commit that referenced this pull request Aug 26, 2021
## Motivation
See spacemeshos/SMIPS#46

Requires #2394, #2393, #2357
Closes #2203
Closes #2687

## Changes
- adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector)
- Hare explicitly reports failed CPs to the mesh, with the results stored in memory

## Test Plan
TBD

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Finish rescoring block goodness after healing
- [x] Rescoring block unit test
- [x] Block voting weight unit test
- [x] Make sure healing -> verifying tortoise handoff is working
- [x] App test (for decay in active set size)
- [x] Split-then-rejoin test (two way split, three way split w/o majority fork)
- [x] Late blocks unit test(s)
- [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540)
- [x] Multi tortoise unit tests

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 26, 2021

Build failed:

@dshulyak
Copy link
Contributor

weird. different unit test is failing every time, it didn't even come to latenodes in last 4 tries

bors merge

bors bot pushed a commit that referenced this pull request Aug 26, 2021
## Motivation
See spacemeshos/SMIPS#46

Requires #2394, #2393, #2357
Closes #2203
Closes #2687

## Changes
- adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector)
- Hare explicitly reports failed CPs to the mesh, with the results stored in memory

## Test Plan
TBD

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Finish rescoring block goodness after healing
- [x] Rescoring block unit test
- [x] Block voting weight unit test
- [x] Make sure healing -> verifying tortoise handoff is working
- [x] App test (for decay in active set size)
- [x] Split-then-rejoin test (two way split, three way split w/o majority fork)
- [x] Late blocks unit test(s)
- [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540)
- [x] Multi tortoise unit tests

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 26, 2021

Build failed:

@dshulyak
Copy link
Contributor

bors merge

bors bot pushed a commit that referenced this pull request Aug 26, 2021
## Motivation
See spacemeshos/SMIPS#46

Requires #2394, #2393, #2357
Closes #2203
Closes #2687

## Changes
- adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector)
- Hare explicitly reports failed CPs to the mesh, with the results stored in memory

## Test Plan
TBD

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Finish rescoring block goodness after healing
- [x] Rescoring block unit test
- [x] Block voting weight unit test
- [x] Make sure healing -> verifying tortoise handoff is working
- [x] App test (for decay in active set size)
- [x] Split-then-rejoin test (two way split, three way split w/o majority fork)
- [x] Late blocks unit test(s)
- [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540)
- [x] Multi tortoise unit tests

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 26, 2021

Build failed:

@dshulyak
Copy link
Contributor

i don't understand why unit tests are so flaky, there are now 3 failures in a single run:

bors merge

bors bot pushed a commit that referenced this pull request Aug 26, 2021
## Motivation
See spacemeshos/SMIPS#46

Requires #2394, #2393, #2357
Closes #2203
Closes #2687

## Changes
- adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector)
- Hare explicitly reports failed CPs to the mesh, with the results stored in memory

## Test Plan
TBD

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Finish rescoring block goodness after healing
- [x] Rescoring block unit test
- [x] Block voting weight unit test
- [x] Make sure healing -> verifying tortoise handoff is working
- [x] App test (for decay in active set size)
- [x] Split-then-rejoin test (two way split, three way split w/o majority fork)
- [x] Late blocks unit test(s)
- [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540)
- [x] Multi tortoise unit tests

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 26, 2021

Build failed:

@dshulyak
Copy link
Contributor

bors merge

bors bot pushed a commit that referenced this pull request Aug 26, 2021
## Motivation
See spacemeshos/SMIPS#46

Requires #2394, #2393, #2357
Closes #2203
Closes #2687

## Changes
- adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector)
- Hare explicitly reports failed CPs to the mesh, with the results stored in memory

## Test Plan
TBD

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Finish rescoring block goodness after healing
- [x] Rescoring block unit test
- [x] Block voting weight unit test
- [x] Make sure healing -> verifying tortoise handoff is working
- [x] App test (for decay in active set size)
- [x] Split-then-rejoin test (two way split, three way split w/o majority fork)
- [x] Late blocks unit test(s)
- [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540)
- [x] Multi tortoise unit tests

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 26, 2021

Build failed:

@dshulyak
Copy link
Contributor

There must be something in this branch that makes those tests crash consistently on CI

@lrettig
Copy link
Member Author

lrettig commented Aug 26, 2021

bors merge

bors bot pushed a commit that referenced this pull request Aug 26, 2021
## Motivation
See spacemeshos/SMIPS#46

Requires #2394, #2393, #2357
Closes #2203
Closes #2687

## Changes
- adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector)
- Hare explicitly reports failed CPs to the mesh, with the results stored in memory

## Test Plan
TBD

## TODO
<!-- This section should be removed when all items are complete -->
- [x] Explain motivation or link existing issue(s)
- [x] Test changes and document test plan
- [x] Finish rescoring block goodness after healing
- [x] Rescoring block unit test
- [x] Block voting weight unit test
- [x] Make sure healing -> verifying tortoise handoff is working
- [x] App test (for decay in active set size)
- [x] Split-then-rejoin test (two way split, three way split w/o majority fork)
- [x] Late blocks unit test(s)
- [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540)
- [x] Multi tortoise unit tests

## DevOps Notes
<!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases -->
- [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources)
- [x] This PR does not affect public APIs
- [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.)
- [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
@bors
Copy link

bors bot commented Aug 26, 2021

Pull request successfully merged into develop.

Build succeeded:

@bors bors bot changed the title Self-healing [Merged by Bors] - Self-healing Aug 26, 2021
@bors bors bot closed this Aug 26, 2021
@bors bors bot deleted the healing branch August 26, 2021 21:09
@lrettig
Copy link
Member Author

lrettig commented Aug 26, 2021

i don't understand why unit tests are so flaky, there are now 3 failures in a single run:

These failures were due to a bugfix I made in the mesh code, just needed to update those tests - TestFindNodeProtocol_FindNode_Concurrency is flaky but the others were not a flakiness issue

// test states are the same when one input is from tortoise and the other from hare
assert.Equal(t, s.Txs, s2.Txs)
require.NotEqual(t, s.Txs, s2.Txs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lrettig
is this correct? if we want the same result, it should be equal right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is a little misleading (see identical comment a few lines down), but the change is intentional and this should be correct. all layers except the latest layer have been applied here, so the state should not be the same. then the latest layer is applied, and then the state should be the same. that's what this is testing.

mesh2.ValidateLayer(context.TODO(), l)

// test states are the same when one input is from tortoise and the other from hare
require.Equal(t, s.Txs, s2.Txs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not do for i := 0; i < numOfLayers; i++ in line 228 and require.Equal(t, s.Txs, s2.Txs) directly?
why care about this one last layer difference?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is more thorough if it isolates the final layer application: the states are not equal before the final layer is applied, and they are equal after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants