New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Self-healing #2400
Conversation
Bring context down into verifying tortoise
Hare reports to mesh when it explicitly fails for a layer. This allows tortoise to add votes against all blocks in such a layer. Also adds some more minor cleanup.
Comments
- Set local opinion as opposed to all blocks in a layer if Hare failed for that layer. - Correctly process exception votes when processing blocks (rather than summing base block vote + exception vote); handle abstain votes here - Correctly interpret the lack of an explicit vote in support of a block in an older late as a vote against that block when summing global opinion Lots of minor cleanup and improvements of comments, logs, variable names Change some panics to errors
Per spacemeshos/research#33 we want to look back to the start of the sliding window (currently defined as hdist) when adding exceptions to a candidate base block
Make comments clearer, some minor cleanup
- Mostly finished self-healing core logic in tortoise - Remove lots of unused code from block builder - Adds weak coin stub to hare while waiting for #2393 to be merged
Change direction of data flow, hare -> mesh -> tortoise
Improve some log output
Remove unused variables, clean up method signatures
Regarding block and ATX weights, per Tal
bors merge |
## Motivation See spacemeshos/SMIPS#46 Requires #2394, #2393, #2357 Closes #2203 Closes #2687 ## Changes - adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector) - Hare explicitly reports failed CPs to the mesh, with the results stored in memory ## Test Plan TBD ## TODO <!-- This section should be removed when all items are complete --> - [x] Explain motivation or link existing issue(s) - [x] Test changes and document test plan - [x] Finish rescoring block goodness after healing - [x] Rescoring block unit test - [x] Block voting weight unit test - [x] Make sure healing -> verifying tortoise handoff is working - [x] App test (for decay in active set size) - [x] Split-then-rejoin test (two way split, three way split w/o majority fork) - [x] Late blocks unit test(s) - [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540) - [x] Multi tortoise unit tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
Build failed: |
weird. different unit test is failing every time, it didn't even come to latenodes in last 4 tries bors merge |
## Motivation See spacemeshos/SMIPS#46 Requires #2394, #2393, #2357 Closes #2203 Closes #2687 ## Changes - adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector) - Hare explicitly reports failed CPs to the mesh, with the results stored in memory ## Test Plan TBD ## TODO <!-- This section should be removed when all items are complete --> - [x] Explain motivation or link existing issue(s) - [x] Test changes and document test plan - [x] Finish rescoring block goodness after healing - [x] Rescoring block unit test - [x] Block voting weight unit test - [x] Make sure healing -> verifying tortoise handoff is working - [x] App test (for decay in active set size) - [x] Split-then-rejoin test (two way split, three way split w/o majority fork) - [x] Late blocks unit test(s) - [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540) - [x] Multi tortoise unit tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
Build failed: |
bors merge |
## Motivation See spacemeshos/SMIPS#46 Requires #2394, #2393, #2357 Closes #2203 Closes #2687 ## Changes - adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector) - Hare explicitly reports failed CPs to the mesh, with the results stored in memory ## Test Plan TBD ## TODO <!-- This section should be removed when all items are complete --> - [x] Explain motivation or link existing issue(s) - [x] Test changes and document test plan - [x] Finish rescoring block goodness after healing - [x] Rescoring block unit test - [x] Block voting weight unit test - [x] Make sure healing -> verifying tortoise handoff is working - [x] App test (for decay in active set size) - [x] Split-then-rejoin test (two way split, three way split w/o majority fork) - [x] Late blocks unit test(s) - [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540) - [x] Multi tortoise unit tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
Build failed: |
i don't understand why unit tests are so flaky, there are now 3 failures in a single run:
bors merge |
## Motivation See spacemeshos/SMIPS#46 Requires #2394, #2393, #2357 Closes #2203 Closes #2687 ## Changes - adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector) - Hare explicitly reports failed CPs to the mesh, with the results stored in memory ## Test Plan TBD ## TODO <!-- This section should be removed when all items are complete --> - [x] Explain motivation or link existing issue(s) - [x] Test changes and document test plan - [x] Finish rescoring block goodness after healing - [x] Rescoring block unit test - [x] Block voting weight unit test - [x] Make sure healing -> verifying tortoise handoff is working - [x] App test (for decay in active set size) - [x] Split-then-rejoin test (two way split, three way split w/o majority fork) - [x] Late blocks unit test(s) - [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540) - [x] Multi tortoise unit tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
Build failed: |
bors merge |
## Motivation See spacemeshos/SMIPS#46 Requires #2394, #2393, #2357 Closes #2203 Closes #2687 ## Changes - adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector) - Hare explicitly reports failed CPs to the mesh, with the results stored in memory ## Test Plan TBD ## TODO <!-- This section should be removed when all items are complete --> - [x] Explain motivation or link existing issue(s) - [x] Test changes and document test plan - [x] Finish rescoring block goodness after healing - [x] Rescoring block unit test - [x] Block voting weight unit test - [x] Make sure healing -> verifying tortoise handoff is working - [x] App test (for decay in active set size) - [x] Split-then-rejoin test (two way split, three way split w/o majority fork) - [x] Late blocks unit test(s) - [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540) - [x] Multi tortoise unit tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
Build failed: |
There must be something in this branch that makes those tests crash consistently on CI |
Layer arithmetic
bors merge |
## Motivation See spacemeshos/SMIPS#46 Requires #2394, #2393, #2357 Closes #2203 Closes #2687 ## Changes - adds zdist param: whereas hdist is the "Hare lookback" distance (the number of layers for which we consider hare results/local input vector rather than global opinion), zdist is the "Hare result wait" distance (the number of layers we're willing to wait for Hare to finish before reverting to invalidating layers with no input vector) - Hare explicitly reports failed CPs to the mesh, with the results stored in memory ## Test Plan TBD ## TODO <!-- This section should be removed when all items are complete --> - [x] Explain motivation or link existing issue(s) - [x] Test changes and document test plan - [x] Finish rescoring block goodness after healing - [x] Rescoring block unit test - [x] Block voting weight unit test - [x] Make sure healing -> verifying tortoise handoff is working - [x] App test (for decay in active set size) - [x] Split-then-rejoin test (two way split, three way split w/o majority fork) - [x] Late blocks unit test(s) - [x] Correctly weight votes from blocks with late atxs/those with a bad beacon value (will follow up in separate issue: #2540) - [x] Multi tortoise unit tests ## DevOps Notes <!-- Please uncheck these items as applicable to make DevOps aware of changes that may affect releases --> - [x] This PR does not require configuration changes (e.g., environment variables, GitHub secrets, VM resources) - [x] This PR does not affect public APIs - [x] This PR does not rely on a new version of external services (PoET, elasticsearch, etc.) - [ ] This PR does not make changes to log messages (which monitoring infrastructure may rely on)
Pull request successfully merged into develop. Build succeeded: |
These failures were due to a bugfix I made in the mesh code, just needed to update those tests - |
// test states are the same when one input is from tortoise and the other from hare | ||
assert.Equal(t, s.Txs, s2.Txs) | ||
require.NotEqual(t, s.Txs, s2.Txs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lrettig
is this correct? if we want the same result, it should be equal right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the comment is a little misleading (see identical comment a few lines down), but the change is intentional and this should be correct. all layers except the latest layer have been applied here, so the state should not be the same. then the latest layer is applied, and then the state should be the same. that's what this is testing.
mesh2.ValidateLayer(context.TODO(), l) | ||
|
||
// test states are the same when one input is from tortoise and the other from hare | ||
require.Equal(t, s.Txs, s2.Txs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not do for i := 0; i < numOfLayers; i++
in line 228 and require.Equal(t, s.Txs, s2.Txs)
directly?
why care about this one last layer difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test is more thorough if it isolates the final layer application: the states are not equal before the final layer is applied, and they are equal after.
Motivation
See spacemeshos/SMIPS#46
Requires #2394, #2393, #2357
Closes #2203
Closes #2687
Changes
Test Plan
TBD
TODO
DevOps Notes