Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/design: Storage Node Graceful Exit #2734

Closed
wants to merge 45 commits into from
Closed

docs/design: Storage Node Graceful Exit #2734

wants to merge 45 commits into from

Conversation

ethanadams
Copy link
Collaborator

What: When a Storage Node wants to leave the network but does not want to lose their escrow we need to have a mechanism for them to exit the network “gracefully”.

Why: Give Storage Nodes a mechanism to leave the network while receiving their escrows, and reduce repair caused by node churn on satellites.

Please describe the tests:

  • Test 1:
  • Test 2:

Please describe the performance impact:

Code Review Checklist (to be filled out by reviewer)

  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?

@ethanadams ethanadams added Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR labels Aug 7, 2019
@ethanadams ethanadams requested a review from a team August 7, 2019 20:51
@cla-bot cla-bot bot added the cla-signed label Aug 7, 2019
@ghost ghost requested review from aleitner and aligeti and removed request for a team August 7, 2019 20:52
@ethanadams ethanadams requested a review from a team August 8, 2019 13:39
@ghost ghost requested review from kaloyan-raev and littleskunk and removed request for a team August 8, 2019 13:39
@egonelbre egonelbre changed the title Storage Node Graceful Exit Design Document design/docs: Storage Node Graceful Exit Aug 8, 2019
@egonelbre egonelbre added the Design Doc Adding or updating a design document. label Aug 8, 2019
@egonelbre egonelbre changed the title design/docs: Storage Node Graceful Exit docs/design: Storage Node Graceful Exit Aug 8, 2019

field node_id blob
field path blob
field peice_info blob
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit pick .. piece.. also instead of dt mention 'date'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed _dt to _at to be consistent

rpc Initiate(stream InitiateRequest) returns (stream InitiateResponse) {}
}

message InitiateRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a comment, but a thought... when I first read 'initiaterequest' didn't think it was referring to graceful exit , so what do you think, of naming it 'GracefulExitInitiateRequest'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to GracefulExit rpc InitiateExit(InitiateExitRequest) returns (InitiateExitResponse)

field completed_dt timestamp ( updateable )
)
```
- Add `PieceAction` field to, `cache.FindStorageNodesRequest`. Update `cache.FindStorageNodesWithPreferences` to ignore exiting nodes for uploads and repairs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of piece action, can u query to find nodes that have only 'exit_initiated_dt' null?, that way no need to add a new piece action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new PieceAction field indeed seems unnecessary. The caller can take advantage of the existing ExcludedNodes field to list the current nodes from the pointer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Not sure where I was going with this one. Document updated.

- Pushes the pieces to the storage node identified in the order using `ecclient`
- Sends the signed new node response to the satellite via a new `CommitPiece` method (uses `metainfo.UpdatePieces`).
- Updates `bytes_deleted` with the number of bytes deleted and sets `completed_dt` to current time
- Execution intervals and batch sizes should be configurable

## Open Questions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"What happens if a piece is deleted when the piece is in the process of being transferred from the exiting node to other node in the network?"

I believe if I understood this correctly, the exiting storage node shall only honor download request and not delete requests or new upload request, Infact exiting SN should internal be made to accept only the download and reject accepting new uploads and/or delete requests

- When a Storage Node is exiting the network gracefully I want the satellite to have the ability to track how much egress they used for exiting so that we do not pay them for that bandwidth.
- When a Storage Node is in the process of gracefully exiting the network, it shall continue to participate in the requested audits and uptimes checks.
- When a Storage Node is in the process of gracefully exiting the network, it shall continue to honor to download requests.
- The Storage Node keep a separate and detailed metric of network bandwidth used to serve the data to clients vs bandwidth used for graceful exit. The Storage Node shall NOT get paid for bandwidth used for graceful exit but shall get paid for the bandwidth used to serve data to clients(downloads, audits, uptime checks etc...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are paying for audits, uptime checks etc...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, we are not paying for uptime checks. We should double-check if we are paying for audits.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to downloads and audit egress

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, we are not paying for uptime checks. We should double-check if we are paying for audits.

We are paying for audits traffic.

- Initiates the exit by setting `nodes.exit_initiated` to current time
- ```
service GracefulExit {
rpc Initiate(stream InitiateRequest) returns (stream InitiateResponse) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stream? Do we plan to keep it open during the whole exit process?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oversight. Fixed

Copy link
Member

@egonelbre egonelbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still open questions, TBD notes and the document is not simple enough.

- Updates `bytes_deleted` with the number of bytes deleted
- Deletes the pieces that were successfully moved
- On failure (ex. satellite is unavailable), successful orders limits stored in `exit_orders` should be reprocessed on the next iteration
- Execution intervals and batch sizes should be configurable

## Open Questions

- What happens if we are doing a repair and a graceful exit on the same segment?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a lot of critical open questions.

field completed_exit_signature blob
)

model exit_orders {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this table?

@egonelbre egonelbre added WIP Work In Progress and removed Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR labels Aug 20, 2019
@ethanadams
Copy link
Collaborator Author

Reworked the document into multiple documents. Closing in favor of a new pull request

@ethanadams ethanadams closed this Aug 23, 2019
@ethanadams ethanadams mentioned this pull request Aug 23, 2019
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed Design Doc Adding or updating a design document. WIP Work In Progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants