Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storagenode/gracefulexit: Implement storage node graceful exit worker - part 1 #3322

Merged
merged 19 commits into from Oct 22, 2019

Conversation

ethanadams
Copy link
Collaborator

@ethanadams ethanadams commented Oct 18, 2019

What:
This worker is responsible for initiating graceful exit with the satellite and then processes piece transfer requests.

https://storjlabs.atlassian.net/browse/V3-2613

Splitting this into 2 parts so that we begin end to end testing of all components
Part 2 will implement concurrent processing of the transfers and better error handling.

Why:
This is needed to transfer pieces of an exiting node to a replacement node.

Please describe the tests:

  • Test 1: Updates to TestChore
  • Test 2:

Please describe the performance impact:

Code Review Checklist (to be filled out by reviewer)

  • Does the PR describe what changes are being made?
  • Does the PR describe why the changes are being made?
  • Does the code follow our style guide?
  • Does the code follow our testing guide?
  • Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
  • Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
  • Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
  • Does any documentation need updating?
  • Do the database access patterns make sense?

@ethanadams ethanadams added the WIP Work In Progress label Oct 18, 2019
@ethanadams ethanadams requested a review from a team October 18, 2019 19:01
@cla-bot cla-bot bot added the cla-signed label Oct 18, 2019
@ghost ghost requested review from ifraixedes and thepaul and removed request for a team October 18, 2019 19:01
@@ -182,6 +182,12 @@ func (endpoint *Endpoint) doProcess(stream processStream) (err error) {
pending := newPendingMap()

var morePiecesFlag int32 = 1
var errorFlag int32 = 0
handleError := func(err error) error {
atomic.StoreInt32(&errorFlag, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm curious what's the benefit of using the atomic store vs. just returning the wrapped error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this so I didn't have to call group.Wait() in the for loop to know if an error occurred. But there's a probably a better way I'm just not familiar with in Go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could have an error channel? I'm not totally sure what would be better either- maybe this is fine, but I just don't think I'd seen it used before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with having an error channel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use channel.

require.NoError(t, err)

err = exitingNode.DB.Satellites().InitiateGracefulExit(ctx, satellite1.ID(), time.Now(), 10000)
require.NoError(t, err)

// check that theh storage node is exiting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - the

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

if errs.Is(err, os.ErrNotExist) {
transferErr = pb.TransferFailed_NOT_FOUND
}
worker.log.Error("failed to get piece reader.", zap.String("satellite ID", satelliteID.String()), zap.String("piece ID", pieceID.String()), zap.Error(errs.Wrap(err)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - I think use zap.Stringer instead of zap.String - then you can just pass in satelliteID instead of satelliteID.String()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

}

// verifyPieceHash verifies whether the piece hash matches the locally computed hash.
func verifyPieceHash(ctx context.Context, limit *pb.OrderLimit, hash *pb.PieceHash, expectedHash []byte) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason for using this instead of the piecestore.Endpoint's exported VerifyPieceHash method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

piecestore.Endpoint.VerifyPieceHash also checks signatures. As far as I know, we don't have the information to check the signature. I could be wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added signature verification

@@ -182,6 +182,12 @@ func (endpoint *Endpoint) doProcess(stream processStream) (err error) {
pending := newPendingMap()

var morePiecesFlag int32 = 1
var errorFlag int32 = 0
handleError := func(err error) error {
atomic.StoreInt32(&errorFlag, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could have an error channel? I'm not totally sure what would be better either- maybe this is fine, but I just don't think I'd seen it used before.


deleteMsg := &pb.SatelliteMessage{
Message: &pb.SatelliteMessage_DeletePiece{
DeletePiece: &pb.DeletePiece{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to send the satellite signature with the delete message?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could a malicious node send a delete message though? maybe it doesn't work that way lol but just checking

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exiting node connects to the satellite and receives the delete message. There is no way (that I know of) for another storage node to connect to a node and send the delete message.

pieceID := msg.DeletePiece.OriginalPieceId
err := worker.store.Delete(ctx, satelliteID, pieceID)
if err != nil {
worker.log.Error("failed to delete piece.", zap.Stringer("satellite ID", satelliteID), zap.Stringer("piece ID", pieceID), zap.Error(errs.Wrap(err)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we break here to wait for the next worker execution?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we'd just move on to the next message, but maybe I'm missing something. In addition, the storage node should purge any remaining pieces at the end of the overall graceful exit process anyway.

uplink/ecclient/client.go Show resolved Hide resolved
pkg/pb/gracefulexit.proto Outdated Show resolved Hide resolved
@@ -743,7 +743,7 @@ func (service *Service) CreateGracefulExitPutOrderLimit(ctx context.Context, buc
UplinkPublicKey: piecePublicKey,
StorageNodeId: nodeID,
PieceId: rootPieceID.Derive(nodeID, pieceNum),
Action: pb.PieceAction_PUT_GRACEFUL_EXIT,
Action: pb.PieceAction_PUT,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you removing PUT_GRACEFUL_EXIT? I think it could be useful to keep it down the line so we can distinguish normal uploads from graceful exit uploads.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address in another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mobyvb why would that be useful?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there are multiple places that need updating to make PieceAction_PUT_GRACEFUL_EXIT work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egonelbre I was thinking it could help to analyze problems that SNOs may notice, but it's not essential.

storagenode/gracefulexit/worker.go Outdated Show resolved Hide resolved
storagenode/gracefulexit/worker.go Outdated Show resolved Hide resolved
}
break
default:
// TODO handle err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return a custom error in this case? This should never happen.

storagenode/gracefulexit/chore_test.go Outdated Show resolved Hide resolved
storagenode/gracefulexit/chore_test.go Outdated Show resolved Hide resolved
exitingNode.GracefulExit.Chore.Loop.TriggerWait()

exitProgress, err = exitingNode.DB.Satellites().ListGracefulExits(ctx)
require.NoError(t, err)
for _, progress := range exitProgress {
if progress.SatelliteID == satellite1.ID() {
require.NotNil(t, progress.FinishedAt)
}
}
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should also check to ensure that the piece that needed to be transferred has shown up on one of the remaining 8 nodes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can iterate over metainfo before the transfer(s), then keep track of all the paths/piece numbers associated with the exiting node. Then, after the transfers, we can check metainfo for the same segments, and expect that the same piece numbers are still in the pointers, but are associated with different nodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

storagenode/gracefulexit/chore_test.go Show resolved Hide resolved
return errs.Wrap(err)
}

func (worker *Worker) handleFailure(ctx context.Context, transferError pb.TransferFailed_Error, pieceID pb.PieceID, satelliteID storj.NodeID, send func(*pb.StorageNodeMessage) error) {
failure := &pb.StorageNodeMessage{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need the satelliteID argument here since it's on the worker struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@ethanadams ethanadams added Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR and removed WIP Work In Progress labels Oct 21, 2019
storagenode/gracefulexit/worker.go Outdated Show resolved Hide resolved
uplink/ecclient/client.go Show resolved Hide resolved
storagenode/gracefulexit/chore_test.go Show resolved Hide resolved
storagenode/gracefulexit/chore_test.go Show resolved Hide resolved
nodePieceCounts := make(map[storj.NodeID]int)
for _, n := range planet.StorageNodes {
node := n
// make sure there are no more pieces on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here is a bit misleading

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Copy/paste error. Updated

@@ -70,16 +79,16 @@ func (chore *Chore) Run(ctx context.Context) (err error) {

for _, satellite := range satellites {
satelliteID := satellite.SatelliteID
worker := NewWorker(chore.log, chore.satelliteDB, satelliteID)
worker := NewWorker(chore.log, chore.store, chore.satelliteDB, chore.trust, chore.dialer, satelliteID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious what's the reason for passing in the trust down to worker instead of getting the address here and then pass the address down to the worker

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. Only the address is necessary. Updated

} else {
worker.log.Error("failed to put piece.", zap.Stringer("satellite ID", worker.satelliteID), zap.Stringer("piece ID", pieceID), zap.Error(errs.Wrap(err)))
// TODO look at error type to decide on the transfer error
worker.handleFailure(ctx, pb.TransferFailed_STORAGE_NODE_UNAVAILABLE, pieceID, c.Send)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use TransferFailed_STORAGE_NODE_UNAVAILABLE instead of UNKNOWN?

errChan := make(chan error, 1)
handleError := func(err error) error {
errChan <- err
close(errChan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something about this immediate closing of the err channel after sending the error to it is not sitting right with me, but I'm not sure why.. just gonna bookmark this for now. would the group.Wait() still be called? if not, would that mean the goroutines could potentially be left running?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only 1 sender and 1 receiver for the channel. I think it's safe to close it in the sender code. The group wait gets checked on error and again before the method exits.


deleteMsg := &pb.SatelliteMessage{
Message: &pb.SatelliteMessage_DeletePiece{
DeletePiece: &pb.DeletePiece{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could a malicious node send a delete message though? maybe it doesn't work that way lol but just checking

func getNodePieceCounts(ctx context.Context, planet *testplanet.Planet) (_ map[storj.NodeID]int, err error) {
nodePieceCounts := make(map[storj.NodeID]int)
for _, n := range planet.StorageNodes {
node := n
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you have to do this reassignment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linter didn't like the former.

storagenode/gracefulexit/chore_test.go:149:21: Using the variable on range scope `node` in function literal (scopelint)

				nodePieceCounts[node.ID()]++

// https://storjlabs.atlassian.net/browse/V3-2613
addr, err := worker.trust.GetAddress(ctx, worker.satelliteID)
if err != nil {
return errs.Wrap(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - wondering if we should have a worker specific error class and use that to wrap instead of the general errs.Wrap.

continue
}

putCtx, cancel := context.WithCancel(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^I think this is a good idea

Copy link
Member

@mobyvb mobyvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We can tie up remaining loose ends in part 2 once everything is hooked up.

Copy link
Contributor

@VinozzZ VinozzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ethanadams ethanadams merged commit 3e0d123 into master Oct 22, 2019
@ethanadams ethanadams deleted the green/v3-2613 branch October 22, 2019 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants