storagenode/gracefulexit: Implement storage node graceful exit worker - part 1 #3322

ethanadams · 2019-10-18T19:01:52Z

What:
This worker is responsible for initiating graceful exit with the satellite and then processes piece transfer requests.

https://storjlabs.atlassian.net/browse/V3-2613

Splitting this into 2 parts so that we begin end to end testing of all components
Part 2 will implement concurrent processing of the transfers and better error handling.

Why:
This is needed to transfer pieces of an exiting node to a replacement node.

Please describe the tests:

Test 1: Updates to TestChore
Test 2:

Please describe the performance impact:

Code Review Checklist (to be filled out by reviewer)

Does the PR describe what changes are being made?
Does the PR describe why the changes are being made?
Does the code follow our style guide?
Does the code follow our testing guide?
Is the PR appropriately sized? (If it could be broken into smaller PRs it should be)
Does the new code have enough tests? (every PR should have tests or justification otherwise. Bug-fix PRs especially)
Does the new code have enough documentation that answers "how do I use it?" and "what does it do?"? (both source documentation and higher level, diagrams?)
Does any documentation need updating?
Do the database access patterns make sense?

navillasa · 2019-10-18T19:04:08Z

satellite/gracefulexit/endpoint.go

@@ -182,6 +182,12 @@ func (endpoint *Endpoint) doProcess(stream processStream) (err error) {
 	pending := newPendingMap()

 	var morePiecesFlag int32 = 1
+	var errorFlag int32 = 0
+	handleError := func(err error) error {
+		atomic.StoreInt32(&errorFlag, 1)


I guess I'm curious what's the benefit of using the atomic store vs. just returning the wrapped error?

I added this so I didn't have to call group.Wait() in the for loop to know if an error occurred. But there's a probably a better way I'm just not familiar with in Go.

maybe we could have an error channel? I'm not totally sure what would be better either- maybe this is fine, but I just don't think I'd seen it used before.

I agree with having an error channel

Updated to use channel.

navillasa · 2019-10-18T19:21:14Z

storagenode/gracefulexit/chore_test.go

 		require.NoError(t, err)

 		err = exitingNode.DB.Satellites().InitiateGracefulExit(ctx, satellite1.ID(), time.Now(), 10000)
 		require.NoError(t, err)

+		// check that theh storage node is exiting


navillasa · 2019-10-18T19:22:08Z

storagenode/gracefulexit/worker.go

+				if errs.Is(err, os.ErrNotExist) {
+					transferErr = pb.TransferFailed_NOT_FOUND
+				}
+				worker.log.Error("failed to get piece reader.", zap.String("satellite ID", satelliteID.String()), zap.String("piece ID", pieceID.String()), zap.Error(errs.Wrap(err)))


nit - I think use zap.Stringer instead of zap.String - then you can just pass in satelliteID instead of satelliteID.String()

navillasa · 2019-10-18T19:25:22Z

storagenode/gracefulexit/worker.go

+}
+
+// verifyPieceHash verifies whether the piece hash matches the locally computed hash.
+func verifyPieceHash(ctx context.Context, limit *pb.OrderLimit, hash *pb.PieceHash, expectedHash []byte) (err error) {


is there a reason for using this instead of the piecestore.Endpoint's exported VerifyPieceHash method?

piecestore.Endpoint.VerifyPieceHash also checks signatures. As far as I know, we don't have the information to check the signature. I could be wrong.

added signature verification

navillasa · 2019-10-18T19:45:17Z

satellite/gracefulexit/endpoint.go

@@ -182,6 +182,12 @@ func (endpoint *Endpoint) doProcess(stream processStream) (err error) {
 	pending := newPendingMap()

 	var morePiecesFlag int32 = 1
+	var errorFlag int32 = 0
+	handleError := func(err error) error {
+		atomic.StoreInt32(&errorFlag, 1)


maybe we could have an error channel? I'm not totally sure what would be better either- maybe this is fine, but I just don't think I'd seen it used before.

VinozzZ · 2019-10-18T20:35:16Z

satellite/gracefulexit/endpoint.go

-
+			deleteMsg := &pb.SatelliteMessage{
+				Message: &pb.SatelliteMessage_DeletePiece{
+					DeletePiece: &pb.DeletePiece{


Do we need to send the satellite signature with the delete message?

I don't think so.

could a malicious node send a delete message though? maybe it doesn't work that way lol but just checking

The exiting node connects to the satellite and receives the delete message. There is no way (that I know of) for another storage node to connect to a node and send the delete message.

VinozzZ · 2019-10-18T20:42:11Z

storagenode/gracefulexit/worker.go

+			pieceID := msg.DeletePiece.OriginalPieceId
+			err := worker.store.Delete(ctx, satelliteID, pieceID)
+			if err != nil {
+				worker.log.Error("failed to delete piece.", zap.Stringer("satellite ID", satelliteID), zap.Stringer("piece ID", pieceID), zap.Error(errs.Wrap(err)))


Should we break here to wait for the next worker execution?

I was thinking we'd just move on to the next message, but maybe I'm missing something. In addition, the storage node should purge any remaining pieces at the end of the overall graceful exit process anyway.

uplink/ecclient/client.go

pkg/pb/gracefulexit.proto

mobyvb · 2019-10-18T20:11:40Z

satellite/orders/service.go

@@ -743,7 +743,7 @@ func (service *Service) CreateGracefulExitPutOrderLimit(ctx context.Context, buc
 		UplinkPublicKey:  piecePublicKey,
 		StorageNodeId:    nodeID,
 		PieceId:          rootPieceID.Derive(nodeID, pieceNum),
-		Action:           pb.PieceAction_PUT_GRACEFUL_EXIT,
+		Action:           pb.PieceAction_PUT,


Why are you removing PUT_GRACEFUL_EXIT? I think it could be useful to keep it down the line so we can distinguish normal uploads from graceful exit uploads.

Will address in another PR.

@mobyvb why would that be useful?

Also there are multiple places that need updating to make PieceAction_PUT_GRACEFUL_EXIT work.

@egonelbre I was thinking it could help to analyze problems that SNOs may notice, but it's not essential.

storagenode/gracefulexit/worker.go

mobyvb · 2019-10-18T21:05:52Z

storagenode/gracefulexit/worker.go

+			}
+			break
+		default:
+			// TODO handle err


Should we return a custom error in this case? This should never happen.

storagenode/gracefulexit/chore_test.go

mobyvb · 2019-10-18T21:14:58Z

storagenode/gracefulexit/chore_test.go

+		exitingNode.GracefulExit.Chore.Loop.TriggerWait()
+
+		exitProgress, err = exitingNode.DB.Satellites().ListGracefulExits(ctx)
+		require.NoError(t, err)
 		for _, progress := range exitProgress {
 			if progress.SatelliteID == satellite1.ID() {
 				require.NotNil(t, progress.FinishedAt)
 			}
 		}
 	})


Maybe we should also check to ensure that the piece that needed to be transferred has shown up on one of the remaining 8 nodes.

We can iterate over metainfo before the transfer(s), then keep track of all the paths/piece numbers associated with the exiting node. Then, after the transfers, we can check metainfo for the same segments, and expect that the same piece numbers are still in the pointers, but are associated with different nodes.

storagenode/gracefulexit/chore_test.go

storagenode/gracefulexit/worker.go

VinozzZ · 2019-10-21T14:14:13Z

storagenode/gracefulexit/worker.go

 	return errs.Wrap(err)
 }

+func (worker *Worker) handleFailure(ctx context.Context, transferError pb.TransferFailed_Error, pieceID pb.PieceID, satelliteID storj.NodeID, send func(*pb.StorageNodeMessage) error) {
+	failure := &pb.StorageNodeMessage{


Maybe we don't need the satelliteID argument here since it's on the worker struct

… now

storagenode/gracefulexit/worker.go

uplink/ecclient/client.go

storagenode/gracefulexit/chore_test.go

VinozzZ · 2019-10-22T00:29:58Z

storagenode/gracefulexit/chore_test.go

+	nodePieceCounts := make(map[storj.NodeID]int)
+	for _, n := range planet.StorageNodes {
+		node := n
+		// make sure there are no more pieces on the node.


The comment here is a bit misleading

Agreed. Copy/paste error. Updated

VinozzZ · 2019-10-22T00:50:56Z

storagenode/gracefulexit/chore.go

@@ -70,16 +79,16 @@ func (chore *Chore) Run(ctx context.Context) (err error) {

 		for _, satellite := range satellites {
 			satelliteID := satellite.SatelliteID
-			worker := NewWorker(chore.log, chore.satelliteDB, satelliteID)
+			worker := NewWorker(chore.log, chore.store, chore.satelliteDB, chore.trust, chore.dialer, satelliteID)


just curious what's the reason for passing in the trust down to worker instead of getting the address here and then pass the address down to the worker

Yeah I agree. Only the address is necessary. Updated

VinozzZ · 2019-10-22T00:53:04Z

storagenode/gracefulexit/worker.go

+				} else {
+					worker.log.Error("failed to put piece.", zap.Stringer("satellite ID", worker.satelliteID), zap.Stringer("piece ID", pieceID), zap.Error(errs.Wrap(err)))
+					// TODO look at error type to decide on the transfer error
+					worker.handleFailure(ctx, pb.TransferFailed_STORAGE_NODE_UNAVAILABLE, pieceID, c.Send)


Why do we use TransferFailed_STORAGE_NODE_UNAVAILABLE instead of UNKNOWN?

navillasa · 2019-10-22T01:39:29Z

satellite/gracefulexit/endpoint.go

+	errChan := make(chan error, 1)
+	handleError := func(err error) error {
+		errChan <- err
+		close(errChan)


something about this immediate closing of the err channel after sending the error to it is not sitting right with me, but I'm not sure why.. just gonna bookmark this for now. would the group.Wait() still be called? if not, would that mean the goroutines could potentially be left running?

There is only 1 sender and 1 receiver for the channel. I think it's safe to close it in the sender code. The group wait gets checked on error and again before the method exits.

navillasa · 2019-10-22T01:55:51Z

satellite/gracefulexit/endpoint.go

-
+			deleteMsg := &pb.SatelliteMessage{
+				Message: &pb.SatelliteMessage_DeletePiece{
+					DeletePiece: &pb.DeletePiece{


could a malicious node send a delete message though? maybe it doesn't work that way lol but just checking

navillasa · 2019-10-22T02:31:07Z

storagenode/gracefulexit/chore_test.go

+func getNodePieceCounts(ctx context.Context, planet *testplanet.Planet) (_ map[storj.NodeID]int, err error) {
+	nodePieceCounts := make(map[storj.NodeID]int)
+	for _, n := range planet.StorageNodes {
+		node := n


why do you have to do this reassignment?

linter didn't like the former.

storagenode/gracefulexit/chore_test.go:149:21: Using the variable on range scope `node` in function literal (scopelint) nodePieceCounts[node.ID()]++

navillasa · 2019-10-22T02:34:39Z

storagenode/gracefulexit/worker.go

-	// https://storjlabs.atlassian.net/browse/V3-2613
+	addr, err := worker.trust.GetAddress(ctx, worker.satelliteID)
+	if err != nil {
+		return errs.Wrap(err)


nit - wondering if we should have a worker specific error class and use that to wrap instead of the general errs.Wrap.

navillasa · 2019-10-22T02:36:35Z

storagenode/gracefulexit/worker.go

+				continue
+			}
+
+			putCtx, cancel := context.WithCancel(ctx)


^I think this is a good idea

mobyvb

Looks good. We can tie up remaining loose ends in part 2 once everything is hooked up.

VinozzZ

LGTM

ethanadams added 3 commits October 17, 2019 14:47

initial commit for v3-2613 for SN transfer pieces

afca482

Merge branch 'master' into green/v3-2613

3ac1e40

added hash verification and changed protobuf message

cb6fdbb

ethanadams added the WIP Work In Progress label Oct 18, 2019

ethanadams requested a review from a team October 18, 2019 19:01

cla-bot bot added the cla-signed label Oct 18, 2019

ghost requested review from ifraixedes and thepaul and removed request for a team October 18, 2019 19:01

ethanadams requested review from VinozzZ, mobyvb, navillasa and wthorp October 18, 2019 19:02

navillasa reviewed Oct 18, 2019

View reviewed changes

removed unused identity

2db5ec1

navillasa reviewed Oct 18, 2019

View reviewed changes

PR comment updates

7b6c034

VinozzZ reviewed Oct 18, 2019

View reviewed changes

mobyvb reviewed Oct 18, 2019

View reviewed changes

VinozzZ reviewed Oct 21, 2019

View reviewed changes

storagenode/gracefulexit/worker.go Show resolved Hide resolved

VinozzZ reviewed Oct 21, 2019

View reviewed changes

ethanadams added 3 commits October 21, 2019 11:49

first batch of PR comment updates

a5e3b18

updated test to make sure the exiting node no longer has pieces

1413f5b

verify replacement node signature

eb68ff1

ethanadams requested review from mobyvb, VinozzZ and navillasa October 21, 2019 18:02

ethanadams added 2 commits October 21, 2019 14:44

reworked test to run exit twice. but have the second run disabled for…

31590d9

… now

fix lint issues

36badde

ethanadams added Request Code Review Code review requested Reviewer Can Merge If all checks have passed, non-owner can merge PR and removed WIP Work In Progress labels Oct 21, 2019

ethanadams added 3 commits October 21, 2019 15:41

Merge branch 'master' into green/v3-2613

979eac4

fixed bad pieceID value in original piece hash

9958a5f

removing verifyPieceHash since this is all handled in the put/upload

bae005d

mobyvb reviewed Oct 21, 2019

View reviewed changes

storagenode/gracefulexit/worker.go Outdated Show resolved Hide resolved

uplink/ecclient/client.go Show resolved Hide resolved

storagenode/gracefulexit/chore_test.go Show resolved Hide resolved

storagenode/gracefulexit/chore_test.go Show resolved Hide resolved

more PR comment updates

7749717

ethanadams requested a review from mobyvb October 22, 2019 00:11

VinozzZ reviewed Oct 22, 2019

View reviewed changes

removed misleading comment and added function comment

26f84c0

ethanadams requested a review from VinozzZ October 22, 2019 00:35

VinozzZ reviewed Oct 22, 2019

View reviewed changes

navillasa reviewed Oct 22, 2019

View reviewed changes

ethanadams added 3 commits October 22, 2019 08:10

removed need for trust in worker

ecb8684

log get satellite address error and continue

64bbf51

Merge branch 'master' into green/v3-2613

3a48de2

ethanadams requested review from VinozzZ and navillasa October 22, 2019 14:01

mobyvb approved these changes Oct 22, 2019

View reviewed changes

Merge branch 'master' into green/v3-2613

ae54362

VinozzZ approved these changes Oct 22, 2019

View reviewed changes

ethanadams merged commit 3e0d123 into master Oct 22, 2019

ethanadams deleted the green/v3-2613 branch October 22, 2019 20:42

storagenode/gracefulexit: Implement storage node graceful exit worker - part 1 #3322

storagenode/gracefulexit: Implement storage node graceful exit worker - part 1 #3322

Conversation

ethanadams commented Oct 18, 2019 • edited by mobyvb

Code Review Checklist (to be filled out by reviewer)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mobyvb left a comment

Choose a reason for hiding this comment

VinozzZ left a comment

Choose a reason for hiding this comment

ethanadams commented Oct 18, 2019 •

edited by mobyvb