vFile: include timeout for ETCD request #1850

luomiao · 2017-08-29T22:45:02Z

Add DialTimeout when creating new ETCD client. With DialTimeout, ETCD won't create a client unless the remote ETCD endpoint is accessible.
This can avoid problems when some of the swarm managers don't have the vFile plugin installed.
Add etcdRequestTimeout for all ETCD requests. Thus an ETCD request will return error when the ETCD endpoint is not accessible, instead of blocking wait, which can cause volume operations stuck forever.

Local tests passed.

1. Add DialTimeout when creating new ETCD client. With DialTimeout, ETCD won't create a client unless the remote ETCD endpoint is accessible. This can avoid problems when some of the swarm managers don't have the vFile plugin installed. 2. Add etcdRequestTimeout for all ETCD requests. Thus an ETCD request will return error when the ETCD endpoint is not accessible, instead of blocking wait, which can cause volume operations stuck forever.

lipingxue

Overall looks good, only have some questions/comments.

What is the local test you have done? How do you verify that this PR has fixed the issues?

lipingxue · 2017-08-29T23:22:25Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

   checkSleepDuration:         How long to wait in any busy waiting situation
                               before checking again
   gcTicker:                   ticker for garbage collector to run a collection
   etcdClientCreateError:      Error indicating failure to create etcd client
+   swarmErrorMsg:              Message indicating swarm cluster is unhealthy


nit: better use name "swarmUnhealthyErrorMsg".

will update.

lipingxue · 2017-08-29T23:23:21Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

@@ -58,10 +60,12 @@ const (
 	etcdScheme               = "http://"
 	etcdClusterStateNew      = "new"
 	etcdClusterStateExisting = "existing"
-	requestTimeout           = 5 * time.Second
+	etcdRequestTimeout       = 2 * time.Second


How do you know the timeout value used here is the right one?

So the etcdUpdateTimeout is in fact the old requestTimeout now.
In the previous code, requestTimeout or 2 * requestTimeout is used for waiting status change of ETCD server, or waiting for a key-value inside ETCD getting updated by other managers. So it's basically keeping the old value.
And the new etcdRequestTimeout is timeout for waiting an ETCD operation to be done. I checked the example code in ETCD codebase and 2 seconds are used for requests timeout.

lipingxue · 2017-08-29T23:27:43Z

client_plugin/drivers/vfile/kvstore/etcdops/etcdops.go

@@ -670,15 +687,12 @@ func addrToEtcdClient(addr string) (*etcdClient.Client, error) {
 	s := strings.Split(addr, ":")
 	endpoint := s[0] + etcdClientPort
 	cfg := etcdClient.Config{
-		Endpoints: []string{endpoint},
+		Endpoints:   []string{endpoint},
+		DialTimeout: etcdRequestTimeout,
 	}

 	etcd, err := etcdClient.New(cfg)
 	if err != nil {


Why the error log is removed?

Because this function is called by the garbage collector and List function too.
When some of the swarm managers don't have plugin installed (no ETCD on it), which is still a valid environment for vFile, this message will pollute the logs. However, it's not really necessary. This is because a plugin will loop all the managers to find a valid one with plugin installed. We only need the error log when none of the managers are valid, which is currently printed out at: https://github.com/vmware/docker-volume-vsphere/pull/1850/files/a7f1b6a45f7bf37d664018534cfed073fc3d2e24#diff-ac388eb52d1ec58c6967185c9f547fe2R674

luomiao · 2017-08-30T01:03:31Z

@lipingxue
I reproduced the ETCD errors in #1792 and #1844 and make sure the PR resolved those errors when 1) managers without vFile plugin installed existing in the swarm; 2) too many managers left the cluster which results in quorum loss situation.

lipingxue

LGTM

luomiao requested review from msterin and lipingxue August 29, 2017 22:45

vmwclabot added the cla-not-required label Aug 29, 2017

lipingxue reviewed Aug 29, 2017

View reviewed changes

Address comments.

c1b2478

lipingxue approved these changes Aug 30, 2017

View reviewed changes

luomiao merged commit 772a172 into vmware-archive:master Aug 30, 2017

luomiao mentioned this pull request Sep 5, 2017

vFile plugin: volume create/rm stuck forever when ETCD is broken #1844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vFile: include timeout for ETCD request #1850

vFile: include timeout for ETCD request #1850

luomiao commented Aug 29, 2017

lipingxue left a comment

lipingxue Aug 29, 2017

luomiao Aug 30, 2017

lipingxue Aug 29, 2017

luomiao Aug 30, 2017

lipingxue Aug 29, 2017

luomiao Aug 30, 2017 •

edited

luomiao commented Aug 30, 2017

lipingxue left a comment

vFile: include timeout for ETCD request #1850

vFile: include timeout for ETCD request #1850

Conversation

luomiao commented Aug 29, 2017

lipingxue left a comment

Choose a reason for hiding this comment

lipingxue Aug 29, 2017

Choose a reason for hiding this comment

luomiao Aug 30, 2017

Choose a reason for hiding this comment

lipingxue Aug 29, 2017

Choose a reason for hiding this comment

luomiao Aug 30, 2017

Choose a reason for hiding this comment

lipingxue Aug 29, 2017

Choose a reason for hiding this comment

luomiao Aug 30, 2017 • edited

Choose a reason for hiding this comment

luomiao commented Aug 30, 2017

lipingxue left a comment

Choose a reason for hiding this comment

luomiao Aug 30, 2017 •

edited