vFile: include timeout for ETCD request #1850
Conversation
1. Add DialTimeout when creating new ETCD client. With DialTimeout, ETCD won't create a client unless the remote ETCD endpoint is accessible. This can avoid problems when some of the swarm managers don't have the vFile plugin installed. 2. Add etcdRequestTimeout for all ETCD requests. Thus an ETCD request will return error when the ETCD endpoint is not accessible, instead of blocking wait, which can cause volume operations stuck forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, only have some questions/comments.
What is the local test you have done? How do you verify that this PR has fixed the issues?
checkSleepDuration: How long to wait in any busy waiting situation | ||
before checking again | ||
gcTicker: ticker for garbage collector to run a collection | ||
etcdClientCreateError: Error indicating failure to create etcd client | ||
swarmErrorMsg: Message indicating swarm cluster is unhealthy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better use name "swarmUnhealthyErrorMsg".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update.
@@ -58,10 +60,12 @@ const ( | |||
etcdScheme = "http://" | |||
etcdClusterStateNew = "new" | |||
etcdClusterStateExisting = "existing" | |||
requestTimeout = 5 * time.Second | |||
etcdRequestTimeout = 2 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know the timeout value used here is the right one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the etcdUpdateTimeout is in fact the old requestTimeout now.
In the previous code, requestTimeout or 2 * requestTimeout is used for waiting status change of ETCD server, or waiting for a key-value inside ETCD getting updated by other managers. So it's basically keeping the old value.
And the new etcdRequestTimeout is timeout for waiting an ETCD operation to be done. I checked the example code in ETCD codebase and 2 seconds are used for requests timeout.
@@ -670,15 +687,12 @@ func addrToEtcdClient(addr string) (*etcdClient.Client, error) { | |||
s := strings.Split(addr, ":") | |||
endpoint := s[0] + etcdClientPort | |||
cfg := etcdClient.Config{ | |||
Endpoints: []string{endpoint}, | |||
Endpoints: []string{endpoint}, | |||
DialTimeout: etcdRequestTimeout, | |||
} | |||
|
|||
etcd, err := etcdClient.New(cfg) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the error log is removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this function is called by the garbage collector and List function too.
When some of the swarm managers don't have plugin installed (no ETCD on it), which is still a valid environment for vFile, this message will pollute the logs. However, it's not really necessary. This is because a plugin will loop all the managers to find a valid one with plugin installed. We only need the error log when none of the managers are valid, which is currently printed out at: https://github.com/vmware/docker-volume-vsphere/pull/1850/files/a7f1b6a45f7bf37d664018534cfed073fc3d2e24#diff-ac388eb52d1ec58c6967185c9f547fe2R674
@lipingxue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Fixes #1792 and #1844
Add DialTimeout when creating new ETCD client. With DialTimeout, ETCD won't create a client unless the remote ETCD endpoint is accessible.
This can avoid problems when some of the swarm managers don't have the vFile plugin installed.
Add etcdRequestTimeout for all ETCD requests. Thus an ETCD request will return error when the ETCD endpoint is not accessible, instead of blocking wait, which can cause volume operations stuck forever.
Local tests passed.