Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

vFile: include timeout for ETCD request #1850

Merged
merged 2 commits into from Aug 30, 2017

Conversation

luomiao
Copy link
Contributor

@luomiao luomiao commented Aug 29, 2017

Fixes #1792 and #1844

  1. Add DialTimeout when creating new ETCD client. With DialTimeout, ETCD won't create a client unless the remote ETCD endpoint is accessible.
    This can avoid problems when some of the swarm managers don't have the vFile plugin installed.

  2. Add etcdRequestTimeout for all ETCD requests. Thus an ETCD request will return error when the ETCD endpoint is not accessible, instead of blocking wait, which can cause volume operations stuck forever.

Local tests passed.

1. Add DialTimeout when creating new ETCD client. With DialTimeout,
ETCD won't create a client unless the remote ETCD endpoint is accessible.
This can avoid problems when some of the swarm managers don't have the
vFile plugin installed.

2. Add etcdRequestTimeout for all ETCD requests. Thus an ETCD request
will return error when the ETCD endpoint is not accessible, instead of
blocking wait, which can cause volume operations stuck forever.
Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, only have some questions/comments.

What is the local test you have done? How do you verify that this PR has fixed the issues?

checkSleepDuration: How long to wait in any busy waiting situation
before checking again
gcTicker: ticker for garbage collector to run a collection
etcdClientCreateError: Error indicating failure to create etcd client
swarmErrorMsg: Message indicating swarm cluster is unhealthy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better use name "swarmUnhealthyErrorMsg".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update.

@@ -58,10 +60,12 @@ const (
etcdScheme = "http://"
etcdClusterStateNew = "new"
etcdClusterStateExisting = "existing"
requestTimeout = 5 * time.Second
etcdRequestTimeout = 2 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know the timeout value used here is the right one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the etcdUpdateTimeout is in fact the old requestTimeout now.
In the previous code, requestTimeout or 2 * requestTimeout is used for waiting status change of ETCD server, or waiting for a key-value inside ETCD getting updated by other managers. So it's basically keeping the old value.
And the new etcdRequestTimeout is timeout for waiting an ETCD operation to be done. I checked the example code in ETCD codebase and 2 seconds are used for requests timeout.

@@ -670,15 +687,12 @@ func addrToEtcdClient(addr string) (*etcdClient.Client, error) {
s := strings.Split(addr, ":")
endpoint := s[0] + etcdClientPort
cfg := etcdClient.Config{
Endpoints: []string{endpoint},
Endpoints: []string{endpoint},
DialTimeout: etcdRequestTimeout,
}

etcd, err := etcdClient.New(cfg)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the error log is removed?

Copy link
Contributor Author

@luomiao luomiao Aug 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this function is called by the garbage collector and List function too.
When some of the swarm managers don't have plugin installed (no ETCD on it), which is still a valid environment for vFile, this message will pollute the logs. However, it's not really necessary. This is because a plugin will loop all the managers to find a valid one with plugin installed. We only need the error log when none of the managers are valid, which is currently printed out at: https://github.com/vmware/docker-volume-vsphere/pull/1850/files/a7f1b6a45f7bf37d664018534cfed073fc3d2e24#diff-ac388eb52d1ec58c6967185c9f547fe2R674

@luomiao
Copy link
Contributor Author

luomiao commented Aug 30, 2017

@lipingxue
I reproduced the ETCD errors in #1792 and #1844 and make sure the PR resolved those errors when 1) managers without vFile plugin installed existing in the swarm; 2) too many managers left the cluster which results in quorum loss situation.

Copy link
Contributor

@lipingxue lipingxue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luomiao luomiao merged commit 772a172 into vmware-archive:master Aug 30, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants