Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

Handle datastores missing in the datatore cache and special chars in datastore names. #1140

Merged
merged 2 commits into from Apr 14, 2017

Conversation

govint
Copy link
Contributor

@govint govint commented Apr 3, 2017

See issue #1116 for the root cause, there are two issues, one triggering the ##other,

  1. Issue Added WIP to Readme.md #1 - The VM used in the test was located on a datastore "local-0 (11)", the mkdir command in vmdk_ops.py:get_vol_path() fails with the space and the '(' and ')' chars in the datastore name. This is fixed in get_vol_path().

  2. Issue Installation and build fixes #2 - (1) opens up another issue. In vmdk_utils.py:get_datastore_name() the given datastore isn't found but the empty list is accessed and causes the exception.

  3. The change handles (1) and (2) by allowing vmdk_utils.py:get_datastore_name() to return None if a datastore isn't found. The same scenario may happen where a datastore is temprarily inaccessible and hence not included in the in-memory datastore list. A subsequent request for the datastore name can cause the exception.

  4. To handle the None return from vmdk_utils.py:get_datastore_name() changes are done in places where this function is called to handle the None return value.

Tests:

  1. WIth these fixes, tested with a datastore named "bigone (1)" and no issues.

04/03/17 12:13:37 1177180 [Thread-1] [DEBUG ] Running cmd /usr/lib/vmware/osfs/bin/osfs-mkdir -n '/vmfs/volumes/bigone (1)/dockvols' <---------
04/03/17 12:13:37 1177180 [Thread-1] [INFO ] Created /vmfs/volumes/bigone (1)/dockvols

04/03/17 12:13:37 1177180 [Thread-1] [DEBUG ] Running cmd /bin/mkdir '/vmfs/volumes/bigone (1)/dockvols/11111111-1111-1111-1111-111111111111' <----------
04/03/17 12:13:37 1177180 [Thread-1] [INFO ] Symlink /vmfs/volumes/bigone (1)/dockvols/_DEFAULT is created to point to path /vmfs/volumes/bigone (1)/dockvols/11111111-1111-1111-1111-111111111111
04/03/17 12:13:37 1177180 [Thread-1] [INFO ] Created /vmfs/volumes/bigone (1)/dockvols/11111111-1111-1111-1111-111111111111

  1. Tested a scenario where a datastore isn't on the datastore list, but get_datastore_name is called with the same datastore name.

Without fix:

04/03/17 06:58:09 1162246 [Thread-1] [ERROR ] Unhandled Exception:
Traceback (most recent call last):
File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 1513, in execRequestThread
opts=opts)
File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 804, in executeRequest
vm_datastore = get_datastore_name(vm_datastore_url)
File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 778, in get_datastore_name
datastore_name = vmdk_utils.get_datastore_name(datastore_url)
File "/usr/lib/vmware/vmdkops/Python/vmdk_utils.py", line 359, in get_datastore_name
return res[0]
IndexError: list index out of range
04/03/17 06:58:09 1162246 [Thread-2] [ERROR ] Unhandled Exception:
Traceback (most recent call last):
File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 1513, in execRequestThread
opts=opts)
File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 804, in executeRequest
vm_datastore = get_datastore_name(vm_datastore_url)
File "/usr/lib/vmware/vmdkops/bin/vmdk_ops.py", line 778, in get_datastore_name
datastore_name = vmdk_utils.get_datastore_name(datastore_url)
File "/usr/lib/vmware/vmdkops/Python/vmdk_utils.py", line 359, in get_datastore_name
return res[0]
IndexError: list index out of range

With fix:

administrator@hte-1s-eng-dhcp98:~/vpl-ut/docker-volume-vsphere$ docker volume create --driver=vsphere --name=TestVol1 -o size=400mb
Error response from daemon: create TestVol1: VolumeDriver.Create: Invalid datastore 'None'.
Known datastores: datastore1.

dock_vol_path = os.path.join("/vmfs/volumes", datastore, DOCK_VOLS_DIR)
path = dock_vol_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add this line? code refactor or fix some bug here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the else part thats really not needed. Its not a bug fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be path = dock_vol_path = os.path.join("/vmfs/volumes", datastore, DOCK_VOLS_DIR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


# Its possible the datastore is None, which can happen if datastores on the
# host could not be identified or initialized with the dockvols path.
if not datastore:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to differentiate these two cases? And why the error message for these two cases printed out is different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had kept both cases together but the error message is indicating "default_datastore = None" (which can happen if there were errors in identifying datastores on the host and hence missing in the cache). The user doesn't know how the default_datastore is identified, what it is and why its none. I can remove default_datastore and make a single error message perhaps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default_datastore is useful. Can we print "default_datastore" as "N/A" when it is "None"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the new condition added.

@pdhamdhere pdhamdhere requested a review from pshahzeb April 7, 2017 15:14
@@ -839,6 +839,9 @@ def generate_tenant_ls_rows(tenant_list):
else:
default_datastore = vmdk_utils.get_datastore_name(tenant.default_datastore_url)

if default_datastore is None:
return rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens for one tenant, the function returns. rows could be half populated. Shouldn't rows be flushed; or you should still traverse remaining tenants in tenant_list if this happens in the middle of iterating tenant_list?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it seem default_datastore could be "" (line 838) and None. Why use 2 representations for the same ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -1019,6 +1022,10 @@ def generate_tenant_access_ls_rows(privileges):
datastore = ""
else:
datastore = vmdk_utils.get_datastore_name(p.datastore_url)

if datastore is None:
return rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. This can happen for one privilege. Need to appropriately consider the list of privileges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, using "" for the datastore if not found.

Copy link
Contributor

@pshahzeb pshahzeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
I have a couple of comments.

Copy link
Contributor

@msterin msterin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good with a few minor nits (see inline).
One issue though - it seems a missing datastore something is indicated by "" and sometimes by None.
Python will handle the difference in the majority of cases (in if not checks) but it feels wrong. Can you fix it and resolve merge, and the fix is good to go

@@ -839,6 +839,9 @@ def generate_tenant_ls_rows(tenant_list):
else:
default_datastore = vmdk_utils.get_datastore_name(tenant.default_datastore_url)

if default_datastore is None:
return rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it seem default_datastore could be "" (line 838) and None. Why use 2 representations for the same ?

@@ -394,6 +394,8 @@ def get_datastore_name(datastore_url):
# get_datastores() return a list of tuple
# each tuple has format like (datastore_name, datastore_url, dockvol_path)
res = [d[0] for d in get_datastores() if d[1] == datastore_url]
if not res:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 397-399 could be replaced with return res[0] if res else None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and thanks for suggesting, looks way better.

dock_vol_path = os.path.join("/vmfs/volumes", datastore, DOCK_VOLS_DIR)
path = dock_vol_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be path = dock_vol_path = os.path.join("/vmfs/volumes", datastore, DOCK_VOLS_DIR)

default_datastore = vm_datastore
default_datastore_url = vm_datastore_url
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason for swapping these lines ? No issue, just curious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was just editing. I'll restore.

@pdhamdhere
Copy link
Contributor

@govint You need to rebase and resolve conflicts.

Copy link
Contributor

@pdhamdhere pdhamdhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor comments.

Not your change. I was expecting a one place at entry point in executerequest where we identify datastore and validate for that request. However, code is spread all over the place. We clearly have an opportunity to cleanse this code.

@@ -1019,6 +1022,10 @@ def generate_tenant_access_ls_rows(privileges):
datastore = ""
else:
datastore = vmdk_utils.get_datastore_name(p.datastore_url)

if datastore is None:
return rows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like existing code already handles datastore=None and you should not return in middle of loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -837,25 +838,31 @@ def executeRequest(vm_uuid, vm_name, config_path, cmd, full_vol_name, opts):
except ValidationError as ex:
return err(str(ex))

if datastore and not vmdk_utils.validate_datastore(datastore):
if not datastore:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L841 & L846 has same if condition, combine them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and removed the new condition added. Its not needed for now and handles the rare case that a datastore is temporarily not visible on the host.

@govint
Copy link
Contributor Author

govint commented Apr 10, 2017

I'm closing this one for now and reopen once #1155 is merged. The same code is modified in #1155 and I need to rework the logic if any based on those changes.

@govint govint closed this Apr 10, 2017
@pdhamdhere
Copy link
Contributor

@govint We have customer already running into this issue. #1155 is much different in scope. Also, there are pending comments on this PR. Can you please address comments, resolve conflicts and get this merged asap.

@pdhamdhere pdhamdhere reopened this Apr 10, 2017
@msterin
Copy link
Contributor

msterin commented Apr 10, 2017

Failing when there is a space in a name should be done asap regardless of other fixes , IMO

@govint govint force-pushed the create-fixes branch 2 times, most recently from 110eb6c to 2683c33 Compare April 11, 2017 11:34
@govint
Copy link
Contributor Author

govint commented Apr 11, 2017

CI is failing as service isn't coming up for almost 10 - 20 secs when vmdk_ops_test.py is started.,

04/11/17 11:41:39 106779 [photon.vmfs-datastore1._DEFAULT.dockerVol_1491910877] [INFO ] executeRequest 'remove' completed with ret=None
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:44 107506 [MainThread] [INFO ] Checking DB mode for /tmp/test-auth.db...
04/11/17 11:41:44 107506 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:44 107506 [MainThread] [INFO ] Auth DB /tmp/test-auth.db is missing, allowing all access
04/11/17 11:41:45 107512 [MainThread] [INFO ] Running tests. Directory used: /vmfs/volumes/datastore1/dockvols
04/11/17 11:41:45 107520 [MainThread] [INFO ] Connecting to the local Service Instance as 'dcui'
04/11/17 11:41:45 107528 [MainThread] [INFO ] Connecting to the local Service Instance as 'dcui'
04/11/17 11:41:46 107528 [MainThread] [INFO ] Checking DB mode for /etc/vmware/vmdkops/auth-db...
04/11/17 11:41:46 107528 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:46 107528 [MainThread] [INFO ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
04/11/17 11:41:46 106779 [MainThread] [WARNING] Received signal num: 10, restarting.
04/11/17 11:41:46 106779 [MainThread] [WARNING] Operations in flight will be terminated
04/11/17 11:41:46 106779 [MainThread] [INFO ] ==== Starting vmdkops service pid=106779 ====
04/11/17 11:41:46 106779 [MainThread] [INFO ] Loading VMCI server lib.
04/11/17 11:41:46 106779 [MainThread] [INFO ] Connecting to the local Service Instance as 'dcui'
04/11/17 11:41:47 107512 [MainThread] [INFO ] Checking DB mode for /etc/vmware/vmdkops/auth-db...
04/11/17 11:41:47 107512 [MainThread] [INFO ] Config DB exists and has modifications, mode SingleNode
04/11/17 11:41:52 107512 [MainThread] [INFO ] datacenter='vim.Datacenter:ha-datacenter' vm_folder='vim.Folder:ha-folder-vm' hosts=(vim.ManagedEntity) [
'vim.ComputeResource:ha-compute-res'
] resource_pool='vim.ResourcePool:ha-root-pool'

@pdhamdhere
Copy link
Contributor

CI is failing as service isn't coming up for almost 10 - 20 secs when vmdk_ops_test.py is started.,

@govint Where is 10-20Secs delay?

@shuklanirdesh82 Can you please check CI run why there is "connection closed" error.

@govint
Copy link
Contributor Author

govint commented Apr 11, 2017

Sorry, its just over a 10 second delay. Test-esx started at this time,

=> Running target test-esx Tue Apr 11 11:41:42 UTC 2017 <-------

Running unit tests in /tmp/vmdk_ops_unittest5002/vmdk_ops_test.py...
Creating new DB at /etc/vmware/vmdkops/auth-db

...........Connection to 192.168.31.62 closed by remote host. <-------
make[2]: *** [test-esx] Error 255
make[1]: *** [test-esx] Error 2
make: *** [test-esx] Error 2
=> Build + Test not successful Tue Apr 11 11:42:49 UTC 2017 <--------- Test completed with error.

ESX service comes up after this time,

....
04/11/17 11:41:46 107528 [MainThread] [INFO ] Checking DB mode for /etc/vmware/vmdkops/auth-db...
04/11/17 11:41:46 107528 [MainThread] [INFO ] Config DB does not exist. mode NotConfigured
04/11/17 11:41:46 107528 [MainThread] [INFO ] Auth DB /etc/vmware/vmdkops/auth-db is missing, allowing all access
04/11/17 11:41:46 106779 [MainThread] [WARNING] Received signal num: 10, restarting.
04/11/17 11:41:46 106779 [MainThread] [WARNING] Operations in flight will be terminated
04/11/17 11:41:46 106779 [MainThread] [INFO ] ==== Starting vmdkops service pid=106779 ====
04/11/17 11:41:46 106779 [MainThread] [INFO ] Loading VMCI server lib.
04/11/17 11:41:46 106779 [MainThread] [INFO ] Connecting to the local Service Instance as 'dcui'
04/11/17 11:41:47 107512 [MainThread] [INFO ] Checking DB mode for /etc/vmware/vmdkops/auth-db...
04/11/17 11:41:47 107512 [MainThread] [INFO ] Config DB exists and has modifications, mode SingleNode
04/11/17 11:41:52 107512 [MainThread] [INFO ] datacenter='vim.Datacenter:ha-datacenter' vm_folder='vim.Folder:ha-folder-vm' hosts=(vim.ManagedEntity) [
'vim.ComputeResource:ha-compute-res'
] resource_pool='vim.ResourcePool:ha-root-pool'


error_info, tenant_uuid, tenant_name = auth.authorize(vm_uuid, datastore_url, cmd, opts)
if error_info:
return err(error_info)

# get /vmfs/volumes/<volid>/dockvols path on ESX:
datastore = get_datastore_name(datastore_url)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this block? was it redundant or not required?

Copy link
Contributor Author

@govint govint Apr 11, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code has the same logic distributed in the first part of executeRequest. I've consolidated the datastore and datastore_url selection above.

Line 844 above sets datastore to the default_datastore (if user hasn't provided a datastore name) and above that default_datastore is set to either the VM datastore or the one found for the tenant.

@msterin
Copy link
Contributor

msterin commented Apr 11, 2017

The code change looks good to me. Will give LGTM when CI passes

@@ -920,6 +920,8 @@ def generate_tenant_ls_rows(tenant_list):
default_datastore = ""
else:
default_datastore = vmdk_utils.get_datastore_name(tenant.default_datastore_url)
if default_datastore is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Would be nice to comment... something like "use empty string to simplify line formatting down the road"

@@ -705,7 +705,7 @@ def get_vol_path(datastore, tenant_name=None):

if not os.path.isdir(dock_vol_path):
# The osfs tools are usable for DOCK_VOLS_DIR on all datastores
cmd = "{0} {1}".format(OSFS_MKDIR_CMD, dock_vol_path)
cmd = "{0} '{1}'".format(OSFS_MKDIR_CMD, dock_vol_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (since the line is touched anyways) - it could be cmd = "{} '{}'".format(OSFS_MKDIR_CMD, dock_vol_path) , no need for positionals, just sequential params are fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@shuklanirdesh82
Copy link
Contributor

#1140 (comment)
@shuklanirdesh82 Can you please check CI run why there is "connection closed" error.

As we are using snapshot, the esx vm state was gone. I had checked sshd connection and esx VM has sufficient configuration.

FYI:Restarted the failed run and it passed.

Copy link
Contributor

@msterin msterin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a race in init_datastoreCache(). Liping has the fix.
Could it be the reason for missing DB name ? Or did you actually added / renamed a datastore while vmdk_ops was in flight ?

@govint
Copy link
Contributor Author

govint commented Apr 14, 2017

I haven't changed the DS name during the test, this is the CI run that was failing.

Can I get approval to submit this change as CI also has passed?

@msterin
Copy link
Contributor

msterin commented Apr 14, 2017

If there was no change DS during the run, then the fix is for a symptom, not root cause. I think we should merge the 'spaces' fix, and get the race fixed before deciding on merging the rest .

@govint govint merged commit f4beb05 into master Apr 14, 2017
@govint govint deleted the create-fixes branch April 14, 2017 16:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants