Skip to content
This repository has been archived by the owner on Nov 9, 2020. It is now read-only.

TestConcurrency is failing: volume creation is failing against docker 1.13 on ubuntu 16.04 VMs #881

Closed
shuklanirdesh82 opened this issue Jan 30, 2017 · 10 comments
Assignees
Milestone

Comments

@shuklanirdesh82
Copy link
Contributor

shuklanirdesh82 commented Jan 30, 2017

Steps to reproduce:

  1. run TestConcurrency against ubuntu 16 VMs and observes the reported issue.

Note: TestConcurrency is commented out with #879, please uncomment concurrency tests after fixing the reported issue

=== RUN   TestConcurrency
Running concurrent tests on tcp://192.168.31.187:2375 and tcp://192.168.31.83:2375 (may take a while)...
Running create/delete concurrent test...
--- FAIL: TestConcurrency (29.75s)
	sanity_test.go:199: Successfully connected to tcp://192.168.31.187:2375
	sanity_test.go:199: Successfully connected to tcp://192.168.31.83:2375
	sanity_test.go:298: Create/delete concurrent test failed, err: Error response from daemon: create volTestP10: Post http://%2Frun%2Fdocker%2Fplugins%2Fvmdk.sock/VolumeDriver.Create: http: ContentLength=44 with Body length 0
FAIL

vmdk_ops log

01/27/17 12:47:23 558076 [Ubuntu.14.04-[vsanDatastore] dockvols/4af40ea9-ae91-4556-b6b6-641b72c85ec5/volTestP10.vmdk] [WARNING] vmci_reply returned error Broken pipe (errno=32)
01/27/17 12:47:25 558076 [photon.vsan-[vsanDatastore] dockvols/4af40ea9-ae91-4556-b6b6-641b72c85ec5/volTestP00.vmdk] [INFO   ] executeRequest 'remove' completed with ret=None
01/27/17 12:47:25 558076 [Thread-39] [INFO   ] cmd get with opts {} on tenant_uuid 4af40ea9-ae91-4556-b6b6-641b72c85ec5 datastore vsanDatastore is allowed to execute
01/27/17 12:47:25 558076 [photon.vsan-[vsanDatastore] dockvols/4af40ea9-ae91-4556-b6b6-641b72c85ec5/volTestP01.vmdk] [INFO   ] executeRequest 'get' completed with ret={'Error': 'Volume volTestP01 not found (file: /vmfs/volumes/vsanDatastore/dockvols/4af40ea9-ae91-4556-b6b6-641b72c85ec5/volTestP01.vmdk)'}

some observation:

  1. even though vms are having PVSCSI adapter: running into the issue reported earlier for PhotonVM (Error occurred while creating docker volume and allocated capacity stays at '0' - vm was created on ESX 6.0U2 #656) now observing against ubuntu VM
  2. cosmetic thing: datastore_cache ending up having duplicates (https://github.com/vmware/docker-volume-vsphere/blob/c50e1180bbc1518850fbf726f420e5abf98bc0a8/esx_service/utils/vmdk_utils.py#L74)
  3. same duplicates for volume ls
root@sc-rdops-vm18-dhcp-57-89:~# docker volume ls
DRIVER              VOLUME NAME
vmdk                volTestP00@TestDatastore1
vmdk                volTestP00@TestDatastore1
vmdk                volTestP11@TestDatastore1
vmdk                volTestP11@TestDatastore1

//CC @kerneltime

@shaominchen
Copy link
Contributor

Hi Ritesh, can you please take a look at this issue? What's the purpose of this test?

@shuklanirdesh82 shuklanirdesh82 modified the milestones: 0.13, v1 GA, 0.12 Feb 15, 2017
@govint
Copy link
Contributor

govint commented Feb 23, 2017

Have tried the test again, works fine locally. In CI am seeing an error in removing the test volume which is different from whats reported in the issue. Debugging this further.

@tusharnt tusharnt modified the milestones: 0.13, 0.12 Feb 28, 2017
@govint
Copy link
Contributor

govint commented Mar 1, 2017

I've reproduced this issue several times with a single docker host and the way its designed there is no guarantee that this error "volume not found" will not happen. Basically up to a default of 5 or 2 threads will be created to run create and delete of volumes in parallel and unless those are on separate datastores there is every chance that one thread will remove a volume and the other will get an error.

As long as there aren't any of the KV file not found or any hang on lock type issues the remove error reported in the test is expected.

I don't see any change for this issue except may be to allow the test to report error only for genuine cases like create error which should succeed. There can be genuine volume create/remove errors which should be caught in other tests as well. For this test suggest logging errors vs. failing the test.

CC @pdhamdhere @msterin

@brunotm
Copy link
Contributor

brunotm commented Mar 1, 2017

Hello @govint

I've reproduced this issue several times with a single docker host and the way its designed there is no guarantee that this error "volume not found" will not happen. Basically up to a default of 5 or 2 threads will be created to run create and delete of volumes in parallel and unless those are on separate datastores there is every chance that one thread will remove a volume and the other will get an error.

How could this happen if each goroutine is working with a different set of volume names ?

@govint
Copy link
Contributor

govint commented Mar 1, 2017

Ok, let me say, I've been testing different changes and lied a bit above. No, this test works fine on a local setup (like I updated earlier), but I do get the errors in CI. And at least one looks like a repro of #954, so I'll debug that and the remove vol errors I'd seen earlier.

@govint
Copy link
Contributor

govint commented Mar 2, 2017

Able to consistently repro #954 with the concurrent test with two clients. Will be debugging that exclusively.

@govint
Copy link
Contributor

govint commented Mar 2, 2017

Ok, I was able to test this on a pair of Ubuntu VMs (14.04) and the test runs perfectly well. Modified the test to do just the parallel volume create and remove between two VMs for a total of fifty volumes just to see if that many reproduces the problem. I'm unable to repro the issue at all. But in CI its consistently reproducible and the issue is exactly whats reported in #954 (which seems a duplicate of this issue).

The buf pointer returned from C looks valid and for some reason the C.Free() seems to be getting a fault - not always but its consistent in this test.
I'm testing with instrumented code to check if the C.Free() is the call that's always causing the issue and dumping the stack for the plugin when it restarts. Few more tests and should be able to confirm code changes.

@govint
Copy link
Contributor

govint commented Mar 3, 2017

Being fixed in PR #941.

Duplicate of #954

@govint
Copy link
Contributor

govint commented Mar 13, 2017

Photon OS issue - vmware/photon#614

@govint govint closed this as completed Mar 15, 2017
@govint govint reopened this Mar 15, 2017
@govint
Copy link
Contributor

govint commented Mar 15, 2017

The issue of the concurrent tests failing seems isolated to Photon OS and seems to be reproduced with a specific version of Photon OS with 6.0 ESX.

Photon OS 4.4.41-1 with ESX 6.0P04 and ESX 6.5 doesn't repro the issue.

@govint govint closed this as completed Mar 15, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants