Attach failing with NotAuthenticated error. #404

prashima · 2017-12-04T19:26:34Z

Pod is not getting created successfully with statically created disk, attached to pod directly as volume or via PVC.

Steps to reproduce

Deployed Kubernetes cluster based off multi-vc-upstream branch. Tried both custom hyperkube build as well as 1.9.0-alpha for Project Octopus.
Tried workflows https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/kubernetes-volumes.html and https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/persistent-vols-claims.html, few minutes after cluster creation. Both failed. Got following message in events log:

AttachVolume.Attach failed for volume "test-volume" : NotAuthenticated

Unable to mount volumes for pod "test-vmdk_default(5467bfc6-d6f4-11e7-8b4c-005056ae23f1)": timeout expired waiting for volumes to attach/mount for pod "default"/"test-vmdk". list of unattached/unmounted volumes=[test-volume]

Will be attaching logs shortly.

======================================

Pasting 'events' for reference:

Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message

28m 28m 1 {default-scheduler } Normal Scheduled Successfully assigned test-vmdk to kubernetes-node2
28m 28m 1 {kubelet kubernetes-node2} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-b99km"
26m 5m 10 {kubelet kubernetes-node2} Warning FailedMount Unable to mount volumes for pod "test-vmdk_default(5467bfc6-d6f4-11e7-8b4c-005056ae23f1)": timeout expired waiting for volumes to attach/mount for pod "default"/"test-vmdk". list of unattached/unmounted volumes=[test-volume]
28m 5m 19 {attachdetach-controller } Warning FailedMount AttachVolume.Attach failed for volume "test-volume" : NotAuthenticated
3m 3m 1 {kubelet kubernetes-node2} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "test-volume"
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Pulling pulling image "gcr.io/google_containers/test-webserver"
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Pulled Successfully pulled image "gcr.io/google_containers/test-webserver"
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Created Created container
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Started Started container

prashima · 2017-12-13T00:30:14Z

I couldn't attach logs because of huge file size, even with tar.gz.

Here is the analysis so far-

VSphere.AttachDisk method is invoked.
For some reason session expires and VSphereConnection tries to establish a new connection-
596685 {"log":"W1202 00:04:13.541216 1 connection.go:71] Creating new client session since the existing session is not valid or not authenticated\n","stream":"stderr","time":"2017-12-02T00:04:13.541 276798Z"}
Inside VSphere.AttachDisk method method VSpehre.getVMFromNodeName is invoked, which returns VirtualMachine gvmomi object. I believe this VirtualMachine object is not getting renewed for or made aware of new connection
List of devices is being requested for the same VirtualMachine object and is failing with NotAuthenticated error.
596690 {"log":"E1202 00:04:13.669325 1 virtualmachine.go:362] Failed to get the devices for VM: "". err: NotAuthenticated\n","stream":"stderr","time":"2017-12-02T00:04:13.669512777Z"}
596691 {"log":"E1202 00:04:13.669443 1 virtualmachine.go:70] Error occurred while checking if disk is attached on VM: "". vmDiskPath: "[sharedVmfs-1] volumes/myDisk1", err: NotAuthenticated\n"," stream":"stderr","time":"2017-12-02T00:04:13.669548225Z"}
VCP code rediscovers nodes and retries only for ManagedObejctNotFound error. So attach operation that is facing NotAuthenticated error is not retried and fails.

prashima · 2017-12-13T23:28:38Z

@abrarshivani and I discussed this issue and have finalized this solution:

VCP code is caching VirtualMachine and Datacenter govmomi objects. These objects contain reference to govmomi client. These references don't get updated when new vSphere connection is made. To solve this problem we will add a callback to vclib.Connect() method. In this callback we will update connection object in all required places.

shaominchen · 2017-12-19T20:00:02Z

I ran into the same issue when creating a pvc - here are the logs:

{"log":"I1218 19:52:44.711058 1 vsphere_volume_util.go:137] VSANStorageProfileData in vsphere volume ""\n","stream":"stderr","time":"2017-12-18T19:52:44.71240654Z"}
{"log":"I1218 19:52:44.711223 1 vsphere.go:961] Starting to create a vSphere volume with volumeOptions: \u0026{CapacityKB:2097152 Tags:map[kubernetes.io/created-for/pvc/namespace:default kubernetes.io/created-for/pvc/name:thinclaim kubernetes.io/created-for/pv/name:pvc-03d10c37-e42d-11e7-8369-005056a7b5d2] Name:kubernetes-dynamic-pvc-03d10c37-e42d-11e7-8369-005056a7b5d2 DiskFormat:thin Datastore: VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType:}\n","stream":"stderr","time":"2017-12-18T19:52:44.712409965Z"}
{"log":"W1218 19:52:44.727288 1 connection.go:71] Creating new client session since the existing session is not valid or not authenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.727414286Z"}
{"log":"I1218 19:52:44.825112 1 vsphere_util.go:191] Getting accessible datastores for node kubernetes-node1\n","stream":"stderr","time":"2017-12-18T19:52:44.825276076Z"}
{"log":"E1218 19:52:44.832968 1 virtualmachine.go:207] Failed to get host system for VM: "". err: NotAuthenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.833199777Z"}
{"log":"I1218 19:52:44.833174 1 vsphere_util.go:174] error "NotAuthenticated" getting accessible datastores for node \u0026{NodeName:kubernetes-node1 vm:0xc420752290}\n","stream":"stderr","time":"2017-12-18T19:52:44.833420102Z"}
{"log":"E1218 19:52:44.833452 1 vsphere.go:1013] Failed to get shared datastore: NotAuthenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.833539568Z"}
{"log":"I1218 19:52:44.833673 1 vsphere.go:1065] The canonical volume path for the newly created vSphere volume is ""\n","stream":"stderr","time":"2017-12-18T19:52:44.833724081Z"}
{"log":"I1218 19:52:44.833884 1 pv_controller.go:1317] failed to provision volume for claim "default/thinclaim" with StorageClass "thin": NotAuthenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.833988065Z"}
{"log":"I1218 19:52:44.834574

prashima · 2017-12-19T21:57:19Z

Since this is a critical issue and needs to be fixed in next 1.9.*, here is the short term solution that team has agreed on (discarding any other previously discussed solution) to minimize impact on existing code:

nodemanager.go: Check and renew vSphere connection for cached govmomi VirtualMachine and Datacenter.
Make sure that new vSphere connection is accessible to all node VMs that reside on that vSphere. This will be done by creating a new connection (if needed) for VSphereInstance and then using it in WIP: Address lack of support within Kubernetes to consume vSphere managed storage #1.

Ideally we should not be mixing business logic with vSphere connection details. We need to make a bigger change to remove all connection and low level failure (ManagedObjectNotFound) from vsphere.go and move it to vclib methods that are closest to the failure. I will file a separate issue for that.

prashima · 2018-01-02T23:09:40Z

I am testing with 1.9 but PR generation will get blocked by #422, as our tests are failing for master.

divyenpatel · 2018-01-05T03:59:16Z

@prashima you can try out with this new ignition image - docker.io/cnastorage/etcd3-ignition:k8smaster for the hyperkube image build using master branch code.

prashima · 2018-01-05T17:41:00Z

Sure, let me give it a try.

prashima · 2018-01-08T19:19:12Z

PR out for review kubernetes#57978

divyenpatel · 2018-01-11T03:35:18Z

Cherry pick to 1.9 is out - kubernetes#58124

prashima added the P0 label Dec 4, 2017

prashima added this to the Sprint - Flash milestone Dec 4, 2017

prashima self-assigned this Dec 4, 2017

tusharnt modified the milestones: Sprint - Flash, Sprint - Wonder Woman Dec 19, 2017

tusharnt added the bug label Dec 19, 2017

This was referenced Dec 21, 2017

Refreshes cached NodeInfo with new vSphere connection kubernetes/kubernetes#57493

Closed

Refreshes cached NodeInfo with new vSphere connection #421

Closed

prashima mentioned this issue Jan 2, 2018

Project Octopus: Fail to provision volume dynamically with error NotAuthorized #393

Closed

prashima mentioned this issue Jan 8, 2018

Renews cached NodeInfo with new vSphere connection kubernetes/kubernetes#57978

Merged

2 tasks

prashima added the Internally reviewed label Jan 11, 2018

divyenpatel closed this as completed in d40cd36 Jan 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attach failing with NotAuthenticated error. #404

Attach failing with NotAuthenticated error. #404

prashima commented Dec 4, 2017

prashima commented Dec 13, 2017

prashima commented Dec 13, 2017

shaominchen commented Dec 19, 2017

prashima commented Dec 19, 2017

prashima commented Jan 2, 2018

divyenpatel commented Jan 5, 2018

prashima commented Jan 5, 2018

prashima commented Jan 8, 2018

divyenpatel commented Jan 11, 2018

Attach failing with NotAuthenticated error. #404

Attach failing with NotAuthenticated error. #404

Comments

prashima commented Dec 4, 2017

prashima commented Dec 13, 2017

prashima commented Dec 13, 2017

shaominchen commented Dec 19, 2017

prashima commented Dec 19, 2017

prashima commented Jan 2, 2018

divyenpatel commented Jan 5, 2018

prashima commented Jan 5, 2018

prashima commented Jan 8, 2018

divyenpatel commented Jan 11, 2018