Skip to content
This repository has been archived by the owner on Sep 7, 2022. It is now read-only.

Attach failing with NotAuthenticated error. #404

Closed
prashima opened this issue Dec 4, 2017 · 9 comments · Fixed by kubernetes/kubernetes#57978
Closed

Attach failing with NotAuthenticated error. #404

prashima opened this issue Dec 4, 2017 · 9 comments · Fixed by kubernetes/kubernetes#57978

Comments

@prashima
Copy link

prashima commented Dec 4, 2017

Pod is not getting created successfully with statically created disk, attached to pod directly as volume or via PVC.

Steps to reproduce

  1. Deployed Kubernetes cluster based off multi-vc-upstream branch. Tried both custom hyperkube build as well as 1.9.0-alpha for Project Octopus.
  2. Tried workflows https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/kubernetes-volumes.html and https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/persistent-vols-claims.html, few minutes after cluster creation. Both failed. Got following message in events log:

AttachVolume.Attach failed for volume "test-volume" : NotAuthenticated

Unable to mount volumes for pod "test-vmdk_default(5467bfc6-d6f4-11e7-8b4c-005056ae23f1)": timeout expired waiting for volumes to attach/mount for pod "default"/"test-vmdk". list of unattached/unmounted volumes=[test-volume]

Will be attaching logs shortly.

======================================

Pasting 'events' for reference:

Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message


28m 28m 1 {default-scheduler } Normal Scheduled Successfully assigned test-vmdk to kubernetes-node2
28m 28m 1 {kubelet kubernetes-node2} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-b99km"
26m 5m 10 {kubelet kubernetes-node2} Warning FailedMount Unable to mount volumes for pod "test-vmdk_default(5467bfc6-d6f4-11e7-8b4c-005056ae23f1)": timeout expired waiting for volumes to attach/mount for pod "default"/"test-vmdk". list of unattached/unmounted volumes=[test-volume]
28m 5m 19 {attachdetach-controller } Warning FailedMount AttachVolume.Attach failed for volume "test-volume" : NotAuthenticated
3m 3m 1 {kubelet kubernetes-node2} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "test-volume"
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Pulling pulling image "gcr.io/google_containers/test-webserver"
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Pulled Successfully pulled image "gcr.io/google_containers/test-webserver"
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Created Created container
3m 3m 1 {kubelet kubernetes-node2} spec.containers{test-container} Normal Started Started container

@prashima prashima added the P0 label Dec 4, 2017
@prashima prashima added this to the Sprint - Flash milestone Dec 4, 2017
@prashima prashima self-assigned this Dec 4, 2017
@prashima
Copy link
Author

I couldn't attach logs because of huge file size, even with tar.gz.

Here is the analysis so far-

  1. VSphere.AttachDisk method is invoked.

  2. For some reason session expires and VSphereConnection tries to establish a new connection-
    596685 {"log":"W1202 00:04:13.541216 1 connection.go:71] Creating new client session since the existing session is not valid or not authenticated\n","stream":"stderr","time":"2017-12-02T00:04:13.541 276798Z"}

  3. Inside VSphere.AttachDisk method method VSpehre.getVMFromNodeName is invoked, which returns VirtualMachine gvmomi object. I believe this VirtualMachine object is not getting renewed for or made aware of new connection

  4. List of devices is being requested for the same VirtualMachine object and is failing with NotAuthenticated error.
    596690 {"log":"E1202 00:04:13.669325 1 virtualmachine.go:362] Failed to get the devices for VM: "". err: NotAuthenticated\n","stream":"stderr","time":"2017-12-02T00:04:13.669512777Z"}
    596691 {"log":"E1202 00:04:13.669443 1 virtualmachine.go:70] Error occurred while checking if disk is attached on VM: "". vmDiskPath: "[sharedVmfs-1] volumes/myDisk1", err: NotAuthenticated\n"," stream":"stderr","time":"2017-12-02T00:04:13.669548225Z"}

  5. VCP code rediscovers nodes and retries only for ManagedObejctNotFound error. So attach operation that is facing NotAuthenticated error is not retried and fails.

@prashima
Copy link
Author

@abrarshivani and I discussed this issue and have finalized this solution:

VCP code is caching VirtualMachine and Datacenter govmomi objects. These objects contain reference to govmomi client. These references don't get updated when new vSphere connection is made. To solve this problem we will add a callback to vclib.Connect() method. In this callback we will update connection object in all required places.

@shaominchen
Copy link

I ran into the same issue when creating a pvc - here are the logs:

{"log":"I1218 19:52:44.711058 1 vsphere_volume_util.go:137] VSANStorageProfileData in vsphere volume ""\n","stream":"stderr","time":"2017-12-18T19:52:44.71240654Z"}
{"log":"I1218 19:52:44.711223 1 vsphere.go:961] Starting to create a vSphere volume with volumeOptions: \u0026{CapacityKB:2097152 Tags:map[kubernetes.io/created-for/pvc/namespace:default kubernetes.io/created-for/pvc/name:thinclaim kubernetes.io/created-for/pv/name:pvc-03d10c37-e42d-11e7-8369-005056a7b5d2] Name:kubernetes-dynamic-pvc-03d10c37-e42d-11e7-8369-005056a7b5d2 DiskFormat:thin Datastore: VSANStorageProfileData: StoragePolicyName: StoragePolicyID: SCSIControllerType:}\n","stream":"stderr","time":"2017-12-18T19:52:44.712409965Z"}
{"log":"W1218 19:52:44.727288 1 connection.go:71] Creating new client session since the existing session is not valid or not authenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.727414286Z"}
{"log":"I1218 19:52:44.825112 1 vsphere_util.go:191] Getting accessible datastores for node kubernetes-node1\n","stream":"stderr","time":"2017-12-18T19:52:44.825276076Z"}
{"log":"E1218 19:52:44.832968 1 virtualmachine.go:207] Failed to get host system for VM: "". err: NotAuthenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.833199777Z"}
{"log":"I1218 19:52:44.833174 1 vsphere_util.go:174] error "NotAuthenticated" getting accessible datastores for node \u0026{NodeName:kubernetes-node1 vm:0xc420752290}\n","stream":"stderr","time":"2017-12-18T19:52:44.833420102Z"}
{"log":"E1218 19:52:44.833452 1 vsphere.go:1013] Failed to get shared datastore: NotAuthenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.833539568Z"}
{"log":"I1218 19:52:44.833673 1 vsphere.go:1065] The canonical volume path for the newly created vSphere volume is ""\n","stream":"stderr","time":"2017-12-18T19:52:44.833724081Z"}
{"log":"I1218 19:52:44.833884 1 pv_controller.go:1317] failed to provision volume for claim "default/thinclaim" with StorageClass "thin": NotAuthenticated\n","stream":"stderr","time":"2017-12-18T19:52:44.833988065Z"}
{"log":"I1218 19:52:44.834574

@prashima
Copy link
Author

Since this is a critical issue and needs to be fixed in next 1.9.*, here is the short term solution that team has agreed on (discarding any other previously discussed solution) to minimize impact on existing code:

  1. nodemanager.go: Check and renew vSphere connection for cached govmomi VirtualMachine and Datacenter.
  2. Make sure that new vSphere connection is accessible to all node VMs that reside on that vSphere. This will be done by creating a new connection (if needed) for VSphereInstance and then using it in WIP: Address lack of support within Kubernetes to consume vSphere managed storage #1.

Ideally we should not be mixing business logic with vSphere connection details. We need to make a bigger change to remove all connection and low level failure (ManagedObjectNotFound) from vsphere.go and move it to vclib methods that are closest to the failure. I will file a separate issue for that.

@prashima
Copy link
Author

prashima commented Jan 2, 2018

I am testing with 1.9 but PR generation will get blocked by #422, as our tests are failing for master.

@divyenpatel
Copy link

@prashima you can try out with this new ignition image - docker.io/cnastorage/etcd3-ignition:k8smaster for the hyperkube image build using master branch code.

@prashima
Copy link
Author

prashima commented Jan 5, 2018

Sure, let me give it a try.

@prashima
Copy link
Author

prashima commented Jan 8, 2018

PR out for review kubernetes#57978

@divyenpatel
Copy link

Cherry pick to 1.9 is out - kubernetes#58124

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.