Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCH created by wizard docker endpoint and admin portal unavailable: docker persona hangs #6978

Closed
AngieCris opened this issue Dec 18, 2017 · 12 comments · Fixed by #6994
Closed
Assignees
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 team/lifecycle

Comments

@AngieCris
Copy link
Contributor

AngieCris commented Dec 18, 2017

Details
Create a VCH with the wizard, wait after docker endpoint and admin portal to come up. Once they're available, connecting to docker endpoint via docker -H VCH_IP:2376 --tls info gives:
Cannot connect to the Docker daemon at tcp://10.192.124.47:2375. Is the docker daemon running?

Also, the admin portal cannot authenticate.

Deploy again with the generated CLI command, it succeeds and the VCH is up and running (docker endpoint and admin portal both work alright).

The generated CLI command:

./vic-machine-linux create --name virtual-container-host --debug 1 --compute-resource /RegionA01/host/RegionA01-COMP01 --image-store RegionA01-ISCSI01-COMP01 --base-image-size 8GB --bridge-network Bridge01-RegionA01-vDS-COMP --bridge-network-range 172.16.0.0/12 --public-network VM-RegionA01-vDS-COMP --no-tlsverify --user Administrator@CORP.LOCAL --thumbprint F7:68:F0:93:F4:EC:B7:FE:C1:02:3F:F3:AB:62:1A:50:E8:9A:0E:85 --target vcsa-01a.corp.local/RegionA01 --ops-user administrator@corp.local

Logs
From the docker-persona logs, portlayer never comes up, so the persona hangs waiting for portlayer:

time="2017-12-18T20:25:08Z" level=info msg="Launching docker personality pprof server on 127.0.0.1:6062" 
Dec 18 2017 20:25:09.331Z DEBUG log cfg: {Formatter:0xc4200730a0 Level:debug Syslog:<nil>}
Dec 18 2017 20:25:09.346Z DEBUG 296.1: [OperationFromContext] [vic/pkg/vsphere/session.(*Session).Connect:191]
Dec 18 2017 20:25:09.346Z DEBUG op=296.1: Creating VMOMI session with thumbprint F7:68:F0:93:F4:EC:B7:FE:C1:02:3F:F3:AB:62:1A:50:E8:9A:0E:85
Dec 18 2017 20:25:09.843Z DEBUG Session Environment Info: 
Dec 18 2017 20:25:09.896Z DEBUG Loaded 180 CAs for registries from system CA bundle
Dec 18 2017 20:25:09.901Z ERROR Unable to load CAs for registry access in config
Dec 18 2017 20:25:09.909Z INFO  Waiting for portlayer to come up

Portlayer gives bad request trying to connect:

Dec 18 2017 21:53:38.248Z DEBUG GuestInfoSource: key: guestinfo.vice./grant_permissions, value: "", error: %!s(<nil>)
Dec 18 2017 21:53:38.248Z DEBUG GuestInfoSource: key: guestinfo.vice./vic_machine_create_options, value: "", error: key not found
Dec 18 2017 21:53:38.248Z DEBUG GuestInfoSource: key: guestinfo.vice./vic_machine_create_options~, value: "", error: key not found
Dec 18 2017 21:53:38.248Z DEBUG op=2108.1: Creating VMOMI session with thumbprint F7:68:F0:93:F4:EC:B7:FE:C1:02:3F:F3:AB:62:1A:50:E8:9A:0E:85
Dec 18 2017 21:53:38.969Z FATAL configure_port_layer ERROR: Failed to connect to vcsa-01a.corp.local: 400 Bad Request

The log bundle:
vch_logs_20171218.zip

Since the CLI works, there's probably something in the API that caused this error, might be related to network configuration.
Cc: @zjs

@AngieCris AngieCris added kind/defect Behavior that is inconsistent with what's intended status/needs-attention The issue needs to be discussed by the team status/needs-triage The issue needs to be evaluated and metadata updated team/lifecycle labels Dec 18, 2017
@AngieCris
Copy link
Contributor Author

AngieCris commented Dec 18, 2017

Tried to re-create several more times with CLI command, some of the CLI create processes failed with EOF when uploading isos to datastore:

WARN[0101] failed an attempt to upload isos with err (Put https://sc2-rdops-vm02-dhcp-93-240.eng.vmware.com/folder/virtual-container-host-8889X/V1.3.0-RC1-15140-BE27555-appliance.iso?dcPath=%2FDatacenter&dsName=datastore1: EOF), 4 retries remain 
Dec 18 2017 16:41:53.531-06:00 WARN  Will try again in 7.142638726s. Operation failed with detected error
WARN[0161] failed an attempt to upload isos with err (Put https://sc2-rdops-vm02-dhcp-93-240.eng.vmware.com/folder/virtual-container-host-8889X/V1.3.0-RC1-15140-BE27555-appliance.iso?dcPath=%2FDatacenter&dsName=datastore1: EOF), 3 retries remain 
Dec 18 2017 16:42:53.664-06:00 WARN  Will try again in 47.58611144s. Operation failed with detected error
INFO[0314] Waiting for IP information

Then it hangs:

INFO[0314] Waiting for IP information                   
ERRO[0494] vic/pkg/trace.(*Operation).Err: Create error: context deadline exceeded
vic/cmd/vic-machine/create.(*Create).Run:741 Create
vic/cmd/vic-machine/common.NewOperation:27 vic-machine-darwin 
ERRO[0495] vic/lib/install/management.(*Dispatcher).ensureApplianceInitializes: Create error: context deadline exceeded
vic/cmd/vic-machine/create.(*Create).Run:741 Create
vic/cmd/vic-machine/common.NewOperation:27 vic-machine-darwin 
ERRO[0495] vic/pkg/trace.(*Operation).Err: Create error: context deadline exceeded
vic/cmd/vic-machine/create.(*Create).Run:741 Create
vic/cmd/vic-machine/common.NewOperation:27 vic-machine-darwin 
ERRO[0495] Unable to get vm config: Post https://sc2-rdops-vm02-dhcp-93-240.eng.vmware.com/sdk: context deadline exceeded 
ERRO[0495] vic/lib/install/management.(*Dispatcher).ensureApplianceInitializes: Create error: context deadline exceeded
vic/cmd/vic-machine/create.(*Create).Run:741 Create
vic/cmd/vic-machine/common.NewOperation:27 vic-machine-darwin 
INFO[0495]                                              
ERRO[0495] Failed to obtain IP address for client interface 
......
ERRO[0496] --------------------                         
ERRO[0496] vic-machine-darwin create failed: Creating VCH exceeded time limit of 3m0s. Please increase the timeout using --timeout to accommodate for a busy vSphere target

Have seen the same failure multiple times in nightly #6933
Not sure if the VC is just down

@AngieCris
Copy link
Contributor Author

AngieCris commented Dec 18, 2017

Video capture of the process by @pdaigle :
create a VCH using the wizard, unable to connect to docker endpoint and portal
copy & paste the generated CLI command
create a VCH using the CLI command
connect successfully to docker endpoint and portal for the VCH created by CLI
https://onevmw-my.sharepoint.com/personal/pdaigle_vmware_com/_layouts/15/guestaccess.aspx?guestaccesstoken=SC%2FKJom7K2LkVP2xM3%2FJSIHxrZAE2Rg8lBvWQGrzHso%3D&docid=2_1ff002f1188b343e296a5a9a76dedb418&rev=1&e=2f5c2c31f81f461e84f5b9a2b8fd2fea

@anchal-agrawal
Copy link
Contributor

Needs a priority - putting in Not Ready.

@jak-atx
Copy link
Contributor

jak-atx commented Dec 19, 2017

The only thing that I could think of that has changed is that we are now using the datacenter specific endpoint: {vc_ip}/datacenter/{datacenter_ip}/vch/

I tried reverting back to using the previous endpoint: {vc_ip}/vch/

And I was able to create a vch that responded to docker info:

docker -H 10.192.120.215:2376 --tls info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: v1.3.0-rc4-15373-b5ee291
Storage Driver: vSphere Integrated Containers v1.3.0-rc4-15373-b5ee291 Backend Engine
VolumeStores:
vSphere Integrated Containers v1.3.0-rc4-15373-b5ee291 Backend Engine: RUNNING
VCH CPU limit: 2681 MHz
VCH memory limit: 4.955 GiB
VCH CPU usage: 1541 MHz
VCH memory usage: 4.922 GiB
VMware Product: VMware vCenter Server
VMware OS: linux-x64
VMware OS version: 6.5.0
Registry Whitelist Mode: disabled. All registry access allowed.
Plugins:
Volume:
Network: bridge
Log:
Swarm: inactive
Operating System: linux-x64
OSType: linux-x64
Architecture: x86_64
CPUs: 2681
Total Memory: 4.955GiB
ID: vSphere Integrated Containers
Docker Root Dir:
Debug Mode (client): false
Debug Mode (server): false
Registry: registry.hub.docker.com
Experimental: false
Live Restore Enabled: false

Using the datacenter specific endpoint this consistently failed.

@jak-atx
Copy link
Contributor

jak-atx commented Dec 19, 2017

However, I was seeing the ‘incorrect username or password’ error consistently (same thing we were seeing with the datacenter specific endpoint initially) until i turned off secure access. So that may be something to investigate also.

@anchal-agrawal
Copy link
Contributor

The only thing that I could think of that has changed is that we are now using the datacenter specific endpoint: {vc_ip}/datacenter/{datacenter_ip}/vch/

@AngieCris @jak-atx is this issue due to a regression from a recent change? I'm not familiar with the creation wizard's code and I'm trying to understand why running the vic-machine create command through the wizard doesn't work, and copying and running the same command manually works.

From @pdaigle's video, I couldn't see that the vic-machine create command through the wizard completed successfully (the video scrolls through the vic-machine create logs in the UI but not when the command completes). The log bundle doesn't have the vic-machine.log file either.

... until i turned off secure access.

@jak-atx Do you mean the tls settings for the VCH in the wizard?

@anchal-agrawal anchal-agrawal self-assigned this Dec 19, 2017
@jak-atx
Copy link
Contributor

jak-atx commented Dec 19, 2017

@anchal-agrawal this appears to be a regression. We recently changed the api endpoint the wizard posts to. when I revert to use the previous endpoint it works (if i turn off the tls verify in the wizard settings, which may be an unrelated issue specific to that enpoint).

@AngieCris
Copy link
Contributor Author

@anchal-agrawal I'm not very familiar how portlayer works...Do you happen to have an idea for the possible reasons why portlayer is getting this error: (or have you seen it before)

Dec 18 2017 21:53:38.969Z FATAL configure_port_layer ERROR: Failed to connect to vcsa-01a.corp.local: 400 Bad Request

What could cause a 400 Bad Request? Could it be an incorrect auth (that would more possibly be a 401)?

@anchal-agrawal
Copy link
Contributor

anchal-agrawal commented Dec 19, 2017

What could cause a 400 Bad Request?

@AngieCris that error is returned while creating a govmomi session here:

sess, err := session.NewSession(sessionconfig).Create(ctx)
if err != nil {
log.Fatalf("configure_port_layer ERROR: %s", err)

We don't have more granular logs, but based on the error message I suspect the error is originating from

_, err = s.Connect(op)
if err != nil {
return nil, err

The 400 response is odd if vcsa-01a.corp.local is a valid VC/ESX target. If the wizard VCH creation fails for the same vic-machine create command that succeeds manually, I'd check for the differences in the request payload between the two cases (as you can see, we set a bunch of fields in the session.Config before the first snippet above, and a few fields before the second snippet).

Could it be an incorrect auth (that would more possibly be a 401)?

If the user/pass are invalid, we'd get the incorrect username or password error.

Some questions:

  • Is this issue reproducible on every multi-datacenter environment? Is it seen consistently on a Nimbus multi-dc VC, for example?
  • Can you confirm whether vic-machine create exits successfully when the VCH is created from the wizard and we're unable to reach the docker endpoint?

@AngieCris
Copy link
Contributor Author

AngieCris commented Dec 19, 2017

@anchal-agrawal thanks for the update.

Is this issue reproducible on every multi-datacenter environment? Is it seen consistently on a Nimbus multi-dc VC, for example?

I think so, at least in @jak-atx 's case it's reproduce-able. If you have the most recent rc4 appliance deployed, and a Nimbus multi-dc VC, and the wizard plugin installed on the VC, you can try using the create wizard to deploy a VCH with all the minimal settings, and try to do a docker info and connect to admin portal when the endpoints become available.

Can you confirm whether vic-machine create exits successfully when the VCH is created from the wizard and we're unable to reach the docker endpoint?

When the VCH is created from the wizard, it's using API create endpoint, not the CLI vic-machine create, and API create doesn't actually check if the VCH can connect to docker endpoint (it basically stops and quits successfully after uploading the image isos).
So even if wizard successfully quits and VCH is created, the persona and portlayer are still initializing in the background.

@jak-atx
Copy link
Contributor

jak-atx commented Dec 19, 2017

@anchal-agrawal we are sending the same payload that worked with RC3 so that shouldn't be an issue.

Also UI sends a vSphere SessionTicket to authenticate, that vic-machine cli does require.

There seems to have been some session related updates in this PR (related to datacenter endpoint):
https://github.com/vmware/vic/pull/6949/files

@AngieCris AngieCris removed the status/needs-triage The issue needs to be evaluated and metadata updated label Dec 19, 2017
@AngieCris
Copy link
Contributor Author

AngieCris commented Dec 20, 2017

Root cause identified with help from @hickeng :

The VC_URL variable that goes into GuestInfo config is wrong. It shouldn't contain path in the URL but it does.
From port-layer log of a the broken VCH that can't connect to docker:

"VC_URL=https://vcsa-01a.corp.local/RegionA01|DC_PATH=RegionA01|CS_PATH=/RegionA01/host/RegionA01-COMP01|POOL_PATH=/RegionA01/host/RegionA01-COMP01/Resources|DS_PATH=RegionA01-ISCSI01-COMP01", error: %!s(<nil>)

Here we see the VC_URL is VC_URL=https://vcsa-01a.corp.local/RegionA01. It should be VC_URL=https://vcsa-01a.corp.local.
This VC_URL is set to the VCH config target on appliance initialization. The target URL we feed to the VCH config should not contain path.

On the code side, the CLI provides the full target URL, both host and path, and later on validator strips away the path part of the URL. On the API side, we set the path to datacenter mo inventory path, so the path is included in the target URL and goes into VCH config.

PR up: #6994

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/defect Behavior that is inconsistent with what's intended priority/p0 team/lifecycle
Projects
None yet
3 participants