New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agents Randomly Flapping since Rancher 1.5.2 Upgrade #8326

Closed
ecliptik opened this Issue Mar 28, 2017 · 71 comments

Comments

Projects
None yet
9 participants
@ecliptik
Copy link

ecliptik commented Mar 28, 2017

Rancher Versions:
Server: v1.5.2
healthcheck: rancher/healthcheck:v0.2.3
ipsec: rancher/net:holder
network-services: rancher/metadata:v0.8.11, rancher/network-manager:v0.5.3
scheduler: rancher/scheduler:v0.7.5
kubernetes (if applicable): NA

Docker Version:
1.12.6

OS and where are the hosts located? (cloud, bare metal, etc):
Ubuntu 16.04 on AWS

Setup Details: (single node rancher vs. HA rancher, internal DB vs. external DB)
Single node with External RDS DB and --advertise-http-port set

Followed SSL/ELB docs to properly setup proxy_protocol.

Environment Type: (Cattle/Kubernetes/Swarm/Mesos)
cattle

Steps to Reproduce:
Upgraded to Rancher v1.5.2 from v1.2.2 and agents randomly disconnect, and try to reconnect.

Results:
Clusters should be stable and not show agents disconnecting.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 28, 2017

Agents flapping in UI.
screen shot 2017-03-28 at 8 27 08 am

AWS ELB setup with proxy_protocol enabled.
screen shot 2017-03-28 at 8 54 19 am

Agent logs looping

3/27/2017 11:15:45 AMINFO: Already up to date
3/27/2017 11:15:45 AMINFO: Sending python-agent applied 1-8d88d6ac405cca84313a9fb1c8b4eba2a4a0f125e4309cdb47625bafbb10ace4
3/27/2017 11:15:45 AMINFO: Sending pyagent applied 1-9e3dcab12585149a65d818082279708965cd826df1937a55b24f48428f8f0c0644fc1033b5be15c88dcf9540b1c04c67
3/27/2017 11:15:46 AMINFO: Starting agent /var/lib/cattle/pyagent/apply.sh
3/27/2017 11:15:46 AMINFO: Executing /var/lib/cattle/pyagent/agent
3/27/2017 11:15:46 AMINFO: Updating host-config
3/27/2017 11:15:46 AMINFO: Downloading https://rancher.demandbase.com/v1//configcontent//host-config current=host-config-1-4995a4a06e8cea52da7f729987a035724633b2abc04a05a443b107618c92c60b
3/27/2017 11:15:46 AMINFO: Already up to date
3/27/2017 11:15:46 AMINFO: Sending host-config applied 1-4995a4a06e8cea52da7f729987a035724633b2abc04a05a443b107618c92c60b
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Launching agent"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="https://rancher.demandbase.com/v1EA81F417CA84DC80627EW4YCWcczT4EbLS3ruqWFA7hAGFdEykDXrppRgeYh"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Listening for events on https://rancher.demandbase.com/v1"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="launching hostapi"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Starting event router."
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Initializing event router" workerCount=250
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Connection established"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Starting websocket pings"
3/27/2017 11:15:46 AMtime="2017-03-27T18:15:46Z" level=info msg="Connecting to proxy." url="wss://rancher.demandbase.com/v1/connectbackend?token=<REDACTED>
3/27/2017 11:16:16 PMINFO: Starting agent for <REDACTED>
3/27/2017 11:16:17 PMINFO: Access Key: <REDACTED>
3/27/2017 11:16:17 PMINFO: Config URL: https://rancher.demandbase.com/v1
3/27/2017 11:16:17 PMINFO: Storage URL: https://rancher.demandbase.com/v1
3/27/2017 11:16:17 PMINFO: API URL: https://rancher.demandbase.com/v1
3/27/2017 11:16:17 PMINFO: IP: 10.9.1.91
3/27/2017 11:16:17 PMINFO: Port:
3/27/2017 11:16:17 PMINFO: Required Image: rancher/agent:v1.2.1
3/27/2017 11:16:17 PMINFO: Current Image: rancher/agent:v1.2.1
3/27/2017 11:16:17 PMINFO: Using image rancher/agent:v1.2.1

additional agent logs,

time="2017-03-23T04:12:06Z" level=info msg="Registering backend for host 0f2925d0-f6e5-4589-495b-dac751bbae39 with session ID c22c6eba-ff61-4a98-b44a-fe76877746d6." 
time="2017-03-23T04:12:10Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:10Z" level=info msg="Registering backend for host 41ba6d9a-10f9-4c46-61b2-b796a7c9dd3c with session ID 2ecfd38e-520d-4846-9795-20bd443c4611." 
time="2017-03-23T04:12:10Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:10Z" level=info msg="Registering backend for host fc90e4e2-2e23-4900-4ca5-b669bc6b4e9e with session ID 9da14b33-aad7-4556-912e-77458becf034." 
time="2017-03-23T04:12:11Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:11Z" level=info msg="Registering backend for host 56aa5a0a-fb87-464d-6c42-9a23b2e02de6 with session ID 0f4edad7-b55b-4b17-a053-e20d814eac02." 
time="2017-03-23T04:12:12Z" level=info msg="Handling backend connection request." 
time="2017-03-23T04:12:12Z" level=info msg="Registering backend for host 124b306f-6d38-4da2-6bd8-aaea8aee6f67 with session ID 5c248471-bdfe-4370-9725-32d345daf22f." 
2017-03-23 04:12:13,577 ERROR [1be01214-be8f-4778-8734-24d58532e2dd:6184117] [instance:233787] [instance.start] [] [ecutorService-5] [c.p.e.p.i.DefaultProcessInstanceImpl] Unknown exception org.jooq.exception.DataChangedException: Database record has been changed
    at org.jooq.impl.UpdatableRecordImpl.checkIfChanged(UpdatableRecordImpl.java:550) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.storeUpdate0(UpdatableRecordImpl.java:291) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.access$200(UpdatableRecordImpl.java:90) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl$3.operate(UpdatableRecordImpl.java:260) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.RecordDelegate.operate(RecordDelegate.java:123) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.storeUpdate(UpdatableRecordImpl.java:255) ~[jooq-3.3.0.jar:na]
    at org.jooq.impl.UpdatableRecordImpl.update(UpdatableRecordImpl.java:149) ~[jooq-3.3.0.jar:na]
    at io.cattle.platform.object.impl.JooqObjectManager.persistRecord(JooqObjectManager.java:228) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
    at io.cattle.platform.object.impl.JooqObjectManager.setFieldsInternal(JooqObjectManager.java:133) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]
    at io.cattle.platform.object.impl.JooqObjectManager$3.execute(JooqObjectManager.java:121) ~[cattle-framework-object-0.5.0-SNAPSHOT.jar:na]

@StrongMonkey StrongMonkey self-assigned this Mar 28, 2017

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 28, 2017

Seeing similar issues. Rancher 1.5.1, HA setup behind ELB and connected to RDS. Docker 1.12.6.

Last night I lost 3 of my rancher agents spontaneously. I SSH'd to the server, and everything looked fine, as far as I could tell. All of the contaiers were still running, just weren't connected to the cluster any longer. I cleared the rancher-agent and the other containers on the host, and rerant the rancher bootstrap script and the node rejoined the cluster.

I can't figure out why the agents randomly disconnected though, which is concerning. I found this in the logs though. Any ideas why there would be websocket issues? The IP address for TLS handshake error is 100% an ELB address.

time="2017-03-28T04:32:33Z" level=info msg="Handling backend connection request."
time="2017-03-28T04:32:33Z" level=info msg="Registering backend for host 5ca832d9-e949-4e41-7b70-5b653b41ba1f with session ID e07437cc-0b52-49d9-b34b-ce8bc8d245a6."
2017/03/28 05:03:26 http: TLS handshake error from 172.26.30.88:63316: EOF
2017/03/28 05:03:29 http: TLS handshake error from 172.26.30.88:63324: EOF
time="2017-03-28T05:38:31Z" level=info msg="Shutting down backend d11849cb-d511-4055-4001-30001f8e149d. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-28T05:38:31Z" level=info msg="Removed backend. Key: d11849cb-d511-4055-4001-30001f8e149d. Session ID f9302ce6-8634-427f-82e8-d674978b09cb ."
time="2017-03-28T05:38:44Z" level=info msg="Handling backend connection request."
time="2017-03-28T05:38:44Z" level=info msg="Registering backend for host d11849cb-d511-4055-4001-30001f8e149d with session ID 29397ff5-664a-40ce-a3b7-77d48aedc376."
2017-03-28 05:45:31,659 WARN  [:] [] [] [] [p859417998-4093] [i.c.p.a.p.s.j.WebSocketMessageWriter] Unhandled Error (closing connection) org.eclipse.jetty.websocket.api.WebSocketException: RemoteEndpoint unavailable, current state [CLOSING], expecting [OPEN or CONNECTED]
at org.eclipse.jetty.websocket.common.WebSocketSession.getRemote(WebSocketSession.java:252) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.events.AbstractEventDriver.incomingFrame(AbstractEventDriver.java:141) ~[websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.WebSocketSession.incomingFrame(WebSocketSession.java:309) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.extensions.ExtensionStack.incomingFrame(ExtensionStack.java:214) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.Parser.notifyFrame(Parser.java:220) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:258) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.readParse(AbstractWebSocketConnection.java:628) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:476) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) [jetty-io-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) [jetty-util-9.2.11.v20150529.jar:9.2.11.v20150529]
at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) [jetty-util-9.2.11.v20150529.jar:9.2.11.v20150529]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 28, 2017

After talking with @aemneina in the Rancher slack channel, this appears this could be database related, with our master running into resource limits due to our RDS instance size of m4.large.

screen shot 2017-03-28 at 9 33 20 am

We're upgrading to the RDS instance to m4.xlarge to see if it helps.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

Seeing this in the logs repeatedly,

time="2017-03-29T05:51:13Z" level=info msg="Starting websocket pings"
time="2017-03-29T15:40:40Z" level=error msg="Received error reading from socket. Exiting." error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2017-03-29T15:40:40Z" level=error msg="websocket: close 1006 (abnormal closure): unexpected EOF"

happening in all environments on almost all the agents.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

Rancher Master logs are showing:

time="2017-03-25T18:13:57Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:13:57Z" level=info msg="Registering backend for host 80addee7-b1c3-46ad-5082-79a3d3eccea8 with session ID 11d8d8c6-7205-4167-b151-40d191fcce64."
time="2017-03-25T18:14:01Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:01Z" level=info msg="Registering backend for host 6a83381a-e309-439f-7fab-25b54e54e47e with session ID 5b920945-d810-4946-b0e6-d285d27773fb."
2017-03-25 18:14:02,082 WARN  [:] [] [] [] [p859417998-4530] [i.c.p.a.p.s.j.WebSocketMessageWriter] Unhandled Error (closing connection) org.eclipse.jetty.websocket.api.WebSocketException: RemoteEndpoint unavailable, current state [CLOSING], expecting [OPEN or CONNECTED]
        at org.eclipse.jetty.websocket.common.WebSocketSession.getRemote(WebSocketSession.java:252) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.events.AbstractEventDriver.incomingFrame(AbstractEventDriver.java:141) ~[websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.WebSocketSession.incomingFrame(WebSocketSession.java:309) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.extensions.ExtensionStack.incomingFrame(ExtensionStack.java:214) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.Parser.notifyFrame(Parser.java:220) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:258) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.readParse(AbstractWebSocketConnection.java:628) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
        at org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:476) [websocket-common-9.2.11.v20150529.jar:9.2.11.v20150529]
time="2017-03-25T18:14:02Z" level=info msg="Shutting down backend 01a12e35-363f-41b6-6a41-819030c2c765. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:02Z" level=info msg="Removed backend. Key: 01a12e35-363f-41b6-6a41-819030c2c765. Session ID 1c369dcd-e8a9-442d-8d2e-75d610ecc7ef ."
time="2017-03-25T18:14:03Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:03Z" level=info msg="Registering backend for host 8a3f62c8-2547-4a95-5734-e0a84c328acd with session ID 4b6cf13b-669f-4c20-9955-f9a36678ad43."
time="2017-03-25T18:14:03Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:03Z" level=info msg="Registering backend for host 618f1923-255c-498c-666c-90cdee1a8c51 with session ID 767c61c5-c12f-4c90-b75e-cd0778cddbe5."
time="2017-03-25T18:14:04Z" level=info msg="Shutting down backend be5570d4-cbc2-45c5-7099-0d44591dff8c. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:04Z" level=info msg="Removed backend. Key: be5570d4-cbc2-45c5-7099-0d44591dff8c. Session ID 431ce8d5-74c1-4572-83a4-60f1c4f6fc52 ."
time="2017-03-25T18:14:05Z" level=info msg="Shutting down backend cde44bd4-6e55-487b-64ba-8f5edf338c97. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:05Z" level=info msg="Removed backend. Key: cde44bd4-6e55-487b-64ba-8f5edf338c97. Session ID 47d8b9a0-7f90-418d-b345-e9869ceb4d90 ."
time="2017-03-25T18:14:07Z" level=info msg="Shutting down backend 6f75f30c-6c84-4109-6e0c-82a1da2e57a2. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:07Z" level=info msg="Removed backend. Key: 6f75f30c-6c84-4109-6e0c-82a1da2e57a2. Session ID 1ae4a641-a2bd-4492-8fc1-a208e9bb7118 ."
time="2017-03-25T18:14:10Z" level=info msg="Shutting down backend f4dbaf2c-8e10-4174-5634-1e0f2d59de84. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:10Z" level=info msg="Removed backend. Key: f4dbaf2c-8e10-4174-5634-1e0f2d59de84. Session ID 80c8916d-b91c-4604-b58d-865bd3bc1ccd ."
time="2017-03-25T18:14:10Z" level=info msg="Shutting down backend 026ee519-0ee0-4f5e-4bd0-640fa6cd201d. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:10Z" level=info msg="Removed backend. Key: 026ee519-0ee0-4f5e-4bd0-640fa6cd201d. Session ID 27d5393f-3046-4880-8ec1-18d637130c2c ."
time="2017-03-25T18:14:11Z" level=info msg="Shutting down backend e5a881cf-b76c-44b3-4c1d-ca55c4f9d8f0. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:11Z" level=info msg="Removed backend. Key: e5a881cf-b76c-44b3-4c1d-ca55c4f9d8f0. Session ID f61ca915-e570-4750-9072-b5a4ec365f20 ."
time="2017-03-25T18:14:11Z" level=info msg="Shutting down backend 81675666-b6de-4d48-65b5-5713ce221ce9. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:11Z" level=info msg="Removed backend. Key: 81675666-b6de-4d48-65b5-5713ce221ce9. Session ID 992a5ab2-2ae9-45c8-a8b0-f47bca743969 ."
time="2017-03-25T18:14:13Z" level=info msg="Shutting down backend 2bbec463-98e8-4be8-4dfe-21faaca5a642. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:13Z" level=info msg="Removed backend. Key: 2bbec463-98e8-4be8-4dfe-21faaca5a642. Session ID 0d3de709-c5dd-4ed8-a166-b7d32d67b380 ."
time="2017-03-25T18:14:15Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:15Z" level=info msg="Registering backend for host 01a12e35-363f-41b6-6a41-819030c2c765 with session ID 45774d2d-7921-424d-b55b-23ce1f39b358."
time="2017-03-25T18:14:16Z" level=info msg="Shutting down backend 53e4a749-a2a5-40c2-7b9d-18c3f2286ac6. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:16Z" level=info msg="Removed backend. Key: 53e4a749-a2a5-40c2-7b9d-18c3f2286ac6. Session ID db4ab728-0cd0-40c2-b72f-cbf2578fd404 ."
time="2017-03-25T18:14:17Z" level=info msg="Shutting down backend 11228144-5f4a-42a0-4d3f-5e9318d6227b. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-25T18:14:17Z" level=info msg="Removed backend. Key: 11228144-5f4a-42a0-4d3f-5e9318d6227b. Session ID a4f8e0b6-64cb-49b8-98d8-f26d991307a6 ."
time="2017-03-25T18:14:17Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:17Z" level=info msg="Registering backend for host be5570d4-cbc2-45c5-7099-0d44591dff8c with session ID f65b437e-ad06-4a74-a5ef-6c2b63247675."
time="2017-03-25T18:14:17Z" level=info msg="Handling backend connection request."
time="2017-03-25T18:14:17Z" level=info msg="Registering backend for host cde44bd4-6e55-487b-64ba-8f5edf338c97 with session ID 01b4cf4b-72e9-4323-99cd-778e7ce1fc05."
time="2017-03-25T18:14:19Z" level=info msg="Shutting down backend c95ce07e-70b8-4d08-73b4-b45f03d377e1. Connection closed because: websocket: close 1006 unexpected EOF."

more errors in logs

2017/03/29 15:29:48 http: proxy error: net/http: request canceled
2017/03/29 15:29:49 http: proxy error: net/http: request canceled
time="2017-03-29T15:31:01Z" level=info msg="Listing webhooks"
2017-03-29 15:40:58,979 INFO    [main] [ConsoleStatus] Starting [11/86]: WebsocketProxyLauncher
time="2017-03-29T15:41:01Z" level=info msg="Downloading key from http://localhost:8081/v1/scripts/api.crt"
time="2017-03-29T15:41:01Z" level=fatal msg="Error getting config." error="Invalid key content"
time="2017-03-29T15:41:02Z" level=fatal msg="Failed to configure cattle client: Get http://localhost:8080/v1: dial tcp [::1]:8080: getsockopt: connection refused"
time="2017-03-29T15:41:02Z" level=info msg="Starting Catalog Service (port 8088, refresh interval 300 seconds)"
time="2017-03-29T15:41:03Z" level=info msg="Downloading key from http://localhost:8081/v1/scripts/api.crt"
time="2017-03-29T15:41:03Z" level=info msg="Starting websocket proxy. Listening on [:8080], Proxying to cattle API at [localhost:8081], Monitoring parent pid [9]."

websocket EOFs

2017/03/28 17:54:07 http: proxy error: net/http: request canceled
time="2017-03-29T05:50:45Z" level=info msg="Exiting go-machine-service"
time="2017-03-29T05:50:46Z" level=info msg="Setting log level" logLevel=info
time="2017-03-29T05:50:46Z" level=info msg="Starting go-machine-service..." gitcommit=v0.36.1
time="2017-03-29T05:50:46Z" level=info msg="Waiting for handler registration (1/2)"
time="2017-03-29T05:50:47Z" level=info msg="Initializing event router" workerCount=10
time="2017-03-29T05:50:47Z" level=info msg="Connection established"
time="2017-03-29T05:50:47Z" level=info msg="Waiting for handler registration (2/2)"
time="2017-03-29T05:50:47Z" level=info msg="Starting websocket pings"
time="2017-03-29T05:50:48Z" level=info msg="Initializing event router" workerCount=250
time="2017-03-29T05:50:48Z" level=info msg="Connection established"
time="2017-03-29T05:50:48Z" level=info msg="Starting websocket pings"
time="2017-03-29T05:50:48Z" level=info msg="Installing builtin drivers"
time="2017-03-29T05:50:48Z" level=info msg="Downloading all drivers"
time="2017-03-29T05:50:48Z" level=info msg="Copying /var/lib/cattle/machine-drivers/56b60d32caac27345f4f55c4c91ed8adcfe7b3d5abcadd29d0551812f222f37e-docker-machine-driver-ubiquity => /usr/local/bin/docker-machine-driver-ubiquity"
time="2017-03-29T05:50:48Z" level=info msg="Done downloading all drivers"
time="2017-03-29T05:50:47Z" level=info msg="Starting websocket pings"
time="2017-03-29T05:51:01Z" level=info msg="Shutting down backend 56aa5a0a-fb87-464d-6c42-9a23b2e02de6. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:01Z" level=info msg="Removed backend. Key: 56aa5a0a-fb87-464d-6c42-9a23b2e02de6. Session ID abd1dd6c-fe84-45c6-a1e4-d6963cf98a03 ."
time="2017-03-29T05:51:04Z" level=info msg="Shutting down backend 41ba6d9a-10f9-4c46-61b2-b796a7c9dd3c. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:04Z" level=info msg="Removed backend. Key: 41ba6d9a-10f9-4c46-61b2-b796a7c9dd3c. Session ID 30c7734b-2c72-4e4a-ac15-86e7cf08ed06 ."
time="2017-03-29T05:51:05Z" level=info msg="Shutting down backend 9461f60b-2497-4458-5387-cafdfe548cea. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:05Z" level=info msg="Removed backend. Key: 9461f60b-2497-4458-5387-cafdfe548cea. Session ID 4e5dc44e-5228-4739-810c-ab4dcecb6ea2 ."
time="2017-03-29T05:51:06Z" level=info msg="Shutting down backend fc90e4e2-2e23-4900-4ca5-b669bc6b4e9e. Connection closed because: websocket: close 1006 unexpected EOF."
time="2017-03-29T05:51:06Z" level=info msg="Removed backend. Key: fc90e4e2-2e23-4900-4ca5-b669bc6b4e9e. Session ID 80361a87-a872-4d5c-bb77-273d71ed3c6e ."
time="2017-03-29T05:51:07Z" level=info msg="Shutting down backend 618f1923-255c-498c-666c-90cdee1a8c51. Connection closed because: websocket: close 1006 unexpected EOF."
@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

Rebooting the EC2 instance that master is on seems to fix the issue, but we did this the day before and overnight the agents started flapping again, so something is triggering it within 24 hours.

Master info:
EC2 instance type m4.xlarge
Memory: 16Gb memory (free showed 12Gb free during flapping)
CPU: 4
Disk: /dev/xvda1 126G 38G 83G 32% /

jvm process,

java -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -Xms128m -Xmx4g -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/cattle/logs -Dlogback.bootstrap.level=WARN -cp /usr/share/cattle/9949e600a7bcdd427ae7bc9d33bf477c:/usr/share/cattle/9949e600a7bcdd427ae7bc9d33bf477c/etc/cattle io.cattle.platform.launcher.Main
@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

Can you open your browser's developer tab and look at the Rancher UI and let me know if the request to /v2-beta/suscribe (or /v1/subscribe) is succeeding or not? Its a websocket connection.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

@cjellick it looks like it succeeding

screen shot 2017-03-29 at 10 36 32 am

In Safari I am still getting the occasional Red Box of Death mentioned in issue #7389 too.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

Succeeding? It got a 404

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

That tells me that websockets in general seem to be not working (not just backed connections from agents).

Is this a setup that was working on a previous version of rancher and then was upgraded or is it a fresh setup? If upgrade, what version did you upgrade from?

My first reaction is to suspect a misconfiguration of the ELB. Was this particularly ELB part of a previously working setup? Has any of its configuration changed? Any chance you misconfigured the proxy protocol bit?

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

@cjellick here's a timeline

  • Rancher 1.0 was running behind a ELB rancher-master with proxy_protocol setup (this was last fall) and was stable
  • Upgraded to Rancher 1.2 and followed recommended use of ALB instead of ELB
  • 1.2.2 was running for a few months behind the ALB with minor stability issues and agent disconnects (and #7389 occasionally)
  • Upgraded to 1.5.2 last week and used original ELB rancher-master from 1.0

Here is our load balancer policy config with the aws cli as followed from https://docs.rancher.com/rancher/v1.5/en//installing-rancher/installing-server/basic-ssl-config/

Polices

aws elb describe-load-balancer-policies --load-balancer-name rancher-master --region=us-west-1
....
        {
            "PolicyAttributeDescriptions": [
                {
                    "AttributeName": "ProxyProtocol",
                    "AttributeValue": "true"
                }
            ],
            "PolicyName": "rancher-master-proxyprotocol-policy",
            "PolicyTypeName": "ProxyProtocolPolicyType"
        }

Load Balancer

aws elb describe-load-balancers --load-balancer-name rancher-master --region=us-west-1
{
    "LoadBalancerDescriptions": [
        {
            "Subnets": [
                "subnet-388e5861",
                "subnet-95bb2af0"
            ],
            "CanonicalHostedZoneNameID": "REDACTED",
            "CanonicalHostedZoneName": "rancher-master-2099044364.us-west-1.elb.amazonaws.com",
            "ListenerDescriptions": [
                {
                    "Listener": {
                        "InstancePort": 8081,
                        "LoadBalancerPort": 80,
                        "Protocol": "TCP",
                        "InstanceProtocol": "TCP"
                    },
                    "PolicyNames": []
                },
                {
                    "Listener": {
                        "InstancePort": 8080,
                        "SSLCertificateId": "arn:aws:iam::REDACTED:server-certificate/GoDaddy_11_2018_demandbase.com",
                        "LoadBalancerPort": 443,
                        "Protocol": "SSL",
                        "InstanceProtocol": "TCP"
                    },
                    "PolicyNames": [
                        "AWSConsole-SSLNegotiationPolicy-rancher-master-1464896788615"
                    ]
                }
            ],
            "HealthCheck": {
                "HealthyThreshold": 5,
                "Interval": 5,
                "Target": "HTTP:8080/ping",
                "Timeout": 4,
                "UnhealthyThreshold": 5
            },
            "VPCId": "vpc-4dc16a28",
            "BackendServerDescriptions": [
                {
                    "InstancePort": 80,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                },
                {
                    "InstancePort": 443,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                },
                {
                    "InstancePort": 8080,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                },
                {
                    "InstancePort": 8081,
                    "PolicyNames": [
                        "rancher-master-proxyprotocol-policy"
                    ]
                }
            ],
            "Instances": [
                {
                    "InstanceId": "i-0bd99aa655bd7c004"
                }
            ],
            "DNSName": "rancher-master-2099044364.us-west-1.elb.amazonaws.com",
            "SecurityGroups": [
                "sg-79ec971c",
                "sg-95fd43f1",
                "sg-52f38c37",
                "sg-afb534c8"
            ],
            "Policies": {
                "LBCookieStickinessPolicies": [],
                "AppCookieStickinessPolicies": [],
                "OtherPolicies": [
                    "AWSConsole-SSLNegotiationPolicy-rancher-master-1464896788615",
                    "ELBSecurityPolicy-2015-05",
                    "AWSConsole-SSLNegotiationPolicy-rancher-master-1464892662822",
                    "rancher-master-proxyprotocol-policy"
                ]
            },
            "LoadBalancerName": "rancher-master",
            "CreatedTime": "2016-06-02T18:37:42.860Z",
            "AvailabilityZones": [
                "us-west-1a",
                "us-west-1b"
            ],
            "Scheme": "internet-facing",
            "SourceSecurityGroup": {
                "OwnerAlias": "REDACTED",
                "GroupName": "spotinst"
            }
        }
    ]
}

The one thing differently we do is the ELB forwards port 80 -> 8081 which is a nginx container running. That container does a 301 re-direct, so if you go to http://rancher.demandbase.com it re-directs to https://rancher.demandbase.com so we're always going over SSL

879b177c0a25        demandbase/docker-nginx-https-redirect:proxy-protocol   "nginx"                  58 minutes ago      Up 58 minutes       443/tcp, 0.0.0.0:8081->80/tcp      elastic_shannon

Dockerfile for that container

Nginx config with proxy_protocol support

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

@ecliptik coudl you run:

docker logs <rancher server container id> 2>&1 | grep -B1000 'Shutting down'

and post the results?

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

@cjellick nothing in the logs so far, this is a newer container though since we replaced the older m4.xlarge instance the container was running on with a c4.2xlarge since we were spiking the CPU when things were flapping.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

@cjellick I also searched my console history which had logs from the earlier Rancher master that was experiencing the flapping and didn't see a Shutting down either. I'm keeping a watch on it now to see if it pops up again.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

@ecliptik I looked closer at your browser screenshot and I htink you grabbed it from the wrong place. I meant just go into the normal Rancher UI, not the API view.

That said, since your setup is currently working, there's no point doing that now.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

@cjellick I did go to just the UI and it seems that in the Network tab of the dev console the socket is in pending

screen shot 2017-03-29 at 12 53 14 pm

but in the console it looks like it connected and opened,

screen shot 2017-03-29 at 12 53 22 pm

not sure if that means anything or not.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

Thats normal bc it is a long lived websocket connection that is streaming data

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

makes sense, if things start flapping again I'll bring up the console and see what it says.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 29, 2017

came back to the web console after lunch and saw some Socket refusing messages, not sure if that's normal behavior though.

screen shot 2017-03-29 at 2 38 28 pm

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

@ecliptik RE seeing this in the agent logs:

time="2017-03-29T05:51:13Z" level=info msg="Starting websocket pings"
time="2017-03-29T15:40:40Z" level=error msg="Received error reading from socket. Exiting." error="websocket: close 1006 (abnormal closure): unexpected EOF"
time="2017-03-29T15:40:40Z" level=error msg="websocket: close 1006 (abnormal closure): unexpected EOF"

Are you seeing that at a regular interval? Do the times line up across hosts?

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

What is your advertise-http-port set to?

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 29, 2017

I just saw this happen again. This is starting to seriously affect us.

I pulled logs from one of the rancher-server hosts while this was happening. https://gist.github.com/jmreicha/4271b5343a0915e6a6ee94854b36d73a. Seeing tons of "Error dialing websocket backend." error="dial tcp 172.26.3.26:8080: getsockopt: connection refused and other socket errors on the rancher-servers.

The rancher-agent logs aren't showing much, just that they disconnect and reconnect. https://gist.github.com/jmreicha/7da0db598e7bcda170159ebddda93c17

The container reboot fixed things again eventually, but this time there was still memory available.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

@jmreicha A couple follow up questions:

  1. RE: "This time there was still memory available" - Why did you say that? Have you observed this being tied to OOM issues? Apologies if you already said this in slack, but could you restate it here?
  2. How long since last time you rebooted the container? And how long sincne you last expereinced the problem?
@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

@jmreicha in your setup, are there any additional hops between your ELB and the rancher/server container (like an nginx or hapoxy container)?

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 29, 2017

@cjellick For 1) Yes we saw this issue crop up a few days ago and observed that the host had basically no memory available to the sysem (~100mb). I have #8330 open for what I was seeing originally.

The last container reboot looks like it was about 3 weeks ago.

Here is the command I'm using to start up the rancher-server:

docker run -d --restart=unless-stopped \
        --log-driver=json-file \
        --log-opt max-size=100mb \
        --log-opt max-file=2 \
        -p 8080:8080 -p 9345:9345 \
        --name rancher-server rancher/server:${rancher_version} \
        --db-host ${db_host} \
        --db-name ${db_name} \
        --db-port ${db_port} \
        --db-user ${db_username} \
        --db-pass ${db_password} \
        --advertise-address $INTERNAL_IP
@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 29, 2017

As for extra hops, I tried to follow the guide on the docs for HA setup as much as possible, so it is ELB -> rancher server, RDS in the backend and proxy protocol enabled. If it makes a difference, things were running smoothly for about a month or two, until just a few days ago.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 29, 2017

Did it go bad after an upgrade? If so, from what version to what version?

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 29, 2017

I've been running on 1.5.1 for about 3 to 4 weeks now. Before that, it was 1.4.2.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 30, 2017

@shakefu a few questions:

  1. How big is your rancher deployment? Roughly how many hosts and how many containers. Ballparked to the closest 10 for hosts and the closest 100 for containers is fine
  2. Are you you in AWS and behind an ELB?
  3. Do you have any other proxies (nginx, haproxy) between the rancher/server containers and the ELB?
@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 30, 2017

@cjellick (I work with shakefu)

  1. 11 hosts, 2 environments, ~330 containers total including infra stacks
  2. Yes we are behind ELB, proxy protocol enabled for 8080 and 443
  3. No other proxies involved in the Rancher server setup
@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 30, 2017

@jmreicha are you at the same org as shakefu? Just wanted to clarify

@shakefu

This comment has been minimized.

Copy link

shakefu commented Mar 30, 2017

Yey, we managed to reproduce it again, by killing a spot instance we were using in our Staging environment via the AWS console.

The instance going into a "reconnecting" state didn't seem to cause issues, but it seemed like as soon as I manually "deactivated" that host, putting it in "deactivating" state, that all the other hosts, in both environments, started to flap the reconnecting loop.

It also appeared that the rancher/server containers became nearly unresponsive (took 1min+ to load the UI) while this was going on.

We were able to recover by restarting both the rancher/server HA containers.

@shakefu

This comment has been minimized.

Copy link

shakefu commented Mar 30, 2017

@cjellick Yeap, same org.

@shakefu

This comment has been minimized.

Copy link

shakefu commented Mar 30, 2017

Looks like the Deactivation was a spurious correlation (though it was impeccably timed.) We deactivated the old host when our auto-scaling group fired up a new instance.

We've reproduced it (again... ooh our poor infrastructure is taking a whoopin' today), by simply increasing the size of the ASG by one, to add a new host, which then causes all other hosts in all environments to start flapping.

It looks like this flapping happens before any containers are actually started (all were in the Storage state) with, I think, the only exception being the Network metadata sidekick. Hard to tell for sure since I was observing through the UI which was barely responsive. Could be a clue? Dunno.

These reproductions have all been done with the 50/150 DB conn setting, and 2x'd the RAM we had from earlier. Our monitoring is showing plenty of available RAM, CPU and we're well below the ulimits. Not really sure what else to poke at, at this point.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 31, 2017

@shakefu Are you using RDS? If so, is it under pressure? Does CPU or IOPs seemed to be spiked there? I'm told memory in RDS instances always appears fully consumed, so that one is less useful.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 31, 2017

@cjellick here's the 1hr graph of our RDS instance of when the hosts were flapping, you can see the change when I rebooted the master instance and things went back to stable.

screen shot 2017-03-30 at 5 23 26 pm

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 31, 2017

@cjellick During the flapping periods we see CPU spikes up to around 40%, write IOPS around 400-500, network throughput at around 20-25 Mbit/s.

Here's what our graphs look like as well:

image

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 31, 2017

@jmreicha or @ecliptik If you see the problem again, can you grab the syslogs from the host os? Want to see if there is anything in there related to running out of memory in there.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 31, 2017

@cjellick I don't have the syslogs, but we do have a New Relic agent installed on the master server and it has system stats from today when we were seeing the flapping from just before 12:00PM to when we rebooted around 1:30PM

screen shot 2017-03-30 at 9 06 58 pm

screen shot 2017-03-30 at 9 08 32 pm

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 31, 2017

@cjellick I have systemd logs, there's not much in there unfortunately but I can get them to you tomorrow.

Memory usage appears to be low for us as well, never reaching above 15-20% used on the host and JVM memory never maxing out according to Sysdig. The main things that I've noticed are the disk throughput on RDS, CPU spikes on the Rancher servers and connections flipping between hosts when the master server reboots and all its connections are picked up by the other server.

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 31, 2017

@cjellick The Rancher server started doing it's things last night right at 3 am. Restarting the container was the fix. We're starting to run out of ideas.

@jmreicha

This comment has been minimized.

Copy link

jmreicha commented Mar 31, 2017

Only other thing I can think of that changed is that we upgraded the scheduler a little over a week ago. Could that have anything to do with the Rancher servers acting up?

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 31, 2017

@jmreicha No, problems with the scheduler would not manifest as agents disconnecting.

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 31, 2017

worked with @cjellick in slack, and we can reproduce the issue on-demand,

  1. Pause a host in any environment
  2. Unpause the host
  3. Hosts in all environments will start flapping

The flapping stops if on the master you issue sudo killall -2 websocket-proxy or by rebooting the host.

Gave additional logs to @cjellick to look over with goroutine dumps from doing a kill -2 <pidofwebsocket-proxy>. Only a killall -2 websockt-proxyseems to stop the flapping, not akill -2 `.

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Mar 31, 2017

To clarify pause/unpause are the Rancher UI deactivate/activate functions

@ecliptik

This comment has been minimized.

Copy link

ecliptik commented Mar 31, 2017

@cjellick as discussed in Slack, I created a feature request for a heartbeat network as a suggestion to help mitigate this type of flapping scenario #8380

@mcanevet

This comment has been minimized.

Copy link

mcanevet commented Apr 5, 2017

I also have this issue. I looks like changing the database charset to utf8 and collation to utf8_general_ci like recommended fixed it so far.
I used this command to do it :

DB="cattle"; ( echo 'SET foreign_key_checks = 0;'; echo 'ALTER DATABASE `'"$DB"'` CHARACTER SET utf8 COLLATE utf8_general_ci;'; mysql "$DB" -e "SHOW TABLES" --batch --skip-column-names | xargs -I{} echo 'ALTER TABLE `'{}'` CONVERT TO CHARACTER SET utf8 COLLATE utf8_general_ci;' ; echo 'SET foreign_key_checks = 1;' ) | mysql "$DB"
@sangeethah

This comment has been minimized.

Copy link
Member

sangeethah commented Apr 5, 2017

Tested with rancher server version - v1.5.4-rc1 have 3 hosts.
Set up had ~45 global services running.
Deactivate host.
Activate host.
After this we do not see any hosts in the set up going to "reconnecting" state.

When a similar scenario was tested on v1.5.3, all the host in the environment was flapping ( switching between "reconnecting" and "active")

@sangeethah sangeethah closed this Apr 5, 2017

@cjellick

This comment has been minimized.

Copy link
Member

cjellick commented Apr 6, 2017

@mcanevet the utf settings on the DB really shouldnt have anything to do with this particular bug. If changing those settings caused you to have to either restart rancher or sever connections from rancher to the DB, it may have appeared to fix it, but I don't think that actually fixed it.

Note that if you are experiencing this problem, you can disable the scheduler service as a work around until 1.5.4 comes out. Though, disabling the scheduler service will cause the following scheduling functions to not work properly:

  1. proper even spreading of containers across hosts (spread will be random instead)
  2. Scheduling based on memory and cpu fields on containers
  3. The "Require Container Label" scheduling feature that you can see on the Edit Host screen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment