Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ksync keeps losing connection #247

Closed
scones opened this issue Nov 22, 2018 · 31 comments

Comments

Projects
None yet
4 participants
@scones
Copy link

commented Nov 22, 2018

log entry from syncthing:

[PETWT] 12:29:12 INFO: Connection to SW2ODE5-ZSCWQH4-NQWRLHS-4465ZAM-N6FCE55-JU2QTUL-QPG6VL4-A6QZRQ5 at 127.0.0.1:22000-127.0.0.1:49714/tcp-server closed: reading length: EOF
[PETWT] 12:29:12 VERBOSE: Disconnected from device SW2ODE5-ZSCWQH4-NQWRLHS-4465ZAM-N6FCE55-JU2QTUL-QPG6VL4-A6QZRQ5

log entry from watch command:

ERRO[3458] lost connection to cluster                    ContainerName= LocalPath=/home/scones/vm/pontos LocalReadOnly=true Name=just-oryx Namespace=application-development Pod= Reload=true RemotePath=/var/www/html RemoteReadOnly=false Selector="[application_name=pontos-app]"
INFO[3458] signal loss dectected. shutting down

relevant stuff:

ksync:
        Version:    Release
        Go Version: go1.11
        Git Commit: c6d2e34
        Git Tag:    0.3.2-hotfix
        Built:      Mon Oct  1 19:19:36 +0000 2018
        OS/Arch:    linux/amd64
service:
        Version:    Release
        Go Version: go1.11
        Git Commit: c6d2e34
        Git Tag:    0.3.2-hotfix
        Built:      Mon Oct  1 19:15:00 +0000 2018

additional relevant stuff:
kubernetes: 1.12.2
minikube: minikube version: v0.30.0 (with driver none)
just 1 pod

@kbernst30

This comment has been minimized.

Copy link

commented Dec 3, 2018

This keeps happening to me as well. Not a blocker on my end but a definite annoyance:

ksync:
	Version:    Release
	Go Version: go1.11
	Git Commit: c6d2e34
	Git Tag:    0.3.2-hotfix
	Built:      Mon Oct  1 19:19:36 +0000 2018
	OS/Arch:    darwin/amd64
service:
	Version:    Release
	Go Version: go1.11.2
	Git Commit: c6d2e34
	Git Tag:    0.3.2-hotfix
	Built:      Tue Nov 27 22:02:35 +0000 2018
@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Dec 11, 2018

Is this still an issue? If not I'll close it and you can reopen when it happens again.

@scones

This comment has been minimized.

Copy link
Author

commented Dec 13, 2018

i will check within the next few days and get back to you

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Dec 17, 2018

Feel free to reopen if needed.

@timfallmk timfallmk closed this Dec 17, 2018

@kbernst30

This comment has been minimized.

Copy link

commented Dec 22, 2018

I just encountered this again. The exact same error above. Happened 3 times today. There is definitely something amiss.

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 3, 2019

@kbernst30 Are you seeing this intermittently or all the time? Is there a specific condition under which it usually occurs?

@kbernst30

This comment has been minimized.

Copy link

commented Jan 7, 2019

There is zero condition that I can tell. It happens constantly. I have to restart several times a day. I am using Docker for Mac Kubernetes integration if it helps at all

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 9, 2019

Interesting. I can't replicate it.

Oh, if you're using a laptop and it goes to sleep, that could cause issues with the cluster. That's a known problem with how Docker Kubernetes works. Similar symptoms here: #143 (comment)

@scones

This comment has been minimized.

Copy link
Author

commented Jan 9, 2019

it happens without a sleep as well.

The counter in the logs (whatever it counts) goes up to 1000 to 3000 and ksync stops working in that range at a random count.

As one has to restart it several times a day during a usual workday, its obvious the system sleep is not the cause.

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 10, 2019

What counter are you referring to? Maybe a log dump (with some sort of timeline for actions) would be helpful here.

@kbernst30

This comment has been minimized.

Copy link

commented Jan 10, 2019

Perhaps he means the number in the INFO log brackets? I see something like this:

INFO[2856] updating                                      pod=test-pod-7db9485d8c-2cn68 spec=test
INFO[2856] update complete                               pod=test-pod-7db9485d8c-2cn68 spec=test
ERRO[2932] lost connection to cluster                    ContainerName= LocalPath={...} LocalReadOnly=false Name=test Namespace=default Pod= Reload=false RemotePath=/home/web/src RemoteReadOnly=false Selector="[app.kubernetes.io/name=test]"
INFO[2932] signal loss dectected. shutting down
@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 10, 2019

@kbernst30 Ah, good point. Those are actually timestamps coming from the log formatter. We use https://github.com/sirupsen/logrus to do logging, and this is a default option.

@scones

This comment has been minimized.

Copy link
Author

commented Jan 13, 2019

Yes, the number in the square brackets. i assumed those are counters.

@kbernst30

This comment has been minimized.

Copy link

commented Jan 14, 2019

Could we potentially re-open this issue? Clearly something isn't right.

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 14, 2019

I suspect this is a problem with the k8s distribution, but I'm happy to investigate more. In all likelihood it's a cluster interrupt problem.

Edit: Since I can't reproduce this, I'll need more information if anyone wants me to investigate

@timfallmk timfallmk reopened this Jan 14, 2019

@kbernst30

This comment has been minimized.

Copy link

commented Jan 14, 2019

Not sure what other info you need but perhaps this will be helpful:

Docker for Mac v2.0.0.0-mac81
Docker Engine v18.09.0
Kubernetes v1.10.3
Ksync Release 0.3.2-hotfix

ksync:
	Version:    Release
	Go Version: go1.11
	Git Commit: c6d2e34
	Git Tag:    0.3.2-hotfix
	Built:      Mon Oct  1 19:19:36 +0000 2018
	OS/Arch:    darwin/amd64
service:
	Version:    Release
	Go Version: go1.11.2
	Git Commit: c6d2e34
	Git Tag:    0.3.2-hotfix
	Built:      Tue Nov 27 22:02:35 +0000 2018
@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 14, 2019

@kbernst30 Thanks. Mostly what I'm looking for is under what operating conditions this happens. Does it happen when you leave the cluster for a while, when you're changing a lot of files, when you're quickly rebooting containers, etc.

As I said before, I think we should concentrate on the cluster itself to start, as it seems to be a problem with pod restarts not ksync.

@kbernst30

This comment has been minimized.

Copy link

commented Jan 14, 2019

Well in my case I'm actually using an interpreted code base so i've disabled reloading of the containers. I also have set the remote folder to read only I believe (forgive me I have to find the settings i used to create the initial spec).

It seems to happen after a while of idle time. It does happen more often if the folder syncs occur more frequently.

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 14, 2019

The idle time thing makes more sense to me. It's likely that docker for mac (which I believe just changed to "Docker for Desktop") does some resource management during downtime to enable running a cluster locally. Kubernetes is not designed to handle interruptions or restarts of the entire cluster well, so I suspect they may be trying to smooth the experience out.

Two thing I would try:

  1. Deliberately idle or heavily use your sync to see if you can trigger the condition
  2. Get the logs from k8s services (and maybe docker) from around that time if you trigger the problem

Also, it might not be relevant, but I'm running the "edge" distribution of "Docker for Desktop" and it seems to have just undergone a major update (to Version 2.0.1.0 (30090)) which included a full wipe and recreation of the cluster. This could indicate some API changes happening, so perhaps trying this new version after the wipe would help.

@kbernst30

This comment has been minimized.

Copy link

commented Jan 15, 2019

I will see what I can find today from k8s logs if/when a connection loss occurs. There is also a new Docker for Desktop update so I will install that as well to see if it helps at all

@kbernst30

This comment has been minimized.

Copy link

commented Jan 15, 2019

Starting the watcher this morning at 10:04am. DId nothing else except keeping it idle. The watcher then lost connection at 10:35am.

There was nothing specific in any log that i can see unfortunately. The following appears constantly:

default	10:31:04.898691 -0500	com.docker.driver.amd64-linux	Cannot identify volume corresponding to path /etc/kubernetes/scheduler.conf: lstat /private/etc/kubernetes: no such file or directory
default	10:31:04.913581 -0500	com.docker.driver.amd64-linux	Cannot identify volume corresponding to path /etc/kubernetes/controller-manager.conf: lstat /private/etc/kubernetes: no such file or directory
default	10:31:04.913642 -0500	com.docker.driver.amd64-linux	Cannot identify volume corresponding to path /usr/libexec/kubernetes/kubelet-plugins/volume/exec: lstat /usr/libexec/kubernetes: no such file or directory
default	10:31:35.062963 -0500	com.docker.driver.amd64-linux	Cannot identify volume corresponding to path /etc/kubernetes/controller-manager.conf: lstat /private/etc/kubernetes: no such file or directory
default	10:31:35.063032 -0500	com.docker.driver.amd64-linux	Cannot identify volume corresponding to path /usr/libexec/kubernetes/kubelet-plugins/volume/exec: lstat /usr/libexec/kubernetes: no such file or directory
default	10:31:35.082040 -0500	com.docker.driver.amd64-linux	Cannot identify volume corresponding to path /etc/kubernetes/scheduler.conf: lstat /private/etc/kubernetes: no such file or directory

But not just when the connection to the cluster is lost. I will continue to dig into other logs for something strange

@scones

This comment has been minimized.

Copy link
Author

commented Jan 15, 2019

as described:

I am running minikube with local read only.
I did not shutdown/idle out/suspend/hibernate the computer while this happened. in fact i was writing code and verify the changes.

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 15, 2019

@kbernst30 That's very strange. Nothing there is something ksync uses. Are you using a CNI or network plugin on top of Docker k8s or just a plain setup from scratch?

@scones Same question to you.

My suspicion is that something else is going wrong with the cluster, eventually causing a shutdown (or restart) of the daemonSet that contains the ksync cluster-side pods. That's the only thing I can think of.

@kbernst30

This comment has been minimized.

Copy link

commented Jan 15, 2019

@timfallmk Plain setup for me. Never needed anything else

@grampelberg

This comment has been minimized.

Copy link
Collaborator

commented Jan 15, 2019

Sounds like the reconnect code needs a little bit of love. Is there anything happening on the network stack? Like, wifi blipping, plugging/unplugging wires?

When this happens, what is the state of the DaemonSet? Anything in the logs there?

@scones

This comment has been minimized.

Copy link
Author

commented Jan 16, 2019

plain setup with helm and some services (ES and MySQL) in addition to the core application pontos

as it just works again when i just re-do the watch:
can you just build a loop with exponential backoff into the watch command? that would mitigate the problem somewhat.

@kbernst30

This comment has been minimized.

Copy link

commented Jan 16, 2019

i also work with helm. Just in case that information is relevant.

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 16, 2019

There is reconnection code already in ksync, but that only works if the network is stable. As @grampelberg said, this is probably due to a network status change.

It should be easy to test, once you have everything work, change wifi networks (or cables) and see if it happens.

@scones

This comment has been minimized.

Copy link
Author

commented Jan 17, 2019

  • there was no log entry about reconnect trials
  • i am speaking of a reconnect attempt on a higher scope (ie. around the watch command, not around the connection layer, i am again assuming about the structure here)
  • also i suggested and incremental backoff to avoid spamming the server

these things could be optional with command line flags..

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 18, 2019

I should probably clarify, as it's not quite as simple as a reconnection.

There are a few connection pieces here. Broadly they fall into three parts:

  1. Local client (probably the CLI, but can be anything consuming the API). Communicates locally with the watch process (see next) via a local (gRPC) port. Usually ephemeral.
  2. watch process. This listens on a local gRPC port for commands from a client (see above) as well as spawns a child process running syncthing. Also establishes tunnels for:
  • gRPC connection to remote (radar) container
  • Connection to Syncthing API
  • Connection to Syncthing transfer port
  1. radar remote process. Other end of the connection for the watch process.

Since there are several moving parts, "reconnection" is not as simple as it sounds. For 1, the client connection only happens when running a command, so reconnection isn't needed. For 2, the process is connecting in two different "directions", "in" (the local connection), and "out" (the remote connections). This means that disconnections can happen in several different ways. I won't go into all the possibilities, but suffice it to say the matrix is large enough.

The way we handle most of these is to drop connection error messages, with the assumption that they are transient. Since the tunnel to the cluster remains (and are built on demand), there's no need to do any fancy reconnection here. You can see this here.

What is likely happening in this case is that the change in local network status causes the network adapter to reset all currently established tunnels. Since the local tunnel is built when starting watch, restarting the process solves the issue. If the problem were with the cluster connection, you wouldn't see the error and it would just continue trying to connect.

Further evidence of this is that the above logs show your network overlay freaking out, likely for the same reason. As I said before, should be easy to test. I'll give it a shot if you haven't already.

@timfallmk

This comment has been minimized.

Copy link
Collaborator

commented Jan 19, 2019

A quick bench test shows this is indeed the cause of the problem. I'm going to close out this issue.

@timfallmk timfallmk closed this Jan 19, 2019

timfallmk added a commit that referenced this issue Jan 22, 2019

timfallmk added a commit that referenced this issue Jan 24, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.