Second replication begins if first replication is not finished #8

woodsb02 · 2017-10-01T10:20:49Z

During first replication of many Gigabytes of data, I initially had the interval of the pull job set as 10m, and the first replication would not be finished by the time the second one was called to start. I checked the status many hours later and could see numerous ssh sessions running which led me to believe multiple replication jobs were now running at once (which I dont think should ever happen). I expected that if another replication job was called to start before the previous had finished, the new job would just be cancelled entirely.

I did not look into the state of my replicated data, or if the replications were proceeding ok. It was purely the fact that multiple zrepl ssh sessions were running that led me to believe this was the behaviour.

problame · 2017-10-02T17:18:20Z

This is a bug. WIll fix.

The documentation describes intended behavior. Apparently, there are some bugs regarding *patient* tasks. refs #8 refs #13

problame · 2017-10-05T16:46:31Z

So I guess this was a pull job? I cannot reproduce the issue.

cmd/config_job_pull.go:94 JobStart() is strictly sequential and will not reconnect unless the previous pull finished.

What's still an open issue: where did all the dangling ssh sessions come from?

woodsb02 · 2017-10-05T23:14:52Z

I will try to replicate it this weekend, and will report back on my findings. I didn’t spend the time to investigate and record my findings last time... I just remember seeing numerous lines from “sudo pgrep -lf zrepl”

problame · 2017-10-16T20:00:00Z

Were you able to replicate the described behavior?

problame · 2017-11-04T10:23:06Z

OK, I was able to observer the issue on a testing system. I saw lots of defunct processes, most likely ssh processes that timed out but were not waitpid() for by zrepl.
Sadly, the logs are gone because the testing system was also used to test TCP logger, which doesn't handle timeouts on the connection well, see #26

problame · 2018-02-16T20:13:53Z

So I think I fixed the issue in 6b5bd0a --- it just landed in zrepl master.
Are you in a situation where you can just build zrepl master and check if the issue is resolved?

woodsb02 · 2018-03-04T02:52:13Z

Hi Christian,
Yes, you are correct - this issue was the same as the one reported in #56.
I have just finished testing with the new latest (unreleased) version of zrepl, and can confirm this is now fixed.
Thanks for your work on fixing this!
Cheers,
Ben

problame added the bug label Oct 2, 2017

problame added this to the 0.0.2 milestone Oct 2, 2017

problame added a commit that referenced this issue Oct 3, 2017

docs: document job types

a4963ce

The documentation describes intended behavior. Apparently, there are some bugs regarding *patient* tasks. refs #8 refs #13

problame modified the milestones: 0.0.2, 0.0.3 Nov 11, 2017

problame mentioned this issue Jan 15, 2018

Reuse open ssh-channels or close used ones instead #56

Closed

problame mentioned this issue Feb 16, 2018

prune: policy: grid: exception for keep=all #6

Closed

problame closed this as completed Mar 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second replication begins if first replication is not finished #8

Second replication begins if first replication is not finished #8

woodsb02 commented Oct 1, 2017

problame commented Oct 2, 2017

problame commented Oct 5, 2017

woodsb02 commented Oct 5, 2017

problame commented Oct 16, 2017

problame commented Nov 4, 2017

problame commented Feb 16, 2018

woodsb02 commented Mar 4, 2018

Second replication begins if first replication is not finished #8

Second replication begins if first replication is not finished #8

Comments

woodsb02 commented Oct 1, 2017

problame commented Oct 2, 2017

problame commented Oct 5, 2017

woodsb02 commented Oct 5, 2017

problame commented Oct 16, 2017

problame commented Nov 4, 2017

problame commented Feb 16, 2018

woodsb02 commented Mar 4, 2018