Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Folder restart can leave several folder instances running #5233

Closed
calmh opened this Issue Oct 2, 2018 · 3 comments

Comments

Projects
None yet
2 participants
@calmh
Copy link
Member

calmh commented Oct 2, 2018

Consider this short, scary snippet from a v0.14.50 instance, after sharing/unsharing a folder a couple of times;

2018-10-02 14:55:57 progress emitter: registering fldr 192.168.11.215\2018-05-26\f.csv
2018-10-02 14:55:57 sendreceive/fldr@0x13fc7a20 parent not missing 192.168.11.106\2018-05-26\14.13.31.zip
2018-10-02 14:55:57 sendreceive/fldr@0x13fc7a20 taking shortcut on 192.168.11.106\2018-05-26\14.13.31.zip
2018-10-02 14:55:57 sendreceive/fldr@0x136749a0 filesystem notification rescan
2018-10-02 14:55:57 sharedPullerState fldr 192.168.11.215\2018-05-26\f.csv copyNeeded -> 0
2018-10-02 14:55:57 sendreceive/fldr@0x1425c160 parent not missing 192.168.11.219\2018-05-25\14.38.48.csv
2018-10-02 14:55:57 sendreceive/fldr@0x13675340 filesystem notification rescan
2018-10-02 14:55:57 sendreceive/fldr@0x1425c160 closing 192.168.11.107\2018-05-26\14.38.47.zip

The three things sendreceive/fldr@0x13fc7a2, sendreceive/fldr@0x136749a0 and sendreceive/fldr@0x1425c160 should definitely not be seen at the same time here.

@calmh calmh added the bug label Oct 2, 2018

@imsodin

This comment has been minimized.

Copy link
Member

imsodin commented Oct 2, 2018

This might be due to suture supervisor level timeouts. The calls to RemoveAndWait should lock indefinitely:

A timeout value of 0 means to wait forever.

but in the next paragraph

If either the supervisor terminates or the timeout passes, ErrTimeout is returned.

I'll try to find some time to investigate this.

@calmh

This comment has been minimized.

Copy link
Member Author

calmh commented Oct 2, 2018

Yeah me too. There's some interesting API stuff happening in the background to cause this, which I need to get to the bottom of. I'm not sure I trust the restarting thing... Also possibly interesting on the same device:

[NNDTE] 13:29:21 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:21 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:22 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:22 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:22 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:22 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:23 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:26 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:26 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:27 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:27 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:27 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:27 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:27 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:27 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:27 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:27 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:27 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:27 INFO: Restarted folder "fldr" (fldr) (sendreceive)
[NNDTE] 13:29:34 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:37 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:37 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:37 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:37 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:40 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)
[NNDTE] 13:29:40 INFO: Completed initial scan of sendreceive folder "fldr" (fldr)

The runner should log starting / stopping to clearer see what happens. :/ Or, it does with tracing on, but tracing wasn't on. Will see if I can reproduce this with tracing.

@calmh

This comment has been minimized.

Copy link
Member Author

calmh commented Oct 2, 2018

Yeah I can see how this happens when there are multiple sequential restarts of the same folder.

A: (locks)
A: grabs tokens
A: asks folder to stop
A: (unlocks)
A: wait for services to stop

B: (locks)
B: grabs tokens
B: asks folder to stop
B: (unlocks)
B: wait for services to stop

---folder actually stops here---

B: (locks)
B: removes tokens and stuff
B: adds and starts the folders, keeping the new tokens
B: (unlocks)

A: (locks)
A: removes tokens and stuff
A: adds and starts the folders, keeping the new tokens
A: (unlocks)

The folder is now running twice, with the tokens for the one started by B lost in cyberspace.

@calmh calmh self-assigned this Oct 2, 2018

calmh added a commit to calmh/syncthing that referenced this issue Oct 3, 2018

lib/model, cmd/syncthing: Wait for folder restarts to complete (fixes s…
…yncthing#5233)

This is the somewhat ugly - but on the other hand clear - fix for what
is really a somewhat thorny issue. To avoid zombie folder runners a new
mutex is introduced that protects the RestartFolder operation. I hate
adding more mutexes but the alternatives I can think of are worse.

The other part of it is that the POST /rest/system/config operation now
waits for the config commit to complete. The point of this is that until
the commit has completed we should not accept another config commit. If
we did, we could end up with two separate RestartFolders queued in the
background. While they are both correct, and will run without
interfering with each other, we can't guarantee the order in which they
will run. Thus it could happen that the newer config got committed
first, and the older config commited after that, leaving us with the
wrong config running.

@calmh calmh added this to the v0.14.52 milestone Oct 5, 2018

calmh added a commit to imsodin/syncthing that referenced this issue Oct 10, 2018

Merge branch 'master' into receiveOnlyResolve
* master:
  lib/db: Rename things (ref syncthing#5198)
  lib/model: Treat failed rename like del&update (syncthing#5203)
  lib/db: Refactor: use a Lowlevel type underneath Instance (ref syncthing#5198) (syncthing#5212)
  gui: Close unclosed tag (fixes syncthing#5253)
  gui, man, authors: Update docs, translations, and contributors
  lib/model: Unflake TestFolderRestartZombies (fixes syncthing#5244)
  lib/model, cmd/syncthing: Wait for folder restarts to complete (fixes syncthing#5233) (syncthing#5239)
  lib/connections: Don't info log about LAN if there are no rate limits (syncthing#5242)
  lib/config: Add context to the home disk out of space error (syncthing#5241)
  gui: Add missing 'Close' button on remote devices' files (syncthing#5232)
  gui: Migrate to fork-awesome (fixes syncthing#5236) (syncthing#5237)
  gui, man, authors: Update docs, translations, and contributors
  lib/fs: Add fakefs (syncthing#5235)
  cmd/syncthing: Fix support bundle zip name pattern
  cmd/syncthing, gui: Implement download of "support bundle" (fixes syncthing#5142) (syncthing#5145)
  lib/model: Prevent repeat db update (syncthing#5231)
  lib/db: Improve VersionList.String (syncthing#5229)
  lib/fs: Try EvalSymlinks without '\\?\' prefix on failure (fixes syncthing#5226) (syncthing#5227)
  gui, man, authors: Update docs, translations, and contributors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.