Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib: Faster termination on exit (ref #6319) #6329

Merged
merged 3 commits into from Feb 13, 2020
Merged

Conversation

@imsodin
Copy link
Member

imsodin commented Feb 12, 2020

There's a steady trickle of "panic: database closed" in sentry...
#6319 (comment)

https://sentry.syncthing.net/syncthing/syncthing/?environment=Candidate&query=panic%3A+database+is+closed

I had a look at the goroutines running at the time of the panic. They are all due to slow shutdown.
In some scanning was blocking, where there's a logical place to check the context.
One (maybe 2, don't remember) were during pulling. Here I don't think there's much we can do in the way of checking contexts. It looks like it just fired up pullers until the limit is reached, and those are then spending time on waiting for model locks or doing io, i.e. a context check would be too late.
Then there is one that hangs in the connect loop. There are a few places where we loop over device or addresses doing something with them without checking the context in between.
Then there's one where everything is still running, even the context aren't cancelled there. That just looks like proper freeze where I can make no sense of it at all. It's a single occurrence so I am fine with that.

case <-ctx.Done():
return
default:
}

This comment has been minimized.

Copy link
@AudriusButkevicius

AudriusButkevicius Feb 12, 2020

Member

Lookup should probably take a context instead.

case <-ctx.Done():
return
default:
}
// Use a special key that is more than just the address, as you might have two devices connected to the same relay

This comment has been minimized.

Copy link
@AudriusButkevicius

AudriusButkevicius Feb 12, 2020

Member

I don't think this loop does anything lengthy, the dial is the thing that takes time.

This comment has been minimized.

Copy link
@imsodin

imsodin Feb 12, 2020

Author Member

I simply didn't check what it does exactly, I just added it based on the loop internals being lengthy (as in lines of code lengthy) and thus wanting to be safe.

select {
case <-f.ctx.Done():
return f.ctx.Err()
default:

This comment has been minimized.

Copy link
@AudriusButkevicius

AudriusButkevicius Feb 12, 2020

Member

There exists check like this in the snap.WithPrefix..., does this one actually help?

This comment has been minimized.

Copy link
@imsodin

imsodin Feb 12, 2020

Author Member

Trying to spell out the reason why I added this all arguments dissolved and they only remaining reason is that a panic on sentry happened at this point, which is not an argument at all. Rechecking the goroutines shows the problem is with lib/connections -> removing.

state.fail(errors.Wrap(f.ctx.Err(), "folder stopped"))
out <- state.sharedPullerState
default:
}

This comment has been minimized.

Copy link
@AudriusButkevicius

AudriusButkevicius Feb 12, 2020

Member

Again, the only lengthy procedure till next check seems weak hashing, which already takes a context? Do you have evidence this helps?

@calmh

This comment has been minimized.

Copy link
Member

calmh commented Feb 12, 2020

I don't think this is the cause for the "database failed to stop" stuff though. I can reproduce that by just starting syncthing with default config and Ctrl-C:ing it. Looking at the routines (adding a panic after the failed to stop message) I don't see anything that should have an open database transaction. I suspect we might just be leaking something in the close waitgroup.

@imsodin imsodin changed the title lib/connections, lib/model: Faster termination on exit (ref #6319) lib: Faster termination on exit (ref #6319) Feb 12, 2020
@imsodin

This comment has been minimized.

Copy link
Member Author

imsodin commented Feb 12, 2020

I contextified the discover.FinderService.Lookup function, which on another look at the goroutines in the reported panics is the main cause of delayed shutdown together with looping over all devices and calling model.Connection in lib/connections.

I don't think this is the cause for the "database failed to stop" stuff though. I can reproduce that by just starting syncthing with default config and Ctrl-C:ing it. Looking at the routines (adding a panic after the failed to stop message) I don't see anything that should have an open database transaction. I suspect we might just be leaking something in the close waitgroup.

It definitely doesn't cover all cases (I mentioned one about pulling above), maybe not yours either. In the reported panics (e.g. https://crash.syncthing.net/report/2193b8026662ebea725e4b73f1147e98e07e3f986f0075704213c8eda5ec4bd8) it looks like suture fails to shutdown the connection service, and thus also other services stay running, which then panic ( in the linked example it's the index sender, in others a folder). That's what I am trying to address here.

@imsodin imsodin force-pushed the imsodin:lib/fasterExit branch from 57c830e to 636d91c Feb 12, 2020
@calmh
calmh approved these changes Feb 13, 2020
Copy link
Member

calmh left a comment

lgtm

@calmh

This comment has been minimized.

Copy link
Member

calmh commented Feb 13, 2020

I found two other places that would prevent database shutdown - new GC runner, and the metadata recalculator. The latter one runs on first startup and explains why my shutdown doesn't work then.

@calmh calmh merged commit c3637f2 into syncthing:master Feb 13, 2020
12 of 13 checks passed
12 of 13 checks passed
GolangCI 1 issue found
Details
Build (Cross Compile) (Syncthing) TeamCity build finished
Details
Build (Debian, Snap) (Syncthing) TeamCity build finished
Details
Build (Discovery Server) TeamCity build finished
Details
Build (Linux) (Syncthing) TeamCity build finished
Details
Build (Linux, Go1.12) (Syncthing) TeamCity build finished
Details
Build (Linux, Go1.14-rc) (Syncthing) TeamCity build finished
Details
Build (Mac) (Syncthing) TeamCity build finished
Details
Build (Relay Server) TeamCity build finished
Details
Build (Source) (Syncthing) TeamCity build finished
Details
Build (Windows) (Syncthing) TeamCity build finished
Details
Check Correctness (Syncthing) TeamCity build finished
Details
Coverage (Syncthing) TeamCity build finished
Details
@calmh calmh added this to the v1.4.0 milestone Feb 13, 2020
@imsodin imsodin deleted the imsodin:lib/fasterExit branch Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.