Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lib/model: Clean up index handler life cycle (fixes #9021) #9038

Merged
merged 7 commits into from
Aug 21, 2023

Conversation

calmh
Copy link
Member

@calmh calmh commented Aug 16, 2023

This is an alternative to #9035. There's a certain amount of overengineering here, possibly, but hear me out...

  • There's a new thing called a serviceMap. This is essentially a map of key -> service plus key -> Suture-token and some management to make sure that adding and removing things from the map does the right things for the services in question. This reduces boilerplate since we have this pattern in a couple of places. The serviceMap is itself a Service that runs the children.
  • The index handler registry now uses one of these to manage the index handlers
  • The model now uses one of these to manage the index handler registries
  • I think it all works out quite nicely, stopping one thing stops all the children, and we now remove (stop) the index handler registry for a device when we close the connection.

We can also use this serviceMap for the folder runners, but that's a large refactor-only diff that we can save for later...

@@ -1792,7 +1793,7 @@ func (m *model) Closed(conn protocol.Connection, err error) {
delete(m.remoteFolderStates, device)
closed := m.closed[device]
delete(m.closed, device)
delete(m.indexHandlers, device)
m.indexHandlers.RemoveAndWait(device, 0)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the relevant point that fixes #9021, as per my understanding of #9035...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and I'm 99% sure that "...AndWait" is safe here, but not 100%

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also my understanding :)

One potentialy small downside of the map-and-service-construct: Can't move the waiting part outside of the lock anymore - worst case it's still writing an index update received just before the conn was closed to db and thus we block pmut for a while longer than necessary. Doesn't seem very relevant though, if db is that slow that it matters operation wont be smooth anyway.

// already a service at the given key, it is removed first.
func (s *serviceMap[K, S]) Add(k K, v S) {
if tok, ok := s.tokens[k]; ok {
// There is already a service at this key, remove it first.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could call this incorrect usage and panic, since it's likely to be caused by a missing Remove at some point...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a middle ground: Panic if it's an RC, send a failure event (and thus report) otherwise.

@imsodin
Copy link
Member

imsodin commented Aug 17, 2023

At first glance the service map has more lib/svcutil vibes than model vibes (haven't actually reviewed yet, so maybe it obviously isn't).

}

// Each calls the given function for each service in the map.
func (s *serviceMap[K, S]) Each(fn func(K, S)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this over exposing the map and just looping over that at the caller, to prevent the caller from messing with the map?

(Also I want range over functions: golang/go#61405)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I considered returning a copy of the map, but this seemed fine as well...

@calmh
Copy link
Member Author

calmh commented Aug 17, 2023

(lib/svcutil) Agreed, if and when it's used outside of the one package at least...

Copy link
Member

@imsodin imsodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nits, lgtm otherwise

Comment on lines +50 to +52
func (s *serviceMap[K, S]) Get(k K) (v S, ok bool) {
v, ok = s.services[k]
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curiousity: Is there a specific reason for the named return values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I thought of it as documentation on what the bool is (though that's hardly necessary as it's a super common pattern) and the method body becomes a few characters shorter, possibly it evolved this was from an earlier iteration without the OK bool...

lib/model/service_map.go Show resolved Hide resolved
r.sup.RemoveAndWait(is.token, 0)
delete(r.indexHandlers, folder.ID)
}
r.indexHandlers.Remove(folder.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also one of the "should never happen", but if it did, wouldn't we want to RemoveAndWait to ensure there's not two indexhandlers at the same time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bug, we did ...AndWait before and should do so now as well 👍

@@ -389,48 +385,35 @@ type indexHandlerFolderState struct {
runner service
}

func newIndexHandlerRegistry(conn protocol.Connection, downloads *deviceDownloadState, closed chan struct{}, parentSup *suture.Supervisor, evLogger events.Logger) *indexHandlerRegistry {
func newIndexHandlerRegistry(conn protocol.Connection, downloads *deviceDownloadState, closed chan struct{}, evLogger events.Logger) *indexHandlerRegistry {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closed arg isn't used anymore.

@@ -1792,7 +1793,7 @@ func (m *model) Closed(conn protocol.Connection, err error) {
delete(m.remoteFolderStates, device)
closed := m.closed[device]
delete(m.closed, device)
delete(m.indexHandlers, device)
m.indexHandlers.RemoveAndWait(device, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also my understanding :)

One potentialy small downside of the map-and-service-construct: Can't move the waiting part outside of the lock anymore - worst case it's still writing an index update received just before the conn was closed to db and thus we block pmut for a while longer than necessary. Doesn't seem very relevant though, if db is that slow that it matters operation wont be smooth anyway.

calmh and others added 2 commits August 19, 2023 08:31
Co-authored-by: Simon Frei <freisim93@gmail.com>
@calmh
Copy link
Member Author

calmh commented Aug 19, 2023

This doesn't fix #9021, are you sure your attempt does?

--- FAIL: TestAutoAcceptEnc (0.00s)
panic: bug: ClusterConfig called on closed or nonexistent connection [recovered]
	panic: bug: ClusterConfig called on closed or nonexistent connection

goroutine 2639 [running]:
testing.tRunner.func1.2({0xf78000, 0x1254330})
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1526 +0x24e
testing.tRunner.func1()
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1529 +0x39f
panic({0xf78000, 0x1254330})
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/runtime/panic.go:884 +0x213
github.com/syncthing/syncthing/lib/model.(*model).ClusterConfig(0xc00088dba0, {0x1268188?, 0xc000453800?}, {{0xc005db1900?, 0x0?, 0x0?}})
	D:/a/syncthing/syncthing/lib/model/model.go:1177 +0x100a
github.com/syncthing/syncthing/lib/model.TestAutoAcceptEnc.func2({0x2, 0x23, 0xe5, 0xbf, 0x3f, 0x57, 0x1f, 0x39, 0xd2, 0xf0, ...}, ...)
	D:/a/syncthing/syncthing/lib/model/model_test.go:1341 +0x125
github.com/syncthing/syncthing/lib/model.TestAutoAcceptEnc(0xc00088d040)
	D:/a/syncthing/syncthing/lib/model/model_test.go:1398 +0x168a
testing.tRunner(0xc00088d040, 0x11418f8)
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1576 +0x10b
created by testing.(*T).Run
	C:/hostedtoolcache/windows/go/1.20.7/x64/src/testing/testing.go:1629 +0x3ea
FAIL	github.com/syncthing/syncthing/lib/model	2.348s
?   	github.com/syncthing/syncthing/lib/model/mocks	[no test files]

(latest build run on Windows)

@calmh
Copy link
Member Author

calmh commented Aug 19, 2023

Frankly the test seems kinda odd and even has a comment related to this panic, which it tries to work around by adding a new connection for the same device ID for every test iteration it does... I think this change (and yours) is sound, but I think the test does something else unrealistic/wrong for different reasons.

@imsodin
Copy link
Member

imsodin commented Aug 19, 2023

I am sure it makes it less likely at least - I get that panic within a few runs on main and not at all on my PR with go test -count 1000 -failfast -run AutoAcceptEnc ./lib/model/. I now also checked your PR out, and indeed it fails too - I don't understand yet why.

Also regarding the test being weird/potentially bogous: I don't disagree - it still shouldn't be able to trigger that panic.

I don't see the difference. I switched my PR to also make the index handler registry a service and control it from model (in an ugly quick and dirty way just for debugging), and the test still doesn't fail.
There are also a few tiny circumstantial changes in the PR here, but fixing them doesn't change anything (as expected). I'll comment inline just for completeness.

@imsodin
Copy link
Member

imsodin commented Aug 20, 2023

Can't say I understand why, but waiting for the index handler to stop outside of the pmut lock makes the test no longer/fail panic for me. Patch doing that on top of your PR: imsodin@50dc429
Something similar might actual be a generally useful extension of the suture API.

@calmh
Copy link
Member Author

calmh commented Aug 20, 2023

That's bizarre. I would like to understand why that has any effect...

@calmh
Copy link
Member Author

calmh commented Aug 20, 2023

FWIW, this also "fixes" it for me;

diff --git a/lib/model/model.go b/lib/model/model.go
index 0fafd0d03..aad3a45da 100644
--- a/lib/model/model.go
+++ b/lib/model/model.go
@@ -1795,6 +1795,7 @@ func (m *model) Closed(conn protocol.Connection, err error) {
        delete(m.closed, device)
        m.indexHandlers.RemoveAndWait(device, 0)
        m.pmut.Unlock()
+       runtime.Gosched()

        m.progressEmitter.temporaryIndexUnsubscribe(conn)
        m.deviceDidClose(device, time.Since(conn.EstablishedAt()))

I'll see if I can figure this out later on, something seems to be async that shouldn't...

@imsodin
Copy link
Member

imsodin commented Aug 20, 2023

I tried to figure out the sequence of events leading to the panic, but I can't say I really have a clear picture. However it definitely involves late effects of the previous test-case (config changes). And I am now more or less convinced that this test is a sufficiently messy to show that while the condition that leads to the panic holds most of the time, there's actually no guarantee it does: A connection might be closed right after a cluster-config is received, thus the model methods handling the two events can race on the pmut lock such that the connection gets closed/removed before the cluster-config. I think we should just return an error in that case, same as with the next condition that checks the configuration. That will still fail this test, but that's clearly a case of a deficient test (probably should be a bunch of separate sub-tests that either get their own state or properly clean up after themselves).

(Also none of this explains the interesting question of why runtime.Gosched() has the impact it does - might be it's just a more or less random scheduling effect that leads to a benign ordering of the racying methods?)

@calmh
Copy link
Member Author

calmh commented Aug 21, 2023

There, I fixed it, I believe. My interpretation is that this was a test-only failure. The thing is, we may under some circumstances try to close a connection more than once, from CommitConfiguration (device is now paused) and AddConnection (we're adding a new connection for the same device) or something. This happens because starting a close is just calling (protocol.Connection).Close(), and the actual cleanup happens when the connection calls back into (*Model).Closed(). Now, (*Model).Closed() must only be called once per connection because it (currently, pre-my-other-PR) just works per device and assumes any connection present for the device is the one it's closing. Our protocol code does this correctly -- it only closes once, because there's a sync.Once for the whole thing. The fakeConnection wasn't as precise about this and would happily call the closed callback more than once. Normally this wouldn't be a problem but the TestAutoAcceptEnc effectively races AddConnection against Close all the time while causing Close() to be called multiple times, so sometimes closing connection A would remove the structures for connection B.

Adding the Gosched or channel stuff or a wait gave things time to close before adding the next connection.

@imsodin
Copy link
Member

imsodin commented Aug 21, 2023

Ah yes and my statement above that ClusterConfig may race Closed is wrong: The protocol package only calls close after it stopped all it's processing, so no in-flight calls going on of course.

@calmh calmh requested a review from imsodin August 21, 2023 10:59
@calmh calmh merged commit 40b3b9a into syncthing:main Aug 21, 2023
23 checks passed
@calmh calmh deleted the indexhandlerlifecycle branch August 22, 2023 06:27
@calmh calmh added this to the v1.24.0 milestone Aug 23, 2023
calmh added a commit to calmh/syncthing that referenced this pull request Sep 1, 2023
* main: (121 commits)
  build: Update dependencies
  gui: Remove footer and move links to header (fixes syncthing#5607) (syncthing#9067)
  gui: Fix lastSeenDays error due to undefined deviceStats when adding new devices (ref syncthing#8730) (syncthing#9066)
  gui: Automatically select device ID on click (ref syncthing#8544) (syncthing#9065)
  gui: Prevent modifications when saving changes (fixes syncthing#9019) (syncthing#9063)
  gui: Show in GUI if limitBandwidthInLan is enabled (syncthing#9062)
  lib/upgrade: Enable HTTP/2 for upgrade checks (syncthing#9060)
  lib/discover: Enable HTTP/2 for global discovery requests (syncthing#9059)
  cmd/stdiscosrv: Streamline context handling
  cmd/stdiscosrv: Explicitly enable HTTP/2
  gui, man, authors: Update docs, translations, and contributors
  cmd/stdiscosrv: Separate HTTPS and replication certificates
  cmd/stdiscosrv: Use larger database settings
  cmd/stdiscosrv: Modernise TLS settings, remove excessive HTTP logging
  cmd/stdiscosrv: Serve compressed responses
  lib/connections: Allow IPv6 ULA in discovery announcements (fixes syncthing#7456) (syncthing#9048)
  lib/beacon: Check FlagRunning (syncthing#9051)
  all: Remove lib/util package (syncthing#9049)
  lib/model: Clean up index handler life cycle (fixes syncthing#9021) (syncthing#9038)
  lib/osutil, lib/upnp: Check FlagRunning (fixes syncthing#8767) (syncthing#9047)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants