Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: Be more stringent with timer resets/stops (ref #9417) #9422

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

calmh
Copy link
Member

@calmh calmh commented Feb 15, 2024

I found some concerning things while looking through our timer handling to see if we leaked timers. I didn't find any leaks of tickers, which is good, and most timers were stopped which is also fine. There were some missing stops that I added, but these should not be critical as a normal timer will anyway be collected when it expires. One place created time.After() in a loop which is not great, that was cleaned up.

However, there were some inconsistencies mostly around our use of resetting timers. The documentation is quite clear on what not to do (although it's less clear on the consequences if you do it anyway):

For a Timer created with NewTimer, Reset should be invoked only on stopped or expired timers with drained channels.

If a program has already received a value from t.C, the timer is known to have expired and the channel drained, so t.Reset can be used directly. If a program has not yet received a value from t.C, however, the timer must be stopped and—if Stop reports that the timer expired before being stopped—the channel explicitly drained:

if !t.Stop() {
<-t.C
}
t.Reset(d)

(https://pkg.go.dev/time#Timer.Reset)

This is impossible to guarantee when doing Reset() from outside the loop that processes the timer in question. Any place where we saved a timer in a struct, looped on a select over it, and did resets from another goroutine would run afoul of this rule.

To handle this better I've rewritten those places to use "local" timers (i.e., timer is declared in the function that loops over it, nobody else can touch it, and we make sure it gets stopped with a defer) and a channel to trigger reset from within that loop instead.

I also added a couple of convenience functions to reset a timer properly and stop a timer properly. Using the stop method is less critical, nothing bad will happen by using just a normal (*timer).Stop(), unless it's for something like a timeout where we want to make sure the timeout isn't trigger after doing the stop because there was already an undrained value in the channel. Then using the new StopAndDrain() will make sure the channel is drained.

I used the non-blocking receive in the convenience functions because it's sometimes hard to know whether the channel is already drained when calling StopAndDrain, we just want to make sure it is drained afterwards.

Copy link
Member

@imsodin imsodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pretty casual review (causing a few surprised grins) until I reached folder.go o.O

Comment on lines -382 to -388
if runningTests {
// Make the behavior stable when running tests to avoid randomly
// varying test coverage. This ensures, in practice if not in
// theory, that the timer fires and we take the true branch of the
// next if.
runtime.Gosched()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

Comment on lines 185 to 193
if (err != nil || !success) && f.pullPause < 60*f.pullBasePause() {
// Back off from retrying to pull
f.pullPause *= 2

// Pulling failed, try again later.
delay := f.pullPause + time.Since(t0)
l.Infof("Folder %v isn't making sync progress - retrying in %v.", f.Description(), stringutil.NiceDurationString(delay))
timeutil.ResetTimer(pullFailTimer, delay)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think resetting the fail timer should be inside this condition. Resp. it should be split: Reset the fail timer for (err != nil || !success), and inside that condition and before resting the timer have another if condition to increase the pause.

Also I think f.pullPause is now only used in this loop/serve function, so can be removed from the struct too.

}

case <-pullFailTimer.C:
f.SchedulePull()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik you also need to schedule a pull instead of calling .pull() in the next case. Resp. maybe it's safer to pass the fail timer into pull and do the resetting there, that way one can't call pull and forget to handle the timer.

Comment on lines 215 to +221
case next := <-f.scanDelay:
l.Debugln(f, "Delaying scan")
f.scanTimer.Reset(next)

case <-f.scanScheduled:
l.Debugln(f, "Scan was scheduled")
f.scanTimer.Reset(0)
if next > 0 {
l.Debugln(f, "Delaying scan")
} else {
l.Debugln(f, "Scan was scheduled")
}
timeutil.ResetTimer(scanTimer, next)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That scheduling is magic, just a tiny little bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants