-
Notifications
You must be signed in to change notification settings - Fork 651
snapstate: move refresh from a systemd timer to the internal snapstate Ensure() #2558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
overlord/snapstate/snapmgr.go
Outdated
| } | ||
|
|
||
| // time to refresh? | ||
| if !nextRefresh.IsZero() && time.Now().Before(nextRefresh) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we shouldn't trigger a refresh before we are seeded?
it seems that if the refresh fails we will try immediately again instead of waiting a bit? because schedule-next-refresh is invoked only if everything is ok?
overlord/snapstate/snapmgr.go
Outdated
| } | ||
| } | ||
|
|
||
| updated, tasksets, err := UpdateMany(m.state, nil, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you check in daemon refresh does more than this nowadays, it uses assertmgr to refresh snap-declaration as well as a first step, we need to reorg that somehow
101f980 to
b531dc8
Compare
Also disable autorefresh during tests and update spread test.
b531dc8 to
732269c
Compare
| } | ||
|
|
||
| // and we also need to have a serial | ||
| _, err := Serial(st) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is correct long term, otoh it means developer images with custom models will not auto-refresh atm
niemeyer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, and LGTM once these are sorted to your satisfaction.
overlord/snapstate/snapstate.go
Outdated
| // snaps on the system. In addition to that it will also refresh important | ||
| // assertions. | ||
| func AutoRefresh(st *state.State) ([]string, []*state.TaskSet, error) { | ||
| // FIXME: obtain it somehow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed, the FIXME can be dropped here. This is doing the right thing. It's the internal implementation of userID handling in SnapSetup vs. SnapState that is broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I removed the comment
overlord/snapstate/snapmgr.go
Outdated
|
|
||
| // for tests | ||
| var refreshDisabled bool | ||
| err := tr.Get("core", "refresh.disabled", &refreshDisabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'll very quickly get widely used. We need to either agree that this is the direction we want to go, or we need to somehow make sure this can only possibly be used when testing, perhaps leveraging cmd.Version or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we need a decision, for now it only works when SNAPD_DEBUG is set. I could also make that SNAPPY_TESTING or even only if we have a test build (but that will be slightly problematic in the tests we run against official builds in spread-cron).
overlord/snapstate/snapmgr.go
Outdated
| } | ||
|
|
||
| var refreshRand time.Duration | ||
| err = tr.Get("core", "refresh.rand", &refreshRand) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please drop this? Given that the syntax in the FIXME above supports windows, we can keep this as an internal var or const while we don't finish the implementation of the full scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I removed it
overlord/snapstate/snapmgr.go
Outdated
| // 2. always set the last refresh regardless of errors | ||
| // in "AutoRefresh()" | ||
| // | ||
| // This currently implements (2) because if there are e.g. store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using (2) sounds a bit problematic now that we're giving people the option to dial down the frequency of retries. If someone has it set for once a week on Mondays, not retrying means it'll take another week until it tries again. Given how often we've been observing various kind of spurious errors, we should probably allow it to actually retry.
It should be super cheap for the store to say "go away, I have nothing for you!" if it really wants the retry to not happen. It can't easily say "Oh, come back please!", though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Fixed.
overlord/snapstate/snapmgr.go
Outdated
| var msg string | ||
| switch len(updated) { | ||
| case 0: | ||
| // check in after some hours |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is outdated now that we're allowing configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
overlord/snapstate/snapmgr.go
Outdated
| default: | ||
| quoted := strutil.Quoted(updated) | ||
| // TRANSLATORS: the %s is a comma-separated list of quoted snap names | ||
| msg = fmt.Sprintf(i18n.G("Refresh snaps %s"), quoted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably do that only in, say, "case 2,3" or similar. Otherwise it might be better to say "Refresh snaps A, B and N others." instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to Refresh %d snaps. I can add the first one or two names if you want but it seems a bit random.
3972ffe to
c3d7b96
Compare
c3d7b96 to
1b43a02
Compare
pedronis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more comments
| // allow disabling auto-refresh in the tests | ||
| if osutil.GetenvBool("SNAPD_DEBUG") { | ||
| var refreshDisabled bool | ||
| err := tr.Get("core", "refresh.disabled", &refreshDisabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we use a less transparent name for this given that is for tests only? or use a env var for the setting as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question - the jury is still out if this is something we want to support for users as well or not. This is why this is so strangely written currently. I try to get clarification on that.
overlord/snapstate/snapmgr.go
Outdated
|
|
||
| // FIXME: this is skewed because it runs every 10min but it will | ||
| // be replaced soon by randomness based on the window | ||
| randomness := time.Duration(rand.Int63n(int64(refreshRandomness))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (similarly to systemd) we should pick this at the start of snapd in the constructor of the manager, and log its value (to help debugging)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added. Please note that this will soon go away for a different randomness that takes the schedule intervall length into account.
overlord/snapstate/snapmgr.go
Outdated
|
|
||
| // Do setLastRefresh() only if the store (in AutoRefresh) gave | ||
| // us no error. This means we retry on store errors every | ||
| // ~10min which will put some burden on the store. But if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we forget and we have similar problems in bits of devicestate, but 10 minutes is very untrue when we have changes on the fly, because then we trigger Ensure a lot with EnsureBefore(0). I think what we want is a minmum time between tries. So setting a lastRefreshTry on the manager itself (doesn't need to be in state) and setting it before auto refresh, and bail out if the elapsed time form lastRefreshTry is lass than 19 minutes or something like that.
As I said we have a bunch of places already that might need code like that (mostly in devicestate I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is a very good point. I added the code but test is currently missing.
overlord/snapstate/snapmgr.go
Outdated
| logger.Noticef(i18n.G("No snaps to auto-refresh found")) | ||
| return nil | ||
| case 1: | ||
| msg = fmt.Sprintf(i18n.G("Refresh snap %q"), updated[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these say Auto-refresh ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that, fixed!
|
|
||
| // canAutoRefresh is a helper that checks if the device is able to | ||
| // auto-refresh | ||
| func canAutoRefresh(st *state.State) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noting this seems not to be tested (the code is simple though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, test is missing here too.
pedronis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a couple of the last comments I think are important
pedronis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you
overlord/snapstate/snapmgr.go
Outdated
| // Check that we have reasonable delays between unsuccessful attempts. | ||
| // If the store is under stress we need to make sure we do not | ||
| // hammer it too often | ||
| if !m.lastRefreshAttempt.IsZero() && m.lastRefreshAttempt.Add(10*time.Minute).Before(time.Now()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before should be After actually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fixed and added a test.
pedronis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after trying something similar I think there's a buglet
pedronis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
This branch is groundwork so that we can control how to delay refreshes. It moves the automatic refresh from a systemd timer into the snapstate and tracks the refresh-time in the internal state.
An extra careful review is appreciated as this has some potential for catastrophies. If the auto-refresh logic has a bug we can not fix that bug automatically via an update. We might consider keeping the systemd job around for a bit more (but with a lower update frequency of maybe once a week) to have a way out if disaster happens.
One nice side-effect of this change is that we do no longer generate empty changes if there is nothing to update. We probably still want to log those empty updates to syslog so that we can double check that our updates are really run. The branch is doing that currently and given that the checks only happen every 4h+rand(4h) that seems to be ok.
We now also have a lot more flexibility regarding the randomization of the update time. Currently its relatively static (4h fixed + rand(4h)) but we have many options here (unlike with systemd).