Skip to content

Conversation

@mvo5
Copy link
Contributor

@mvo5 mvo5 commented Jan 5, 2017

This branch is groundwork so that we can control how to delay refreshes. It moves the automatic refresh from a systemd timer into the snapstate and tracks the refresh-time in the internal state.

An extra careful review is appreciated as this has some potential for catastrophies. If the auto-refresh logic has a bug we can not fix that bug automatically via an update. We might consider keeping the systemd job around for a bit more (but with a lower update frequency of maybe once a week) to have a way out if disaster happens.

One nice side-effect of this change is that we do no longer generate empty changes if there is nothing to update. We probably still want to log those empty updates to syslog so that we can double check that our updates are really run. The branch is doing that currently and given that the checks only happen every 4h+rand(4h) that seems to be ok.

We now also have a lot more flexibility regarding the randomization of the update time. Currently its relatively static (4h fixed + rand(4h)) but we have many options here (unlike with systemd).

}

// time to refresh?
if !nextRefresh.IsZero() && time.Now().Before(nextRefresh) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we shouldn't trigger a refresh before we are seeded?

it seems that if the refresh fails we will try immediately again instead of waiting a bit? because schedule-next-refresh is invoked only if everything is ok?

}
}

updated, tasksets, err := UpdateMany(m.state, nil, 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you check in daemon refresh does more than this nowadays, it uses assertmgr to refresh snap-declaration as well as a first step, we need to reorg that somehow

@mvo5 mvo5 reopened this Jan 31, 2017
@mvo5 mvo5 force-pushed the feature/internal-refresh branch from 101f980 to b531dc8 Compare February 1, 2017 07:01
Also disable autorefresh during tests and update spread test.
@mvo5 mvo5 force-pushed the feature/internal-refresh branch from b531dc8 to 732269c Compare February 1, 2017 07:29
}

// and we also need to have a serial
_, err := Serial(st)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is correct long term, otoh it means developer images with custom models will not auto-refresh atm

Copy link
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, and LGTM once these are sorted to your satisfaction.

// snaps on the system. In addition to that it will also refresh important
// assertions.
func AutoRefresh(st *state.State) ([]string, []*state.TaskSet, error) {
// FIXME: obtain it somehow?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, the FIXME can be dropped here. This is doing the right thing. It's the internal implementation of userID handling in SnapSetup vs. SnapState that is broken.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I removed the comment


// for tests
var refreshDisabled bool
err := tr.Get("core", "refresh.disabled", &refreshDisabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'll very quickly get widely used. We need to either agree that this is the direction we want to go, or we need to somehow make sure this can only possibly be used when testing, perhaps leveraging cmd.Version or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we need a decision, for now it only works when SNAPD_DEBUG is set. I could also make that SNAPPY_TESTING or even only if we have a test build (but that will be slightly problematic in the tests we run against official builds in spread-cron).

}

var refreshRand time.Duration
err = tr.Get("core", "refresh.rand", &refreshRand)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please drop this? Given that the syntax in the FIXME above supports windows, we can keep this as an internal var or const while we don't finish the implementation of the full scheduling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I removed it

// 2. always set the last refresh regardless of errors
// in "AutoRefresh()"
//
// This currently implements (2) because if there are e.g. store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using (2) sounds a bit problematic now that we're giving people the option to dial down the frequency of retries. If someone has it set for once a week on Mondays, not retrying means it'll take another week until it tries again. Given how often we've been observing various kind of spurious errors, we should probably allow it to actually retry.

It should be super cheap for the store to say "go away, I have nothing for you!" if it really wants the retry to not happen. It can't easily say "Oh, come back please!", though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Fixed.

var msg string
switch len(updated) {
case 0:
// check in after some hours
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is outdated now that we're allowing configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

default:
quoted := strutil.Quoted(updated)
// TRANSLATORS: the %s is a comma-separated list of quoted snap names
msg = fmt.Sprintf(i18n.G("Refresh snaps %s"), quoted)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do that only in, say, "case 2,3" or similar. Otherwise it might be better to say "Refresh snaps A, B and N others." instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to Refresh %d snaps. I can add the first one or two names if you want but it seems a bit random.

@pedronis pedronis self-assigned this Feb 2, 2017
@mvo5 mvo5 force-pushed the feature/internal-refresh branch from 3972ffe to c3d7b96 Compare February 3, 2017 07:33
@mvo5 mvo5 force-pushed the feature/internal-refresh branch from c3d7b96 to 1b43a02 Compare February 3, 2017 08:24
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments

// allow disabling auto-refresh in the tests
if osutil.GetenvBool("SNAPD_DEBUG") {
var refreshDisabled bool
err := tr.Get("core", "refresh.disabled", &refreshDisabled)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we use a less transparent name for this given that is for tests only? or use a env var for the setting as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question - the jury is still out if this is something we want to support for users as well or not. This is why this is so strangely written currently. I try to get clarification on that.


// FIXME: this is skewed because it runs every 10min but it will
// be replaced soon by randomness based on the window
randomness := time.Duration(rand.Int63n(int64(refreshRandomness)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (similarly to systemd) we should pick this at the start of snapd in the constructor of the manager, and log its value (to help debugging)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added. Please note that this will soon go away for a different randomness that takes the schedule intervall length into account.


// Do setLastRefresh() only if the store (in AutoRefresh) gave
// us no error. This means we retry on store errors every
// ~10min which will put some burden on the store. But if the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we forget and we have similar problems in bits of devicestate, but 10 minutes is very untrue when we have changes on the fly, because then we trigger Ensure a lot with EnsureBefore(0). I think what we want is a minmum time between tries. So setting a lastRefreshTry on the manager itself (doesn't need to be in state) and setting it before auto refresh, and bail out if the elapsed time form lastRefreshTry is lass than 19 minutes or something like that.

As I said we have a bunch of places already that might need code like that (mostly in devicestate I think)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a very good point. I added the code but test is currently missing.

logger.Noticef(i18n.G("No snaps to auto-refresh found"))
return nil
case 1:
msg = fmt.Sprintf(i18n.G("Refresh snap %q"), updated[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these say Auto-refresh ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, fixed!


// canAutoRefresh is a helper that checks if the device is able to
// auto-refresh
func canAutoRefresh(st *state.State) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noting this seems not to be tested (the code is simple though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, test is missing here too.

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of the last comments I think are important

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

// Check that we have reasonable delays between unsuccessful attempts.
// If the store is under stress we need to make sure we do not
// hammer it too often
if !m.lastRefreshAttempt.IsZero() && m.lastRefreshAttempt.Add(10*time.Minute).Before(time.Now()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before should be After actually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed and added a test.

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after trying something similar I think there's a buglet

@pedronis pedronis changed the title snapstate: move refresh from a systemd timer to the inernal snapstate Ensure() snapstate: move refresh from a systemd timer to the internal snapstate Ensure() Feb 4, 2017
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@mvo5 mvo5 merged commit 92958e1 into canonical:master Feb 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants