New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
o/state: record change-update notices on change status updates #13528
o/state: record change-update notices on change status updates #13528
Conversation
Just tagging @benhoyt for awareness |
overlord/state/change.go
Outdated
@@ -423,11 +423,30 @@ func (c *Change) Status() Status { | |||
panic(fmt.Sprintf("internal error: cannot process change status: %v", statusStats)) | |||
} | |||
|
|||
// addNotice triggers a change-update notice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually use the word "records" for notices, and it's "an occurrence". Additionally, I think it would be good to document what the key is. So perhaps:
// addNotice triggers a change-update notice. | |
// addNotice records an occurrence of a change-update notice for this change. | |
// The notice key is set to the change ID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This much better, thanks! updated!
overlord/state/change.go
Outdated
// tasks' statuses so we might be sending a lot of change-update | ||
// notices. | ||
if err := c.addNotice(); err != nil { | ||
logger.Debugf(`internal error: failed to add "change-update" notice: %v`, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should panic here, as this will only happen if we've got a programmer bug in addNotice
(State.AddNotice
only returns an error if you the notice type is invalid or the key is an empty string). So I think we should be loud about an internal error/bug here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to be careful with panics, but I agree, it makes sense here. updated, thank you!
overlord/state/change.go
Outdated
opts := &AddNoticeOptions{ | ||
Data: map[string]string{ | ||
ChangeUpdateNoticeKindDataKey: c.Kind(), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A zero value for AddNoticeOptions.RepeatAfter
means that it will "repeat" for every occurrence, that is, show up at the top of the list again when you list notices (or return immediately when you're long-polling for notices). Is that what we want? See also discussion below about alternating rapidly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have it repeat for every occurrence.
On one hand, I think it is a bit spammy but on the other hand it makes sense for listeners that are waiting for a specific change-status like "Done", it would be bad to miss it because it was suppressed through repeat-after.
@pedronis @Meulengracht would love to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to have a repeat-after as a general thing for changes. You might miss important change events. Like a change entering error status, done, or waiting.
overlord/state/change.go
Outdated
// Note: Change status alternates rapidly between “Do” and “Doing” | ||
// statuses because it gets computed based on an aggregation of its | ||
// tasks' statuses so we might be sending a lot of change-update | ||
// notices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty concerned about this. Can you give an example? If it is alternating rapidly like this, is that really an existing bug? How could we avoid it?
Also, in the notices spec JU048 we describe that a notice will only be recorded when the change is spawned or its status changes:
This notice is recorded when a change is updated: spawned, or status updated (for example, from Doing to Done or Error)
I notice that this is in notifyStatusChange
, which looks like it's called whenever a task status changes. I'm presuming the if c.lastObservedStatus == new { return }
block prevents that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a bug, you are right that if c.lastObservedStatus == new { return }
prevents that, it is actually a side-effect of having a change's status be computed based on its tasks' statuses, which makes sense since a change is just a collection of tasks.
This is an example I captured through a local run, notice the high number of occurrences:
{
"id": "2",
"user-id": null,
"type": "change-update",
"key": "827",
"first-occurred": "2024-01-31T12:12:41.764964306Z",
"last-occurred": "2024-01-31T12:12:46.49646734Z",
"last-repeated": "2024-01-31T12:12:46.49646734Z",
"occurrences": 21, <=
"last-data": {
"kind": "remove-snap"
},
"expire-after": "168h0m0s"
}
Do we want to hide this alternating fact from listeners somehow? there is a good case for this to avoid clients spamming the daemon for information about those (not so important) notices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more a fundamental thing around changes. The Change.Status()
reflects the current status of the change based on the current status of all tasks in that change. (In a prioritised order). So when a task completes, the change goes back to 'Do' which is correct as the change is not 'Doing' before the next task spawns into 'Doing'.
However obviously this makes it a bit spammy, especially for changes with many linked tasks that wait for one other. We may either handle it at the change level (i.e a more complex status logic that assumes nothing goes wrong between the end of one task, and the start of the next, assuming one can actually start), but I'm not sure I like this. Or we can handle it specifically on a level for notices. (I.e don't track statuses that go back to 'Do' only for notices, meaning we ignore 'Doing' => 'Do' and the returning 'Do' => 'Doing')
It's a bit unelegant and annoying to have special code for, but I prefer the last solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I agree with @Meulengracht thinking here. But it seems to me that to implement that idea we need to keep some state in the change about on what status we emitted the notice last?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pedronis Exactly, we can handle it by keeping even more state on the Change which I personally would really like to avoid (or maybe we can somehow reuse the existing change status tracking in a more elegant way? then I would be more open to it), or we can add special handling code in the Notice system, which specifically deals with this behavior, since that is where the concern is, and this might be the better solution, as it would avoid us keeping more complex "previous-status" tracking on the change state.
I would really like to know what you think would make more sense in this case @pedronis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, but on the contrary you meed to pick something that makes sense whether is the initialisation or not, maybe RepeatCheckData should take both the old and the new value (from the options) instead, that was sort on my mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benhoyt @pedronis @Meulengracht @olivercalder This PR #13576 is attempt at stateful repeat checks as suggested by @pedronis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe RepeatCheckData should take both the old and the new value (from the options) instead, that was sort on my mind
@pedronis I don't fully understand this, do you mean doing set/get through pointer references to RepeatCheckData in options directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the challenges we have I made a new more concrete proposal about the signature of RepeatCheck in your other PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just circling back to this thread with a link to my comment on that PR (which I'm not comfortable with): #13576 (comment). And thanks very much for your proofs-of-concept and the links to them above, @ZeyadYasser.
overlord/state/notices.go
Outdated
const ( | ||
// Data key used for holding change kind in a change-update notice. | ||
// TODO: Should we add strict type validation for notices data keys? | ||
ChangeUpdateNoticeKindDataKey string = "kind" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need strict type validation here. We can just have a test for it.
I think we should make this constant name lowercase / non-exported, i.e., changeUpdateNoticeKindDataKey
. However, at that point maybe we should just use "kind"
directly in the map? It only occurs once, and I think having the constant obscures more than it helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
overlord/state/state.go
Outdated
@@ -343,6 +343,10 @@ func (s *State) NewChange(kind, summary string) *Change { | |||
id := strconv.Itoa(s.lastChangeId) | |||
chg := newChange(s, id, kind, summary) | |||
s.changes[id] = chg | |||
// Add change-update notice for newly spawned change | |||
if err := chg.addNotice(); err != nil { | |||
logger.Debugf(`internal error: failed to add "change-update" notice: %v`, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here about possibly panicking (as this shouldn't happen unless we have a programming bug).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated!
overlord/state/change.go
Outdated
func (c *Change) addNotice() error { | ||
opts := &AddNoticeOptions{ | ||
Data: map[string]string{ | ||
ChangeUpdateNoticeKindDataKey: c.Kind(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the spec we also include the change ID in the data with an "id"
field. I'm not actually sure that's necessary, because you'll always have the key (which is the change ID). But I thought I'd point it out -- I'm interested to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to think of value it could add for notice listener, I am sure the spec was written with some use-case in mind but at least in snapd we don't have a good case for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to speak with someone that had a use-case in mind, for now maybe not add it and wait for someone to ask for it? It's easier to add things than remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would make sense only if we thought that we should not expose as fixed that the key is the change-id but I don't think we want not to expose that? I don't see much harm in adding it though either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to expose that the key is the change-id as part of the documented behaviour. So let's leave it out for now; as @Meulengracht said, it's much easier to add later if we need it than remove it.
a8a32eb
to
00f54b8
Compare
@benhoyt @pedronis @Meulengracht @olivercalder PR is updated (with tests) to include suggestions here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates. I really like this approach, with the decision and all stored data occurring in the change code.
One small nitpick about naming, but otherwise looks great to me!
overlord/state/change.go
Outdated
@@ -149,6 +149,7 @@ type Change struct { | |||
taskIDs []string | |||
ready chan struct{} | |||
lastObservedStatus Status | |||
lastNoticedStatus Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: I don't love lastNoticedStatus
since "noticed" in this case is a bit of a (clever) double-entendre. I had to check to be sure "noticed" referred to "recorded notice" rather than "noticed" as in "observed". Maybe lastRecordedNoticeStatus
would be a bit clearer, even if it is more verbose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite being the one to suggest that name, I agree with this -- thanks. Clear is better than clever!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, and a request for another test or two to test the new status-changed logic in the state
package.
c.Check(n["type"], Equals, "change-update") | ||
c.Check(n["key"], Equals, chg.ID()) | ||
c.Check(n["last-data"], DeepEquals, map[string]any{"kind": "install"}) | ||
c.Check(n["occurrences"], Equals, 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for this test. However, it seems to me we should also have a test here (that is, in the state
package) for the "status change" code. I realize it's being exercised over in the snapstate
package, but I think we should be exercising state functionality here (and the snapstate
package won't be moved over to Pebble, so then we'll lose the test in Pebble).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, Thank you! I added additional tests under change_test.go
.
My initial thoughts was to add the tests to the state
package, but went with snapstate
instead to avoid having hand-crafted synthetic tests and have a real-world change tested instead. I think it would be a good idea to keep those in snapd for good measure.
overlord/state/change.go
Outdated
// Add change-update notice for status change | ||
if !shouldSkipChangeUpdateNotice(c.lastNoticedStatus, new) { | ||
if err := c.addNotice(); err != nil { | ||
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we include the context here to differentiate, so if this does happen, we can differentiate this from the newly-spawned message?
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err) | |
logger.Panicf(`internal error: failed to add "change-update" notice on status change: %v`, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
overlord/state/state.go
Outdated
@@ -343,6 +343,10 @@ func (s *State) NewChange(kind, summary string) *Change { | |||
id := strconv.Itoa(s.lastChangeId) | |||
chg := newChange(s, id, kind, summary) | |||
s.changes[id] = chg | |||
// Add change-update notice for newly spawned change | |||
if err := chg.addNotice(); err != nil { | |||
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above:
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err) | |
logger.Panicf(`internal error: failed to add "change-update" notice for new change: %v`, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thank you!
overlord/state/change.go
Outdated
@@ -149,6 +149,7 @@ type Change struct { | |||
taskIDs []string | |||
ready chan struct{} | |||
lastObservedStatus Status | |||
lastNoticedStatus Status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite being the one to suggest that name, I agree with this -- thanks. Clear is better than clever!
overlord/state/change.go
Outdated
func (c *Change) notifyStatusChange(new Status) { | ||
if c.lastObservedStatus == new { | ||
return | ||
} | ||
c.state.notifyChangeStatusChangedHandlers(c, c.lastObservedStatus, new) | ||
c.lastObservedStatus = new | ||
// Add change-update notice for status change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine now, but can I suggest we change the two "skips" to the same structure, rather than having an early return above? That way if we ever tweak the lastObservedStatus
logic, we won't accidentally skip the change-update check by returning early. Suggestion:
func (c *Change) notifyStatusChange(new Status) {
if c.lastObservedStatus != new {
c.state.notifyChangeStatusChangedHandlers(c, c.lastObservedStatus, new)
c.lastObservedStatus = new
}
if !shouldSkipChangeUpdateNotice(c.lastNoticedStatus, new) {
// addNotice()
c.lastNoticedStatus = new
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, Thanks! updated
for i := 0; i < 5; i++ { | ||
addTask("fake-task") | ||
} | ||
addTask("error-trigger") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my benefit: how does this trigger an error and Undo/Abort sequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just a mock task handler that always fails, it is setup here:
snapd/overlord/managers_test.go
Lines 487 to 491 in 8c247ea
// For triggering errors | |
erroringHandler := func(task *state.Task, _ *tomb.Tomb) error { | |
return errors.New("error out") | |
} | |
s.o.TaskRunner().AddHandler("error-trigger", erroringHandler, nil) |
overlord/snapstate/snapstate_test.go
Outdated
c.Check(n["type"], Equals, "change-update") | ||
c.Check(n["key"], Equals, chg.ID()) | ||
c.Check(n["last-data"], DeepEquals, map[string]any{"kind": "refresh"}) | ||
// Default -> Doing -> Undo -> abort -> Undo -> Error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Micro-nit:
// Default -> Doing -> Undo -> abort -> Undo -> Error | |
// Default -> Doing -> Undo -> Abort -> Undo -> Error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thank you
But overall, this is looking really nice, and is very straight-forward -- thanks! |
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #13528 +/- ##
==========================================
- Coverage 78.88% 78.87% -0.01%
==========================================
Files 1037 1037
Lines 132757 132872 +115
==========================================
+ Hits 104722 104809 +87
- Misses 21503 21523 +20
- Partials 6532 6540 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for the updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! LGTM now, pending the decision on whether to save lastRecordedNoticeStatus
to disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, couple of small comments
overlord/state/change.go
Outdated
@@ -184,6 +185,8 @@ type marshalledChange struct { | |||
|
|||
SpawnTime time.Time `json:"spawn-time"` | |||
ReadyTime *time.Time `json:"ready-time,omitempty"` | |||
|
|||
LastRecordedNoticeStatus Status `json:"last-recorded-notice-status"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add omitempty here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thank you!
c.lastObservedStatus = new | ||
} | ||
if !shouldSkipChangeUpdateNotice(c.lastRecordedNoticeStatus, new) { | ||
if err := c.addNotice(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment before this:
// implies State.writing()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, thanks! updated
some additions merit a re-review
* o/state: address review comments (thanks @benhoyt) * panic when addNotice fails to signal an internal programmer error * better comments * o/state: skip alternating change-update notices * o/state: address review comments (thanks @olivercalder and @benhoyt) * o/state: save change notice status to disk (thanks @pedronis) * o/state: add omitempty to last-recorded-notice-status change state (thanks @pedronis) Signed-off-by: Zeyad Gouda <zeyad.gouda@canonical.com>
59f2c60
to
adb0336
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
This is backport from snapd. snapcore/snapd#13528 Signed-off-by: Zeyad Gouda <zeyad.gouda@canonical.com>
) This PR records change-update notices based as mentioned in the JU048 spec. - `NewChange` addresses the change spawned case. - `notifyStatusChange` is for all other status changes. This is backport from snapd snapcore/snapd#13528.
This PR adds trigger points to the
change-update
notice based on the JU048 spec.NewChange
trigger point is to address thechange spawned
case.notifyStatusChange
is for all other status changes.Note: Change status alternates rapidly betweenDo
andDoing
statuses because it gets computed based on an aggregation of its tasks' statuses which might cause the listener to hit the/v2/changes
a lot hence putting pressure on snapd, no?There are no test currently, I will add unit and spread tests as soon we align on the notice trigger points.