Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

o/state: record change-update notices on change status updates #13528

Merged
merged 1 commit into from Feb 27, 2024

Conversation

ZeyadYasser
Copy link
Collaborator

@ZeyadYasser ZeyadYasser commented Jan 30, 2024

This PR adds trigger points to the change-update notice based on the JU048 spec.

  • NewChange trigger point is to address the change spawned case.
  • notifyStatusChange is for all other status changes.

Note: Change status alternates rapidly between Do and Doing statuses because it gets computed based on an aggregation of its tasks' statuses which might cause the listener to hit the /v2/changes a lot hence putting pressure on snapd, no?

There are no test currently, I will add unit and spread tests as soon we align on the notice trigger points.

@ZeyadYasser ZeyadYasser added Needs Samuele review Needs a review from Samuele before it can land Skip spread Indicate that spread job should not run labels Jan 30, 2024
@Meulengracht
Copy link
Member

Just tagging @benhoyt for awareness

@@ -423,11 +423,30 @@ func (c *Change) Status() Status {
panic(fmt.Sprintf("internal error: cannot process change status: %v", statusStats))
}

// addNotice triggers a change-update notice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually use the word "records" for notices, and it's "an occurrence". Additionally, I think it would be good to document what the key is. So perhaps:

Suggested change
// addNotice triggers a change-update notice.
// addNotice records an occurrence of a change-update notice for this change.
// The notice key is set to the change ID.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This much better, thanks! updated!

// tasks' statuses so we might be sending a lot of change-update
// notices.
if err := c.addNotice(); err != nil {
logger.Debugf(`internal error: failed to add "change-update" notice: %v`, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should panic here, as this will only happen if we've got a programmer bug in addNotice (State.AddNotice only returns an error if you the notice type is invalid or the key is an empty string). So I think we should be loud about an internal error/bug here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to be careful with panics, but I agree, it makes sense here. updated, thank you!

opts := &AddNoticeOptions{
Data: map[string]string{
ChangeUpdateNoticeKindDataKey: c.Kind(),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A zero value for AddNoticeOptions.RepeatAfter means that it will "repeat" for every occurrence, that is, show up at the top of the list again when you list notices (or return immediately when you're long-polling for notices). Is that what we want? See also discussion below about alternating rapidly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to have it repeat for every occurrence.

On one hand, I think it is a bit spammy but on the other hand it makes sense for listeners that are waiting for a specific change-status like "Done", it would be bad to miss it because it was suppressed through repeat-after.

@pedronis @Meulengracht would love to hear your thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to have a repeat-after as a general thing for changes. You might miss important change events. Like a change entering error status, done, or waiting.

Comment on lines 443 to 446
// Note: Change status alternates rapidly between “Do” and “Doing”
// statuses because it gets computed based on an aggregation of its
// tasks' statuses so we might be sending a lot of change-update
// notices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty concerned about this. Can you give an example? If it is alternating rapidly like this, is that really an existing bug? How could we avoid it?

Also, in the notices spec JU048 we describe that a notice will only be recorded when the change is spawned or its status changes:

This notice is recorded when a change is updated: spawned, or status updated (for example, from Doing to Done or Error)

I notice that this is in notifyStatusChange, which looks like it's called whenever a task status changes. I'm presuming the if c.lastObservedStatus == new { return } block prevents that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a bug, you are right that if c.lastObservedStatus == new { return } prevents that, it is actually a side-effect of having a change's status be computed based on its tasks' statuses, which makes sense since a change is just a collection of tasks.

This is an example I captured through a local run, notice the high number of occurrences:

    {
      "id": "2",
      "user-id": null,
      "type": "change-update",
      "key": "827",
      "first-occurred": "2024-01-31T12:12:41.764964306Z",
      "last-occurred": "2024-01-31T12:12:46.49646734Z",
      "last-repeated": "2024-01-31T12:12:46.49646734Z",
      "occurrences": 21,                                    <=
      "last-data": {
        "kind": "remove-snap"
      },
      "expire-after": "168h0m0s"
    }

Do we want to hide this alternating fact from listeners somehow? there is a good case for this to avoid clients spamming the daemon for information about those (not so important) notices.

CC: @Meulengracht @pedronis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more a fundamental thing around changes. The Change.Status() reflects the current status of the change based on the current status of all tasks in that change. (In a prioritised order). So when a task completes, the change goes back to 'Do' which is correct as the change is not 'Doing' before the next task spawns into 'Doing'.

However obviously this makes it a bit spammy, especially for changes with many linked tasks that wait for one other. We may either handle it at the change level (i.e a more complex status logic that assumes nothing goes wrong between the end of one task, and the start of the next, assuming one can actually start), but I'm not sure I like this. Or we can handle it specifically on a level for notices. (I.e don't track statuses that go back to 'Do' only for notices, meaning we ignore 'Doing' => 'Do' and the returning 'Do' => 'Doing')

It's a bit unelegant and annoying to have special code for, but I prefer the last solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with @Meulengracht thinking here. But it seems to me that to implement that idea we need to keep some state in the change about on what status we emitted the notice last?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedronis Exactly, we can handle it by keeping even more state on the Change which I personally would really like to avoid (or maybe we can somehow reuse the existing change status tracking in a more elegant way? then I would be more open to it), or we can add special handling code in the Notice system, which specifically deals with this behavior, since that is where the concern is, and this might be the better solution, as it would avoid us keeping more complex "previous-status" tracking on the change state.

I would really like to know what you think would make more sense in this case @pedronis

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, but on the contrary you meed to pick something that makes sense whether is the initialisation or not, maybe RepeatCheckData should take both the old and the new value (from the options) instead, that was sort on my mind

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benhoyt @pedronis @Meulengracht @olivercalder This PR #13576 is attempt at stateful repeat checks as suggested by @pedronis.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe RepeatCheckData should take both the old and the new value (from the options) instead, that was sort on my mind

@pedronis I don't fully understand this, do you mean doing set/get through pointer references to RepeatCheckData in options directly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the challenges we have I made a new more concrete proposal about the signature of RepeatCheck in your other PR

Copy link
Contributor

@benhoyt benhoyt Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just circling back to this thread with a link to my comment on that PR (which I'm not comfortable with): #13576 (comment). And thanks very much for your proofs-of-concept and the links to them above, @ZeyadYasser.

const (
// Data key used for holding change kind in a change-update notice.
// TODO: Should we add strict type validation for notices data keys?
ChangeUpdateNoticeKindDataKey string = "kind"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need strict type validation here. We can just have a test for it.

I think we should make this constant name lowercase / non-exported, i.e., changeUpdateNoticeKindDataKey. However, at that point maybe we should just use "kind" directly in the map? It only occurs once, and I think having the constant obscures more than it helps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

@@ -343,6 +343,10 @@ func (s *State) NewChange(kind, summary string) *Change {
id := strconv.Itoa(s.lastChangeId)
chg := newChange(s, id, kind, summary)
s.changes[id] = chg
// Add change-update notice for newly spawned change
if err := chg.addNotice(); err != nil {
logger.Debugf(`internal error: failed to add "change-update" notice: %v`, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about possibly panicking (as this shouldn't happen unless we have a programming bug).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated!

func (c *Change) addNotice() error {
opts := &AddNoticeOptions{
Data: map[string]string{
ChangeUpdateNoticeKindDataKey: c.Kind(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spec we also include the change ID in the data with an "id" field. I'm not actually sure that's necessary, because you'll always have the key (which is the change ID). But I thought I'd point it out -- I'm interested to hear your thoughts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to think of value it could add for notice listener, I am sure the spec was written with some use-case in mind but at least in snapd we don't have a good case for it.

CC: @pedronis @Meulengracht

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to speak with someone that had a use-case in mind, for now maybe not add it and wait for someone to ask for it? It's easier to add things than remove them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense only if we thought that we should not expose as fixed that the key is the change-id but I don't think we want not to expose that? I don't see much harm in adding it though either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to expose that the key is the change-id as part of the documented behaviour. So let's leave it out for now; as @Meulengracht said, it's much easier to add later if we need it than remove it.

@ZeyadYasser
Copy link
Collaborator Author

@benhoyt @pedronis @Meulengracht @olivercalder PR is updated (with tests) to include suggestions here.

Copy link
Member

@olivercalder olivercalder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the updates. I really like this approach, with the decision and all stored data occurring in the change code.

One small nitpick about naming, but otherwise looks great to me!

@@ -149,6 +149,7 @@ type Change struct {
taskIDs []string
ready chan struct{}
lastObservedStatus Status
lastNoticedStatus Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I don't love lastNoticedStatus since "noticed" in this case is a bit of a (clever) double-entendre. I had to check to be sure "noticed" referred to "recorded notice" rather than "noticed" as in "observed". Maybe lastRecordedNoticeStatus would be a bit clearer, even if it is more verbose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite being the one to suggest that name, I agree with this -- thanks. Clear is better than clever!

Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, and a request for another test or two to test the new status-changed logic in the state package.

c.Check(n["type"], Equals, "change-update")
c.Check(n["key"], Equals, chg.ID())
c.Check(n["last-data"], DeepEquals, map[string]any{"kind": "install"})
c.Check(n["occurrences"], Equals, 1.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for this test. However, it seems to me we should also have a test here (that is, in the state package) for the "status change" code. I realize it's being exercised over in the snapstate package, but I think we should be exercising state functionality here (and the snapstate package won't be moved over to Pebble, so then we'll lose the test in Pebble).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, Thank you! I added additional tests under change_test.go.

My initial thoughts was to add the tests to the state package, but went with snapstate instead to avoid having hand-crafted synthetic tests and have a real-world change tested instead. I think it would be a good idea to keep those in snapd for good measure.

// Add change-update notice for status change
if !shouldSkipChangeUpdateNotice(c.lastNoticedStatus, new) {
if err := c.addNotice(); err != nil {
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we include the context here to differentiate, so if this does happen, we can differentiate this from the newly-spawned message?

Suggested change
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err)
logger.Panicf(`internal error: failed to add "change-update" notice on status change: %v`, err)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -343,6 +343,10 @@ func (s *State) NewChange(kind, summary string) *Change {
id := strconv.Itoa(s.lastChangeId)
chg := newChange(s, id, kind, summary)
s.changes[id] = chg
// Add change-update notice for newly spawned change
if err := chg.addNotice(); err != nil {
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above:

Suggested change
logger.Panicf(`internal error: failed to add "change-update" notice: %v`, err)
logger.Panicf(`internal error: failed to add "change-update" notice for new change: %v`, err)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

@@ -149,6 +149,7 @@ type Change struct {
taskIDs []string
ready chan struct{}
lastObservedStatus Status
lastNoticedStatus Status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite being the one to suggest that name, I agree with this -- thanks. Clear is better than clever!

func (c *Change) notifyStatusChange(new Status) {
if c.lastObservedStatus == new {
return
}
c.state.notifyChangeStatusChangedHandlers(c, c.lastObservedStatus, new)
c.lastObservedStatus = new
// Add change-update notice for status change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine now, but can I suggest we change the two "skips" to the same structure, rather than having an early return above? That way if we ever tweak the lastObservedStatus logic, we won't accidentally skip the change-update check by returning early. Suggestion:

func (c *Change) notifyStatusChange(new Status) {
    if c.lastObservedStatus != new {
        c.state.notifyChangeStatusChangedHandlers(c, c.lastObservedStatus, new)
        c.lastObservedStatus = new
    }
    if !shouldSkipChangeUpdateNotice(c.lastNoticedStatus, new) {
        // addNotice()
        c.lastNoticedStatus = new
    }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, Thanks! updated

for i := 0; i < 5; i++ {
addTask("fake-task")
}
addTask("error-trigger")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my benefit: how does this trigger an error and Undo/Abort sequence?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just a mock task handler that always fails, it is setup here:

// For triggering errors
erroringHandler := func(task *state.Task, _ *tomb.Tomb) error {
return errors.New("error out")
}
s.o.TaskRunner().AddHandler("error-trigger", erroringHandler, nil)

c.Check(n["type"], Equals, "change-update")
c.Check(n["key"], Equals, chg.ID())
c.Check(n["last-data"], DeepEquals, map[string]any{"kind": "refresh"})
// Default -> Doing -> Undo -> abort -> Undo -> Error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro-nit:

Suggested change
// Default -> Doing -> Undo -> abort -> Undo -> Error
// Default -> Doing -> Undo -> Abort -> Undo -> Error

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you

@benhoyt
Copy link
Contributor

benhoyt commented Feb 22, 2024

But overall, this is looking really nice, and is very straight-forward -- thanks!

@codecov-commenter
Copy link

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (6a7ecfe) 78.88% compared to head (a946ece) 78.87%.
Report is 3 commits behind head on master.

Files Patch % Lines
overlord/state/change.go 83.33% 2 Missing and 1 partial ⚠️
overlord/state/state.go 25.00% 2 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #13528      +/-   ##
==========================================
- Coverage   78.88%   78.87%   -0.01%     
==========================================
  Files        1037     1037              
  Lines      132757   132872     +115     
==========================================
+ Hits       104722   104809      +87     
- Misses      21503    21523      +20     
- Partials     6532     6540       +8     
Flag Coverage Δ
unittests 78.87% <72.72%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ZeyadYasser ZeyadYasser changed the title o/state: trigger change-update notices on change status updates o/state: record change-update notices on change status updates Feb 22, 2024
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

overlord/state/change.go Show resolved Hide resolved
Copy link
Contributor

@benhoyt benhoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the updates!

olivercalder
olivercalder previously approved these changes Feb 22, 2024
Copy link
Member

@olivercalder olivercalder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! LGTM now, pending the decision on whether to save lastRecordedNoticeStatus to disk.

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, couple of small comments

@@ -184,6 +185,8 @@ type marshalledChange struct {

SpawnTime time.Time `json:"spawn-time"`
ReadyTime *time.Time `json:"ready-time,omitempty"`

LastRecordedNoticeStatus Status `json:"last-recorded-notice-status"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add omitempty here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thank you!

c.lastObservedStatus = new
}
if !shouldSkipChangeUpdateNotice(c.lastRecordedNoticeStatus, new) {
if err := c.addNotice(); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment before this:

// implies State.writing()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, thanks! updated

@pedronis pedronis dismissed olivercalder’s stale review February 23, 2024 16:30

some additions merit a re-review

* o/state: address review comments (thanks @benhoyt)
	* panic when addNotice fails to signal an internal programmer error
	* better comments

* o/state: skip alternating change-update notices

* o/state: address review comments (thanks @olivercalder and @benhoyt)

* o/state: save change notice status to disk (thanks @pedronis)

* o/state: add omitempty to last-recorded-notice-status change state (thanks @pedronis)

Signed-off-by: Zeyad Gouda <zeyad.gouda@canonical.com>
Copy link
Member

@Meulengracht Meulengracht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you

@Meulengracht Meulengracht merged commit 360fec7 into snapcore:master Feb 27, 2024
39 of 43 checks passed
@ernestl ernestl added this to the 2.62 milestone Feb 29, 2024
ZeyadYasser added a commit to ZeyadYasser/pebble that referenced this pull request Mar 9, 2024
This is backport from snapd.

snapcore/snapd#13528

Signed-off-by: Zeyad Gouda <zeyad.gouda@canonical.com>
benhoyt pushed a commit to canonical/pebble that referenced this pull request Mar 11, 2024
)

This PR records change-update notices based as mentioned in the
JU048 spec.

- `NewChange` addresses the change spawned case.
- `notifyStatusChange` is for all other status changes.

This is backport from snapd snapcore/snapd#13528.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Samuele review Needs a review from Samuele before it can land
Projects
None yet
7 participants