Skip to content

Conversation

@minglumlu
Copy link
Member

(Two commits are included in this PR. The first one is only to move the position of existing code within the same file to allow the use of module Watcher in module VM in the second one.)

The xenstore watcher maintains a map from domid to VM UUID. This map is used to dispatch the xenstore events. When the VM is renamed, its UUID changes. Hence this map needs to refresh. Otherwise, the xenstore events could not be dispatched to renamed VM.

Signed-off-by: Ming Lu <ming.lu@cloud.com>
@minglumlu minglumlu force-pushed the private/mingl/CA-419908 branch 2 times, most recently from ba900d4 to ccbb428 Compare November 28, 2025 05:48
@minglumlu minglumlu changed the title CA-416410: Update xenstore watcher to refresh domains when VM is renamed CA-419908: Update xenstore watcher to refresh domains when VM is renamed Nov 28, 2025
@minglumlu
Copy link
Member Author

The diff view of the first commit is not easy for reviewers to get the change clearly as it is only for moving some existing code snippets.

Copy link
Contributor

@last-genius last-genius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this fixing a real-world issue? If so, could you describe when it happens, and if there are any symptoms of missed events?

@minglumlu
Copy link
Member Author

minglumlu commented Nov 28, 2025

Is this fixing a real-world issue? If so, could you describe when it happens, and if there are any symptoms of missed events?

The VM renaming usually happens for VM migrations during which, on the migration destination host, the domain of the VM is created with a temporary UUID like 74be1ecb-7a5c-461b-5d5c-000000000001. And at the stage of post-migration, the VM is renamed back with the normal UUID like 4be1ecb-7a5c-461b-5d5c-7bd51d71d392 and then unpaused. Since this point, if the domain mapping (domid -> UUID) is not updated, some xentore events would be lost because the temporary UUID has been replaced with the normal one. Xenopd can't dispatch it for the correct VM.

One case is the PV driver in VM writes feature-poweroff into xenstore after being migrated. The event for this xenstore writing may get lost due to the issue. This causes the migrated VM can't be cleanly shutdown due to lack of the feature in xapi. There may be other symptoms but this is the one we know surely so far.
This is not 100% reproducible as this depends on the timing when the PV driver writes. If it writes before the UPDATES.refresh_vm call in VM.pool_migrate_complete, the xenstore writing result can still be propagated to xapi correctly.

look_for_different_domains () ;
Atomic.set need_refresh_domains false
) else (
if Atomic.exchange need_refresh_domains false then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about wrap the atomic operations in functions:
set_need_refresh_domains
clear_need_refresh_domains
consume_need_refresh_domains

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are too simple that I don't think it is worth wrapping them within the module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be simpler and correct, too?

if path = _introduceDomain || path = _releaseDomain || Atomic.get need_refresh_domains  then (
  look_for_different_domains () ;
  Atomic.set need_refresh_domains false
) else ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to call watch_fired when it is neither _introduceDomain nor _introduceDomain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to call watch_fired in this rename case? I thought just updating the mapping would be enough.

Does look_for_different_domains detect uuid changes? Looking at domain_looks_different, it may not?

Copy link
Member Author

@minglumlu minglumlu Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to call watch_fired in this rename case?

The fix doesn't make it call watch_fired in the rename case. Instead, it just marks that the watcher should refresh the domain list later when an event is popped up.

Does look_for_different_domains detect uuid changes?

The look_for_different_domains calls list_domains in which Xenctrl.domain_getinfolist is called to get the current domain list with UUIDs. The UUIDs are just the ones written in xenstore by Toolstack. And the renaming is exactly to rewrite the UUIDs in xenstore.

I thought just updating the mapping would be enough.

The map could be updated directly in rename case. But I think it would result in updating from external side on the data structure which is generated from a single source Xenctrl.domain_getinfolist within the watcher. This seems a change more than enough unnecessarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't split exchange into get and set, that is not atomic anymore, and if some other thread sets the value inbetween you'll miss it.
E.g. while you are busy refreshing the domains another remain request could've come in, that the refresh loop already went past and won't notice.

Copy link
Contributor

@edwintorok edwintorok Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be simpler and correct, too?

if path = _introduceDomain || path = _releaseDomain || Atomic.get need_refresh_domains  then (
  look_for_different_domains () ;
  Atomic.set need_refresh_domains false
) else ..

I think a variation of this might work. You have to do the Atomic.exchange first thing:

if Atomic.exchange need_refresh_domains false || path = _introduceDomain || path = _releaseDomain then
  look_for_different_domains ();
if path = _introduceDomain || path = _releaseDomain then ()
else

And no atomic or look_for_different_domains calls on the else branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would suggest:

if Atomic.exchange need_refresh_domains false || path = _introduceDomain || path = _releaseDomain then
  look_for_different_domains ();
if path <> _introduceDomain && path <> _releaseDomain then (

)

@psafont
Copy link
Member

psafont commented Nov 28, 2025

Aside from what was described, this also might happen on the sending side where the UUID gets renamed to end in 000. But this may be less critical.

module RRDD = Rrd_client.Client
module StringSet = Set.Make (String)

module IntMap = Map.Make (struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module IntMap = Map.Make(Int)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blindly copied in the moving-only commit. Will append another commit for this.

look_for_different_domains () ;
Atomic.set need_refresh_domains false
) else (
if Atomic.exchange need_refresh_domains false then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be simpler and correct, too?

if path = _introduceDomain || path = _releaseDomain || Atomic.get need_refresh_domains  then (
  look_for_different_domains () ;
  Atomic.set need_refresh_domains false
) else ..

@minglumlu
Copy link
Member Author

The rerun jobs passed with the fix.

@minglumlu minglumlu force-pushed the private/mingl/CA-419908 branch from 9c0b20e to ab77d49 Compare December 1, 2025 10:12
The xenstore watcher maintains a map from domid to VM UUID. This map is
used to dispatch the xenstore events. When the VM is renamed, its UUID
changes. Hence this map needs to refresh. Otherwise, the xenstore events
could not be dispatched to renamed VM.

Signed-off-by: Ming Lu <ming.lu@cloud.com>
Signed-off-by: Ming Lu <ming.lu@cloud.com>
@minglumlu minglumlu force-pushed the private/mingl/CA-419908 branch from ab77d49 to 1b7f149 Compare December 1, 2025 11:18
@minglumlu minglumlu added this pull request to the merge queue Dec 1, 2025
Merged via the queue into xapi-project:master with commit 279f707 Dec 1, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants