Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Live migrating HVM linux (and others?) with storage migration from any older release to 7.6 works but vm dead #111

Open
oallart opened this issue Dec 14, 2018 · 21 comments

Comments

5 participants
@oallart
Copy link

commented Dec 14, 2018

Situation:

Any OS (tested with multiple, centos 7.6 for reference) running in a VM on a pre 7.6 xcpng server using local storage, works fine. When live migrating the running VM to a newer xcp-ng 7.6, everything seems to work as usual but the VM is dead on arrival. No console (white screen) but VM is marked as running.
Migration done in xcp-ng center.

More detail:

  • VM is unresponsive, no console, no network
  • Some message on 'xen tools out of date' shows up in xencenter
  • local storage only, no shared storage, no pool or master
  • live migrating from any version (7.1/2/3/4/5) to 7.6 has the same issue
  • live migrating from 7.6 to 7.6 works fine
  • live migrating from any version (7.1/2/3/4/5) to 7.5 works fine
  • rebooting the VM fixes the issue (defeats the live migration concept)
  • no error in console logs
  • updating XS tools does not fix anything (updated, restarted, migrated, same issue)
  • can not try on xenserver 7.6 since they restricted live migration in 7.3 on the lowest unpaid tier version
  • source and target hardware is identical (brand new supermicro X11)

It looks like something has changed in 7.6. We haven't found any literature on the issue so far, or any similar issue.
For the time being we have been forced to revert to 7.5.0-2. Happy to provide more detail on request.

@nicodemus

This comment has been minimized.

Copy link

commented Dec 14, 2018

I had the same issue going from XenServer 6.5 to XCP-NG 7.6. All of the PV guests migrated fine, but every single HVM VM was dead after migration. Some had 6.5 tools installed, some had 7.5, some had none at all. All VMs were Linux, various distros and various versions. VM showed as running, but console was non-responsive, and the VMs were softlocked. Restarting the VM was the only way to get them back.

@stormi

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

This is interesting. I had the same issue yesterday while doing some tests related to #90 (I built a version of XAPI that would allow storage motion during rolling pool upgrade) and was wondering if that was because of my nested virtulization setup. Turns out it wasn't.

I'll try to reproduce in XenServer.

@stormi

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

I have reproduced in a nested XenServer VM. Unless I failed my test, this proves that the same issue exists in XenServer. The migration was done in Xen Orchestra, so that's not specific to XCP-ng Center or XenCenter.

@stormi

This comment has been minimized.

Copy link
Member

commented Dec 14, 2018

I have reported the issue to XenServer's team: https://bugs.xenserver.org/browse/XSO-924

@nicodemus

This comment has been minimized.

Copy link

commented Dec 14, 2018

Forgot to add, my setup uses shared NFS storage vs the submitter's local storage.

@oallart

This comment has been minimized.

Copy link
Author

commented Dec 16, 2018

Thanks @stormi

@stormi

This comment has been minimized.

Copy link
Member

commented Dec 17, 2018

@nicodemus was the migration within a pool being upgraded, or did you create a separate pool for XCP-ng and migrate the VMs from the XS pool to the XCP-ng pool?

@nicodemus

This comment has been minimized.

Copy link

commented Dec 17, 2018

@nicodemus was the migration within a pool being upgraded, or did you create a separate pool for XCP-ng and migrate the VMs from the XS pool to the XCP-ng pool?

It was a pool of two servers being upgraded. I evacuated one node and upgraded from XenServer 6.5 to XCP-NG 7.6. When trying to migrate the VMs off the remaining XS 6.5 box to the XCP7.6 box is when the issue was experienced. Every HVM guest migrated 'successfully', but was dead and had to be hard reset. All PVM guests migrated without a hitch.

@stormi

This comment has been minimized.

Copy link
Member

commented Jan 14, 2019

I've made more tests, here are the results:

  • CentOS 7 64 bits, HVM: bug hit, it is migrated but fails to resume and has to be restarted
  • CentOS 6.6 64 bits, PV: migrates smoothly
  • Windows 7 32 bits, HVM: migrates smoothly

For the CentOS 7 VM, migrations made from XS 6.5, XCP-ng 7.4, XCP-ng 7.5 towards either XS 7.6 or XCP-ng 7.6. Same results in every case. XS 7.6 to XCP-ng 7.6 works fine.

For CentOS 6.6 PV and Windows 7, I only tested migration from XS 6.5 to XS 7.5 and then from XS 7.5 to XS 7.6.

It does not matter whether the VM is migrated from or to local storage or shared storage. What matters is that the VDI has to be migrated, which was the case in all my tests since those were cross-pool migrations.

@olivierlambert

This comment has been minimized.

Copy link
Member

commented Jan 14, 2019

So in short, the bug is triggered with Xen Storage Motion while moving to a XCP-ng (or XS) 7.6 host, for some HVM guest, correct? (regardless cross or intra pool?)

@stormi

This comment has been minimized.

Copy link
Member

commented Jan 14, 2019

Yes, though intra pool will require a modified XAPI because otherwise it won't allow you to migrate with Xen Storage Motion during a pool upgrade (#90).

And "Some HVM guests" seem to be all linux HVM guests for now, and possibilty some others.

@stormi stormi changed the title Live migrating with local storage from any release to 7.6 works but vm dead Live migrating with local storage from any older release to 7.6 works but vm dead Jan 15, 2019

@stormi stormi changed the title Live migrating with local storage from any older release to 7.6 works but vm dead Live migrating HVM linux (and others?) with storage migration from any older release to 7.6 works but vm dead Jan 15, 2019

@oallart

This comment has been minimized.

Copy link
Author

commented Jan 16, 2019

XS has reported the reason for the issue and a workaround;

Until this is fixed, to work around this either install the VM from one of the other HVM Linux templates (e.g. CentOS 7) or if the VM already exists, set the device id (xe vm-param-set uuid=... platform:device_id=0001) and reboot the VM before migrating it to XS 7.6.

Am not happy with that part:

reboot the VM before migrating

Some people have also said that the issue affects them even with platform:device_id=0001 set.

Indeed, most of our systems are created with "other media" and platform device ID is not set.
Digging that platform device ID yields https://xenbits.xen.org/docs/4.6-testing/misc/pci-device-reservations.txt which indicates it is a PCI mechanism that has been around for a while. My question is, why is this affecting us now?

@stormi

This comment has been minimized.

Copy link
Member

commented Jan 16, 2019

@oallart were your own VMs installed with the "Other install media" template?

@oallart

This comment has been minimized.

Copy link
Author

commented Jan 17, 2019

@stormi absolutely, and all centos 7.x in my tests. We typically pxe boot our VMs to start with.

@stormi stormi added this to In Progress in Team board Feb 7, 2019

@stormi

This comment has been minimized.

Copy link
Member

commented Mar 4, 2019

It will be fixed in the next releases of XenServer and XCP-ng.

The patch seems to be xapi-project/xenopsd@67e12a1

stormi added a commit to xcp-ng-rpms/xenopsd that referenced this issue Mar 4, 2019

@stormi

This comment has been minimized.

Copy link
Member

commented Mar 4, 2019

I have built a a version with a backport of the patch that should fix this issue for XCP-ng 7.6. I'm not sure I will push it to everyone after the tests, but at least it's available to anyone who finds this bug report, and I'm nevertheless highly interested in testing results from anyone who would still have a setup allowing to test it.

Testing an update candidate means basically:

  • Install it on the relevant host(s)
  • Make sure it's is used. A reboot is the safest option but not always available. For this specific update, xe-toolstack-restart should be enough.
  • Ensure you see no regression due to the update.
  • If possible, test that it fixes what it's meant to fix.

The patch fixes the live migration from older releases of XS or XCP-ng towards XCP-ng 7.6, for VMs that don't have platform:device_id set, that is mostly VMs created with the "other installation media" template.

To install it:

yum update xenopsd xenopsd-xc xenopsd-xenlight --enablerepo='xcp-ng-updates_testing'

To reinstall the previous version:

yum downgrade xenopsd xenopsd-xc xenopsd-xenlight

@stormi stormi moved this from In Progress to Update candidate in Team board Mar 4, 2019

@Ultra2D

This comment has been minimized.

Copy link

commented Mar 7, 2019

Tested this using a VM based on template "Debian Wheezy 7.0 (64-bit)" without device_id set.

A clone of that VM could be migrated to XCP-ng 7.6 with the patch installed on the pool master. Migrating another clone to XCP-ng without the patch on the pool master results in a stuck VM, so it works!

@stormi

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Thanks!

@olivierlambert

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Yay!!

@stormi

This comment has been minimized.

Copy link
Member

commented Mar 11, 2019

A point of vigilance regarding this update (and the upcoming XCP-ng 8.0 that includes a similar fix by default): might make the first in-pool (homogeneous pool) migration of a VM without device_id set fail?

(see https://xcp-ng.org/forum/post/9742)

@stormi

This comment has been minimized.

Copy link
Member

commented Mar 25, 2019

Note: still interested in feedback from the community (having migration issues is not a requirement for testing the update. Installing it and continuing with normal operations is also a way to test that there is no regression). I'll not consider pushing it to everyone unless there is enough testing that I can rely on.

@stormi stormi referenced this issue Apr 30, 2019

Open

XCP-ng 8.0 (meta-issue) #180

34 of 56 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.