Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration can fail especially for loaded VMs #72

Closed
stormi opened this issue Oct 15, 2018 · 54 comments
Closed

Migration can fail especially for loaded VMs #72

stormi opened this issue Oct 15, 2018 · 54 comments
Assignees
Projects

Comments

@stormi
Copy link
Member

stormi commented Oct 15, 2018

Might be related to emu-manager.

See https://xcp-ng.org/forum/topic/522/unable-to-migrate-live-vms-after-upgrading-from-xcp-ng-7-4-to-7-5

@stormi stormi added this to To Do in Team board via automation Oct 15, 2018
@stormi stormi moved this from To Do to In Progress in Team board Oct 15, 2018
@borzel
Copy link
Member

borzel commented Oct 25, 2018

Did you found any clue?

Background (what I can recall):

  • we had luck and a big maintenance window for our upgrade from XS 7.2 to XCP-ng 7.5 so we just shut all VMs down
  • like other people we also have different CPUs in a pool
  • and also this message in XCP-ng Center about the change in pool CPU features (which of course is normal if you remove hosts or add hosts with different CPUs, specially in a upgrade situation where you install some of the slaves or the master from scratch).

Steps we did:

  • we upgraded the master
  • we started some VMs on that master (for DHCP/DNS and basic AD stuff)
  • we upgraded some slaves, added them to the pool (CPU pool feature change)
  • tried to migrate a VM from master-> BOOM -> hang with 100% CPU on just one of the 2 vCPUs
  • solved this situation (kill VM, toolstack restart, ...)
  • booted the VM (now with the new pool CPU featureset)
  • tried another migration -> BOOM (same situation)

I'll try to reproduce this situation on our "3 host different CPU" testpool as soon as I can.

@borzel
Copy link
Member

borzel commented Oct 28, 2018

emu-manager throws an error: https://xcp-ng.org/forum/post/4961
could it reproduce on XCP-ng 7.5 if I do stress --cpu 1 in the VM

@borzel
Copy link
Member

borzel commented Oct 28, 2018

could also reproduce with XCP-ng 7.6 :-|

@stormi
Copy link
Member Author

stormi commented Nov 8, 2018

A patched xcp-emu-manager is now available in the updates_testing media for XCP-ng 7.5 and 7.6.

@stormi stormi changed the title Migration randomly fails Migration can fail especially for loaded VMs Nov 8, 2018
@stormi
Copy link
Member Author

stormi commented Nov 9, 2018

The patched xcp-emu-manager has been made available as an update for XCP-ng 7.6. However it only mitigates the bug for now: instead of failing near the end of the migration, it can now wait for the VM to be less loaded so that it can resume processing the migration. It might still be possible that it fails after sometime if the VM's activity never decreases enough.

We're still working on a proper fix for this situation.

@phil-flex
Copy link

The patched xcp-emu-manager has been made available as an update for XCP-ng 7.6. However it only mitigates the bug for now: instead of failing near the end of the migration, it can now wait for the VM to be less loaded so that it can resume processing the migration. It might still be possible that it fails after sometime if the VM's activity never decreases enough.

We're still working on a proper fix for this situation.

I can confirm that the issue still exist. All VM being used are not able to migrate and hangs at 100% for few days. I have to restart the toolstack then reboot the VM (which is still in the original host). Anything I can help such as which log keyword capture can provided me? since our server log is quite a lot of messages and some are not able to provide here.

Team board automation moved this from In Progress to Done Nov 19, 2018
@stormi
Copy link
Member Author

stormi commented Nov 19, 2018

I didn't mean to close it yet. Testing still in progress.

@stormi stormi reopened this Nov 19, 2018
Team board automation moved this from Done to In Progress Nov 19, 2018
@stormi
Copy link
Member Author

stormi commented Nov 20, 2018

An update candidate is being tested by the community and available in the updates_testing repository.

yum install xcp-emu-manager --enablerepo='xcp-ng-updates_testing'

@larsmaes
Copy link

Still no luck with the update candidate

@olivierlambert
Copy link
Member

Can you be more specific?

@borzel
Copy link
Member

borzel commented Nov 20, 2018

Test with XCP-ng 7.6 (fully patched) and xcp-emu-manager from xcp-ng-updates_testing, VM executes stress --cpu 1 --vm 1 --io 1

-> VM is stuck and doesn't migrate :-/ ... waiting allmost 10 minutes

Edit:
I migrated a second VM (without load!) successfully to the other host and back while the migration with load still is stuck.

@olivierlambert
Copy link
Member

What's your OS? it's working for me on a Debian VM that previously failed but works now with latest patches. That's frustrating 😞

@johnelse will introduce more debug in the next iteration, so we could be able to get the details for everyone, without any need to reproduce it internally. This way, we'll detect at least all the potential edge cases!

@borzel
Copy link
Member

borzel commented Nov 20, 2018

@olivierlambert Ubuntu 16.04.4 LTS

@olivierlambert
Copy link
Member

Okay thanks, I'll try on Ubuntu.

@bplessis
Copy link

bplessis commented Nov 21, 2018

is a reboot needed after applying emu-manager update ?

Tried without a reboot an heavy loaded VM (elasticsearch data node on debian) and a migration resulted in an hanged VM, with no more connectivity nor console access.

@olivierlambert
Copy link
Member

olivierlambert commented Nov 21, 2018

Reboot shouldn't be needed.

edit: thanks for your feedback by the way, we are continuing to investigate

@bplessis
Copy link

Ok, for testing I replaced emu-manager by the XS one without reboot and it work directly so yes, no need for reboot seems required ^^'
However i tried to use the XS 7.6 emu-manager on a XCP-ng 7.5 to finish pool upgrade and it wasn't pretty ...

@stormi
Copy link
Member Author

stormi commented Nov 22, 2018

Yes, the way emu-manager is called changed so you need the version from 7.5

@vmpr
Copy link

vmpr commented Nov 27, 2018

I`ve also tested it on our pool with XCP-ng 7.6 (fully patched) and xcp-emu-manager (version 0.0.7) from xcp-ng-updates_testing, VM (latest centos 7.5) executes stress --cpu 1 --vm 1 --io 1

vm is stucked at 100% and not responsive anymore
after toolstack restart vm died

positive: vm without load moves smoothly

@fabiorauber
Copy link

I have experienced this bug on Kubernetes host machines (Ubuntu 18.04 LTS) on XCP-ng 7.5 and 7.6. Migration causes the VM to use too much CPU until it becomes unresponsive. A VM with less load migrates smoothly.

@olivierlambert
Copy link
Member

We should have a fix Monday or so if we finished to find the patch.

@Silencer80
Copy link

@stormi , could you please add an updated package to the 7.5 testing repo?

@stormi
Copy link
Member Author

stormi commented Dec 21, 2018

I'll see if the fixes can be backported safely.

@stormi
Copy link
Member Author

stormi commented Dec 24, 2018

The update has been pushed to the updates repository for XCP-ng 7.6. I still need to backport it to 7.5.

@vmpr
Copy link

vmpr commented Dec 25, 2018

cheers guys for your effort! merry xmas btw :)

@borzel
Copy link
Member

borzel commented Dec 25, 2018

I did a little test with a loaded VM (stress --cpu 1 --io 1 --vm 1) and the migration failed, but it reported that the VM was not cooperating on the needed shutdown. The VM was not interuppted or damaged. All fine :-)

I have to do more test's, but it seems to not corrupt the loaded VM! 🎉

@prowebuk
Copy link

prowebuk commented Jan 3, 2019

Hi, upgraded a two host pool from XS7.2 to XCG7.6 and have a number of VM's that won't migrate and require a toolstack restart to recover. Both hosts were yum updated. One particular VM with guest OS CentOS release 6.3 (Final) / 4gb mem, no memory ballooning is under zero load and still refuses to migrate, tools updated to 7.4 from XCP, VM rebooted and both toolstacks restarted. Daemon logs available here:

https://www.proweb.net/xcp/daemon.log.source.2019-01-03_04-30-52.txt
https://www.proweb.net/xcp/daemon.log.target.2019-01-03_04-30-58.txt

With XS7.6 emu-manager-1.0.5-1 installed, VM migrates without issue:

https://www.proweb.net/xcp/daemon.source.xs-emu.log.2019-01-03_05-40-56.txt
https://www.proweb.net/xcp/daemon.target.xs-emu.Log.2019-01-03_05-40-57.txt

@olivierlambert
Copy link
Member

Thanks for the feedback, we'll take a look ASAP :)

@stormi
Copy link
Member Author

stormi commented Jan 3, 2019

I had a look: our emu-manager exits before the migration can actually start. Logs from the source host:

Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: Command line: -debug -domid 11 -controloutfd 2 -controlinfd 0 -mode listen
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: Adding client on fd 13 as (0)
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: sending {"execute":"migrate_init"}
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: On (0) got message '{"execute":"migrate_init"}'
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver:debug: replying to (0): { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: received { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: sending {"execute":"set_args","arguments":{"pv":"true"}}
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: On (0) got message '{"execute":"set_args","arguments":{"pv":"true"}}'
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver:debug: replying to (0): { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: received { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: Host disconnected fd 13 (0)
Jan  3 04:31:07 pw-im-xen-2 forkexecd: [error|pw-im-xen-2|0 ||forkexecd] 7898 (/usr/lib64/xen/bin/emu-manager -controloutfd 7 -controlinfd 8 -fd 9 -mode sav...) exited with code 2

The last two lines display the issue: xenguest loses contact with emu-manager, because the latter exited.

The problem is, there's no information about the nature of the error. We need to find a way to reproduce and get more information from xcp-emu-manager before that happens.

When you look back at your logs (/var/log/daemon.log), did the failed migrations all fail with the same message? If so, were the preceding lines of logs equivalent to the ones I'm posting above? Do you have notifications about segmentation faults in the output of dmesg?

@prowebuk
Copy link

prowebuk commented Jan 3, 2019

Hi @stormi

All failed with the same issue:

Jan 2 22:12:32 pw-im-xen-2 forkexecd: [error|pw-im-xen-2|0 ||forkexecd] 25928 (/usr/lib64/xen/bin/emu-manager -controloutfd 7 -controlinfd 8 -fd 9 -mode hvm...) exited with code 2
Jan 2 22:28:20 pw-im-xen-2 forkexecd: [error|pw-im-xen-2|0 ||forkexecd] 2225 (/usr/lib64/xen/bin/emu-manager -controloutfd 7 -controlinfd 8 -fd 9 -mode sav...) exited with code 2
Jan 3 04:31:07 pw-im-xen-2 forkexecd: [error|pw-im-xen-2|0 ||forkexecd] 7898 (/usr/lib64/xen/bin/emu-manager -controloutfd 7 -controlinfd 8 -fd 9 -mode sav...) exited with code 2

No exceptions in dmesg.

@stormi
Copy link
Member Author

stormi commented Jan 3, 2019

@prowebuk thanks. Did they all fail right after these messages?

Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: On (0) got message '{"execute":"set_args","arguments":{"pv":"true"}}'
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver:debug: replying to (0): { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: received { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: Host disconnected fd 13 (0)

@prowebuk
Copy link

prowebuk commented Jan 3, 2019

They hung, the second to last migrate (Jan 2 22:28:20) for 5 hours and until I restarted the toolstack

@stormi
Copy link
Member Author

stormi commented Jan 3, 2019

I suppose the toolstack does not anticipate a crashing emu-manager. What I wanted to know in my last comment is what the few log messages before the "emu-manager exited with code 2" message were, to check if that was each time at the same stage of the process, or random.

@prowebuk
Copy link

prowebuk commented Jan 3, 2019

@stormi

Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: libempserver:debug: replying to (0): { "event" : "MIGRATION", "data": {"sent": 18645475,"remaining": -1,"iteration": 4001}}
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: progress: Frames iteration 4001: 89367 of 89367 (100%)
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: Checking live policy.  3 / 18645754 for 4002
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: libempserver:debug: replying to (0): { "event" : "MIGRATION", "data": {"sent": 18645754,"remaining": 3,"iteration": 4002}}
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: Checking live policy.  0 / 18645757 for 4003
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: libempserver:debug: replying to (0): { "event" : "MIGRATION", "data": {"sent": 18645757,"remaining": 0,"iteration": 4003}}
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: No dirty pages, finishing migration
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: waiting for suspend
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: libempserver: Host disconnected fd 13 (0)
Jan  2 22:12:32 pw-im-xen-2 xenguest-10-emp[25929]: libempserver: Host disconnected fd 13 (0)
Jan  2 22:12:32 pw-im-xen-2 forkexecd: [error|pw-im-xen-2|0 ||forkexecd] 25928 (/usr/lib64/xen/bin/emu-manager -controloutfd 7 -controlinfd 8 -fd 9 -mode hvm...) exited with code 2

--

Jan  2 22:28:20 pw-im-xen-2 xenguest-3-emp[2226]: libempserver: Adding client on fd 13 as (0)
Jan  2 22:28:20 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: sending {"execute":"migrate_init"}
Jan  2 22:28:20 pw-im-xen-2 xenguest-3-emp[2226]: libempserver: On (0) got message '{"execute":"migrate_init"}'
Jan  2 22:28:20 pw-im-xen-2 xenguest-3-emp[2226]: libempserver:debug: replying to (0): { "return" : {} }
Jan  2 22:28:20 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: received { "return" : {} }
Jan  2 22:28:20 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: sending {"execute":"set_args","arguments":{"pv":"true"}}
Jan  2 22:28:20 pw-im-xen-2 xenguest-3-emp[2226]: libempserver: On (0) got message '{"execute":"set_args","arguments":{"pv":"true"}}'
Jan  2 22:28:20 pw-im-xen-2 xenguest-3-emp[2226]: libempserver:debug: replying to (0): { "return" : {} }
Jan  2 22:28:20 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: received { "return" : {} }
Jan  2 22:28:20 pw-im-xen-2 xenguest-3-emp[2226]: libempserver: Host disconnected fd 13 (0)
Jan  2 22:28:20 pw-im-xen-2 forkexecd: [error|pw-im-xen-2|0 ||forkexecd] 2225 (/usr/lib64/xen/bin/emu-manager -controloutfd 7 -controlinfd 8 -fd 9 -mode sav...) exited with code 2

--

Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: Adding client on fd 13 as (0)
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: sending {"execute":"migrate_init"}
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: On (0) got message '{"execute":"migrate_init"}'
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver:debug: replying to (0): { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: received { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: sending {"execute":"set_args","arguments":{"pv":"true"}}
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: On (0) got message '{"execute":"set_args","arguments":{"pv":"true"}}'
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver:debug: replying to (0): { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 emu-manager: [debug|pw-im-xen-2|0 ||xcp-emu-manager] Xenguest: received { "return" : {} }
Jan  3 04:31:07 pw-im-xen-2 xenguest-11-emp[7899]: libempserver: Host disconnected fd 13 (0)
Jan  3 04:31:07 pw-im-xen-2 forkexecd: [error|pw-im-xen-2|0 ||forkexecd] 7898 (/usr/lib64/xen/bin/emu-manager -controloutfd 7 -controlinfd 8 -fd 9 -mode sav...) exited with code 2

@stormi
Copy link
Member Author

stormi commented Jan 3, 2019

Thanks. The first one is interesting because it differs from the others and may be when it started breaking. I'd be interested in the full logs for that migration, if it is possible :)

@prowebuk
Copy link

prowebuk commented Jan 3, 2019

@stormi
Copy link
Member Author

stormi commented Jan 7, 2019

I have managed to reproduce the issue locally. My assumption is that it happens only for PV guests.

@stormi
Copy link
Member Author

stormi commented Jan 7, 2019

Bug identified, and think we've got a fix! I managed to live migrate a PV VM that wouldn't migrate before. We'll clean-up the code and issue an update candidate. Thanks a lot for the feedback and the logs.

@stormi
Copy link
Member Author

stormi commented Jan 8, 2019

@prowebuk I've pushed an update candidate, in case you are willing to test it.

Install it with:

yum update xcp-emu-manager-0.0.8-2.x86_64 --enablerepo='xcp-ng-updates_testing'
xe-toolstack-restart

@stormi
Copy link
Member Author

stormi commented Jan 16, 2019

xcp-emu-manager-0.0.9-1 has been made available to all users through the updates repository for XCP-ng 7.6, and I backported the fixes to xcp-emu-manager-0.0.3-1.4 in XCP-ng 7.5. Local tests are good, waiting for community tests before pushing it too.

@olivierlambert
Copy link
Member

I think we can close this, what do you think @stormi ?

@stormi
Copy link
Member Author

stormi commented Feb 6, 2019

The update for 7.5 is still awaiting tests from the community so I'll keep it open until we released it to everyone.

@olivierlambert
Copy link
Member

Understood, let's wait a week or so then :)

@bplessis
Copy link

bplessis commented Feb 6, 2019

I just tested three migrations of not loaded VMs, during a 7.4 => 7.6 upgrade, got two VMs hanged (PVHVM) and one success (PV)

@olivierlambert
Copy link
Member

Hi,

This is not a EMU-manager issue, but another problem also existing in XenServer, see https://bugs.xenserver.org/browse/XSO-924

@stormi stormi moved this from In Progress to Done in Team board Feb 19, 2019
@stormi
Copy link
Member Author

stormi commented Feb 19, 2019

Closing now that the update for XCP-ng 7.5 has been pushed too.

@olivierlambert
Copy link
Member

@stormi so we should close this one, right?

@stormi
Copy link
Member Author

stormi commented Feb 27, 2019

Indeed, looks like I forgot to push the close button with my last comment :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

No branches or pull requests