XCP-ng 8.0 / CH 8.0 coalesce issues #298

stormi · 2019-10-25T13:14:04Z

If I understood correctly, XCP-ng 8.0 inherited a regression from Citrix Hypervisor 8.0 regarding the coalesce process.

Try to backport the fixes (one to fix the army of zombies, another to fix never-ending coalesce) from the upstream sm repository to fix them in XCP-ng.

More about the issues: https://bugs.xenserver.org/browse/XSO-966

The text was updated successfully, but these errors were encountered:

- Fixes "army of zombies" and never ending coalesce - xcp-ng/xcp#298

stormi · 2019-10-25T14:18:42Z

A test package is available:

yum install sm sm-rawhba --enablerepo=xcp-ng-testing

rizaemet · 2019-11-13T07:17:13Z

Hello,
I have a never-ending coalesce issue. I want to try test package. server reeboot required after install? What should I pay attention to?

olivierlambert · 2019-11-13T07:32:21Z

Please reboot, yes. Nothing else to do.

rizaemet · 2019-11-13T10:34:39Z

Maybe not related with this issue but my problem is happening as before. This block looping:

Nov 13 12:31:27 XenServer-08 SMGC: [13829]   Running VHD coalesce on *ad1f957a[VHD](1800.000G//2.500G|n)
Nov 13 12:31:27 XenServer-08 SM: [27829] ['/usr/bin/vhd-util', 'coalesce', '--debug', '-n', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/VHD-ad1f957a-9f60-48d0-8833-e7b7fd19dde5']
Nov 13 12:32:14 XenServer-08 SM: [13829] ['/sbin/lvremove', '-f', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/coalesce_ad1f957a-9f60-48d0-8833-e7b7fd19dde5_1']
Nov 13 12:32:15 XenServer-08 SM: [13829] ['/sbin/dmsetup', 'status', 'VG_XenStorage--25e96f21--b8d3--f880--3df5--f71897c52429-coalesce_ad1f957a--9f60--48d0--8833--e7b7fd19dde5_1']
Nov 13 12:32:29 XenServer-08 SM: [13829] ['/sbin/lvcreate', '-n', 'coalesce_166de08f-4c3d-4b0f-9f72-2f3c1213e5cc_1', '-L', '4', 'VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429', '--addtag', 'journaler', '-W', 'n']
Nov 13 12:41:56 XenServer-08 SMGC: [13829]   Running VHD coalesce on *166de08f[VHD](1800.000G//1.910G|n)
Nov 13 12:41:56 XenServer-08 SM: [5526] ['/usr/bin/vhd-util', 'coalesce', '--debug', '-n', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/VHD-166de08f-4c3d-4b0f-9f72-2f3c1213e5cc']
Nov 13 12:42:33 XenServer-08 SM: [13829] ['/sbin/lvremove', '-f', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/coalesce_166de08f-4c3d-4b0f-9f72-2f3c1213e5cc_1']
Nov 13 12:42:33 XenServer-08 SM: [13829] ['/sbin/dmsetup', 'status', 'VG_XenStorage--25e96f21--b8d3--f880--3df5--f71897c52429-coalesce_166de08f--4c3d--4b0f--9f72--2f3c1213e5cc_1']
Nov 13 12:42:47 XenServer-08 SM: [13829] ['/sbin/lvcreate', '-n', 'coalesce_139f02bf-5a1d-4e1c-958d-68dfe8ac478c_1', '-L', '4', 'VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429', '--addtag', 'journaler', '-W', 'n']
Nov 13 12:52:13 XenServer-08 SMGC: [13829]   Running VHD coalesce on *139f02bf[VHD](1800.000G//1.785G|n)
Nov 13 12:52:13 XenServer-08 SM: [15949] ['/usr/bin/vhd-util', 'coalesce', '--debug', '-n', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/VHD-139f02bf-5a1d-4e1c-958d-68dfe8ac478c']
Nov 13 12:52:48 XenServer-08 SM: [13829] ['/sbin/lvremove', '-f', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/coalesce_139f02bf-5a1d-4e1c-958d-68dfe8ac478c_1']
Nov 13 12:52:48 XenServer-08 SM: [13829] ['/sbin/dmsetup', 'status', 'VG_XenStorage--25e96f21--b8d3--f880--3df5--f71897c52429-coalesce_139f02bf--5a1d--4e1c--958d--68dfe8ac478c_1']
Nov 13 12:52:55 XenServer-08 SMGC: [13829] Snapshot-coalesce did not help, abandoning attempts
Nov 13 12:52:55 XenServer-08 SMGC: [13829] Set leaf-coalesce = offline for deac1149[VHD](1800.000G/1.802G/1803.523G|a)
Nov 13 12:52:55 XenServer-08 SM: [13829] Raising exception [204, Gave up on leaf coalesce after leaf grew bigger than before snapshot taken [opterr=VDI=deac1149[VHD](1800.000G/1.802G/1803.523G|a)]]
Nov 13 12:52:55 XenServer-08 SMGC: [13829] Removed leaf-coalesce from deac1149[VHD](1800.000G/1.802G/1803.523G|a)
Nov 13 12:52:56 XenServer-08 SMGC: [13829] gc: EXCEPTION <class 'SR.SROSError'>, Gave up on leaf coalesce after leaf grew bigger than before snapshot taken [opterr=VDI=deac1149[VHD](1800.000G/1.802G/1803.523G|a)]
Nov 13 12:52:56 XenServer-08 SMGC: [13829]     sr.coalesceLeaf(candidate, dryRun)
Nov 13 12:52:56 XenServer-08 SMGC: [13829]   File "/opt/xensource/sm/cleanup.py", line 1587, in coalesceLeaf
Nov 13 12:52:56 XenServer-08 SMGC: [13829]     self._coalesceLeaf(vdi)
Nov 13 12:52:56 XenServer-08 SMGC: [13829]   File "/opt/xensource/sm/cleanup.py", line 1788, in _coalesceLeaf

olivierlambert · 2019-11-13T10:38:19Z

The zombie process thing is solved by the patch. It might be possible there's still a problem on the last leaf. But not on further depth (so in short, it will work perfectly until reaching depth of 1, the final child can't be merge for some reasons we are investigating).

This seems to happen only on LVM based storage, not file based.

rizaemet · 2019-11-13T10:58:06Z

So I will wait your investigation. I'm ready to help if you need to test or any log.

danieldemoraisgurgel · 2019-11-18T19:40:10Z

Oliver, how is XCP users' feedback on this problem in coalesce?

We are evaluating migrating our hosts to XCP (with development and more customer proximity). As informed https://bugs.xenserver.org/browse/XSO-966, the backup process is being a difficult task for our customers due to the failure of the coalesce after the snapshot is created.

Would 20 hosts from CH8 to XCP8, is there any recommendation after host update? xenserver-tools works properly or do we need to install the XCP agent itself?

nagilum99 · 2019-11-18T19:59:43Z

I upgraded XenServer to XCP-ng and didn't touch the HV-tools, it works perfectly.
It's even a possible solution for XCP-ng to grab the XS/CH tools to install inside the guests, as the OSS drivers for windows are a bit tricky.

olivierlambert · 2019-11-18T21:30:49Z

@danieldemoraisgurgel we have a patch on XCP-ng 8.0. Please open a support ticket if you want assistance on that. Sadly, Citrix won't make a patch on CH 8.0.

danieldemoraisgurgel · 2019-11-19T17:17:23Z

We are updating one of our clusters (migrating from CH8 to XCP8). Next we will test the available update and see if we have any positive results with the Coalesce process.

I believe that if we succeed, it will be the first step in migrating our Citrix framework to XCP and we will soon be closing a support contract! ;-)

olivierlambert · 2019-11-19T17:25:45Z

Patch should fix all zombie process, we checked that with our customers. There's still the final leaf that can't coalesce in all cases, but the impact is almost invisible.

stormi · 2019-11-22T12:25:34Z

Update pushed to XCP-ng 8.0

danieldemoraisgurgel · 2019-11-22T13:11:03Z

Thank you Stormi.
I'll be applying and validating the patch.

danieldemoraisgurgel · 2019-11-22T13:33:57Z

@stormi the update actually solves the problem of zombies process, but the coalesce, all disks after backup/snapshot (removal) still continue with a 1 disk stuck in the leaf chain.

I am also testing the update made available in https://support.citrix.com/article/CTX265619 in an XS 7.1 pool.

BogdanRudas · 2019-11-22T15:43:31Z

I've played with LIVE_LEAF_COALESCE_MAX_SIZE in /opt/xensource/sm/cleanup.py (in XS7.1) and it helps a bit.
There is also LIVE_LEAF_COALESCE_TIMEOUT which I didn't explored yet.

Is you want to play with this please be sure to stop coalescing gracefully first using /opt/xensource/sm/cleanup.py -a -u SR-UUID and then run it again xe sr-scan uuid=SR-UUID

olivierlambert · 2019-11-22T15:48:08Z

@danieldemoraisgurgel we know for the "last leaf". We are getting less strict in XO backup to do it despite there's the final leaf uncoalesced. In the mean time, we'll experiment with interesting @BogdanRudas suggestions. Thanks everyone!

danieldemoraisgurgel · 2019-11-23T12:03:16Z

The perception I have is as follows in XCP8:

After backup, all the disks were left with 1 disk frozen in the leaf tree.
Even pausing the VM and rescanning disk, the coalesce process does not start.

For CH7.1 with the XS71ECU2020 update, the coalesce process completed 100% by pausing the VMs. We will now re-back it up and see if the coalesce runs again 100%.

I used standard times in LIVE_LEAF_COALESCE_TIMEOUT=10.
The new test will be with LIVE_LEAF_COALESCE_TIMEOUT=300.

danieldemoraisgurgel · 2019-11-23T15:03:58Z

The strange thing is, I had to turn off the VMs, rescan disk and then turn on again.

The coalesce process began with the linked VMs (in production) and successfully completed. The following values have been changed at /opt/xensource/sm/cleanup.py :

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024 # bytes
LIVE_LEAF_COALESCE_TIMEOUT = 300 # seconds

Well, apparently everything ok... we will see in our next backup if it will be necessary to turn off the VMs for the coalesce to start and complete correctly.

olivierlambert · 2019-11-23T21:59:02Z

Okay please keep us posted 👍 Thanks for your report!

danieldemoraisgurgel · 2019-11-25T12:21:56Z

After the informed change, the backup occurred with 100% success, on no disk in the coalesce chain.

We're migrating another cluster to XCP-ng 8!
Thanks for the support, quick return and attention.

olivierlambert · 2019-11-25T15:01:21Z

So to recap:

XCP-ng 8.0 with latest patches (including the sm-driver fix)
Changing LIVE_LEAF_COALESCE_TIMEOUT in sm/cleanup.py from 10 to 300

Do you confirm you also had to change LIVE_LEAF_COALESCE_MAX_SIZE to make it work?

danieldemoraisgurgel · 2019-11-26T14:32:59Z

@olivierlambert I change this values in sm/cleanup.py :

from: LIVE_LEAF_COALESCE_MAX_SIZE = 20 * 1024 * 1024 
to: LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024

from: LIVE_LEAF_COALESCE_TIMEOUT = 10
to: LIVE_LEAF_COALESCE_TIMEOUT = 300

The coalesce process successfully completed in all cases, without the need to shut down the servers.

I can't tell you how much the value LIVE_LEAF_COALESCE_MAX_SIZE will affect the process, but since it was several months away with this problem, I don't think the default value wasn't enough for the size of bytes required for the process to complete.

After patching and changes, I realize that the coalesce process is finally working properly and properly (our environment has more than 350 VMs and about 98TB).

We are migrating to XCP-ng 8 :-)

olivierlambert · 2019-11-26T19:59:19Z

Nice! We'll see if we can raise those value "by default" in XCP-ng.

DavorSaric · 2019-12-16T11:09:59Z

Hello,

I am on:
PRODUCT_VERSION='8.0.0
INSTALLATION_DATE='2019-11-15 ...'

Local storage here (server uses 2xSSD in mdadm raid1)

I didn't install anything from testing repo so my versions are:

sm-2.2.3-1.0.2.xcpng8.0.x86_64
sm-rawhba-2.2.3-1.0.2.xcpng8.0.x86_64

4 VM's on XCP-NG with disk sizes, 500GB, 80GB, 50GB and 20GB. I got report of failed backup due to SR error with not enough space. Checked the chains with:

vhd-util scan -m "VHD-*" -f -c -l VG_XenStorage-SR-UUID -p -v

And saw chains only on largest disk, 500GB, which are not being removed. I have reconfigured with following solution:

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024 
LIVE_LEAF_COALESCE_TIMEOUT = 300

And rerun Scan and all chains deleted so now it looks like this:

vhd=VHD-UUID1 capacity=21474836480 size=21525168128 hidden=0 parent=none
vhd=VHD-UUID2 capacity=53687091200 size=53800337408 hidden=0 parent=none
vhd=VHD-UUID3 capacity=85899345920 size=86075506688 hidden=0 parent=none
vhd=VHD-UUID4 capacity=536870912000 size=537927876608 hidden=0 parent=none

I guess I do not need sm and sm-rawhba from testing repo?

MrOffline77 · 2019-12-30T12:49:26Z

Thanks @danieldemoraisgurgel I tried to fix this since weeks 🥇.
I can confirm that the above steps are worked for me.

stormi · 2020-02-21T15:27:46Z

New logic for leaf coalesce has been backported from upstream into XCP-ng 8.1 beta: see https://xcp-ng.org/forum/post/22794

Feedback highly welcome!

DavorSaric · 2020-09-24T10:30:35Z

Hi,

could I have some issues on XenServer 7.1 if I just change bellow two values and re-run coalescence? It works on XCP-NG 8.0 without patches from 8.1 so could it be that it will work on 7.1? I am on production with over 340 VM's.

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024
LIVE_LEAF_COALESCE_TIMEOUT = 300

We do not have this patches on 7.1 as we are not using payed LTS support:
https://support.citrix.com/article/CTX265619
blktap-3.5.0-xs.2+1.0_71.2.6.x86_64.rpm
sm-1.17.0-xs.2+1.0_71.2.4.x86_64.rpm
sm-rawhba-1.17.0-xs.2+1.0_71.2.4.x86_64.rpm

olivierlambert · 2020-09-24T10:39:40Z

You might try, but we obviously can't tell you more about it, because we have no experience on the result on 7.1. I strongly suggest that you consider to migrate to XCP-ng if you can at some point then 👍

DavorSaric · 2020-09-24T10:57:16Z

Tnx 👍

stormi added the version:8.0 label Oct 25, 2019

stormi self-assigned this Oct 25, 2019

stormi added a commit to xcp-ng-rpms/sm that referenced this issue Oct 25, 2019

Backport upstream patches to fix coalesce issues

e00bf76

- Fixes "army of zombies" and never ending coalesce - xcp-ng/xcp#298

stormi modified the milestone: XCP-ng 8.1 Oct 25, 2019

stormi closed this as completed Nov 22, 2019

stormi mentioned this issue Nov 22, 2019

hanging LVMoISCSISR processes during snapshot operations #285

Closed

sniperkitten mentioned this issue Dec 31, 2019

Failed to find logical volume messages in SMlog and defunct processes NAUbackup/VmBackup#91

Open

stormi mentioned this issue Jan 6, 2020

Raise default values for LIVE_LEAF_COALESCE_TIMEOUT and ..._MAX_SIZE? #323

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XCP-ng 8.0 / CH 8.0 coalesce issues #298

XCP-ng 8.0 / CH 8.0 coalesce issues #298

stormi commented Oct 25, 2019

stormi commented Oct 25, 2019

rizaemet commented Nov 13, 2019

olivierlambert commented Nov 13, 2019

rizaemet commented Nov 13, 2019

olivierlambert commented Nov 13, 2019

rizaemet commented Nov 13, 2019

danieldemoraisgurgel commented Nov 18, 2019

nagilum99 commented Nov 18, 2019

olivierlambert commented Nov 18, 2019

danieldemoraisgurgel commented Nov 19, 2019

olivierlambert commented Nov 19, 2019

stormi commented Nov 22, 2019

danieldemoraisgurgel commented Nov 22, 2019

danieldemoraisgurgel commented Nov 22, 2019

BogdanRudas commented Nov 22, 2019

olivierlambert commented Nov 22, 2019

danieldemoraisgurgel commented Nov 23, 2019

danieldemoraisgurgel commented Nov 23, 2019

olivierlambert commented Nov 23, 2019

danieldemoraisgurgel commented Nov 25, 2019

olivierlambert commented Nov 25, 2019

danieldemoraisgurgel commented Nov 26, 2019

olivierlambert commented Nov 26, 2019

DavorSaric commented Dec 16, 2019 •

edited

Loading

MrOffline77 commented Dec 30, 2019

stormi commented Feb 21, 2020

DavorSaric commented Sep 24, 2020

olivierlambert commented Sep 24, 2020

DavorSaric commented Sep 24, 2020

XCP-ng 8.0 / CH 8.0 coalesce issues #298

XCP-ng 8.0 / CH 8.0 coalesce issues #298

Comments

stormi commented Oct 25, 2019

stormi commented Oct 25, 2019

rizaemet commented Nov 13, 2019

olivierlambert commented Nov 13, 2019

rizaemet commented Nov 13, 2019

olivierlambert commented Nov 13, 2019

rizaemet commented Nov 13, 2019

danieldemoraisgurgel commented Nov 18, 2019

nagilum99 commented Nov 18, 2019

olivierlambert commented Nov 18, 2019

danieldemoraisgurgel commented Nov 19, 2019

olivierlambert commented Nov 19, 2019

stormi commented Nov 22, 2019

danieldemoraisgurgel commented Nov 22, 2019

danieldemoraisgurgel commented Nov 22, 2019

BogdanRudas commented Nov 22, 2019

olivierlambert commented Nov 22, 2019

danieldemoraisgurgel commented Nov 23, 2019

danieldemoraisgurgel commented Nov 23, 2019

olivierlambert commented Nov 23, 2019

danieldemoraisgurgel commented Nov 25, 2019

olivierlambert commented Nov 25, 2019

danieldemoraisgurgel commented Nov 26, 2019

olivierlambert commented Nov 26, 2019

DavorSaric commented Dec 16, 2019 • edited Loading

MrOffline77 commented Dec 30, 2019

stormi commented Feb 21, 2020

DavorSaric commented Sep 24, 2020

olivierlambert commented Sep 24, 2020

DavorSaric commented Sep 24, 2020

DavorSaric commented Dec 16, 2019 •

edited

Loading