Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XCP-ng 8.0 / CH 8.0 coalesce issues #298

Closed
stormi opened this issue Oct 25, 2019 · 29 comments
Closed

XCP-ng 8.0 / CH 8.0 coalesce issues #298

stormi opened this issue Oct 25, 2019 · 29 comments
Assignees

Comments

@stormi
Copy link
Member

stormi commented Oct 25, 2019

If I understood correctly, XCP-ng 8.0 inherited a regression from Citrix Hypervisor 8.0 regarding the coalesce process.

Try to backport the fixes (one to fix the army of zombies, another to fix never-ending coalesce) from the upstream sm repository to fix them in XCP-ng.

More about the issues: https://bugs.xenserver.org/browse/XSO-966

@stormi stormi self-assigned this Oct 25, 2019
stormi added a commit to xcp-ng-rpms/sm that referenced this issue Oct 25, 2019
- Fixes "army of zombies" and never ending coalesce
- xcp-ng/xcp#298
@stormi stormi modified the milestone: XCP-ng 8.1 Oct 25, 2019
@stormi
Copy link
Member Author

stormi commented Oct 25, 2019

A test package is available:

yum install sm sm-rawhba --enablerepo=xcp-ng-testing

@rizaemet
Copy link

Hello,
I have a never-ending coalesce issue. I want to try test package. server reeboot required after install? What should I pay attention to?

@olivierlambert
Copy link
Member

Please reboot, yes. Nothing else to do.

@rizaemet
Copy link

Maybe not related with this issue but my problem is happening as before. This block looping:

Nov 13 12:31:27 XenServer-08 SMGC: [13829]   Running VHD coalesce on *ad1f957a[VHD](1800.000G//2.500G|n)
Nov 13 12:31:27 XenServer-08 SM: [27829] ['/usr/bin/vhd-util', 'coalesce', '--debug', '-n', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/VHD-ad1f957a-9f60-48d0-8833-e7b7fd19dde5']
Nov 13 12:32:14 XenServer-08 SM: [13829] ['/sbin/lvremove', '-f', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/coalesce_ad1f957a-9f60-48d0-8833-e7b7fd19dde5_1']
Nov 13 12:32:15 XenServer-08 SM: [13829] ['/sbin/dmsetup', 'status', 'VG_XenStorage--25e96f21--b8d3--f880--3df5--f71897c52429-coalesce_ad1f957a--9f60--48d0--8833--e7b7fd19dde5_1']
Nov 13 12:32:29 XenServer-08 SM: [13829] ['/sbin/lvcreate', '-n', 'coalesce_166de08f-4c3d-4b0f-9f72-2f3c1213e5cc_1', '-L', '4', 'VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429', '--addtag', 'journaler', '-W', 'n']
Nov 13 12:41:56 XenServer-08 SMGC: [13829]   Running VHD coalesce on *166de08f[VHD](1800.000G//1.910G|n)
Nov 13 12:41:56 XenServer-08 SM: [5526] ['/usr/bin/vhd-util', 'coalesce', '--debug', '-n', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/VHD-166de08f-4c3d-4b0f-9f72-2f3c1213e5cc']
Nov 13 12:42:33 XenServer-08 SM: [13829] ['/sbin/lvremove', '-f', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/coalesce_166de08f-4c3d-4b0f-9f72-2f3c1213e5cc_1']
Nov 13 12:42:33 XenServer-08 SM: [13829] ['/sbin/dmsetup', 'status', 'VG_XenStorage--25e96f21--b8d3--f880--3df5--f71897c52429-coalesce_166de08f--4c3d--4b0f--9f72--2f3c1213e5cc_1']
Nov 13 12:42:47 XenServer-08 SM: [13829] ['/sbin/lvcreate', '-n', 'coalesce_139f02bf-5a1d-4e1c-958d-68dfe8ac478c_1', '-L', '4', 'VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429', '--addtag', 'journaler', '-W', 'n']
Nov 13 12:52:13 XenServer-08 SMGC: [13829]   Running VHD coalesce on *139f02bf[VHD](1800.000G//1.785G|n)
Nov 13 12:52:13 XenServer-08 SM: [15949] ['/usr/bin/vhd-util', 'coalesce', '--debug', '-n', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/VHD-139f02bf-5a1d-4e1c-958d-68dfe8ac478c']
Nov 13 12:52:48 XenServer-08 SM: [13829] ['/sbin/lvremove', '-f', '/dev/VG_XenStorage-25e96f21-b8d3-f880-3df5-f71897c52429/coalesce_139f02bf-5a1d-4e1c-958d-68dfe8ac478c_1']
Nov 13 12:52:48 XenServer-08 SM: [13829] ['/sbin/dmsetup', 'status', 'VG_XenStorage--25e96f21--b8d3--f880--3df5--f71897c52429-coalesce_139f02bf--5a1d--4e1c--958d--68dfe8ac478c_1']
Nov 13 12:52:55 XenServer-08 SMGC: [13829] Snapshot-coalesce did not help, abandoning attempts
Nov 13 12:52:55 XenServer-08 SMGC: [13829] Set leaf-coalesce = offline for deac1149[VHD](1800.000G/1.802G/1803.523G|a)
Nov 13 12:52:55 XenServer-08 SM: [13829] Raising exception [204, Gave up on leaf coalesce after leaf grew bigger than before snapshot taken [opterr=VDI=deac1149[VHD](1800.000G/1.802G/1803.523G|a)]]
Nov 13 12:52:55 XenServer-08 SMGC: [13829] Removed leaf-coalesce from deac1149[VHD](1800.000G/1.802G/1803.523G|a)
Nov 13 12:52:56 XenServer-08 SMGC: [13829] gc: EXCEPTION <class 'SR.SROSError'>, Gave up on leaf coalesce after leaf grew bigger than before snapshot taken [opterr=VDI=deac1149[VHD](1800.000G/1.802G/1803.523G|a)]
Nov 13 12:52:56 XenServer-08 SMGC: [13829]     sr.coalesceLeaf(candidate, dryRun)
Nov 13 12:52:56 XenServer-08 SMGC: [13829]   File "/opt/xensource/sm/cleanup.py", line 1587, in coalesceLeaf
Nov 13 12:52:56 XenServer-08 SMGC: [13829]     self._coalesceLeaf(vdi)
Nov 13 12:52:56 XenServer-08 SMGC: [13829]   File "/opt/xensource/sm/cleanup.py", line 1788, in _coalesceLeaf

@olivierlambert
Copy link
Member

The zombie process thing is solved by the patch. It might be possible there's still a problem on the last leaf. But not on further depth (so in short, it will work perfectly until reaching depth of 1, the final child can't be merge for some reasons we are investigating).

This seems to happen only on LVM based storage, not file based.

@rizaemet
Copy link

So I will wait your investigation. I'm ready to help if you need to test or any log.

@danieldemoraisgurgel
Copy link

Oliver, how is XCP users' feedback on this problem in coalesce?

We are evaluating migrating our hosts to XCP (with development and more customer proximity). As informed https://bugs.xenserver.org/browse/XSO-966, the backup process is being a difficult task for our customers due to the failure of the coalesce after the snapshot is created.

Would 20 hosts from CH8 to XCP8, is there any recommendation after host update? xenserver-tools works properly or do we need to install the XCP agent itself?

@nagilum99
Copy link

I upgraded XenServer to XCP-ng and didn't touch the HV-tools, it works perfectly.
It's even a possible solution for XCP-ng to grab the XS/CH tools to install inside the guests, as the OSS drivers for windows are a bit tricky.

@olivierlambert
Copy link
Member

@danieldemoraisgurgel we have a patch on XCP-ng 8.0. Please open a support ticket if you want assistance on that. Sadly, Citrix won't make a patch on CH 8.0.

@danieldemoraisgurgel
Copy link

We are updating one of our clusters (migrating from CH8 to XCP8). Next we will test the available update and see if we have any positive results with the Coalesce process.

I believe that if we succeed, it will be the first step in migrating our Citrix framework to XCP and we will soon be closing a support contract! ;-)

@olivierlambert
Copy link
Member

Patch should fix all zombie process, we checked that with our customers. There's still the final leaf that can't coalesce in all cases, but the impact is almost invisible.

@stormi
Copy link
Member Author

stormi commented Nov 22, 2019

Update pushed to XCP-ng 8.0

@stormi stormi closed this as completed Nov 22, 2019
@danieldemoraisgurgel
Copy link

Thank you Stormi.
I'll be applying and validating the patch.

@danieldemoraisgurgel
Copy link

@stormi the update actually solves the problem of zombies process, but the coalesce, all disks after backup/snapshot (removal) still continue with a 1 disk stuck in the leaf chain.

I am also testing the update made available in https://support.citrix.com/article/CTX265619 in an XS 7.1 pool.

@BogdanRudas
Copy link

I've played with LIVE_LEAF_COALESCE_MAX_SIZE in /opt/xensource/sm/cleanup.py (in XS7.1) and it helps a bit.
There is also LIVE_LEAF_COALESCE_TIMEOUT which I didn't explored yet.

Is you want to play with this please be sure to stop coalescing gracefully first using /opt/xensource/sm/cleanup.py -a -u SR-UUID and then run it again xe sr-scan uuid=SR-UUID

@olivierlambert
Copy link
Member

@danieldemoraisgurgel we know for the "last leaf". We are getting less strict in XO backup to do it despite there's the final leaf uncoalesced. In the mean time, we'll experiment with interesting @BogdanRudas suggestions. Thanks everyone!

@danieldemoraisgurgel
Copy link

The perception I have is as follows in XCP8:

  • After backup, all the disks were left with 1 disk frozen in the leaf tree.
  • Even pausing the VM and rescanning disk, the coalesce process does not start.

For CH7.1 with the XS71ECU2020 update, the coalesce process completed 100% by pausing the VMs. We will now re-back it up and see if the coalesce runs again 100%.

I used standard times in LIVE_LEAF_COALESCE_TIMEOUT=10.
The new test will be with LIVE_LEAF_COALESCE_TIMEOUT=300.

@danieldemoraisgurgel
Copy link

The strange thing is, I had to turn off the VMs, rescan disk and then turn on again.

The coalesce process began with the linked VMs (in production) and successfully completed. The following values have been changed at /opt/xensource/sm/cleanup.py :

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024 # bytes
LIVE_LEAF_COALESCE_TIMEOUT = 300 # seconds

Well, apparently everything ok... we will see in our next backup if it will be necessary to turn off the VMs for the coalesce to start and complete correctly.

@olivierlambert
Copy link
Member

Okay please keep us posted 👍 Thanks for your report!

@danieldemoraisgurgel
Copy link

After the informed change, the backup occurred with 100% success, on no disk in the coalesce chain.

We're migrating another cluster to XCP-ng 8!
Thanks for the support, quick return and attention.

@olivierlambert
Copy link
Member

So to recap:

  1. XCP-ng 8.0 with latest patches (including the sm-driver fix)
  2. Changing LIVE_LEAF_COALESCE_TIMEOUT in sm/cleanup.py from 10 to 300

Do you confirm you also had to change LIVE_LEAF_COALESCE_MAX_SIZE to make it work?

@danieldemoraisgurgel
Copy link

@olivierlambert I change this values in sm/cleanup.py :

from: LIVE_LEAF_COALESCE_MAX_SIZE = 20 * 1024 * 1024 
to: LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024 
from: LIVE_LEAF_COALESCE_TIMEOUT = 10
to: LIVE_LEAF_COALESCE_TIMEOUT = 300

The coalesce process successfully completed in all cases, without the need to shut down the servers.

I can't tell you how much the value LIVE_LEAF_COALESCE_MAX_SIZE will affect the process, but since it was several months away with this problem, I don't think the default value wasn't enough for the size of bytes required for the process to complete.

After patching and changes, I realize that the coalesce process is finally working properly and properly (our environment has more than 350 VMs and about 98TB).

We are migrating to XCP-ng 8 :-)

@olivierlambert
Copy link
Member

Nice! We'll see if we can raise those value "by default" in XCP-ng.

@DavorSaric
Copy link

DavorSaric commented Dec 16, 2019

Hello,

I am on:
PRODUCT_VERSION='8.0.0
INSTALLATION_DATE='2019-11-15 ...'

Local storage here (server uses 2xSSD in mdadm raid1)

I didn't install anything from testing repo so my versions are:

sm-2.2.3-1.0.2.xcpng8.0.x86_64
sm-rawhba-2.2.3-1.0.2.xcpng8.0.x86_64

4 VM's on XCP-NG with disk sizes, 500GB, 80GB, 50GB and 20GB. I got report of failed backup due to SR error with not enough space. Checked the chains with:

vhd-util scan -m "VHD-*" -f -c -l VG_XenStorage-SR-UUID -p -v

And saw chains only on largest disk, 500GB, which are not being removed. I have reconfigured with following solution:

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024 
LIVE_LEAF_COALESCE_TIMEOUT = 300

And rerun Scan and all chains deleted so now it looks like this:

vhd=VHD-UUID1 capacity=21474836480 size=21525168128 hidden=0 parent=none
vhd=VHD-UUID2 capacity=53687091200 size=53800337408 hidden=0 parent=none
vhd=VHD-UUID3 capacity=85899345920 size=86075506688 hidden=0 parent=none
vhd=VHD-UUID4 capacity=536870912000 size=537927876608 hidden=0 parent=none

I guess I do not need sm and sm-rawhba from testing repo?

@MrOffline77
Copy link

Thanks @danieldemoraisgurgel I tried to fix this since weeks 🥇.
I can confirm that the above steps are worked for me.

@stormi
Copy link
Member Author

stormi commented Feb 21, 2020

New logic for leaf coalesce has been backported from upstream into XCP-ng 8.1 beta: see https://xcp-ng.org/forum/post/22794

Feedback highly welcome!

@DavorSaric
Copy link

Hi,

could I have some issues on XenServer 7.1 if I just change bellow two values and re-run coalescence? It works on XCP-NG 8.0 without patches from 8.1 so could it be that it will work on 7.1? I am on production with over 340 VM's.

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 * 1024 * 1024
LIVE_LEAF_COALESCE_TIMEOUT = 300

We do not have this patches on 7.1 as we are not using payed LTS support:
https://support.citrix.com/article/CTX265619
blktap-3.5.0-xs.2+1.0_71.2.6.x86_64.rpm
sm-1.17.0-xs.2+1.0_71.2.4.x86_64.rpm
sm-rawhba-1.17.0-xs.2+1.0_71.2.4.x86_64.rpm

@olivierlambert
Copy link
Member

You might try, but we obviously can't tell you more about it, because we have no experience on the result on 7.1. I strongly suggest that you consider to migrate to XCP-ng if you can at some point then 👍

@DavorSaric
Copy link

Tnx 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants