Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NFS hard mount instead of soft mount to avoid RO VMs (or offer option)? #334

Open
stormi opened this issue Feb 4, 2020 · 90 comments
Open

Comments

@stormi
Copy link
Member

stormi commented Feb 4, 2020

See proposal and testimony from user on forum: https://xcp-ng.org/forum/post/21940

We may also consider changing the default timeout options.

@olivierlambert
Copy link
Member

I think it might be interesting to ask the question to Citrix storage guys. We should create an XSO to get their opinion and maybe their reasons about their current choices.

@stormi stormi added this to To Do in Team board via automation Feb 4, 2020
@ghost
Copy link

ghost commented Feb 26, 2020

Perhaps I can suggest to always use a unique fsid= export option for each exported path on the nfs server. This ought to be documented in the docs and wiki :)

@ezaton
Copy link

ezaton commented Feb 29, 2020

The thing is that if NFS is served by a cluster (example - PaceMaker), failover event will work flawlessly if NFS is mounted with 'hard' option on the XenServer. Otherwise, VMs will experience a (short) disk loss and the Linux ones will get, by default, a read-only filesystem.
The simple workaround is to edit /opt/xensource/sm/nfs.py and modify the line:

options = "soft,proto=%s,vers=%s" % (
to:
options = "hard,proto=%s,vers=%s" % (

This is an ugly workaround, but it allows VMs to live, which is more important that the beauty of the hack.

@ghost
Copy link

ghost commented Feb 29, 2020

I believe it is possible to add custom NFS mount options when adding a new SR through XOA. Have you tested this?

@ezaton
Copy link

ezaton commented Feb 29, 2020

Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.

@olivierlambert
Copy link
Member

Yes, that's why it would require a XAPI modification for this. That's doable :)

I think we should keep the default behavior, but allow an override: this will let people who want to test, to test it.

In theory, we should:

  • add an extra parameter in the SR NFS create
  • add an extra variable (keeping soft by default if no extra param hard added) in NFS driver code

That should be it. @ezaton do you want to contribute?

@ezaton
Copy link

ezaton commented Feb 29, 2020

I am not sure I have the Python know-how, but I will make an effort during the next few days. This is a major thing I am carrying with me since XS version 6.1 or so. These were my early NFS clusters days. Nowadays - I have so many NFS clusters in so many locations. So - yeah. I want to contribute. I will see that I can actually do it.

Thanks!

@olivierlambert
Copy link
Member

olivierlambert commented Feb 29, 2020

Okay so IIRC, you might indeed check how NFS version is passed down to the driver (from XAPI to the NFS Python file). It's a good start to understand how it works, and then do the same for the hard/soft mount thing :)

edit: @Wescoeur knows a lot about SMAPIv1, so he might assist you on this (if you have questions).

@ghost
Copy link

ghost commented Feb 29, 2020

Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.

I thought subsequent mount-options override previous mount options. This is how we can add nfsver=4.1 for example, isn't it. I haven't tried, but it might be worth trying.

@ezaton
Copy link

ezaton commented Feb 29, 2020

This is a quote from 'man 5 nfs':

  soft / hard    Determines the recovery behavior of the NFS client after an NFS request times out.  If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely.  If the soft option is specified, then the NFS client  fails
                  an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

                  NB:  A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity.  Using NFS over TCP or increasing the value of the retrans option may
                  mitigate some of the risks of using the soft option.

Look at the comment. I believe that hard should be the default - at least for regular SR. ISO-SR is another thing.
I have just forked the code. I will see if I can modify it without exceeding my talent :-)

@nagilum99
Copy link

Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.

Maybe increasing that value could be a less intrustive option and could be supplied without being ignored?

@ezaton
Copy link

ezaton commented Mar 1, 2020

These are meant to mitigate (some of) the problems caused by soft mount, instead of just mounting 'hard'. Look - when it's your virtual machine there, you do not want a momentary network disruption to kill your VMs. The safety of your virtual machines is the key requirement. Soft mount just doesn't provide it.

@ezaton
Copy link

ezaton commented Mar 3, 2020

I have edited nfs.py and NFSSR.py and created a pull request here: xapi-project/sm#485

@stormi
Copy link
Member Author

stormi commented Mar 3, 2020

Thanks. I think you need to add context and explain why hard would be better than soft and what tests you did to have a chance of getting it merged.

@ezaton
Copy link

ezaton commented Mar 3, 2020

I will add all these details in the pull request.

@ghost
Copy link

ghost commented Mar 5, 2020

Doesn't work. The hard-coded 'soft' directive in nfs.py overrides it.

I just tried in XOA to create a new SR with the "hard" mount option. Seems to stick when looking at the output from mount.

image

# mount
example.com:/media/nfs_ssd/3ec42c2f-552c-222f-3d46-4f98613fe2e1 on /run/sr-mount/3ec42c2f-552c-222f-3d46-4f98613fe2e1 type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,clientaddr=192.168.1.10,local_lock=none,addr=192.168.1.2)

@olivierlambert
Copy link
Member

@Gatak if it's the case it's even easier :D

Can you double check it's the correct hard behavior?

@ezaton
Copy link

ezaton commented Mar 5, 2020

This is a change of behaviour from what I am remembering, however - I have just tested it, and this is true. Consistent across reboots and across detach/reattach - so my patch is (partially) redundant.
However - I believe that 'hard' should be the default for VM NFS SRs.

@ghost
Copy link

ghost commented Mar 5, 2020

I believe that 'hard' should be the default for VM NFS SRs.

Yes, based on the documentation provided it does seem the safest option.

@olivierlambert
Copy link
Member

olivierlambert commented Mar 5, 2020

Yes, but you can't decide to do this change for everyone without a consensus. We'll talk more with Citrix team to understand their original choice.

What we can do in XO: expose a menu that select "hard" by default. This will encourage hard by default without changing it into the platform directly.

Does this sound reasonable for you?

@ghost
Copy link

ghost commented Mar 5, 2020

Yes, but you can't decide to do this change for everyone without a consensus. We'll talk more with Citrix team to understand their original choice.

Sounds good. Many use soft because you could not abort/unmount a hard mounted NFS share. But this may be old truths..

What we can do in XO: expose a menu that select "hard" by default. This will encourage hard by default without changing it into the platform directly.

I think it is important to mention that the NFS export should use the fsid* option to create a stable export filesystem ID. Otherwise the ID might change on reboot, which will prevent a share from being re-connected.

* https://linux.die.net/man/5/exports

@olivierlambert
Copy link
Member

What about NFS HA? (regarding fsid)

@ezaton
Copy link

ezaton commented Mar 5, 2020

What about NFS HA? (regarding fsid)

NFS HA maintains fsid. If you setup an NFS cluster, you handle your fsid, or else, it doesn't work very well. For stand-alone systems, the fsid is derived from the device id, but not for clusters.

@nackstein
Copy link

I wrote some condideration on the forum thread about this issue and report here the post important one.
It seems that nfs.py already support user options and those got appended to the default. the mount command kept the last option so if default is soft and user append hard: soft,hard = hard. same for timeo and retrans.
the linux VM that goes in readonly is probably due to a default in ubuntu. there is an option in the superblock of ext2/3/4 about the behavior if errors are encountered. RHEL on the otherside does not remount in read-only and will contiue (retry) do perform I/O on the disk. it's to be verified if the error is propagated to the userspace or it stay at fs level inside the VM.

using hard as default is risky in my opinion. I have to say that usually on servers i set hard,intr in order to protect the poor written application software from receiving I/O error and with the intr option still be able to kill the process if I need to umount the fs.
I say it's risky because if you use a lot of different NFS storage and only one goes down for long period you will get a semi-frozen dom0. it's to be verified what happen to xapi and normal operation, if you are able to ignore the 1 broken NFS SR and continue working or the whole xapi or other deamon running on dom0 get stuck at listing mount points or accessing the one broken SR.
I think nobody want to reboot an host because the NFS SR for the iso files is down.
for short downtime raising the NFS mount option retrans (default 3) or timeo (default 100) could be enough. the ideal solution is to have the single VM retrying on a soft mount without going read-only so it's easy to manually recover the fs without reboot of the host for stale nfs mount point.
it seems that windows have a nice default behavior and RHEL should too. the problem could be limited to ubuntu o other distro (to be verified)

@ezaton
Copy link

ezaton commented Mar 6, 2020

he linux VM that goes in readonly is probably due to a default in ubuntu. there is an option in the superblock of ext2/3/4 about the behavior if errors are encountered. RHEL on the otherside does not remount in read-only and will contiue (retry) do perform I/O on the disk. it's to be verified if the error is propagated to the userspace or it stay at fs level inside the VM.

This is incorrect. All Linux servers I have had the pleasure of working with - RHEL5/6/7, Centos, Oracle Linux, Ubuntu and some more - all of them mount by default with the directive onerror=readonly. You have to explicitly change this behaviour for your Linux to not fail(!) when NFS performs failover with soft mount.

Xapi - and SM-related tasks, are handled independently per-SR - check the logs. I agree that ISO SR should remain soft (although this can crash VMs, but this is less of a problem, because the ISO is originally read-only), so my patch (and the proposed change to the GUI) is to have 'hard' mount option for VM data disks, and 'soft' for ISO SR.

@ghost
Copy link

ghost commented Mar 6, 2020

usually on servers i set hard,intr in order to protect the poor written application software from receiving I/O error and with the intr option still be able to kill the process if I need to umount the fs.

According to https://linux.die.net/man/5/nfs the intr mount option is deprecated. However it should still be possible to kill a process. In this case it must be one on the Xen services reading from the stale NFS share. Not sure how possible it is to kill. Is it tapdisk?

I did one test yesterday with a Windows server VM on a hard mounted NFS server that i took offline for ~30 minutes. The VM froze and i got NFS timeouts on the XCP-ng server dmesg, but once i started the NFS server the freeze stopped and things went back to normal.

This did not previously work when i had the soft mount option and had not specified fsid export option. Then the XCP-ng would not reconnect and wait forever with a stale mount.

@nackstein
Copy link

nackstein commented Mar 6, 2020

I made a test with ubuntu server 19.10. installed with defaults setting without LVM.
the fs is mount with the continue behavior by default (as I see on a RHEL7)
root@ubuntu01:~# tune2fs -l /dev/xvda2 |grep -i behav
Errors behavior: Continue

I tested with a script that update a file every second on the VM.
the test consist in exportfs -uav on the NFS server to turn down the share and exportfs -rv to bring it online again.
with default SR option soft,timeo=100,retrans=3 the VM does not detect a problem for about 1 minute (I didn't precisily measured time). after 5 minutes of downtime the root fs get remounted read-only.
on the xcp host I see that df command block for about 10/20 seconds and return the output.
once the NFS come back it's almost istantly mounted.

I repeated the test with retrans=360, I expected that the client didn't received error for a heck of time but I was wrong. after about 5 minutes the root fs of the VM get remounted read-only.

I investigated on the timeout parameter of the disk normaly in /sys/block/sd*/device/timeout
but it seems that the xen disk does not export this parameter. I was confident that non having a timeout a default infinite wait was implemented but now I think I was wrong.

I still have to understand what really happen:
if the VM get the I/O error from the dom0 and then remount read-only before I expected it (timeo=100 and retrans=360 should retry for about 1 hour) or
if the timeout is internal in the kernel of the VM and once exceeded the fs is remounted read-only.
the first case means for some reason the NFS paramenter are not enforced while the second case means that even with hard mount you should see the problem. so right now I miss something.

@nackstein
Copy link

some more test. It turn out that one possible problem was how I conducted the test.
I user unxeport/export and this seems to trigger the error reporting to userspace even before the timeout expire. I tried with timeo=3000,retrans=10 but after about 50 seconds the VM mounted read-only and a ls command on xcp host returned error after few seconds instead of waiting. this with unexport/export.

I now tried with null routing as suggested on the forum, ip route add <xcp host ip/32> via 127.0.0.1 dev lo to block all traffic between NFS server and xcp host and then ip route del to rollback.
now after 5 minutes the VM does not get error with timeo=3000,retrans=10 and commands on the host like df block, the NFS mount honor the configured timeout.

I'm going to retest with timeo=100,retrans=360 to be sure it works and to verify how the tcp timeouts interact.

I think this tell us 2 things:

  1. the xvda disk does not have timeouts
  2. in case of ip failover on the nfs server it should be safer to create the exports and then configure the ip then viceversa. this to let the share appear from the first time the vip is reachable again and avoid error to be propagated to userspace

@stormi
Copy link
Member Author

stormi commented Mar 6, 2020

Just a quick word to say that this discussion is very interesting, whatever what the outcome will be. I'm following it closely.

@ghost
Copy link

ghost commented Apr 30, 2020

After reading through this and the xcp-ng forum I remounted the nfs share as hard and it got worse for me, the VM's went into RO state and it finally took restarting the entire cluster to get it back up again. Here is the mount output of my share, am I doing anything wrong?

10.10.10.11:/export/xcp-ng-hard/050155f5-9bcc-7578-80d8-c6088d72a216 on /run/sr-mount/050155f5-9bcc-7578-80d8-c6088d72a216 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,mountaddr=10.10.10.11,mountvers=3,mountport=43817,mountproto=tcp,local_lock=none,addr=10.10.10.11)

I think you have other issues with your OMV setup, at least based on the issues from the forum thread.

@geek-baba
Copy link

After reading through this and the xcp-ng forum I remounted the nfs share as hard and it got worse for me, the VM's went into RO state and it finally took restarting the entire cluster to get it back up again. Here is the mount output of my share, am I doing anything wrong?

10.10.10.11:/export/xcp-ng-hard/050155f5-9bcc-7578-80d8-c6088d72a216 on /run/sr-mount/050155f5-9bcc-7578-80d8-c6088d72a216 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,mountaddr=10.10.10.11,mountvers=3,mountport=43817,mountproto=tcp,local_lock=none,addr=10.10.10.11)

I think you have other issues with your OMV setup, at least based on the issues from the forum thread.

Potentially but how do I establish that? Also the forum issue is with nfs4, and the reason I started to look into nfs4 because of the this issue with my existing nfs3 share.

@StreborStrebor
Copy link

StreborStrebor commented Apr 30, 2020

After reading through this and the xcp-ng forum I remounted the nfs share as hard and it got worse for me, the VM's went into RO state and it finally took restarting the entire cluster to get it back up again. Here is the mount output of my share, am I doing anything wrong?

10.10.10.11:/export/xcp-ng-hard/050155f5-9bcc-7578-80d8-c6088d72a216 on /run/sr-mount/050155f5-9bcc-7578-80d8-c6088d72a216 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,mountaddr=10.10.10.11,mountvers=3,mountport=43817,mountproto=tcp,local_lock=none,addr=10.10.10.11)

I assume you stopped all VMs before you remounted the VM storage? (it would not be surprising at all that the VMs went RO, if you unmounted the storage, remounted it as hard, while the VMs are running)

A suggestion:
Add a Debian or Ubuntu VM on your XCP-ng server. Add NFS 4.1 to that. See if you can then mount the NFS share on the XCP-server. Should take you no more than 30 minutes. Personally, I gave up on OMV years ago as a fileserver for XenServer. I run several Debian servers with NFS as fileservers for XCP-ng (all NFS 3 - hard mounted). Very clean and very lean.

Also, DON'T experiment further with all your VMs at risk. I'd use a test VM for that, untill certain that all is working robustly.

@geek-baba
Copy link

After reading through this and the xcp-ng forum I remounted the nfs share as hard and it got worse for me, the VM's went into RO state and it finally took restarting the entire cluster to get it back up again. Here is the mount output of my share, am I doing anything wrong?

10.10.10.11:/export/xcp-ng-hard/050155f5-9bcc-7578-80d8-c6088d72a216 on /run/sr-mount/050155f5-9bcc-7578-80d8-c6088d72a216 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,mountaddr=10.10.10.11,mountvers=3,mountport=43817,mountproto=tcp,local_lock=none,addr=10.10.10.11)

I assume you stopped all VMs before you remounted the VM storage? (it would not be surprising at all that the VMs went RO, if you unmounted the storage, remounted it as hard, while the VMs are running)

This is what I did:

  1. All the VM's were running on /export/xcp-ng-sr default mounted, stopped all the VM's
  2. Created a new share /export/xcp-ng-hard/, hard mounted using xcp-ng center, copied all the stopped VM's to the new hard mounted share
  3. Started all the VM's and went on to troubleshoot my nfs4 issue that I am having with my current setup.
  4. While I was testing nfs4 and was updating the nfs4 share, the nfs demon would have restarted couple of times (no longer than 30 sec), I see several VM's went into RO state.

This is where the bigger issue started, I was able to restart the VM's with soft mount and no problem. Now with hard mounted, it wont reboot, force shutdown wont work, after several toolstack restart and force shutdown, the VM's will shutdown, now none of the VM's will restart, I had to restart the entire cluster to start the VM's again.

@stormi
Copy link
Member Author

stormi commented Apr 30, 2020

I suggest to keep using the forum for understanding what is going on, and come back here with the conclusions, to keep that issue readable.

@geek-baba
Copy link

I suggest to keep using the forum for understanding what is going on, and come back here with the conclusions, to keep that issue readable.

I want to clarify the 2 different issues that are I am dealing with that started with VM's going into RO state:

  1. NFSv4 mounting issue - I have always used NFSv3 till the VM's started to go into RO state, before I stumbled on this issue, I started to explore NFSv4 to test if it has better I/O performance that may alleviate this issue. This is nothing to do with this post and I am trying to find whats going on using the xcp-ng forum.

  2. NFSv3 Hard Mount - after hard mounting NFSv3 share, my issue got worse, in another xcp-ng post regarding similar issue that another member had and I was discussing my results, @olivierlambert suggested that I report my findings to this thread, this issue is not being actively discussed in the forum so nothing to bring back here

I am happy to do more test or provide logs if you want but I would caution using hard as default because I am not sure if that gets us the desired results consistently.

@olivierlambert
Copy link
Member

What was interesting is the hard behavior. It shouldn't put your VMs in RO mode. If you can reproduce clearly another behavior, it would be interesting to share 👍

@stormi
Copy link
Member Author

stormi commented Apr 30, 2020

Ok, I understand better. I don't understand how this is technically possible that the VMs go RO over a hard mounted share, but apparently it did in your case.

@geek-baba
Copy link

geek-baba commented Apr 30, 2020

I can do more test later today and try to reproduce the issue, here is my hard mounted nfs sr, let me know if I need to change anything with this:

10.10.10.11:/export/xcp-ng-hard/050155f5-9bcc-7578-80d8-c6088d72a216 on /run/sr-mount/050155f5-9bcc-7578-80d8-c6088d72a216 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,mountaddr=10.10.10.11,mountvers=3,mountport=43817,mountproto=tcp,local_lock=none,addr=10.10.10.11)

Also what other logs you would like to see?

@ezaton
Copy link

ezaton commented Apr 30, 2020

From my understanding of NFS (and I have a little understanding of it), this should never happen. There is no way that hard NFS mount would result in VMs in RO. Tests here showed that you can kill your NFS server for a very long time, and still - VMs would hang, but not go into RO.
I believe you have some misconfiguration on the NFS server's side, although I cannot really imagine what they are. The first thought that comes to my mind is that you have bad blocks on your NFS server, and that I/O errors are being forwarded to the NFS clients (this might cause it to go all the way through the hypervisor, through the VM and into its filesystem and block device). Can you share with us the configuration on the NFS server? The logs output? Although not related directly to the issue at hand, it is an interesting edge case, which might give us better understanding and clues as to the results of extreme cases with storage or network devices.

@geek-baba
Copy link

geek-baba commented Apr 30, 2020

I doubt that NFS server will have a bad block (possible but..), its consist of 12 mirrored VDEV using 24 new Intel NVMe enterprise drives over ZFS, but as I said potentially yes. I am providing the nfs export of the NAS server (OMV - its just a Debian 10 with a gui to manage NAS settings) and what I see on the xcp-ng side:

NAS Export:

$ cat /etc/exports
/export/xcp-ng-hard (fsid=5,rw,subtree_check,insecure,crossmnt,no_root_squash,anongid=1000,anonuid=1000)

XCP-ng Mount

$ mount
10.10.10.11:/export/xcp-ng-hard/050155f5-9bcc-7578-80d8-c6088d72a216 on /run/sr-mount/050155f5-9bcc-7578-80d8-c6088d72a216 type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=100,retrans=3,sec=sys,mountaddr=10.10.10.11,mountvers=3,mountport=43817,mountproto=tcp,local_lock=none,addr=10.10.10.11)

What log files do you want me to get for you when i am able to reproduce the RO issue?

@ghost
Copy link

ghost commented Apr 30, 2020

@geek-baba IMHO I think you should fix the root cause of the failed NFS4 mounts in your setup before continuing here, as it if difficult to know of those issues also affect this RO/hard mount problem for you. I can't explain why your NFS4 mount does not work, and I agree with @StreborStrebor that you should set up a test VM with a NFS server to test and rule out things. It won't take more than 10 minutes to set up.

@geek-baba
Copy link

@geek-baba IMHO I think you should fix the root cause of the failed NFS4 mounts in your setup before continuing here, as it if difficult to know of those issues also affect this RO/hard mount problem for you. I can't explain why your NFS4 mount does not work, and I agree with @StreborStrebor that you should set up a test VM with a NFS server to test and rule out things. It won't take more than 10 minutes to set up.

Did you read my earlier post?

@stormi stormi removed this from To Do in Team board Jul 15, 2020
@Forza-tng
Copy link

Hello everyone. What's the status of this. Did we come to some conclusion on what is the best choice and if choices should be made available in XO?

@stormi
Copy link
Member Author

stormi commented Jul 12, 2021

Hello everyone. What's the status of this. Did we come to some conclusion on what is the best choice and if choices should be made available in XO?

Not exactly. We know for sure that soft mount does not cope well with NFS server disconnections, but we haven't established 100% that hard mounts would be the best solution in all cases.

@Forza-tng
Copy link

Hello everyone. What's the status of this. Did we come to some conclusion on what is the best choice and if choices should be made available in XO?

Not exactly. We know for sure that soft mount does not cope well with NFS server disconnections, but we haven't established 100% that hard mounts would be the best solution in all cases.

I understand. I do think it is important to solve this, if not at least provide some guiding suggestions and information to the user before creating he SR?

@Fohdeesha
Copy link
Member

Bumping this, another XCP-ng customer has been bit by this with HA NFS storage

@olivierlambert
Copy link
Member

I think we'll leave the default choice "as is" (too risky to change it), but we can help to configure hard when HA is used for example.

@ezaton
Copy link

ezaton commented Mar 1, 2023

I disagree. I think that the 'soft' default means that VMs will(!) get IO errors in case of a network glitch or any momentary disconnection, and that the danger to VMs in this case is much more severe than blocking IO.
As propagated in the past - the default has to be hard, which is safer and more reliable, and for all my customers using NFS, this is being modified in-code by me. I hate it, however, I would risk hating it on getting VM data corruptions.

@olivierlambert
Copy link
Member

Then, open a support ticket so we can see with you how to simplify this for your entire infrastructure. "The default has to be hard" is easy to tell when you don't have thousand users all around the world, this is a kind of breaking change, so it's out of question for XCP-ng 8.2 at least.

@Forza-tng
Copy link

I am curious, in what way is this a breaking change?

@olivierlambert
Copy link
Member

olivierlambert commented Mar 1, 2023

It's changing a previous behavior for all our users. You can't fathom the potential consequences at this scale, by changing the behavior on how a storage respond to being mounted in hard vs soft. Helping people doing the change on their own is acceptable, but changing it by default during an LTS release is not.

@Forza-tng
Copy link

Then, maybe provide the option hard and soft during NFS SR creation in XOA/XO and have some explanation of the consequences for each?

@olivierlambert
Copy link
Member

That's exactly what I'm saying.

@Forza-tng
Copy link

That's exactly what I'm saying.

This is perfectly fine from my point of view. Thanks.

@olivierlambert
Copy link
Member

@Fohdeesha how far we are for this?

@Fohdeesha
Copy link
Member

@olivierlambert I've confirmed that adding hard as an option when creating an SR in XOA works just fine, the PBD for the SR gets created with the proper hard NFS option and is retained, as explained in this post: #334 (comment)

I suppose if you wanted to go further, XOA could add a hard/soft toggle button that automatically added this, with an explainer of the pros/cons of each.

@olivierlambert
Copy link
Member

Okay so let's do that in XO, pinging @marcungeschikts so it's added in XO backlog 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants