Linux 4.2 compat: vfs_rename() #3653

behlendorf · 2015-07-30T23:21:12Z

The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place. This way there can never
exist a partial version of the file, it's all or nothing.

Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms. Even doing
basic I/O to a file system from within the kernel is strongly
discouraged. In order to support this at all the vn_rename()
implementation ends up being complex and fragile. So fragile
that recent Linux 4.2 changes have broken it.

While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely. Then all this complex, fragile code can
be removed. Achieving this is straight forward because
config_write() is the only consumer of vn_rename().

This patch reworks spa_config_write() to update the cache file
in place. The file will be truncated, written out, and then
synced to disk. If an error is encountered the file will be
unlinked leaving the system in a consistent state.

This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file. However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.

If this were to somehow happen it poses no risk to pool. Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.

Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.

Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov

The spa_config_write() function relies on the classic method of making sure updates to the /etc/zfs/zpool.cache file are atomic. It writes out a temporary version of the file and then uses vn_rename() to switch it in to place. This way there can never exist a partial version of the file, it's all or nothing. Conceptually this is a good strategy and it makes good sense for platforms where it's easy to do a rename within the kernel. Unfortunately, Linux is not one of those platforms. Even doing basic I/O to a file system from within the kernel is strongly discouraged. In order to support this at all the vn_rename() implementation ends up being complex and fragile. So fragile that recent Linux 4.2 changes have broken it. While it is possible to update vn_rename() to work with the latest kernels a better long term strategy is to stop using vn_rename() entirely. Then all this complex, fragile code can be removed. Achieving this is straight forward because config_write() is the only consumer of vn_rename(). This patch reworks spa_config_write() to update the cache file in place. The file will be truncated, written out, and then synced to disk. If an error is encountered the file will be unlinked leaving the system in a consistent state. This does expose a tiny tiny tiny window where a system could crash at exactly the wrong moment could leave a partially written cache file. However, this is highly unlikely because the cache file is 1) infrequently updated, 2) only a few kilobytes in size, and 3) written with a single vn_rdwr() call. If this were to somehow happen it poses no risk to pool. Simply removing the cache file will allow the pool to be imported cleanly. Going forward this will be even less of an issue as we intend to disable the use of a cache file by default. Bottom line not using vn_rename() allows us to make ZoL more robust against upstream kernel changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

ryao · 2015-08-04T15:33:21Z

@behlendorf I do not think this will work as you expect on all filesystems. It is possible for in-place filesystems to return partially updated cachefile data following a power loss when they reuse the old data blocks. This will not occur on either ext4 (excluding data=writeback), which will update the data before metadata, or xfs, which will serve zeroes rather than inconsistent data. However, other in-place filesystems are not necessarily quite as strict and could return old, random, or partially updated data. The following page documents the XFS behavior, but states other filesystems do not do this:

You can run into more or less the same problem with any journaling filesystem; the others just don't serve zeroes. Instead, they give you the data that's physically on the medium. Imagine the situation when the corrupt /etc/motd suddenly becomes a window to your previous /etc/shadow contents... I really prefer how XFS handles that. Sometimes you do get the old data back with the other filesystems, but this is because the filesystems may reuse the blocks of the old file. So it's a trade-off, and your choice between security and, uh, convenience.

http://madduck.net/blog/2006.08.11:xfs-zeroes/

behlendorf · 2015-08-06T13:38:48Z

I'm not thrilled with it either. This definitely isn't a perfect solution as I describe in the commit comment. It does expose a tiny window where if you crash at exactly the wrong time and your file system doesn't handle this will you could see a damaged cache file.

However, if this does happen simply removing the cache file will resolve the issue. And we'd still like to retire use of the cache file by default anyway. With blkid integration it doesn't buy us much.

Better solutions welcome as long as they don't overly complicate the code for this one specific case. I considered a few options, including an upcall to do the rename, but this seemed the cleanest. Frankly this just isn't an operation we should be doing in the kernel.

ryao · 2015-08-07T04:36:43Z

@behlendorf People have talked about retiring the cacheflle for a while, but it serves an important purpose on multipath systems that blkid cannot. Namely, knowing which pools are imported and more importantly, which pools not to import. If you have multiple systems seeing the pool's vdevs, you do not want to import a pool that is imported on another system because that would cause corruption. This is especially true when the hostid is the same (e.g. set to 0, which is our present default behavior). Under no circumstance should the default behavior cause pool corruption, but I see no solid way around this when multiple systems see the same disks and blkid determines what to import. If the system administrator instructs the system to import a pool imported elsewhere, that is fine, but the systmem should not automatically do it. If the hostid is random, pool import will fail on every boot. That might a good thing because it would force people to fix their system configurations, but it would complicate setup more than it already is. After having thought about this over the past couple years, I do not think retiring it is a good idea. Any painful situations caused by the cachefile are not quite as bad as the data loss that could be caused by its absence.

As for fixing the rename support, I will try to take a look at it before the weekend.

behlendorf · 2015-08-18T23:00:20Z

@ryao even on multipath systems you must never rely on the cache file to ensure the right pools are imported. This is the job for proper high availability package such as corosync or pacemaker to make sure the pool isn't importable concurrently on two systems. The cache files only real purpose is to speed up pool imports when there are a large number (100's) of devices.

Regardless, to support 4.2 kernels we either need to do this (which is very very very virtually zero risk). Or someone needs to propose an alternative.

Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels results in an EACCES error. Rather than attempting to add and maintain more ugly compatibility code it's best to just retire this interface. As a first step the SPLAT test is disabled for Linux 4.2 and newer kernels. vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs/zfs#3653

behlendorf · 2015-08-19T23:59:48Z

Merged as:

efc412b Linux 4.2 compat: vfs_rename()

The spa_config_write() function relies on the classic method of making sure updates to the /etc/zfs/zpool.cache file are atomic. It writes out a temporary version of the file and then uses vn_rename() to switch it in to place. This way there can never exist a partial version of the file, it's all or nothing. Conceptually this is a good strategy and it makes good sense for platforms where it's easy to do a rename within the kernel. Unfortunately, Linux is not one of those platforms. Even doing basic I/O to a file system from within the kernel is strongly discouraged. In order to support this at all the vn_rename() implementation ends up being complex and fragile. So fragile that recent Linux 4.2 changes have broken it. While it is possible to update vn_rename() to work with the latest kernels a better long term strategy is to stop using vn_rename() entirely. Then all this complex, fragile code can be removed. Achieving this is straight forward because config_write() is the only consumer of vn_rename(). This patch reworks spa_config_write() to update the cache file in place. The file will be truncated, written out, and then synced to disk. If an error is encountered the file will be unlinked leaving the system in a consistent state. This does expose a tiny tiny tiny window where a system could crash at exactly the wrong moment could leave a partially written cache file. However, this is highly unlikely because the cache file is 1) infrequently updated, 2) only a few kilobytes in size, and 3) written with a single vn_rdwr() call. If this were to somehow happen it poses no risk to pool. Simply removing the cache file will allow the pool to be imported cleanly. Going forward this will be even less of an issue as we intend to disable the use of a cache file by default. Bottom line not using vn_rename() allows us to make ZoL more robust against upstream kernel changes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#3653

behlendorf added Type: Building Indicates an issue related to building binaries Difficulty - Easy labels Jul 30, 2015

behlendorf added this to the 0.6.5 milestone Jul 30, 2015

behlendorf closed this in efc412b Aug 19, 2015

behlendorf deleted the linux-4.2-rename branch May 18, 2018 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux 4.2 compat: vfs_rename() #3653

Linux 4.2 compat: vfs_rename() #3653

behlendorf commented Jul 30, 2015

ryao commented Aug 4, 2015

behlendorf commented Aug 6, 2015

ryao commented Aug 7, 2015

behlendorf commented Aug 18, 2015

behlendorf commented Aug 19, 2015

Linux 4.2 compat: vfs_rename() #3653

Linux 4.2 compat: vfs_rename() #3653

Conversation

behlendorf commented Jul 30, 2015

ryao commented Aug 4, 2015

behlendorf commented Aug 6, 2015

ryao commented Aug 7, 2015

behlendorf commented Aug 18, 2015

behlendorf commented Aug 19, 2015