Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux 4.2 compat: vfs_rename() #3653

Closed
wants to merge 1 commit into from

Conversation

behlendorf
Copy link
Contributor

The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place. This way there can never
exist a partial version of the file, it's all or nothing.

Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms. Even doing
basic I/O to a file system from within the kernel is strongly
discouraged. In order to support this at all the vn_rename()
implementation ends up being complex and fragile. So fragile
that recent Linux 4.2 changes have broken it.

While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely. Then all this complex, fragile code can
be removed. Achieving this is straight forward because
config_write() is the only consumer of vn_rename().

This patch reworks spa_config_write() to update the cache file
in place. The file will be truncated, written out, and then
synced to disk. If an error is encountered the file will be
unlinked leaving the system in a consistent state.

This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file. However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.

If this were to somehow happen it poses no risk to pool. Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.

Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.

Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov

The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place.  This way there can never
exist a partial version of the file, it's all or nothing.

Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms.  Even doing
basic I/O to a file system from within the kernel is strongly
discouraged.  In order to support this at all the vn_rename()
implementation ends up being complex and fragile.  So fragile
that recent Linux 4.2 changes have broken it.

While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely.  Then all this complex, fragile code can
be removed.  Achieving this is straight forward because
config_write() is the only consumer of vn_rename().

This patch reworks spa_config_write() to update the cache file
in place.  The file will be truncated, written out, and then
synced to disk.  If an error is encountered the file will be
unlinked leaving the system in a consistent state.

This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file.  However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.

If this were to somehow happen it poses no risk to pool.  Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.

Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
@behlendorf behlendorf added Type: Building Indicates an issue related to building binaries Difficulty - Easy labels Jul 30, 2015
@behlendorf behlendorf added this to the 0.6.5 milestone Jul 30, 2015
@ryao
Copy link
Contributor

ryao commented Aug 4, 2015

@behlendorf I do not think this will work as you expect on all filesystems. It is possible for in-place filesystems to return partially updated cachefile data following a power loss when they reuse the old data blocks. This will not occur on either ext4 (excluding data=writeback), which will update the data before metadata, or xfs, which will serve zeroes rather than inconsistent data. However, other in-place filesystems are not necessarily quite as strict and could return old, random, or partially updated data. The following page documents the XFS behavior, but states other filesystems do not do this:

You can run into more or less the same problem with any journaling filesystem; the others just don't serve zeroes. Instead, they give you the data that's physically on the medium. Imagine the situation when the corrupt /etc/motd suddenly becomes a window to your previous /etc/shadow contents... I really prefer how XFS handles that. Sometimes you do get the old data back with the other filesystems, but this is because the filesystems may reuse the blocks of the old file. So it's a trade-off, and your choice between security and, uh, convenience.

http://madduck.net/blog/2006.08.11:xfs-zeroes/

@behlendorf
Copy link
Contributor Author

I'm not thrilled with it either. This definitely isn't a perfect solution as I describe in the commit comment. It does expose a tiny window where if you crash at exactly the wrong time and your file system doesn't handle this will you could see a damaged cache file.

However, if this does happen simply removing the cache file will resolve the issue. And we'd still like to retire use of the cache file by default anyway. With blkid integration it doesn't buy us much.

Better solutions welcome as long as they don't overly complicate the code for this one specific case. I considered a few options, including an upcall to do the rename, but this seemed the cleanest. Frankly this just isn't an operation we should be doing in the kernel.

@ryao
Copy link
Contributor

ryao commented Aug 7, 2015

@behlendorf People have talked about retiring the cacheflle for a while, but it serves an important purpose on multipath systems that blkid cannot. Namely, knowing which pools are imported and more importantly, which pools not to import. If you have multiple systems seeing the pool's vdevs, you do not want to import a pool that is imported on another system because that would cause corruption. This is especially true when the hostid is the same (e.g. set to 0, which is our present default behavior). Under no circumstance should the default behavior cause pool corruption, but I see no solid way around this when multiple systems see the same disks and blkid determines what to import. If the system administrator instructs the system to import a pool imported elsewhere, that is fine, but the systmem should not automatically do it. If the hostid is random, pool import will fail on every boot. That might a good thing because it would force people to fix their system configurations, but it would complicate setup more than it already is. After having thought about this over the past couple years, I do not think retiring it is a good idea. Any painful situations caused by the cachefile are not quite as bad as the data loss that could be caused by its absence.

As for fixing the rename support, I will try to take a look at it before the weekend.

@behlendorf
Copy link
Contributor Author

@ryao even on multipath systems you must never rely on the cache file to ensure the right pools are imported. This is the job for proper high availability package such as corosync or pacemaker to make sure the pool isn't importable concurrently on two systems. The cache files only real purpose is to speed up pool imports when there are a large number (100's) of devices.

Regardless, to support 4.2 kernels we either need to do this (which is very very very virtually zero risk). Or someone needs to propose an alternative.

behlendorf added a commit to openzfs/spl that referenced this pull request Aug 19, 2015
Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error.  Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface.  As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.

  vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs/zfs#3653
@behlendorf
Copy link
Contributor Author

Merged as:

efc412b Linux 4.2 compat: vfs_rename()

tomgarcia pushed a commit to tomgarcia/zfs that referenced this pull request Aug 25, 2015
The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place.  This way there can never
exist a partial version of the file, it's all or nothing.

Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms.  Even doing
basic I/O to a file system from within the kernel is strongly
discouraged.  In order to support this at all the vn_rename()
implementation ends up being complex and fragile.  So fragile
that recent Linux 4.2 changes have broken it.

While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely.  Then all this complex, fragile code can
be removed.  Achieving this is straight forward because
config_write() is the only consumer of vn_rename().

This patch reworks spa_config_write() to update the cache file
in place.  The file will be truncated, written out, and then
synced to disk.  If an error is encountered the file will be
unlinked leaving the system in a consistent state.

This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file.  However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.

If this were to somehow happen it poses no risk to pool.  Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.

Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#3653
@behlendorf behlendorf deleted the linux-4.2-rename branch May 18, 2018 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Building Indicates an issue related to building binaries
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants