tar: a.out: file changed as we read it #764

Closed
ryao opened this Issue May 30, 2012 · 14 comments

3 participants

@ryao
ZFS on Linux member

I suspect that there is a race condition in the mmap code. gcc hello.c && scanelf -Xxz -r a.out && tar -cf hello.tgz a.out will work with a simple hello world program on any filesystem, but ZFS, where we see 'tar: a.out: file changed as we read it' printed as an error message.

There is a downstream Gentoo bug tracking this issue:

https://bugs.gentoo.org/show_bug.cgi?id=411555

@behlendorf
ZFS on Linux member

Why in particular do you suspect mmap() rather than something like the inode updates? It's true that when using mmap() zfs keeps a copy of the data in the page cache and one in the arc. However, we very carefully keep them in sync and this is actually something which xfstests does stress quite heavily.

@ryao
ZFS on Linux member

Anthony G. Basile, one of the Gentoo Hardened developers, and I spent some time debugging this. The following program should be equivalent to scanelf -Xxz -r a.out, but if I replace scanelf -Xxz -r a.out with it, I cannot reproduce this issue:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main()
{
int fd = open("a.out", O_RDWR);
lseek(fd, 0x10, SEEK_CUR);
write(fd, "A", 1);
close(fd);
}

The main difference appears to be the use of mmap() according to strace, which caused me to suspect the mmap() code. I did not consider inode updates when debugging this.

I am not able to give ZFS the attention I would like to give it at the present time. I will look at inode updates when I get a chance.

@maxximino

There's something worse: (the problem "traverses" sync and sleep )

sigmaii tmp # echo 'int main(){}' >hello.c
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sleep 10 && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && sleep 10 &&tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it

Those results are systematical,not the result of a single unlucky run.

The system was idle during this test. The pool has a log device on a OCZ Vertex 3, so transaction commits should be a lot faster than 10 seconds. I had 13GB of RAM free. (tot. 16GB)

Kernel is 3.1.10 and ZFS version is very old, around commit zfsonlinux/zfs@ab26409, but it is so stable and fast that I'm not minimally spurred into upgrading.

@ryao
ZFS on Linux member

As an additional data point, gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out also suffers from this issue. :/

@behlendorf
ZFS on Linux member

This may be relatively harmless (but still a bug we need to fix). It's my understanding that the "file changed as we read it" tar warning message occurs when the mtime of the source file changes between the start and end of the file read. It's possible we're not strictly updating the mtime when the file is written via mmap(2), however the contents of the file are kept properly in sync. When you replace scanelf with your custom app now get a proper mtime update and the warning goes away.

@maxximino

Frequently {m,c}time goes backwards!!

sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && stat a.out &&tar -cf hello.tgz a.out ; stat a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
  File: 'a.out'
  Size: 7831            Blocks: 7          IO Block: 8192   regular file
Device: 10h/16d Inode: 8927439     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-31 20:25:50.305634000 +0200
Modify: 2012-05-31 20:25:50.302329865 +0200
Change: 2012-05-31 20:25:50.302329865 +0200
 Birth: -
tar: a.out: file changed as we read it
  File: 'a.out'
  Size: 7831            Blocks: 7          IO Block: 8192   regular file
Device: 10h/16d Inode: 8927439     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-31 20:25:51.755515000 +0200
Modify: 2012-05-31 20:25:50.300682000 +0200
Change: 2012-05-31 20:25:50.300773000 +0200
 Birth: -
@ryao
ZFS on Linux member

This affects generation of gcc binary packages on Gentoo. I have known about this for a while, but I could not reproduce it without building GCC on Gentoo until recently.

@ryao
ZFS on Linux member

This is still an issue as of 0.6.0-rc11.

@ryao
ZFS on Linux member

A few observations.

  1. This issue cannot be reproduced using BSD Tar in place of GNU Tar. i.e. gcc hello.c && scanelf -Xxz -r a.out && bsdtar -cf hello.tgz a.out does not exhibit any issues.
  2. Doing a touch operation on the file before executing tar will prevent this issue from occurring. e.g. gcc hello.c && scanelf -Xxz -r a.out && touch a.out && tar -cf hello.tgz a.out
  3. The md5sum of a.out is always consistent.
  4. I cannot reproduce this issue on FreeBSD 9.1-BETA1.

This suggests that this issue is a ZFSOnLinux-specific regression in how mtime updates are handled. It seems that GNU Tar is the only software sensitive to this.

@ryao
ZFS on Linux member

Rick Farina informed me in IRC that this behavior breaks Gentoo's catalyst software when used on ZFS. That software is used by Gentoo's Release Engineering team to generate LiveCDs and stage3 tarballs and breaking it is a fairly serious thing.

Since FreeBSD is unaffected, I took a look at their code. They wrote zfs_delmap() to handle this. Unfortunately, Linux's memory management code does not provide similarity functionality, so it is not portable to Linux.

https://github.com/freebsd/freebsd-head/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c

When munmap() is called, Linux will tear down the mapping, but it will not notify the filesystem until the last open file handle is closed. It should be possible to address this by updating mtime on dirty files when release() is called.

I wrote a patch based on that idea. It made the minimal test case work:

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 89f0f60..2b4ddcd 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -222,6 +222,13 @@ zfs_close(struct inode *ip, int flag, cred_t *cr)
    ZFS_ENTER(zsb);
    ZFS_VERIFY_ZP(zp);

+   /* 
+    * Update mtime
+    * FIXME: This should only be done on dirty znodes
+    */
+   if ( 1 == zp->z_is_mapped )
+       zfs_inode_update(zp);
+
    /*
     * Zero the synchronous opens in the znode.  Under Linux the
     * zfs_close() hook is not symmetric with zfs_open(), it is

Unfortunately, I am only checking that the file is mmapped when I also need to check that it is dirty. The consequence is that any files that are mmapped, but not modified, will have their mtimes updated. I will look into how to check to see if the file is dirty later. I appreciate any comments that people have on this.

@behlendorf
ZFS on Linux member

@ryao Your patch seems to fix the issue, but I don't think it quite gets at the heart of the problem. The issue I thought was that zfs_putpage() was only updating the znode and never the inode which resulted in lazy inode mtime update which upset tar.

So I fixed this with the following debug patch zfs_putpage()->zfs_tstamp_update_setup() and to my surprise this didn't resolve the issue.

@@ -1095,6 +1095,7 @@ zfs_tstamp_update_setup(znode_t *zp, uint_t flag, uint64_t

        if (flag & ATTR_MTIME) {
                ZFS_TIME_ENCODE(&now, mtime);
+               ZFS_TIME_DECODE(&ZTOI(zp)->i_mtime, mtime);
                if (ZTOZSB(zp)->z_use_fuids) {
                        zp->z_pflags |= (ZFS_ARCHIVE |
                            ZFS_AV_MODIFIED);

However your patch does seem to solve the problem and I don't quite see why. It just ensures that the znode and inode mtime (and all the other attributes) are in sync at the final close. The question is why isn't that already the case with the above patch in place? We must be missing another update somewhere in the code.

@ryao Can you explain what's going on here? What am I missing?

The consequence is that any files that are mmapped, but not modified, will have their mtimes updated.

Yes and no. Their mtimes will be updated but only to reflect the mtime stored in the znode which should be the same as the current inode mtimes.

This whole znode/inode->mtime issue is a lingering bit of cleanup for the Linux port. Under Solaris I believe this information wasn't stored in their generic vnode structure so they instead put it in their znode structure. Now under Linux we have an inode which should be the authoritative location for this but we always have the existing code which expects it to be in the znode. Therefore, as a merium term fix I added a zfs_inode_update() function to sync the znode fields with the inode fields.

What I've always meant to do is go back and unify the two locations in to a single one where possible.

@ryao
ZFS on Linux member

@behlendorf zfs_putpage is invoked when the flush occurs. That is sometime after final close and applications expect to see an update on close. Since known cases where this is a problem only involve a single open file handle, updating mtime at final close makes things work. @maxximino commented that the time is going backward, which implies a cached time is being set on flush. I have not spotted the code responsible for that, but it seems that it will not move mtime to be some time in the past if we update it on the final close, which makes userland happy.

@ryao
ZFS on Linux member

On second thought, the backward time is likely still occurring, but my minimal test case does not detect it. That would explain a report by Rick Farina (Zero_Chaos in IRC) that my patch didn't solve the problem for him.

@ryao
ZFS on Linux member

I have verified that this patch solves the original problem that was detected in Gentoo. It appears that Zero_Chaos had applied my patch to sys-fs/zfs instead of sys-fs/zfs-kmod, which prevented it from having any effect on his system.

@behlendorf behlendorf added a commit that closed this issue Dec 5, 2012
@behlendorf behlendorf Preserve inode mtime/ctime in .writepage()
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().

But more importantly than preserving the exact time add the missing
call to sa_bulk_update().  This ensures that the znode modifications
are written to disk as part of the transaction.  Without this the
inode may mistaken rollback to the previous on-disk znode state.

Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode.  This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.

Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
d3aa3ea
@behlendorf behlendorf closed this in d3aa3ea Dec 5, 2012
@behlendorf behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 13, 2012
@behlendorf behlendorf Update SAs when an inode is dirtied
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Issue #764
Issue #1140

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
f4451de
@behlendorf behlendorf added a commit that referenced this issue Dec 14, 2012
@behlendorf behlendorf Update SAs when an inode is dirtied
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
Closes #1140
8780c53
@unya unya added a commit to unya/zfs that referenced this issue Dec 13, 2013
@behlendorf behlendorf Preserve inode mtime/ctime in .writepage()
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().

But more importantly than preserving the exact time add the missing
call to sa_bulk_update().  This ensures that the znode modifications
are written to disk as part of the transaction.  Without this the
inode may mistaken rollback to the previous on-disk znode state.

Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode.  This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.

Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
1cf4c51
@unya unya added a commit to unya/zfs that referenced this issue Dec 13, 2013
@behlendorf behlendorf Update SAs when an inode is dirtied
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
Closes #1140
520e875
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment