tar: a.out: file changed as we read it #764

ryao · 2012-05-30T01:35:17Z

I suspect that there is a race condition in the mmap code. gcc hello.c && scanelf -Xxz -r a.out && tar -cf hello.tgz a.out will work with a simple hello world program on any filesystem, but ZFS, where we see 'tar: a.out: file changed as we read it' printed as an error message.

There is a downstream Gentoo bug tracking this issue:

https://bugs.gentoo.org/show_bug.cgi?id=411555

The text was updated successfully, but these errors were encountered:

behlendorf · 2012-05-30T16:44:49Z

Why in particular do you suspect mmap() rather than something like the inode updates? It's true that when using mmap() zfs keeps a copy of the data in the page cache and one in the arc. However, we very carefully keep them in sync and this is actually something which xfstests does stress quite heavily.

ryao · 2012-05-30T22:10:15Z

Anthony G. Basile, one of the Gentoo Hardened developers, and I spent some time debugging this. The following program should be equivalent to scanelf -Xxz -r a.out, but if I replace scanelf -Xxz -r a.out with it, I cannot reproduce this issue:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main()
{
int fd = open("a.out", O_RDWR);
lseek(fd, 0x10, SEEK_CUR);
write(fd, "A", 1);
close(fd);
}

The main difference appears to be the use of mmap() according to strace, which caused me to suspect the mmap() code. I did not consider inode updates when debugging this.

I am not able to give ZFS the attention I would like to give it at the present time. I will look at inode updates when I get a chance.

maxximino · 2012-05-30T23:13:14Z

There's something worse: (the problem "traverses" sync and sleep )

sigmaii tmp # echo 'int main(){}' >hello.c
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sleep 10 && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it
sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && sleep 10 && sync && sleep 10 &&tar -cf hello.tgz a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
tar: a.out: file changed as we read it

Those results are systematical,not the result of a single unlucky run.

The system was idle during this test. The pool has a log device on a OCZ Vertex 3, so transaction commits should be a lot faster than 10 seconds. I had 13GB of RAM free. (tot. 16GB)

Kernel is 3.1.10 and ZFS version is very old, around commit ab26409, but it is so stable and fast that I'm not minimally spurred into upgrading.

ryao · 2012-05-31T01:51:39Z

As an additional data point, gcc hello.c && scanelf -Xxz -r a.out && sync && tar -cf hello.tgz a.out also suffers from this issue. :/

behlendorf · 2012-05-31T17:10:30Z

This may be relatively harmless (but still a bug we need to fix). It's my understanding that the "file changed as we read it" tar warning message occurs when the mtime of the source file changes between the start and end of the file read. It's possible we're not strictly updating the mtime when the file is written via mmap(2), however the contents of the file are kept properly in sync. When you replace scanelf with your custom app now get a proper mtime update and the warning goes away.

maxximino · 2012-05-31T18:29:57Z

Frequently {m,c}time goes backwards!!

sigmaii tmp # gcc hello.c && scanelf -Xxz -r a.out && sync && stat a.out &&tar -cf hello.tgz a.out ; stat a.out
 TYPE    PAX   FILE 
ET_EXEC ---xer a.out 
  File: 'a.out'
  Size: 7831            Blocks: 7          IO Block: 8192   regular file
Device: 10h/16d Inode: 8927439     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-31 20:25:50.305634000 +0200
Modify: 2012-05-31 20:25:50.302329865 +0200
Change: 2012-05-31 20:25:50.302329865 +0200
 Birth: -
tar: a.out: file changed as we read it
  File: 'a.out'
  Size: 7831            Blocks: 7          IO Block: 8192   regular file
Device: 10h/16d Inode: 8927439     Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2012-05-31 20:25:51.755515000 +0200
Modify: 2012-05-31 20:25:50.300682000 +0200
Change: 2012-05-31 20:25:50.300773000 +0200
 Birth: -

ryao · 2012-05-31T19:01:26Z

This affects generation of gcc binary packages on Gentoo. I have known about this for a while, but I could not reproduce it without building GCC on Gentoo until recently.

ryao · 2012-10-06T21:24:40Z

This is still an issue as of 0.6.0-rc11.

ryao · 2012-10-06T21:41:09Z

A few observations.

This issue cannot be reproduced using BSD Tar in place of GNU Tar. i.e. gcc hello.c && scanelf -Xxz -r a.out && bsdtar -cf hello.tgz a.out does not exhibit any issues.
Doing a touch operation on the file before executing tar will prevent this issue from occurring. e.g. gcc hello.c && scanelf -Xxz -r a.out && touch a.out && tar -cf hello.tgz a.out
The md5sum of a.out is always consistent.
I cannot reproduce this issue on FreeBSD 9.1-BETA1.

This suggests that this issue is a ZFSOnLinux-specific regression in how mtime updates are handled. It seems that GNU Tar is the only software sensitive to this.

ryao · 2012-10-22T04:46:19Z

Rick Farina informed me in IRC that this behavior breaks Gentoo's catalyst software when used on ZFS. That software is used by Gentoo's Release Engineering team to generate LiveCDs and stage3 tarballs and breaking it is a fairly serious thing.

Since FreeBSD is unaffected, I took a look at their code. They wrote zfs_delmap() to handle this. Unfortunately, Linux's memory management code does not provide similarity functionality, so it is not portable to Linux.

https://github.com/freebsd/freebsd-head/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c

When munmap() is called, Linux will tear down the mapping, but it will not notify the filesystem until the last open file handle is closed. It should be possible to address this by updating mtime on dirty files when release() is called.

I wrote a patch based on that idea. It made the minimal test case work:

diff --git a/module/zfs/zfs_vnops.c b/module/zfs/zfs_vnops.c
index 89f0f60..2b4ddcd 100644
--- a/module/zfs/zfs_vnops.c
+++ b/module/zfs/zfs_vnops.c
@@ -222,6 +222,13 @@ zfs_close(struct inode *ip, int flag, cred_t *cr)
    ZFS_ENTER(zsb);
    ZFS_VERIFY_ZP(zp);

+   /* 
+    * Update mtime
+    * FIXME: This should only be done on dirty znodes
+    */
+   if ( 1 == zp->z_is_mapped )
+       zfs_inode_update(zp);
+
    /*
     * Zero the synchronous opens in the znode.  Under Linux the
     * zfs_close() hook is not symmetric with zfs_open(), it is

Unfortunately, I am only checking that the file is mmapped when I also need to check that it is dirty. The consequence is that any files that are mmapped, but not modified, will have their mtimes updated. I will look into how to check to see if the file is dirty later. I appreciate any comments that people have on this.

behlendorf · 2012-10-22T19:50:58Z

@ryao Your patch seems to fix the issue, but I don't think it quite gets at the heart of the problem. The issue I thought was that zfs_putpage() was only updating the znode and never the inode which resulted in lazy inode mtime update which upset tar.

So I fixed this with the following debug patch zfs_putpage()->zfs_tstamp_update_setup() and to my surprise this didn't resolve the issue.

@@ -1095,6 +1095,7 @@ zfs_tstamp_update_setup(znode_t *zp, uint_t flag, uint64_t

        if (flag & ATTR_MTIME) {
                ZFS_TIME_ENCODE(&now, mtime);
+               ZFS_TIME_DECODE(&ZTOI(zp)->i_mtime, mtime);
                if (ZTOZSB(zp)->z_use_fuids) {
                        zp->z_pflags |= (ZFS_ARCHIVE |
                            ZFS_AV_MODIFIED);

However your patch does seem to solve the problem and I don't quite see why. It just ensures that the znode and inode mtime (and all the other attributes) are in sync at the final close. The question is why isn't that already the case with the above patch in place? We must be missing another update somewhere in the code.

@ryao Can you explain what's going on here? What am I missing?

The consequence is that any files that are mmapped, but not modified, will have their mtimes updated.

Yes and no. Their mtimes will be updated but only to reflect the mtime stored in the znode which should be the same as the current inode mtimes.

This whole znode/inode->mtime issue is a lingering bit of cleanup for the Linux port. Under Solaris I believe this information wasn't stored in their generic vnode structure so they instead put it in their znode structure. Now under Linux we have an inode which should be the authoritative location for this but we always have the existing code which expects it to be in the znode. Therefore, as a merium term fix I added a zfs_inode_update() function to sync the znode fields with the inode fields.

What I've always meant to do is go back and unify the two locations in to a single one where possible.

ryao · 2012-10-23T09:02:49Z

@behlendorf zfs_putpage is invoked when the flush occurs. That is sometime after final close and applications expect to see an update on close. Since known cases where this is a problem only involve a single open file handle, updating mtime at final close makes things work. @maxximino commented that the time is going backward, which implies a cached time is being set on flush. I have not spotted the code responsible for that, but it seems that it will not move mtime to be some time in the past if we update it on the final close, which makes userland happy.

ryao · 2012-10-23T10:36:59Z

On second thought, the backward time is likely still occurring, but my minimal test case does not detect it. That would explain a report by Rick Farina (Zero_Chaos in IRC) that my patch didn't solve the problem for him.

ryao · 2012-12-03T21:25:26Z

I have verified that this patch solves the original problem that was detected in Gentoo. It appears that Zero_Chaos had applied my patch to sys-fs/zfs instead of sys-fs/zfs-kmod, which prevented it from having any effect on his system.

Revert the portion of commit d3aa3ea which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Issue openzfs#764 Issue openzfs#1140 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Revert the portion of commit d3aa3ea which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #764 Closes #1140

When updating a file via mmap()'ed I/O preserve the mtime/ctime which were updated when the page was made writable by the generic callback filemap_page_mkwrite(). But more importantly than preserving the exact time add the missing call to sa_bulk_update(). This ensures that the znode modifications are written to disk as part of the transaction. Without this the inode may mistaken rollback to the previous on-disk znode state. Additionally, for mmap()'ed znodes explicitly set the atime, mtime, and ctime on close using the up to date values in the inode. This is critical because writepage() may occur after close and on close we need to ensure the values are correct. Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#764

Revert the portion of commit d3aa3ea which always resulted in the SAs being update when an mmap()'ed file was closed. That change accidentally resulted in unexpected ctime updates which upset tools like git. That was always a horrible hack and I'm happy it will never make it in to a tagged release. The right fix is something I initially resisted doing because I was worried about the additional overhead. However, in hindsight the overhead isn't as bad as I feared. This patch implemented the sops->dirty_inode() callback which is unsurprisingly called when an inode is dirtied. We leverage this callback to keep the znode SAs strictly in sync with the inode. However, for now we're going to go slowly to avoid introducing any new unexpected issues by only updating the atime, mtime, and ctime. This will cover the callpath of most concern to us. ->filemap_page_mkwrite->file_update_time->update_time-> mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#764 Closes openzfs#1140

ryao mentioned this issue Jul 13, 2012

GNU Grep 2.13 causes build failure in glib on ZFS #829

Closed

behlendorf mentioned this issue Jul 24, 2012

Fedora 17: cannot compile anymore openzfs/spl#141

Closed

behlendorf closed this as completed in d3aa3ea Dec 5, 2012

This was referenced Dec 13, 2012

ZFS updates ctime of mmaped files when unneeded #1140

Closed

Update SAs when an inode is dirtied #1144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tar: a.out: file changed as we read it #764

tar: a.out: file changed as we read it #764

ryao commented May 30, 2012

behlendorf commented May 30, 2012

ryao commented May 30, 2012

maxximino commented May 30, 2012

ryao commented May 31, 2012

behlendorf commented May 31, 2012

maxximino commented May 31, 2012

ryao commented May 31, 2012

ryao commented Oct 6, 2012

ryao commented Oct 6, 2012

ryao commented Oct 22, 2012

behlendorf commented Oct 22, 2012

ryao commented Oct 23, 2012

ryao commented Oct 23, 2012

ryao commented Dec 3, 2012

tar: a.out: file changed as we read it #764

tar: a.out: file changed as we read it #764

Comments

ryao commented May 30, 2012

behlendorf commented May 30, 2012

ryao commented May 30, 2012

maxximino commented May 30, 2012

ryao commented May 31, 2012

behlendorf commented May 31, 2012

maxximino commented May 31, 2012

ryao commented May 31, 2012

ryao commented Oct 6, 2012

ryao commented Oct 6, 2012

ryao commented Oct 22, 2012

behlendorf commented Oct 22, 2012

ryao commented Oct 23, 2012

ryao commented Oct 23, 2012

ryao commented Dec 3, 2012