Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unlistable and disappearing files #7401

Closed
vbrik opened this issue Apr 6, 2018 · 111 comments · Fixed by #7416
Closed

Unlistable and disappearing files #7401

vbrik opened this issue Apr 6, 2018 · 111 comments · Fixed by #7416
Assignees
Labels
Type: Regression Indicates a functional regression

Comments

@vbrik
Copy link

vbrik commented Apr 6, 2018

System information

Type Version/Name
Distribution Name Scientific Linux
Distribution Version 6.8
Linux Kernel 2.6.32-696.23.1.el6.x86_64
Architecture x86_64
ZFS Version 0.7.7
SPL Version 0.7.7

Describe the problem you're observing

Data loss when copying a directory with large-ish number of files. For example, cp -r SRC DST with 10000 files in SRC is likely to result in a couple of "cp: cannot create regular file `DST/XXX': No space left on device" error messages, and a few thousand files missing from the listing of the DST directory. (Needless to say, filesystem being full is not the problem.)

The missing files are missing in the sense that they don't appear in the directory listing, but can be accessed using their name (except for the couple of files for which cp generated "No space left on device" error). For example:

# ls -l DST | grep FOO | wc -l
0
# ls -l DST/FOO
-rw-r--r-- 1 root root 5 Apr  6 14:59 DST/FOO

The content of DST/FOO are accessible by path (e.g. cat DST/FOO works) and is the same as SRC/FOO. If caches are dropped (echo 3 > /proc/sys/vm/drop_caches) or the machine is rebooted, opening FOO directly by path fails.

ls -ld DST reports N fewer hard links than SRC, where N is the number of files for which cp reported "No space left on device" error.

Names of missing files are mostly predictable if SRC is small.

Scrub does not find any errors.

I think the problem appeared in 0.7.7, but I am not sure.

Describe how to reproduce the problem

# mkdir SRC
# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
# cp -r SRC DST
cp: cannot create regular file `DST/8442': No space left on device
cp: cannot create regular file `DST/2629': No space left on device
# ls -l
total 3107
drwxr-xr-x 2 root root 10000 Apr  6 15:28 DST
drwxr-xr-x 2 root root 10002 Apr  6 15:27 SRC
# find DST -type f | wc -l 
8186
# ls -l DST | grep 8445 | wc -l
0
# ls -l DST/8445
-rw-r--r-- 1 root root 5 Apr  6 15:28 DST/8445
# cat DST/8445
8445
# echo 3 > /proc/sys/vm/drop_caches
# cat DST/8445
cat: DST/8445: No such file or directory

Include any warning/errors/backtraces from the system logs

# zpool status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 87h47m with 0 errors on Sat Mar 31 07:09:27 2018
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            wwn-0x5000c50085ac4c0f  ONLINE       0     0     0
            wwn-0x5000c50085acda77  ONLINE       0     0     0
            wwn-0x5000c500858db3d7  ONLINE       0     0     0
            wwn-0x5000c50085ac9887  ONLINE       0     0     0
            wwn-0x5000c50085aca6df  ONLINE       0     0     0
          raidz1-1                  ONLINE       0     0     0
            wwn-0x5000c500858db743  ONLINE       0     0     0
            wwn-0x5000c500858db347  ONLINE       0     0     0
            wwn-0x5000c500858db4a7  ONLINE       0     0     0
            wwn-0x5000c500858dbb0f  ONLINE       0     0     0
            wwn-0x5000c50085acaa97  ONLINE       0     0     0
          raidz1-2                  ONLINE       0     0     0
            wwn-0x5000c50085accb4b  ONLINE       0     0     0
            wwn-0x5000c50085acab9f  ONLINE       0     0     0
            wwn-0x5000c50085ace783  ONLINE       0     0     0
            wwn-0x5000c500858db67b  ONLINE       0     0     0
            wwn-0x5000c50085acb983  ONLINE       0     0     0
          raidz1-3                  ONLINE       0     0     0
            wwn-0x5000c50085ac4fd7  ONLINE       0     0     0
            wwn-0x5000c50085acb24b  ONLINE       0     0     0
            wwn-0x5000c50085ace13b  ONLINE       0     0     0
            wwn-0x5000c500858db43f  ONLINE       0     0     0
            wwn-0x5000c500858db61b  ONLINE       0     0     0
          raidz1-4                  ONLINE       0     0     0
            wwn-0x5000c500858dbbb7  ONLINE       0     0     0
            wwn-0x5000c50085acce7f  ONLINE       0     0     0
            wwn-0x5000c50085acd693  ONLINE       0     0     0
            wwn-0x5000c50085ac3d87  ONLINE       0     0     0
            wwn-0x5000c50085acc89b  ONLINE       0     0     0
          raidz1-5                  ONLINE       0     0     0
            wwn-0x5000c500858db28b  ONLINE       0     0     0
            wwn-0x5000c500858db68f  ONLINE       0     0     0
            wwn-0x5000c500858dbadf  ONLINE       0     0     0
            wwn-0x5000c500858db623  ONLINE       0     0     0
            wwn-0x5000c500858db48b  ONLINE       0     0     0
          raidz1-6                  ONLINE       0     0     0
            wwn-0x5000c500858db6ef  ONLINE       0     0     0
            wwn-0x5000c500858db39b  ONLINE       0     0     0
            wwn-0x5000c500858db47f  ONLINE       0     0     0
            wwn-0x5000c500858dbb23  ONLINE       0     0     0
            wwn-0x5000c500858db803  ONLINE       0     0     0
        logs
          zfs-slog                  ONLINE       0     0     0
        spares
          wwn-0x5000c500858db463    AVAIL   

errors: No known data errors
# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank   254T   159T  94.3T         -    27%    62%  1.00x  ONLINE  -
# zfs list -t all
NAME           USED  AVAIL  REFER  MOUNTPOINT
tank           127T  69.0T  11.5T  /mnt/tank
tank/jade      661G  69.0T   661G  /mnt/tank/jade
tank/simprod   115T  14.8T   115T  /mnt/tank/simprod
# zfs get all tank
NAME  PROPERTY              VALUE                  SOURCE
tank  type                  filesystem             -
tank  creation              Sat Jan 20 12:11 2018  -
tank  used                  127T                   -
tank  available             68.9T                  -
tank  referenced            11.6T                  -
tank  compressratio         1.00x                  -
tank  mounted               yes                    -
tank  quota                 none                   default
tank  reservation           none                   default
tank  recordsize            128K                   default
tank  mountpoint            /mnt/tank              local
tank  sharenfs              off                    default
tank  checksum              on                     default
tank  compression           off                    default
tank  atime                 off                    local
tank  devices               on                     default
tank  exec                  on                     default
tank  setuid                on                     default
tank  readonly              off                    default
tank  zoned                 off                    default
tank  snapdir               hidden                 default
tank  aclinherit            restricted             default
tank  createtxg             1                      -
tank  canmount              on                     default
tank  xattr                 sa                     local
tank  copies                1                      default
tank  version               5                      -
tank  utf8only              off                    -
tank  normalization         none                   -
tank  casesensitivity       sensitive              -
tank  vscan                 off                    default
tank  nbmand                off                    default
tank  sharesmb              off                    default
tank  refquota              none                   default
tank  refreservation        none                   default
tank  guid                  2271746520743372128    -
tank  primarycache          all                    default
tank  secondarycache        all                    default
tank  usedbysnapshots       0B                     -
tank  usedbydataset         11.6T                  -
tank  usedbychildren        116T                   -
tank  usedbyrefreservation  0B                     -
tank  logbias               latency                default
tank  dedup                 off                    default
tank  mlslabel              none                   default
tank  sync                  standard               default
tank  dnodesize             legacy                 default
tank  refcompressratio      1.00x                  -
tank  written               11.6T                  -
tank  logicalused           128T                   -
tank  logicalreferenced     11.6T                  -
tank  volmode               default                default
tank  filesystem_limit      none                   default
tank  snapshot_limit        none                   default
tank  filesystem_count      none                   default
tank  snapshot_count        none                   default
tank  snapdev               hidden                 default
tank  acltype               off                    default
tank  context               none                   default
tank  fscontext             none                   default
tank  defcontext            none                   default
tank  rootcontext           none                   default
tank  relatime              off                    default
tank  redundant_metadata    all                    default
tank  overlay               off                    default
# zpool get all tank   
NAME  PROPERTY                       VALUE                          SOURCE
tank  size                           254T                           -
tank  capacity                       62%                            -
tank  altroot                        -                              default
tank  health                         ONLINE                         -
tank  guid                           7056741522691970971            -
tank  version                        -                              default
tank  bootfs                         -                              default
tank  delegation                     on                             default
tank  autoreplace                    on                             local
tank  cachefile                      -                              default
tank  failmode                       wait                           default
tank  listsnapshots                  off                            default
tank  autoexpand                     off                            default
tank  dedupditto                     0                              default
tank  dedupratio                     1.00x                          -
tank  free                           94.2T                          -
tank  allocated                      160T                           -
tank  readonly                       off                            -
tank  ashift                         0                              default
tank  comment                        -                              default
tank  expandsize                     -                              -
tank  freeing                        0                              -
tank  fragmentation                  27%                            -
tank  leaked                         0                              -
tank  multihost                      off                            default
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local
@shodanshok
Copy link
Contributor

shodanshok commented Apr 6, 2018

I can confirm the same behavior on a minimal CentOS 7.4 installation (running inside VirtualBox) and latest ZFS 0.7.7. Please note that when copying somewhat bigger files (ie: kernel source) it does not happen, so it seems something as a race condition...

; the only changed property was xattr=sa
[root@localhost ~]# zpool list
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank  7.94G  25.3M  7.91G         -     0%     0%  1.00x  ONLINE  -
[root@localhost ~]# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
tank       24.5M  7.67G  3.36M  /tank
tank/test  21.0M  7.67G  21.0M  /tank/test

; creating the source dir on a XFS filesystem
[root@localhost ~]# cd /root/
[root@localhost ~]# mkdir test
[root@localhost ~]# cd test
[root@localhost ~]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done

; copying from XFS to ZFS: no problem at all
[root@localhost ~]# cd /tank/test
[root@localhost test]# cp -r /root/test/SRC/ DST1
[root@localhost test]# cp -r /root/test/SRC/ DST2
[root@localhost test]# cp -r /root/test/SRC/ DST3
[root@localhost test]# find DST1/ | wc -l
10001
[root@localhost test]# find DST2/ | wc -l
10001
[root@localhost test]# find DST3/ | wc -l
10001

; copying from ZFS dataset itself: big troubles!
[root@localhost test]# rm -rf SRC DST1 DST2 DST3
[root@localhost test]# cp -r /root/test/SRC .
[root@localhost test]# cp -r SRC DST1
cp: cannot create regular file ‘DST1/8809’: No space left on device
[root@localhost test]# cp -r SRC DST2
[root@localhost test]# cp -r SRC DST3
cp: cannot create regular file ‘DST3/6507’: No space left on device
[root@localhost test]# find DST1/ | wc -l
10000
[root@localhost test]# find DST2/ | wc -l
10001
[root@localhost test]# find DST3/ | wc -l
8189

; disabling cache: nothing changes (we continue to "lose" files)
[root@localhost test]# zfs set primarycache=none tank
[root@localhost test]# zfs set primarycache=none tank/test
[root@localhost test]# echo 3 > /proc/sys/vm/drop_caches
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001
[root@localhost test]# rm -rf SRC; mkdir SRC; for i in $(seq 1 10000); do echo $i > SRC/$i ; done; find SRC | wc -l
10001

The problem does NOT appear on ZoL 0.7.6:

; creating the dataset and copying the SRC dir
[root@localhost ~]# zfs create tank/test
[root@localhost ~]# zfs set xattr=sa tank
[root@localhost ~]# zfs set xattr=sa tank/test
[root@localhost ~]# cp -r /root/test/SRC/ /tank/test/
[root@localhost ~]# cd /tank/test/
[root@localhost test]# find SRC/ | wc -l
10001

; more copies
[root@localhost test]# cp -r SRC/ DST
[root@localhost test]# cp -r SRC/ DST1
[root@localhost test]# cp -r SRC/ DST2
[root@localhost test]# cp -r SRC/ DST3
[root@localhost test]# cp -r SRC/ DST4
[root@localhost test]# cp -r SRC/ DST5
[root@localhost test]# find DST | wc -l
10001
[root@localhost test]# find DST1 | wc -l
10001
[root@localhost test]# find DST2 | wc -l
10001
[root@localhost test]# find DST3 | wc -l
10001
[root@localhost test]# find DST4 | wc -l
10001
[root@localhost test]# find DST5 | wc -l
10001

Maybe it can help. Here you find the output of zdb -dddddddd tank/test 192784 (a "good" DST directory):

Dataset tank/test [ZPL], ID 74, cr_txg 13, 26.5M, 190021 objects, rootbp DVA[0]=<0:5289e00:200> DVA[1]=<0:65289e00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=123L/123P fill=190021 cksum=d622b78d2:50c053a50d0:fca8cd4455d7:2216d160ee7f7d

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    192784    2   128K    16K   909K     512  1.02M  100.00  ZFS directory (K=inherit) (Z=inherit)
                                               272   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 64
        path    /DST16
        uid     0
        gid     0
        atime   Sat Apr  7 01:11:29 2018
        mtime   Sat Apr  7 01:11:31 2018
        ctime   Sat Apr  7 01:11:31 2018
        crtime  Sat Apr  7 01:11:29 2018
        gen     97
        mode    40755
        size    10002
        parent  34
        links   2
        pflags  40800000144
        SA xattrs: 96 bytes, 1 entries

                security.selinux = unconfined_u:object_r:unlabeled_t:s0\000
        Fat ZAP stats:
                Pointer table:
                        1024 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 10
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 10000
                Leaf blocks: 64
                Total blocks: 65
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x13c18a19
                Leafs with 2^n pointers:
                          4:     64 ****************************************
                Blocks with n*5 entries:
                          9:     64 ****************************************
                Blocks n/10 full:
                          6:      4 ****
                          7:     43 ****************************************
                          8:     16 ***************
                          9:      1 *
                Entries with n chunks:
                          3:  10000 ****************************************
                Buckets with n entries:
                          0:  24119 ****************************************
                          1:   7414 *************
                          2:   1126 **
                          3:    102 *
                          4:      7 *

... and zdb -dddddddd tank/test 202785 (a "bad" DST directory):

Dataset tank/test [ZPL], ID 74, cr_txg 13, 26.5M, 190021 objects, rootbp DVA[0]=<0:5289e00:200> DVA[1]=<0:65289e00:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=123L/123P fill=190021 cksum=d622b78d2:50c053a50d0:fca8cd4455d7:2216d160ee7f7d

    Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
    202785    2   128K    16K   766K     512   896K  100.00  ZFS directory (K=inherit) (Z=inherit)
                                               272   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
        dnode maxblkid: 55
        path    /DST17
        uid     0
        gid     0
        atime   Sat Apr  7 01:12:49 2018
        mtime   Sat Apr  7 01:11:33 2018
        ctime   Sat Apr  7 01:11:33 2018
        crtime  Sat Apr  7 01:11:32 2018
        gen     98
        mode    40755
        size    10001
        parent  34
        links   2
        pflags  40800000144
        SA xattrs: 96 bytes, 1 entries

                security.selinux = unconfined_u:object_r:unlabeled_t:s0\000
        Fat ZAP stats:
                Pointer table:
                        1024 elements
                        zt_blk: 0
                        zt_numblks: 0
                        zt_shift: 10
                        zt_blks_copied: 0
                        zt_nextblk: 0
                ZAP entries: 8259
                Leaf blocks: 55
                Total blocks: 56
                zap_block_type: 0x8000000000000001
                zap_magic: 0x2f52ab2ab
                zap_salt: 0x1bf8e8a3
                Leafs with 2^n pointers:
                          4:     50 ****************************************
                          5:      3 ***
                          6:      2 **
                Blocks with n*5 entries:
                          9:     55 ****************************************
                Blocks n/10 full:
                          5:      6 ******
                          6:      7 *******
                          7:     32 ********************************
                          8:      6 ******
                          9:      4 ****
                Entries with n chunks:
                          3:   8259 ****************************************
                Buckets with n entries:
                          0:  20964 ****************************************
                          1:   6217 ************
                          2:    904 **
                          3:     66 *
                          4:      9 *

@alatteri
Copy link

alatteri commented Apr 6, 2018

We are also seeing similar behavior since the install of 0.7.7

@siebenmann
Copy link
Contributor

I have a hand-built ZoL 0.7.7 on a stock Ubuntu 16.04 server (currently with Ubuntu kernel version '4.4.0-109-generic') and I can't reproduce this problem on it, following the reproduction here and some variants (eg using 'seq -w' to make all of the filenames the same size). The pool I'm testing against has a single mirrored vdev.

@loli10K loli10K added the Type: Regression Indicates a functional regression label Apr 7, 2018
@rblank
Copy link
Contributor

rblank commented Apr 7, 2018

One more data point, with the hope that it helps narrow down the issue.

I cannot reproduce the issue on the few machines I have here, neither with 10k files, nor with 100k or even 1M. They all have very similar configuraition. They use a single 2-drive mirrored vdev. The drives are Samsung SSD 950 PRO 512GB (NVMe, quite fast).

$ uname -a
Linux pat 4.9.90-gentoo #1 SMP PREEMPT Tue Mar 27 00:19:59 CEST 2018 x86_64 Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz GenuineIntel GNU/Linux

$ qlist -I -v zfs-kmod
sys-fs/zfs-kmod-0.7.7

$ qlist -I -v spl
sys-kernel/spl-0.7.7

$ zpool status
  pool: pat:pool
 state: ONLINE
  scan: scrub repaired 0B in 0h1m with 0 errors on Sat Apr  7 03:35:12 2018
config:

        NAME                                                 STATE     READ WRITE CKSUM
        pat:pool                                             ONLINE       0     0     0
          mirror-0                                           ONLINE       0     0     0
            nvme0n1p4                                        ONLINE       0     0     0
            nvme1n1p4                                        ONLINE       0     0     0
        spares
          ata-Samsung_SSD_850_EVO_1TB_S2RFNXAH118721D-part8  AVAIL   

errors: No known data errors

$ zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
pat:pool   408G   110G   298G         -    18%    26%  1.00x  ONLINE  -

$ zpool get all pat:pool
NAME      PROPERTY                       VALUE                          SOURCE
pat:pool  size                           408G                           -
pat:pool  capacity                       26%                            -
pat:pool  altroot                        -                              default
pat:pool  health                         ONLINE                         -
pat:pool  guid                           16472389984482033769           -
pat:pool  version                        -                              default
pat:pool  bootfs                         -                              default
pat:pool  delegation                     on                             default
pat:pool  autoreplace                    on                             local
pat:pool  cachefile                      -                              default
pat:pool  failmode                       wait                           default
pat:pool  listsnapshots                  off                            default
pat:pool  autoexpand                     off                            default
pat:pool  dedupditto                     0                              default
pat:pool  dedupratio                     1.00x                          -
pat:pool  free                           298G                           -
pat:pool  allocated                      110G                           -
pat:pool  readonly                       off                            -
pat:pool  ashift                         12                             local
pat:pool  comment                        -                              default
pat:pool  expandsize                     -                              -
pat:pool  freeing                        0                              -
pat:pool  fragmentation                  18%                            -
pat:pool  leaked                         0                              -
pat:pool  multihost                      off                            default
pat:pool  feature@async_destroy          enabled                        local
pat:pool  feature@empty_bpobj            active                         local
pat:pool  feature@lz4_compress           active                         local
pat:pool  feature@multi_vdev_crash_dump  enabled                        local
pat:pool  feature@spacemap_histogram     active                         local
pat:pool  feature@enabled_txg            active                         local
pat:pool  feature@hole_birth             active                         local
pat:pool  feature@extensible_dataset     active                         local
pat:pool  feature@embedded_data          active                         local
pat:pool  feature@bookmarks              enabled                        local
pat:pool  feature@filesystem_limits      enabled                        local
pat:pool  feature@large_blocks           enabled                        local
pat:pool  feature@large_dnode            enabled                        local
pat:pool  feature@sha512                 enabled                        local
pat:pool  feature@skein                  enabled                        local
pat:pool  feature@edonr                  enabled                        local
pat:pool  feature@userobj_accounting     active                         local

$ zfs list
NAME                                          USED  AVAIL  REFER  MOUNTPOINT
(...)
pat:pool/home/joe/tmp                        27.9G   285G  27.9G  /home/joe/tmp
(...)

$ zfs get all pat:pool/home/joe/tmp
NAME                   PROPERTY               VALUE                  SOURCE
pat:pool/home/joe/tmp  type                   filesystem             -
pat:pool/home/joe/tmp  creation               Sat Mar 12 17:32 2016  -
pat:pool/home/joe/tmp  used                   27.9G                  -
pat:pool/home/joe/tmp  available              285G                   -
pat:pool/home/joe/tmp  referenced             27.9G                  -
pat:pool/home/joe/tmp  compressratio          1.16x                  -
pat:pool/home/joe/tmp  mounted                yes                    -
pat:pool/home/joe/tmp  quota                  none                   default
pat:pool/home/joe/tmp  reservation            none                   default
pat:pool/home/joe/tmp  recordsize             128K                   default
pat:pool/home/joe/tmp  mountpoint             /home/joe/tmp          inherited from pat:pool/home
pat:pool/home/joe/tmp  sharenfs               off                    default
pat:pool/home/joe/tmp  checksum               on                     default
pat:pool/home/joe/tmp  compression            lz4                    inherited from pat:pool
pat:pool/home/joe/tmp  atime                  off                    inherited from pat:pool
pat:pool/home/joe/tmp  devices                on                     default
pat:pool/home/joe/tmp  exec                   on                     default
pat:pool/home/joe/tmp  setuid                 on                     default
pat:pool/home/joe/tmp  readonly               off                    default
pat:pool/home/joe/tmp  zoned                  off                    default
pat:pool/home/joe/tmp  snapdir                hidden                 default
pat:pool/home/joe/tmp  aclinherit             restricted             default
pat:pool/home/joe/tmp  createtxg              507                    -
pat:pool/home/joe/tmp  canmount               on                     default
pat:pool/home/joe/tmp  xattr                  sa                     inherited from pat:pool
pat:pool/home/joe/tmp  copies                 1                      default
pat:pool/home/joe/tmp  version                5                      -
pat:pool/home/joe/tmp  utf8only               off                    -
pat:pool/home/joe/tmp  normalization          none                   -
pat:pool/home/joe/tmp  casesensitivity        sensitive              -
pat:pool/home/joe/tmp  vscan                  off                    default
pat:pool/home/joe/tmp  nbmand                 off                    default
pat:pool/home/joe/tmp  sharesmb               off                    default
pat:pool/home/joe/tmp  refquota               none                   default
pat:pool/home/joe/tmp  refreservation         none                   default
pat:pool/home/joe/tmp  guid                   10274125767907263189   -
pat:pool/home/joe/tmp  primarycache           all                    default
pat:pool/home/joe/tmp  secondarycache         all                    default
pat:pool/home/joe/tmp  usedbysnapshots        0B                     -
pat:pool/home/joe/tmp  usedbydataset          27.9G                  -
pat:pool/home/joe/tmp  usedbychildren         0B                     -
pat:pool/home/joe/tmp  usedbyrefreservation   0B                     -
pat:pool/home/joe/tmp  logbias                latency                default
pat:pool/home/joe/tmp  dedup                  off                    default
pat:pool/home/joe/tmp  mlslabel               none                   default
pat:pool/home/joe/tmp  sync                   standard               default
pat:pool/home/joe/tmp  dnodesize              legacy                 default
pat:pool/home/joe/tmp  refcompressratio       1.16x                  -
pat:pool/home/joe/tmp  written                27.9G                  -
pat:pool/home/joe/tmp  logicalused            31.6G                  -
pat:pool/home/joe/tmp  logicalreferenced      31.6G                  -
pat:pool/home/joe/tmp  volmode                default                default
pat:pool/home/joe/tmp  filesystem_limit       none                   default
pat:pool/home/joe/tmp  snapshot_limit         none                   default
pat:pool/home/joe/tmp  filesystem_count       none                   default
pat:pool/home/joe/tmp  snapshot_count         none                   default
pat:pool/home/joe/tmp  snapdev                hidden                 default
pat:pool/home/joe/tmp  acltype                posixacl               inherited from pat:pool
pat:pool/home/joe/tmp  context                none                   default
pat:pool/home/joe/tmp  fscontext              none                   default
pat:pool/home/joe/tmp  defcontext             none                   default
pat:pool/home/joe/tmp  rootcontext            none                   default
pat:pool/home/joe/tmp  relatime               off                    default
pat:pool/home/joe/tmp  redundant_metadata     all                    default
pat:pool/home/joe/tmp  overlay                off                    default
pat:pool/home/joe/tmp  net.c-space:snapshots  keep=1M                inherited from pat:pool/home/joe
pat:pool/home/joe/tmp  net.c-space:root       0                      inherited from pat:pool

@alexcrow1974
Copy link

I get a worse situation on latest Centos 7 with kmod:

`[root@zirconia test]# mkdir SRC
[root@zirconia test]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done
[root@zirconia test]# cp -r SRC DST
cp: cannot create regular file ‘DST/5269’: No space left on device
cp: cannot create regular file ‘DST/9923’: No space left on device
[root@zirconia test]# cat DST/5269
cat: DST/5269: No such file or directory
[root@zirconia test]# cat DST/9923
cat: DST/9923: No such file or directory
[root@zirconia test]# cat DST/9924
9924
[root@zirconia test]# cat DST/9923
cat: DST/9923: No such file or directory
[root@zirconia test]# ls -l DST/9923
ls: cannot access DST/9923: No such file or directory

[root@zirconia test]# zpool status
pool: storage
state: ONLINE
scan: none requested
config:

NAME                                            STATE     READ WRITE CKSUM
storage                                         ONLINE       0     0     0
  raidz1-0                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30KPM0D  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJDDD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJAHD  ONLINE       0     0     0
  raidz1-1                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NGXDD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJ91D  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30LN7GD  ONLINE       0     0     0
  raidz1-2                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJM5D  ONLINE       0     0     0
    ata-HGST_HUS724020ALA640_PN2134P5GAY9PX     ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJD5D  ONLINE       0     0     0
  raidz1-3                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJD8D  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NJHVD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30K5PMD  ONLINE       0     0     0
  raidz1-4                                      ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30NLZLD  ONLINE       0     0     0
    ata-Hitachi_HDS723020BLA642_MN1220F30MVW4D  ONLINE       0     0     0
    ata-HGST_HUS724020ALA640_PN2134P5GBBL9X     ONLINE       0     0     0
logs
  mirror-5                                      ONLINE       0     0     0
    nvme0n1p1                                   ONLINE       0     0     0
    nvme1n1p1                                   ONLINE       0     0     0
cache
  nvme0n1p2                                     ONLINE       0     0     0
  nvme1n1p2                                     ONLINE       0     0     0`

I

@shodanshok
Copy link
Contributor

@rblank Did you use empty files? Please try the following:

  • cd into your ZFS dataset
  • execute mkdir SRC; for i in $(seq 1 10000); do echo -n > SRC/$i; done; find SRC | wc -l
  • now issue for i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; done

Thanks.

@rblank
Copy link
Contributor

rblank commented Apr 7, 2018

I used the exact commands from the OP (which create non-empty files), only changing 10000 to 100000 and 1000000. But for completeness, I tried yours as well.

$ mkdir SRC; for i in $(seq 1 10000); do echo -n > SRC/$i; done; find SRC | wc -l
10001
$ for i in $(seq 1 10); do cp -r SRC DST$i; find DST$i | wc -l; done
10001
10001
10001
10001
10001
10001
10001
10001
10001
10001

The few data points above weakly hint at raidz, since no one was able to reproduce on mirrors so far.

@alatteri
Copy link

alatteri commented Apr 7, 2018

On one of my pools this works fine, on another it exhibits the problems. Both datasets belong to the same pool.

bash-4.2$ mkdir SRC
bash-4.2$ for i in $(seq 1 10000); do echo $i > SRC/$i ; done
bash-4.2$ cp -r SRC DST
cp: cannot create regular file ‘DST/222’: No space left on device
cp: cannot create regular file ‘DST/6950’: No space left on device

On beast/engineering the above commands run without issue. On beast/dataio they fail.

bash-4.2$ zfs get all beast/engineering
NAME               PROPERTY               VALUE                  SOURCE
beast/engineering  type                   filesystem             -
beast/engineering  creation               Sun Nov  5 17:53 2017  -
beast/engineering  used                   1.85T                  -
beast/engineering  available              12.0T                  -
beast/engineering  referenced             1.85T                  -
beast/engineering  compressratio          1.04x                  -
beast/engineering  mounted                yes                    -
beast/engineering  quota                  none                   default
beast/engineering  reservation            none                   default
beast/engineering  recordsize             1M                     inherited from beast
beast/engineering  mountpoint             /beast/engineering     default
beast/engineering  sharenfs               on                     inherited from beast
beast/engineering  checksum               on                     default
beast/engineering  compression            lz4                    inherited from beast
beast/engineering  atime                  off                    inherited from beast
beast/engineering  devices                on                     default
beast/engineering  exec                   on                     default
beast/engineering  setuid                 on                     default
beast/engineering  readonly               off                    default
beast/engineering  zoned                  off                    default
beast/engineering  snapdir                hidden                 default
beast/engineering  aclinherit             restricted             default
beast/engineering  createtxg              20615173               -
beast/engineering  canmount               on                     default
beast/engineering  xattr                  sa                     inherited from beast
beast/engineering  copies                 1                      default
beast/engineering  version                5                      -
beast/engineering  utf8only               off                    -
beast/engineering  normalization          none                   -
beast/engineering  casesensitivity        sensitive              -
beast/engineering  vscan                  off                    default
beast/engineering  nbmand                 off                    default
beast/engineering  sharesmb               off                    inherited from beast
beast/engineering  refquota               none                   default
beast/engineering  refreservation         none                   default
beast/engineering  guid                   18311947624891459017   -
beast/engineering  primarycache           metadata               local
beast/engineering  secondarycache         all                    default
beast/engineering  usedbysnapshots        151M                   -
beast/engineering  usedbydataset          1.85T                  -
beast/engineering  usedbychildren         0B                     -
beast/engineering  usedbyrefreservation   0B                     -
beast/engineering  logbias                latency                default
beast/engineering  dedup                  off                    default
beast/engineering  mlslabel               none                   default
beast/engineering  sync                   disabled               inherited from beast
beast/engineering  dnodesize              auto                   inherited from beast
beast/engineering  refcompressratio       1.04x                  -
beast/engineering  written                0                      -
beast/engineering  logicalused            1.92T                  -
beast/engineering  logicalreferenced      1.92T                  -
beast/engineering  volmode                default                default
beast/engineering  filesystem_limit       none                   default
beast/engineering  snapshot_limit         none                   default
beast/engineering  filesystem_count       none                   default
beast/engineering  snapshot_count         none                   default
beast/engineering  snapdev                hidden                 default
beast/engineering  acltype                posixacl               inherited from beast
beast/engineering  context                none                   default
beast/engineering  fscontext              none                   default
beast/engineering  defcontext             none                   default
beast/engineering  rootcontext            none                   default
beast/engineering  relatime               off                    default
beast/engineering  redundant_metadata     all                    default
beast/engineering  overlay                off                    default
beast/engineering  com.sun:auto-snapshot  true                   inherited from beast
bash-4.2$ zfs get all beast/dataio
NAME          PROPERTY               VALUE                  SOURCE
beast/dataio  type                   filesystem             -
beast/dataio  creation               Fri Oct 13 11:13 2017  -
beast/dataio  used                   45.0T                  -
beast/dataio  available              12.0T                  -
beast/dataio  referenced             45.0T                  -
beast/dataio  compressratio          1.09x                  -
beast/dataio  mounted                yes                    -
beast/dataio  quota                  none                   default
beast/dataio  reservation            none                   default
beast/dataio  recordsize             1M                     inherited from beast
beast/dataio  mountpoint             /beast/dataio          default
beast/dataio  sharenfs               on                     inherited from beast
beast/dataio  checksum               on                     default
beast/dataio  compression            lz4                    inherited from beast
beast/dataio  atime                  off                    inherited from beast
beast/dataio  devices                on                     default
beast/dataio  exec                   on                     default
beast/dataio  setuid                 on                     default
beast/dataio  readonly               off                    default
beast/dataio  zoned                  off                    default
beast/dataio  snapdir                hidden                 default
beast/dataio  aclinherit             restricted             default
beast/dataio  createtxg              19156147               -
beast/dataio  canmount               on                     default
beast/dataio  xattr                  sa                     inherited from beast
beast/dataio  copies                 1                      default
beast/dataio  version                5                      -
beast/dataio  utf8only               off                    -
beast/dataio  normalization          none                   -
beast/dataio  casesensitivity        sensitive              -
beast/dataio  vscan                  off                    default
beast/dataio  nbmand                 off                    default
beast/dataio  sharesmb               off                    inherited from beast
beast/dataio  refquota               none                   default
beast/dataio  refreservation         none                   default
beast/dataio  guid                   7216940837685529084    -
beast/dataio  primarycache           all                    default
beast/dataio  secondarycache         all                    default
beast/dataio  usedbysnapshots        0B                     -
beast/dataio  usedbydataset          45.0T                  -
beast/dataio  usedbychildren         0B                     -
beast/dataio  usedbyrefreservation   0B                     -
beast/dataio  logbias                latency                default
beast/dataio  dedup                  off                    default
beast/dataio  mlslabel               none                   default
beast/dataio  sync                   disabled               inherited from beast
beast/dataio  dnodesize              auto                   inherited from beast
beast/dataio  refcompressratio       1.09x                  -
beast/dataio  written                45.0T                  -
beast/dataio  logicalused            49.3T                  -
beast/dataio  logicalreferenced      49.3T                  -
beast/dataio  volmode                default                default
beast/dataio  filesystem_limit       none                   default
beast/dataio  snapshot_limit         none                   default
beast/dataio  filesystem_count       none                   default
beast/dataio  snapshot_count         none                   default
beast/dataio  snapdev                hidden                 default
beast/dataio  acltype                posixacl               inherited from beast
beast/dataio  context                none                   default
beast/dataio  fscontext              none                   default
beast/dataio  defcontext             none                   default
beast/dataio  rootcontext            none                   default
beast/dataio  relatime               off                    default
beast/dataio  redundant_metadata     all                    default
beast/dataio  overlay                off                    default
beast/dataio  com.sun:auto-snapshot  false                  local

@alatteri
Copy link

alatteri commented Apr 7, 2018

I think the issue is related to primarycache=all. If I set a pool to have primarycache=metadata there are no errors.

@shodanshok
Copy link
Contributor

@rblank I replicated the issue with a simple, single-vdev pool. I'll try and report back with mirror, anyway.

@alatteri What pool/vdev layout do you use? Can you show zpool status on both machines? I tried with primarycache=none and it failed, albeit with much lower frequency (ie: it failed after the 5th copy). I'll try with primarycache=metadata.

@alatteri
Copy link

alatteri commented Apr 7, 2018

Same machine, different datasets on the same pool.

beast: /nfs/beast/home/alan % zpool status
  pool: beast
 state: ONLINE
  scan: scrub canceled on Fri Mar  2 16:47:01 2018
config:

	NAME                                   STATE     READ WRITE CKSUM
	beast                                  ONLINE       0     0     0
	  raidz2-0                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHN5M1X  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHN5NPX  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHNP9BX  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHN6M4Y  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHNPBLX  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHKY7PX  ONLINE       0     0     0
	  raidz2-1                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG1G8SL  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG1BVVL  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG13K0L  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG1GA9L  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG1G9YL  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG6D9ZS  ONLINE       0     0     0
	  raidz2-2                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG68U3S  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG2WW7S  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHMHVGY  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHKRYUX  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NAHKXMKX  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCG5ZYKS  ONLINE       0     0     0
	  raidz2-3                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGSM01S  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGSY9HS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGTHJUS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGTKV1S  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGTMN4S  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGTGTLS  ONLINE       0     0     0
	  raidz2-4                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGTKUWS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGTG3YS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGTLYZS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGSZ2GS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGSV93S  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE610_NCGT04NS  ONLINE       0     0     0
	  raidz2-5                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HHZGSB  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1GTE6HD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1GU06VD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1GS5KNF  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_NCHA3DZS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_NCHAE5JS  ONLINE       0     0     0
	  raidz2-6                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HJ21DB  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_NCH9WUXS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_NCHAXNTS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_NCHA0DLS  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HJG72B  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HHX19B  ONLINE       0     0     0
	cache
	  nvme0n1                              ONLINE       0     0     0

errors: No known data errors

  pool: pimplepaste
 state: ONLINE
  scan: scrub repaired 0B in 2h38m with 0 errors on Mon Mar 19 00:17:45 2018
config:

	NAME                                   STATE     READ WRITE CKSUM
	pimplepaste                            ONLINE       0     0     0
	  raidz2-0                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVHTBD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVHVSD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVHT1D  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HUYA5D  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVDPMD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZAZDD  ONLINE       0     0     0
	  raidz2-1                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVATKD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZB0ND  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HY6LYD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JT32KD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVAGVD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZBL5D  ONLINE       0     0     0
	  raidz2-2                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HWZ1AD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZAYJD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZ8YMD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVDN8D  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZAKPD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HWZ2ZD  ONLINE       0     0     0
	  raidz2-3                             ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZAX7D  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVHD8D  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVG6ND  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HW7VBD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1HZBHMD  ONLINE       0     0     0
	    ata-HGST_HDN726060ALE614_K1JVB2SD  ONLINE       0     0     0

errors: No known data errors

@rincebrain
Copy link
Contributor

@vbrik what's the HW config of this system - how much RAM, what model of x86_64 CPU?

@tmcqueen-materials
Copy link

tmcqueen-materials commented Apr 7, 2018

I can confirm this bug on a mirrored zpool. It is a production system so I didn't do much testing before downgrading to 0.7.6:

pool: ssdzfs-array
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
	still be used, but some features are unavailable. [it is at the 0.6.5.11 features level]
action: Enable all features using 'zpool upgrade'. Once this is done,
	the pool may no longer be accessible by software that does not support
	the features. See zpool-features(5) for details.
  scan: scrub repaired 0B in 0h16m with 0 errors on Sun Apr  1 01:46:59 2018
config:

	NAME                                     STATE     READ WRITE CKSUM
	ssdzfs-array                             ONLINE       0     0     0
	  mirror-0                               ONLINE       0     0     0
	    ata-XXXX-enc  ONLINE       0     0     0
	    ata-YYYY-enc  ONLINE       0     0     0
	  mirror-1                               ONLINE       0     0     0
	    ata-ZZZZ-enc  ONLINE       0     0     0
	    ata-QQQQ-enc  ONLINE       0     0     0

errors: No known data errors
$zfs create ssdzfs-array/tmp
$(run test as previously described; fails about 1/2 the time)
$uname -a
Linux MASKED 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I have attempted to reproduce the bug on 0.7.6 without success. Here is an except of one of the processor feature levels:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz
stepping	: 5
microcode	: 0x19
cpu MHz		: 1600.000
cache size	: 8192 KB
physical id	: 0
siblings	: 4
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid dtherm ida
bogomips	: 5333.51
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual
power management:
[    1.121288] microcode: CPU3 sig=0x106a5, pf=0x2, revision=0x19

@alexcrow1974
Copy link

I still get it with primarycache=metadata, on the first attempt to cp:
[root@zirconia ~]# zfs set primarycache=metadata storage/rhev [root@zirconia ~]# cd /storage/rhev/ [root@zirconia rhev]# ls export test [root@zirconia rhev]# cd test/ [root@zirconia test]# rm -rf DST [root@zirconia test]# rm -rf SRC/* [root@zirconia test]# for i in $(seq 1 10000); do echo $i > SRC/$i ; done [root@zirconia test]# cp -r SRC DST cp: cannot create regular file ‘DST/5269’: No space left on device cp: cannot create regular file ‘DST/3759’: No space left on device

@abraunegg
Copy link
Contributor

For those that have upgraded to the 0.7.7 branch - is it advisable to downgrade back to 0.7.6 until this regression is resolved?

@alatteri
Copy link

alatteri commented Apr 8, 2018

What is the procedure to downgrade ZFS on CentOS 7.4?

@tmcqueen-materials
Copy link

tmcqueen-materials commented Apr 8, 2018

For reverts, I usually do:

$ yum history  (identify transaction that installed 0.7.7 over 0.7.6; yum history info XXX can be used to confirm)
$ yum history undo XXX (where XXX is the transaction number identified in the previous step)

Note that with dkms installs, after reverts, I usually find I need to:

$ dkms remove zfs/0.7.6 -k `uname -r`
$ dkms remove spl/0.7.6 -k `uname -r`
$ dkms install spl/0.7.6 -k `uname -r` --force
$ dkms install zfs/0.7.6 -k `uname -r` --force

To make sure all modules are actually happy and loadable on reboot.

@darrenfreeman
Copy link

Is this seen with rsync instead of cp?

@aerusso
Copy link
Contributor

aerusso commented Apr 8, 2018

I'm not able to reproduce this, and I have several machines (Debian unstable; 0.7.7, Linux 4.15). Can people also include uname -srvmo? Maybe the kernel version is playing a role?

Linux 4.15.0-2-amd64 #1 SMP Debian 4.15.11-1 (2018-03-20) x86_64 GNU/Linux

@shodanshok
Copy link
Contributor

Ok, I've done some more tests.
System is CentOS 7.4 x86-64 with latest available kernel:

  • single vdev pool: reproduced
  • mirrored pool: reproduced
  • kmod and dkms: reproduced
  • compiled from source [1]: reproduced
  • compression lz4 and off: reproduced
  • primary cache all, metadata and none: reproduced

On a Ubuntu Server 16.04 LTS with compiled 0.7.7 spl+zfs (so not using the repository version), I can not reproduce the error. As a side note, compiling on Ubuntu does not give any warning.

So, the problem seems confined in CentOS/RHEL territory. To me, it seems a timing/racing problem (possibly related to the ARC): anything which increases copy time lowers the error probability/frequency. Some example of action which lower the fail rate:

  • cp -a (it copies file attributes)
  • disabling cache
  • copy from SRC on another filesystem (eg: root XFS). Note: this seems to completely avoid the problem.

[1] compilation give the following warning:

/usr/src/zfs-0.7.7/module/zcommon/zfs_fletcher_avx512.o: warning: objtool: fletcher_4_avx512f_byteswap()+0x4e: can't find jump dest instruction at .text+0x171
/usr/src/zfs-0.7.7/module/zfs/vdev_raidz_math_avx512f.o: warning: objtool: mul_x2_2()+0x24: can't find jump dest instruction at .text+0x39
/usr/src/zfs-0.7.7/module/zfs/vdev_raidz_math_avx512bw.o: warning: objtool: raidz_zero_abd_cb()+0x33: can't find jump dest instruction at .text+0x3d

@aerusso
Copy link
Contributor

aerusso commented Apr 8, 2018

@shodanshok I'm sorry, I'm having a lot of trouble tracking this piece of information down. What Linux kernel version is Centos 7.4 on? I assume this is with kernel-3.10.0-693.21.1.el7.x86_64.

Is anyone experiencing this issue with "recent" mainline kernels (like 4.x)?

@cstackpole
Copy link

cstackpole commented Apr 8, 2018

Greetings,
I have mirrors with the same problem.
Scientific Linux 7.4 (fully updated)
zfs-0.7.7 from zfsonlinux.org repos

$ uname -srvmo
Linux 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 13:12:24 CST 2018 x86_64 GNU/Linux

The output of my yum install:

Running transaction
  Installing : kernel-devel-3.10.0-693.21.1.el7.x86_64                                                                         1/10 
  Installing : kernel-headers-3.10.0-693.21.1.el7.x86_64                                                                       2/10 
  Installing : glibc-headers-2.17-196.el7_4.2.x86_64                                                                           3/10 
  Installing : glibc-devel-2.17-196.el7_4.2.x86_64                                                                             4/10 
  Installing : gcc-4.8.5-16.el7_4.2.x86_64                                                                                     5/10 
  Installing : dkms-2.4.0-1.20170926git959bd74.el7.noarch                                                                      6/10 
  Installing : spl-dkms-0.7.7-1.el7_4.noarch                                                                                   7/10 
Loading new spl-0.7.7 DKMS files...
Building for 3.10.0-693.21.1.el7.x86_64
Building initial module for 3.10.0-693.21.1.el7.x86_64
Done.

spl:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/spl/spl/

splat.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/splat/splat/
Adding any weak-modules

depmod....

DKMS: install completed.
  Installing : zfs-dkms-0.7.7-1.el7_4.noarch                                                                                   8/10 
Loading new zfs-0.7.7 DKMS files...
Building for 3.10.0-693.21.1.el7.x86_64
Building initial module for 3.10.0-693.21.1.el7.x86_64
Done.

zavl:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/avl/avl/

znvpair.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/nvpair/znvpair/

zunicode.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/unicode/zunicode/

zcommon.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zcommon/zcommon/

zfs.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zfs/zfs/

zpios.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/zpios/zpios/

icp.ko:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.10.0-693.21.1.el7.x86_64/extra/icp/icp/
Adding any weak-modules

depmod....

DKMS: install completed.
  Installing : spl-0.7.7-1.el7_4.x86_64                                                                                        9/10 
  Installing : zfs-0.7.7-1.el7_4.x86_64                                                                                       10/10 
  Verifying  : dkms-2.4.0-1.20170926git959bd74.el7.noarch                                                                      1/10 
  Verifying  : zfs-dkms-0.7.7-1.el7_4.noarch                                                                                   2/10 
  Verifying  : zfs-0.7.7-1.el7_4.x86_64                                                                                        3/10 
  Verifying  : spl-0.7.7-1.el7_4.x86_64                                                                                        4/10 
  Verifying  : kernel-devel-3.10.0-693.21.1.el7.x86_64                                                                         5/10 
  Verifying  : glibc-devel-2.17-196.el7_4.2.x86_64                                                                             6/10 
  Verifying  : kernel-headers-3.10.0-693.21.1.el7.x86_64                                                                       7/10 
  Verifying  : gcc-4.8.5-16.el7_4.2.x86_64                                                                                     8/10 
  Verifying  : spl-dkms-0.7.7-1.el7_4.noarch                                                                                   9/10 
  Verifying  : glibc-headers-2.17-196.el7_4.2.x86_64                                                                          10/10 

Installed:
  zfs.x86_64 0:0.7.7-1.el7_4                                                                                                        

Dependency Installed:
  dkms.noarch 0:2.4.0-1.20170926git959bd74.el7                      gcc.x86_64 0:4.8.5-16.el7_4.2                                   
  glibc-devel.x86_64 0:2.17-196.el7_4.2                             glibc-headers.x86_64 0:2.17-196.el7_4.2                         
  kernel-devel.x86_64 0:3.10.0-693.21.1.el7                         kernel-headers.x86_64 0:3.10.0-693.21.1.el7                     
  spl.x86_64 0:0.7.7-1.el7_4                                        spl-dkms.noarch 0:0.7.7-1.el7_4                                 
  zfs-dkms.noarch 0:0.7.7-1.el7_4                                  

Complete!

I am using rsnapshot to do backups. It is when it runs the equivalent to below that issues come up.

$ /usr/bin/cp -al /bkpfs/Rsnapshot/hourly.0 /bkpfs/Rsnapshot/hourly.1
/usr/bin/cp: cannot create hard link ‘/bkpfs/Rsnapshot/hourly.1/System/home/user/filename’ to ‘/bkpfs/Rsnapshot/hourly.0/System/home/user/filename’: No space left on device

There's plenty of space

$ df -h /bkpfs/
Filesystem      Size  Used Avail Use% Mounted on
bkpfs           5.0T  4.2T  776G  85% /bkpfs
$ df -i /bkpfs/
Filesystem         Inodes   IUsed      IFree IUse% Mounted on
bkpfs          1631487194 5614992 1625872202    1% /bkpfs
zpool iostat -v bkpfs
                                                  capacity     operations     bandwidth 
pool                                            alloc   free   read  write   read  write
----------------------------------------------  -----  -----  -----  -----  -----  -----
bkpfs                                           4.52T   950G      9      5  25.4K   117K
  mirror                                        1.84T   912G      4      3  22.0K  94.7K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  11.2K  47.4K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  10.8K  47.4K
  mirror                                        2.68T  37.3G      4      2  3.46K  22.2K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  1.71K  11.1K
    ata-Hitachi_HUA723030ALA640                     -      -      2      1  1.75K  11.1K
cache                                               -      -      -      -      -      -
  ata-INTEL_SSDSC2BW120H6                       442M   111G     17      0  9.48K  10.0K
----------------------------------------------  -----  -----  -----  -----  -----  -----
zpool status
  pool: bkpfs
 state: ONLINE
  scan: scrub repaired 0B in 11h17m with 0 errors on Sun Apr  1 05:34:09 2018
config:

	NAME                                            STATE     READ WRITE CKSUM
	bkpfs                                           ONLINE       0     0     0
	  mirror-0                                      ONLINE       0     0     0
	    ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
	    ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
	  mirror-1                                      ONLINE       0     0     0
	    ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
	    ata-Hitachi_HUA723030ALA640                 ONLINE       0     0     0
	cache
	  ata-INTEL_SSDSC2BW120H6                      ONLINE       0     0     0

errors: No known data errors
zfs get all bkpfs
NAME   PROPERTY              VALUE                  SOURCE
bkpfs  type                  filesystem             -
bkpfs  creation              Fri Dec 22 10:34 2017  -
bkpfs  used                  4.52T                  -
bkpfs  available             776G                   -
bkpfs  referenced            4.19T                  -
bkpfs  compressratio         1.00x                  -
bkpfs  mounted               yes                    -
bkpfs  quota                 none                   default
bkpfs  reservation           none                   default
bkpfs  recordsize            128K                   default
bkpfs  mountpoint            /bkpfs                 default
bkpfs  sharenfs              off                    default
bkpfs  checksum              on                     default
bkpfs  compression           off                    default
bkpfs  atime                 on                     default
bkpfs  devices               on                     default
bkpfs  exec                  on                     default
bkpfs  setuid                on                     default
bkpfs  readonly              off                    default
bkpfs  zoned                 off                    default
bkpfs  snapdir               hidden                 default
bkpfs  aclinherit            restricted             default
bkpfs  createtxg             1                      -
bkpfs  canmount              on                     default
bkpfs  xattr                 on                     default
bkpfs  copies                1                      default
bkpfs  version               5                      -
bkpfs  utf8only              off                    -
bkpfs  normalization         none                   -
bkpfs  casesensitivity       sensitive              -
bkpfs  vscan                 off                    default
bkpfs  nbmand                off                    default
bkpfs  sharesmb              off                    default
bkpfs  refquota              none                   default
bkpfs  refreservation        none                   default
bkpfs  guid                  8662648373298485368    -
bkpfs  primarycache          all                    default
bkpfs  secondarycache        all                    default
bkpfs  usedbysnapshots       334G                   -
bkpfs  usedbydataset         4.19T                  -
bkpfs  usedbychildren        234M                   -
bkpfs  usedbyrefreservation  0B                     -
bkpfs  logbias               latency                default
bkpfs  dedup                 off                    default
bkpfs  mlslabel              none                   default
bkpfs  sync                  standard               default
bkpfs  dnodesize             legacy                 default
bkpfs  refcompressratio      1.00x                  -
bkpfs  written               1.38T                  -
bkpfs  logicalused           4.51T                  -
bkpfs  logicalreferenced     4.18T                  -
bkpfs  volmode               default                default
bkpfs  filesystem_limit      none                   default
bkpfs  snapshot_limit        none                   default
bkpfs  filesystem_count      none                   default
bkpfs  snapshot_count        none                   default
bkpfs  snapdev               hidden                 default
bkpfs  acltype               off                    default
bkpfs  context               none                   default
bkpfs  fscontext             none                   default
bkpfs  defcontext            none                   default
bkpfs  rootcontext           none                   default
bkpfs  relatime              off                    default
bkpfs  redundant_metadata    all                    default
bkpfs  overlay               off                    default

For those that want to know my hardware, the system is a AMD X2 255 processor with 8GB of memory (so far more than enough for my home backup system).

I can revert today, or I can help test if someone needs me to try something. Just let me know.

Thanks!

@rincebrain
Copy link
Contributor

Can someone who can repro this try bisecting the changes between 0.7.6 and 0.7.7 so we can see which commit breaks people?

@loli10K
Copy link
Contributor

loli10K commented Apr 8, 2018

Most likely cc63068, seems to be a race condition in the mzap->fzap upgrade phase.

@rincebrain
Copy link
Contributor

@loli10K this, uh, seems horrendous enough that unless someone volunteers a fix for the race Real Fast, a revert and cutting a point release for this alone seems like it would be merited, to me at least.

@cstackpole
Copy link

cstackpole commented Apr 8, 2018

@rincebrain I can try later today. I'm meeting some friends for lunch and will be gone for a few hours but I'm happy to help how I can when I get back.
[Edit] To try to bisect the changes that is. :-)

@rincebrain
Copy link
Contributor

@cstackpole if you do, it's probably worth trying with and without the commit @loli10K pointed to, rather than letting the bisect naturally find it.

@Ringdingcoder
Copy link

From what we have seen so far it certainly seems to only affect older (by which I mean lower-versioned) kernels. I have not been able to reproduce the issue on Linux 4.15 (Fedora).

aerusso added a commit to aerusso/zfs that referenced this issue Apr 8, 2018
Issue openzfs#7401 identified data loss when many small files
are being copied. Add a test to check for this condition.

Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
behlendorf pushed a commit that referenced this issue Apr 9, 2018
This reverts commit cc63068.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401 
Cloes #7416
@ryao
Copy link
Contributor

ryao commented Apr 9, 2018

Our analysis is not finished. I am reopening this pending the completion of our analysis.

@ryao ryao reopened this Apr 9, 2018
@behlendorf
Copy link
Contributor

Right I didn't mean to suggest this issue should be closed, and reverting the change was all that was needed. There's still clearly careful investigation to be done, which we can now focus on.

@ryao when possible rolling back to a snapshot would be the cleanest way to recover these files. However, since that won't always be an option let's investigate implementing a generic orhpan recovery mechanism. Adding this functionality initially to zdb would allow us to check existing datasets, and would be nice additional test coverage for ztest to leverage. We could potentially follow this up with support for a .zfs/lost+found directory.

tonyhutter added a commit to zfsonlinux/zfsonlinux.github.com that referenced this issue Apr 10, 2018
Remove 0.7.7 links due to a regression:
openzfs/zfs#7401

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
tonyhutter added a commit that referenced this issue Apr 10, 2018
This reverts commit cc63068.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401
Closes #7416
matthewbauer added a commit to matthewbauer/nixpkgs that referenced this issue Apr 10, 2018
@darrenfreeman
Copy link

Given the improved understanding of the cause of this regression, can anything be said about the behaviour of rsync? If it reports no errors, are the data fine?

What about mv? And what if mv is from one dataset to another, on the same pool?

@rincebrain
Copy link
Contributor

rincebrain commented Apr 10, 2018

@darrenfreeman The mailing list or IRC chatroom would probably be a better place to ask, but

  • rsync should be fine, since I think it should bail out on e.g. rsync -a src/ dst/ once it gets ENOSPC once, and not try any additional files
  • mv across datasets on a pool is just like mv across other filesystems, cp then rm, so I would guess that might be subject to the same caveats about version peculiarities as cp above, but I haven't tested that.

Also, one final caveat:

  • knowledge, particularly about how much vulnerability exists for files that get lost in the metaphorical shuffle after getting back ENOSPC, is incomplete, so it's safest to revert versions (or bump once 0.7.8 is cut) if at all possible, and everything above is based on incomplete information.

@Ringdingcoder
Copy link

rsync always sorts files, so it should be fine. And as long as you don't receive errors, you should be fine.
Since data is not silently lost, this is not the worst-case catastrophic bug, just a major annoyance. The most inconvenient issue about it are the orphaned files, but fortunately they are tied to their respective datasets, not to the entire pool, and can get rid of by rolling back or re-creating individual datasets.

@darrenfreeman
Copy link

darrenfreeman commented Apr 10, 2018

Reproducibility: yes
ZoL version: git, recent commit, 10adee2
Distribution: Ubuntu 17.10
Kernel Version: 4.13.0-38-generic
Coreutils Version: 8.26-3ubuntu4
SELinux status: not installed AFAICT

Reproduced using: ./zap-collision-test.sh .

Furthermore, this didn't look good:

rm -Rf DST
Segmentation fault (core dumped)

The pool was freshly created as,

zfs create rpool/test -o recordsize=4k
touch -s 1G /rpool/test/file
zpool create test /rpool/test/file -o ashift=12

I am trying to install the debug symbols for rm, however I am now also getting segfaults when not even touching this zpool. (apt-key is segfaulting when trying to trust the debug repo.) So I fear I better push the comment button now and reboot :/

Update: can't reproduce the segfault on rm -Rf DST, after rebooting and installing debug symbols.

@markus2120
Copy link

markus2120 commented Apr 10, 2018

Thanks for the solutions and quick efforts to fix.
Are there any methods to check a complete filesystem if there any affected files? I do have backups - anyone give me a oneliner to list them?

@markdesouza
Copy link

Given this bug has now been listed on The Register (https://www.theregister.co.uk/2018/04/10/zfs_on_linux_data_loss_fixed/), it might be wise have an FAQ article on the wiki page (with a link in this ticket). The FAQ article should clearly state which versions of ZoL are affected and which distros/kernel versions (similar to the birthhole bug). This would hopefully limit any panic concerns about the reliability of ZoL as a storage layer.

@durval-menezes
Copy link

durval-menezes commented Apr 10, 2018

Given this bug has now been listed on The Register (https://www.theregister.co.uk/2018/04/10/zfs_on_linux_data_loss_fixed/)

From that article (emphasis mine):
"So even though three reviewers signed off on the cruddy commit, the speedy response may mean it’s possible to consider this a triumph of sorts for open source."

Ouch.

I agree with @markdesouza that there should be a FAQ article for that so we ZFS apologizers can point anyone who questions us about that to it. I would also like to suggest that the ZFS signing-off procedure be reviewed to avoid (or at least make it way more improbable) for such a "cruddy commit" to make it into a ZFS stable release, and that notice of this review also be added to that same FAQ article.

srhb pushed a commit to srhb/nixpkgs that referenced this issue Apr 10, 2018
Due to upstream data loss bug: openzfs/zfs#7401
@aerusso
Copy link
Contributor

aerusso commented Apr 10, 2018

In #7411, the random_creation test looks like it may be a more robust reproducer (especially for future bugs) because it naturally relies on the ordering of the ZAP hashes. Also, if there are other reproducers, it might be a good idea to centralize discussion of them in that PR so they can be easily included.

@darrenfreeman
Copy link

darrenfreeman commented Apr 10, 2018

Answering my earlier question. Debian 9.3 as above.

rsync doesn't hit the bug, it creates files in lexical order. (I.e. file 999 is followed by 9990.) In a very small number of tests, I didn't find a combination of switches that would fail.

So anyone who prefers rsync, should have a pretty good chance of having missed the bug.

Something similar to mv /pool/dataset1/SRC /pool/dataset2/ also didn't fail. (Move between datasets within the same pool.) Although, on the same box, cp doesn't fail either, so that doesn't prove much.

@tonyhutter
Copy link
Contributor

FYI - you probably all saw it already, but we released zfs-0.7.8 with the reverted patch last night.

@ryao
Copy link
Contributor

ryao commented Apr 10, 2018

@ort163 We do not have a one liner yet. People are continuing to analyze the issue and we will have a proper fix in the near future. That will include a way to detect+correct the wrong directory sizes, list snapshots affected and place the orphaned files in some kind of lost+found directory. I am leaning toward extending scrub to do it.

@ryao
Copy link
Contributor

ryao commented Apr 10, 2018

@markdesouza I have spent a fair amount of time explaining things to end users on Hacker News, Reddit and Phoronix. I do not think that our understanding is sufficient to post a final FAQ yet, but we could post an interim FAQ.

I think the interim FAQ entry should advise users to upgrade ASAP to avoid having to possibly deal with orphaned files if nothing has happened yet, or more orphaned files if something has already happened; and not to change how they do things after upgrading unless they deem it necessary until we finish our analysis, make a proper fix, and issue proper instructions on how to repair the damage in the release notes. I do not think there is any harm to pools if datasets have incorrect directory sizes and orphaned files while people wait for us to release a proper fix with instructions on how to completely address the issue, so telling them to wait after upgrading should be fine. The orphan files should stay around and persist through send/recv unless snapshot rollback is done or the dataset is destroyed.

Until that is up, you could point users to my hacker news post:

https://news.ycombinator.com/item?id=16797932

In specific, we need to nail down whether existing files’ directory entries could be lost, what if any other side effects happen when this is triggered on new file creation, what course of events leads to directory entries disappearing after ENOSPC, how system administrators could detect it and how system administrators will repair it. Then we should be able to make a proper FAQ entry.

Edit: The first 3 questions are answered satisfactorily in #7421.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 15, 2018
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7401 
Closes openzfs#7421
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Aug 23, 2018
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7401 
Closes openzfs#7421
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 5, 2018
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes openzfs#7401 
Closes openzfs#7421
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Regression Indicates a functional regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.