Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement database to workaround misreported physical sector sizes #3

Closed
wants to merge 1 commit into from

Conversation

ryao
Copy link
Owner

@ryao ryao commented Aug 13, 2013

This implements vdev_bdev_database_check(). It alters the detected
sector size of any device listed in a database of drives known to lie
about their physical sector sizes.

This is based on "6931570 Add flash devices' VID/PID to disk table to
advertising 4K physical sector size" from Open Solaris and on
sg_simple4.c from sg3_utils. About two dozen lines are taken from
sg_simple4.c, which is GPLv2 licensed. However, sg_simple4.c is
analogous to a Hello World program and is safe for us to use. We
requested that Douglas Gilbert, the author of sg_simple4.c, confirm that
this is the case. A cutdown version of his response is as follows:

I would consider a SCSI INQUIRY example using the Linux sg
driver interface (also written by me) as the equivalent of an
"hello world" program in C.

The database was created with the help of the freenode and ZFSOnLinux
communities.

Some notes:

  1. The following drives both were confirmed to lie via reports in IRC
    and they contain capacity information in their identifiers:

INTEL SSDSA2M080
INTEL SSDSA2M160
M4-CT256M4SSD2
WDC WD15EARS-00S
WDC WD15EARS-00Z
WDC WD20EARS-00M

The identifiers for different capacity models were extrapolated and
added under the assumption that those models also lie. Google was used
to verify that the extrapolated drive identifiers existed prior to their
inclusion.

  1. The OCZ-VERTEX2 3.5 identifer applies to two drives that differ
    solely in page size (and slightly in capacity). One uses 4096-byte pages
    and the other uses 8192-byte pages. Both are set to use 8192-byte pages.
    We could detect the page size by checking the capacity, but that would
    unnecessarily complicate the code.
  2. It is possible for updated drive firmware to correctly report the
    sector size. There were reports of a few advanced format drives doing
    that. One report stated that the vendor changed the identification
    string while another was unclear on this. Both reports involved WDC
    models.
  3. Google was used to determine the size of pages in the listed flash
    devices. Reports of 8192-byte pages took precedence over reports of
    4096-byte pages.
  4. Devices behind USB adapters can have their identification strings
    altered. Identification strings obtained across USB adapters are
    omitted and no attempt is made to correct for alterations made by USB
    adapters when doing comparisons against the database. Two entries in the
    Open Solaris database that appear to have been altered by a USB
    adapter were omitted.

Signed-off-by: Richard Yao ryao@gentoo.org

This implements vdev_bdev_database_check(). It alters the detected
sector size of any device listed in a database of drives known to lie
about their physical sector sizes.

This is based on "6931570 Add flash devices' VID/PID to disk table to
advertising 4K physical sector size" from Open Solaris and on
sg_simple4.c from sg3_utils. About two dozen lines are taken from
sg_simple4.c, which is GPLv2 licensed. However, sg_simple4.c is
analogous to a Hello World program and is safe for us to use. We
requested that Douglas Gilbert, the author of sg_simple4.c, confirm that
this is the case. A cutdown version of his response is as follows:

```
I would consider a SCSI INQUIRY example using the Linux sg
driver interface (also written by me) as the equivalent of an
"hello world" program in C.
```

The database was created with the help of the freenode and ZFSOnLinux
communities.

Some notes:

1. The following drives both were confirmed to lie via reports in IRC
and they contain capacity information in their identifiers:

INTEL SSDSA2M080
INTEL SSDSA2M160
M4-CT256M4SSD2
WDC WD15EARS-00S
WDC WD15EARS-00Z
WDC WD20EARS-00M

The identifiers for different capacity models were extrapolated and
added under the assumption that those models also lie. Google was used
to verify that the extrapolated drive identifiers existed prior to their
inclusion.

2. The OCZ-VERTEX2 3.5 identifer applies to two drives that differ
solely in page size (and slightly in capacity). One uses 4096-byte pages
and the other uses 8192-byte pages. Both are set to use 8192-byte pages.
We could detect the page size by checking the capacity, but that would
unnecessarily complicate the code.

3. It is possible for updated drive firmware to correctly report the
sector size. There were reports of a few advanced format drives doing
that. One report stated that the vendor changed the identification
string while another was unclear on this. Both reports involved WDC
models.

4. Google was used to determine the size of pages in the listed flash
devices. Reports of 8192-byte pages took precedence over reports of
4096-byte pages.

5. Devices behind USB adapters can have their identification strings
altered. Identification strings obtained across USB adapters are
omitted and no attempt is made to correct for alterations made by USB
adapters when doing comparisons against the database. Two entries in the
Open Solaris database that appear to have been altered by a USB
adapter were omitted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
@ryao ryao closed this Aug 13, 2013
ryao added a commit that referenced this pull request Jul 23, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Mac OS X does not
implement O_DIRECT, but it does implement F_NOCACHE, which is similiar
to #2 in that it prevents new data from being cached. AIX relaxs #3 by
only committing the file data to disk. Metadata updates required should
the operations make the file larger are asynchronous unless O_DSYNC is
specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit that referenced this pull request Jul 23, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Mac OS X does not
implement O_DIRECT, but it does implement F_NOCACHE, which is similiar
to #2 in that it prevents new data from being cached. AIX relaxes #3 by
only committing the file data to disk. Metadata updates required should
the operations make the file larger are asynchronous unless O_DSYNC is
specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit that referenced this pull request Jul 23, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit that referenced this pull request Jul 23, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit that referenced this pull request Jul 24, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit that referenced this pull request Jul 24, 2015
DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Other platforms
implement varying subsets of the XFS semantics. FreeBSD does not
implement #1 and might not implement others (not checked). Mac OS X does
not implement O_DIRECT, but it does implement F_NOCACHE, which is
similiar to #2 in that it prevents new data from being cached. AIX
relaxes #3 by only committing the file data to disk. Metadata updates
required should the operations make the file larger are asynchronous
unless O_DSYNC is specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit that referenced this pull request Dec 3, 2015
(gdb) bt
\#0  0x00007f31952a35d7 in raise () from /lib64/libc.so.6
\#1  0x00007f31952a4cc8 in abort () from /lib64/libc.so.6
\#2  0x00007f319529c546 in __assert_fail_base () from /lib64/libc.so.6
\#3  0x00007f319529c5f2 in __assert_fail () from /lib64/libc.so.6
\#4  0x00007f319529c66b in __assert () from /lib64/libc.so.6
\#5  0x00007f319659e024 in send_iterate_prop (zhp=zhp@entry=0x19cf500, nv=0x19da110) at libzfs_sendrecv.c:724
\openzfs#6  0x00007f319659e35d in send_iterate_fs (zhp=zhp@entry=0x19cf500, arg=arg@entry=0x7ffc515d6380) at libzfs_sendrecv.c:762
\openzfs#7  0x00007f319659f3ca in gather_nvlist (hdl=<optimized out>, fsname=fsname@entry=0x19cf250 "tpool/test", fromsnap=fromsnap@entry=0x7ffc515d7531 "snap1", tosnap=tosnap@entry=0x7ffc515d7542 "snap2", recursive=B_FALSE,
    nvlp=nvlp@entry=0x7ffc515d6470, avlp=avlp@entry=0x7ffc515d6478) at libzfs_sendrecv.c:809
\openzfs#8  0x00007f31965a408f in zfs_send (zhp=zhp@entry=0x19cf240, fromsnap=fromsnap@entry=0x7ffc515d7531 "snap1", tosnap=tosnap@entry=0x7ffc515d7542 "snap2", flags=flags@entry=0x7ffc515d6d30, outfd=outfd@entry=1,
    filter_func=filter_func@entry=0x0, cb_arg=cb_arg@entry=0x0, debugnvp=debugnvp@entry=0x0) at libzfs_sendrecv.c:1461
\openzfs#9  0x000000000040a981 in zfs_do_send (argc=<optimized out>, argv=0x7ffc515d6ff0) at zfs_main.c:3841
\openzfs#10 0x0000000000404d10 in main (argc=6, argv=0x7ffc515d6fc8) at zfs_main.c:6724
(gdb) fr 5
\#5  0x00007f319659e024 in send_iterate_prop (zhp=zhp@entry=0x19cf500, nv=0x19da110) at libzfs_sendrecv.c:724
724                             verify(nvlist_lookup_uint64(propnv,
(gdb) list
719                             verify(nvlist_lookup_string(propnv,
720                                 ZPROP_VALUE, &value) == 0);
721                             VERIFY(0 == nvlist_add_string(nv, propname, value));
722                     } else {
723                             uint64_t value;
724                             verify(nvlist_lookup_uint64(propnv,
725                                 ZPROP_VALUE, &value) == 0);
726                             VERIFY(0 == nvlist_add_uint64(nv, propname, value));
727                     }
728             }
(gdb) p prop
$1 = ZFS_PROP_RELATIME
ryao pushed a commit that referenced this pull request Sep 12, 2016
The original code will do an out-of-bound access on pl[] during last
iteration.

 ==================================================================
 BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs]
 Read of size 8 by task tmpfile/7850
 page:ffffea00017c6dc0 count:0 mapcount:0 mapping:          (null) index:0x0
 flags: 0xffff8000000000()
 page dumped because: kasan: bad access detected
 CPU: 3 PID: 7850 Comm: tmpfile Tainted: G           OE   4.6.0+ #3
  ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618
  ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8
  ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3
 Call Trace:
  [<ffffffff81635618>] dump_stack+0x63/0x8b
  [<ffffffff81313ee8>] kasan_report_error+0x528/0x560
  [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0
  [<ffffffff813144b8>] kasan_report+0x58/0x60
  [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffff81312e4e>] __asan_load8+0x5e/0x70
  [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs]
  [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs]

  [<ffffffff81353c3a>] SyS_execve+0x3a/0x50
  [<ffffffff810058ef>] do_syscall_64+0xef/0x180
  [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25
 Memory state around the buggy address:
  ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4
                                                                 ^
  ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00
  ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ==================================================================

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4705
Issue openzfs#4708
ryao pushed a commit that referenced this pull request Sep 12, 2016
Leaks reported by using AddressSanitizer, GCC 6.1.0

Direct leak of 4097 byte(s) in 1 object(s) allocated from:
    #1 0x414f73 in process_options cmd/ztest/ztest.c:721

Direct leak of 5440 byte(s) in 17 object(s) allocated from:
    #1 0x41bfd5 in umem_alloc ../../lib/libspl/include/umem.h:88
    #2 0x41bfd5 in ztest_zap_parallel cmd/ztest/ztest.c:4659
    #3 0x4163a8 in ztest_execute cmd/ztest/ztest.c:5907

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#4896
ryao pushed a commit that referenced this pull request Jan 3, 2017
- Fix autoreplace behaviour on statechange-led.sh script.

ZED sends the following events on an auto-replace:

1. statechange: Disk goes UNAVAIL->ONLINE
2. statechange: Disk goes ONLINE->UNAVAIL
3. vdev_attach: Disk goes ONLINE

Events 1-2 happen when ZED first attempts to do an auto-online.  When that
fails, ZED then tries an auto-replace, generating the vdev_attach event in #3.

In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE
transition to turn off the LED.  It ignored the #2 ONLINE->UNAVAIL transition,
assuming it was just the "old" VDEV going offline.  This is problematic, as
a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want
to ignore that.

This new patch correctly turns on the fault LED every time a drive becomes
UNAVAIL.  It also monitors vdev_attach events to trigger turning off the LED
when an auto-replaced disk comes online.

- Remove unnecessary libdevmapper warning with --with-config=kernel

This fixes an unnecessary libdevmapper warning when building
--with-config=kernel.  Kernel code does not use libdevmapper, so the warning
is not needed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes openzfs#2375 
Closes openzfs#5312 
Closes openzfs#5331
ryao pushed a commit that referenced this pull request Apr 8, 2017
… pool

When importing a pool with a large number of filesystems within the same
parent filesystem, we see that dmu_objset_find_dp() takes a long time.
It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
and spa_load_verify().

There are several ways to improve performance here:

    1. We don't really need to do spa_check_logs() or
       spa_ld_claim_log_blocks() if the pool was closed cleanly.

    2. spa_load_verify() uses dmu_objset_find_dp() to check that no
       datasets have too long of names.

    3. dmu_objset_find_dp() is slow because it's doing
       zap_value_search() (which is O(N sibling datasets)) to determine
       the name of each dsl_dir when it's opened. In this case we
       actually know the name when we are opening it, so we can provide
       it and avoid the lookup.

This change implements fix #3 from the above list; i.e. make
dmu_objset_find_dp() provide the name of the dataset so that we don't
have to search for it.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7606
OpenZFS-commit: openzfs/openzfs@cac6bab
Closes openzfs#5662
ryao pushed a commit that referenced this pull request Jun 29, 2020
After spa_vdev_remove_aux() is called, the config nvlist is no longer
valid, as it's been replaced by the new one (with the specified device
removed).  Therefore any pointers into the nvlist are no longer valid.
So we can't save the result of
`fnvlist_lookup_string(nv, ZPOOL_CONFIG_PATH)` (in vd_path) across the
call to spa_vdev_remove_aux().

Instead, use spa_strdup() to save a copy of the string before calling
spa_vdev_remove_aux.

Found by AddressSanitizer:

ERROR: AddressSanitizer: heap-use-after-free on address ...
READ of size 34 at 0x608000a1fcd0 thread T686
    #0 0x7fe88b0c166d  (/usr/lib/x86_64-linux-gnu/libasan.so.4+0x5166d)
    #1 0x7fe88a5acd6e in spa_strdup spa_misc.c:1447
    #2 0x7fe88a688034 in spa_vdev_remove vdev_removal.c:2259
    #3 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    #4 0x55ffbc769fba in ztest_execute ztest.c:6714
    #5 0x55ffbc779a90 in ztest_thread ztest.c:6761
    openzfs#6 0x7fe889cbc6da in start_thread
    openzfs#7 0x7fe8899e588e in __clone

0x608000a1fcd0 is located 48 bytes inside of 88-byte region
freed by thread T686 here:
    #0 0x7fe88b14e7b8 in __interceptor_free
    #1 0x7fe88ae541c5 in nvlist_free nvpair.c:874
    #2 0x7fe88ae543ba in nvpair_free nvpair.c:844
    #3 0x7fe88ae57400 in nvlist_remove_nvpair nvpair.c:978
    #4 0x7fe88a683c81 in spa_vdev_remove_aux vdev_removal.c:185
    #5 0x7fe88a68857c in spa_vdev_remove vdev_removal.c:2221
    openzfs#6 0x55ffbc7748f8 in ztest_vdev_aux_add_remove ztest.c:3229
    openzfs#7 0x55ffbc769fba in ztest_execute ztest.c:6714
    openzfs#8 0x55ffbc779a90 in ztest_thread ztest.c:6761
    openzfs#9 0x7fe889cbc6da in start_thread

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#9706
ryao pushed a commit that referenced this pull request Sep 28, 2022
`zpool_do_import()` passes `argv[0]`, (optionally) `argv[1]`, and
`pool_specified` to `import_pools()`.  If `pool_specified==FALSE`, the
`argv[]` arguments are not used.  However, these values may be off the
end of the `argv[]` array, so loading them could dereference unmapped
memory.  This error is reported by the asan build:

```
=================================================================
==6003==ERROR: AddressSanitizer: heap-buffer-overflow
READ of size 8 at 0x6030000004a8 thread T0
    #0 0x562a078b50eb in zpool_do_import zpool_main.c:3796
    #1 0x562a078858c5 in main zpool_main.c:10709
    #2 0x7f5115231bf6 in __libc_start_main
    #3 0x562a07885eb9 in _start

0x6030000004a8 is located 0 bytes to the right of 24-byte region
allocated by thread T0 here:
    #0 0x7f5116ac6b40 in __interceptor_malloc
    #1 0x562a07885770 in main zpool_main.c:10699
    #2 0x7f5115231bf6 in __libc_start_main
```

This commit passes NULL for these arguments if they are off the end
of the `argv[]` array.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#12339
ryao pushed a commit that referenced this pull request Feb 24, 2023
Under certain loads, the following panic is hit:

    panic: page fault
    KDB: stack backtrace:
    #0 0xffffffff805db025 at kdb_backtrace+0x65
    #1 0xffffffff8058e86f at vpanic+0x17f
    #2 0xffffffff8058e6e3 at panic+0x43
    #3 0xffffffff808adc15 at trap_fatal+0x385
    #4 0xffffffff808adc6f at trap_pfault+0x4f
    #5 0xffffffff80886da8 at calltrap+0x8
    openzfs#6 0xffffffff80669186 at vgonel+0x186
    openzfs#7 0xffffffff80669841 at vgone+0x31
    openzfs#8 0xffffffff8065806d at vfs_hash_insert+0x26d
    openzfs#9 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#11 0xffffffff8065a28c at lookup+0x45c
    openzfs#12 0xffffffff806594b9 at namei+0x259
    openzfs#13 0xffffffff80676a33 at kern_statat+0xf3
    openzfs#14 0xffffffff8067712f at sys_fstatat+0x2f
    openzfs#15 0xffffffff808ae50c at amd64_syscall+0x10c
    openzfs#16 0xffffffff808876bb at fast_syscall_common+0xf8

The page fault occurs because vgonel() will call VOP_CLOSE() for active
vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While
here, define vop_open for consistency.

After adding the necessary vop, the bug progresses to the following
panic:

    panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1)
    cpuid = 17
    KDB: stack backtrace:
    #0 0xffffffff805e29c5 at kdb_backtrace+0x65
    #1 0xffffffff8059620f at vpanic+0x17f
    #2 0xffffffff81a27f4a at spl_panic+0x3a
    #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40
    #4 0xffffffff8066fdee at vinactivef+0xde
    #5 0xffffffff80670b8a at vgonel+0x1ea
    openzfs#6 0xffffffff806711e1 at vgone+0x31
    openzfs#7 0xffffffff8065fa0d at vfs_hash_insert+0x26d
    openzfs#8 0xffffffff81a39069 at sfs_vgetx+0x149
    openzfs#9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4
    openzfs#10 0xffffffff80661c2c at lookup+0x45c
    openzfs#11 0xffffffff80660e59 at namei+0x259
    openzfs#12 0xffffffff8067e3d3 at kern_statat+0xf3
    openzfs#13 0xffffffff8067eacf at sys_fstatat+0x2f
    openzfs#14 0xffffffff808b5ecc at amd64_syscall+0x10c
    openzfs#15 0xffffffff8088f07b at fast_syscall_common+0xf8

This is caused by a race condition that can occur when allocating a new
vnode and adding that vnode to the vfs hash. If the newly created vnode
loses the race when being inserted into the vfs hash, it will not be
recycled as its usecount is greater than zero, hitting the above
assertion.

Fix this by dropping the assertion.

FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700
Reviewed-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@axcient.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Submitted-by: Klara, Inc.
Sponsored-by: rsync.net
Closes openzfs#14501
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants