ZFS Crypto support #494

Closed
FransUrbo opened this Issue Dec 14, 2011 · 159 comments

Comments

Projects
None yet
Member

FransUrbo commented Dec 14, 2011

As of ZFS Pool Version 30, there is support for encryption. This part is unfortunatly closed source, so an opensource implementation would be required. That means it would probably not be compatible with the Solaris version 'but who cares' :).

Illumos is apparently working on this at https://www.illumos.org/projects/zfs-crypto. Source repository can be found at https://bitbucket.org/buffyg/illumos-zfs-crypto. Unfortunatly there is no changes since the fork from illumos-gate. Should ZoL start thinking about this or should we just take the back seat?

Don't know how big of a problem this would be, but 'copying' the way that LUKS (Linux Unified Key Setup) do it seems to be a good place to start.

Owner

behlendorf commented Dec 14, 2011

I'd like to hold off on this for the moment, we have enough other work on our plate and this is a huge change! If Illumos puts together an implementation we'll happily look at integrating it. We would could even use the source from ZFS Pool Version 30 if Oracle decides to release the source 6-12 months from now (unlikely but possible).

Member

FransUrbo commented Dec 14, 2011

If you don't mind LUKS, I might have some time to look at this in a week or two.

Owner

behlendorf commented Dec 14, 2011

I'm OK with making it easier to layer zfs on top of LUKS, that would be nice. It's just not what most people think of when they say zfs encryption support.

Member

FransUrbo commented Dec 15, 2011

I was rather thinking of 'cloning'/'copying' the way LUKS works. Or rather, use the LUKS API inside ZFS. LUKS is used by 'cryptsetup' (configures encrypted block devices) and 'dmsetup' (The Linux Kernel Device Mapper userspace library). So it seems LUKS is an API for device encryption.

Using ZFS 'on top of' something like that would probably be easier, but not, as you say, not the intention...

Owner

behlendorf commented Dec 15, 2011

It would be interesting to investigate if what your suggesting is possible. It would result in a second version of zfs encryption which isn't compatible with the pool v30 version but that might not be a big deal. We should be integrating the new feature flag support early next year so it could end up as a Linux-only feature.

Member

FransUrbo commented Dec 15, 2011

I don't think we'll ever going to be compatible with v30... Not any of us, not unless Oracle all of a sudden 'sees the light', and I'm not holding my breath on that! :)

Best would be if we could come up with a solution, that would be portable to other OS'es. Don't know how much Linux the 'Linux Unified Key Setup' is, but it's worth a look. I'll start that once I have a workable sharesmb patch.

baryluk commented Dec 15, 2011

How about at least reverse engineering v30 format?

Member

FransUrbo commented Dec 15, 2011

Be my guest! Reverse engineering something, especially a crypt algorithm isn't any where near as simple as it sounds!

baryluk commented Dec 16, 2011

We know it is using SHA-256, and AES-128 with Incremental mode probably, so actually there is nothing complicated, only some on-disk meta-data needs to be reverse enginered, like which bit is what, and where is salt, and where is stored information that it is AES-128 and not 192 or 256. It should be easy. Unfortunately I do not have access to Solaris right now to test it.

Member

FransUrbo commented Dec 16, 2011

That DO sound easy :). Unfortunatly, we probably have to... I've spent the day looking into LUKS, but it does not seem to fit the purpose :(.

It is intended for being placed between the device and the FS. Which means it needs one device (either one physical disk or multiple disks presented as one through raid/md) where it can store data linearly... Kind of. But since ZFS is both a FS and a ... 'device mapper' (?) which have multiple devices, I doubt it will be possible to have LUKS split the data and it's key storage partitions split over multiple physical disk. I haven't looked at the code yet, just the specs but that's what it looks like so far.

patrykk commented Dec 25, 2011

Hi,
"Oracle Solaris 11 Kernel Source-Code Leaked"
more information:

http://www.phoronix.com/scan.php?page=news_item&px=MTAzMDE

akorn commented Dec 26, 2011

Of course, you shouldn't look at the leaked source if you work on ZFS lest Oracle accuse you of copyright infringement.

patrykk commented Dec 27, 2011

Yes, You are right.

baryluk commented Dec 27, 2011

LUKS is not an option. ZFS performs encryption on per-dataset/volume/file basis, LUKS works on device level. We already have crypto primitives available in kernel, we already have on-disk format designed, we just need to reverse enginer it (it should be slightly easier than designing it - which in case of crypto-stuff is hard to do properly/securely). Probably ZIL will be the hardest part.

Of course looking at leaked source-code is not an option at all. Even for second I wasn't thinking about it.

Member

dajhorn commented Jan 3, 2012

An interim solution is ecryptfs, which can be installed on top of ZFS.

Most RPM and DEB systems have built-in management for ecryptfs, which makes it easy to configure.

For maximum performance, dedup and compression should be disabled on any ZFS dataset that hosts a crypto layer.

Contributor

pyavdr commented May 6, 2012

This ( http://src.opensolaris.org/source/xref/zfs-crypto ) looks very nice, CDDL and lots of zfs crypto stuff. Maybe we should try to cooperate with Illumous for a common port to linux. In any case processors with AES-NI should be supported to gain optimal performance.

Contributor

maxximino commented May 6, 2012

I would like to point out that the code linked above has ZPL_VERSION = 3 and SPA_VERSION=15. That's quite old!!
(source: http://src.opensolaris.org/source/xref/zfs-crypto/gate/usr/src/uts/common/sys/fs/zfs.h#318)
Oracle didn't merge that code until version 30 (http://hub.opensolaris.org/bin/view/Community+Group+zfs/30).

Owner

behlendorf commented May 10, 2012

We should certainly work with the other ZFS implementations when any crypto work is being considered. Also it's my understanding that the link your referencing is to some of the early crypto work and it has been significantly reworked before being include in v30. That said, it's still probably a reasonable place to get familiar with the basic design decisions.

Member

ryao commented Jul 22, 2012

Here is Sun's design document for ZFS encryption support:

http://hub.opensolaris.org/bin/download/Project+zfs-crypto/files/zfs-crypto-design.pdf

We can check out the early code by doing hg clone ssh://anon@hg.opensolaris.org//hg/zfs-crypto/gate.

mcr-ksh commented Sep 26, 2012

I would love to see that as well. crypto is an amazing feature.

The last post was 5 month ago. Did you guys decide on anything? What is the current state?

gua1 commented Feb 20, 2013

@FloFra
This is marked "Milestone: 1.0.0" and I think the zfsonlinux developers hope that illumos would have implemented crypto support by then or I suppose if illumos hasn't then they would work on it. My interpretation is "to be done in the distant future".

In the ZFS on Linux area

https://groups.google.com/a/zfsonlinux.org/forum/?fromgroups#!searchin/zfs-discuss/crypto

https://groups.google.com/a/zfsonlinux.org/forum/?fromgroups#!searchin/zfs-devel/crypto leads to pool version 33, zfs-crypto (2011-12-22).

In the illumos area

Whilst https://www.illumos.org/projects/zfs-crypto is not recently updated, there's http://wiki.illumos.org/display/illumos/Project+Ideas (2012-07-18)

Device drivers

Niagra Crypto
… Re-implement the crypto acceleration drivers for the SPARC sun4v cpus.

File systems

ZFS encryption
… Import and update the work started by Darren Moffat to provide cryptographic support for ZFS.

I'll align myself with the latter.

Elsewhere

In irc://irc.freenode.net/#zfs on 2013-11-09, someone attention to code on GitHub. We acknowledged the need for someone to audit that code, so I didn't follow the link.

Member

ryao commented Sep 2, 2013

@grahamperrin mentioned some encryption code on github. It was determined on the mailing list that it includes code from the Solaris 11 leak and is therefore encumbered. We will not be using it.

I believe the encryption code referred to is located at https://github.com/zfsrogue/zfs-crypto. I've been able to merge and build both the SPL changes (https://github.com/zfsrogue/spl-crypto) and the ZFS branch.
Doesnt merge well with the pending ARC cache changes though so not being tested on my current labrats.
I'll post results once i free up a test cycle

Member

FransUrbo commented Apr 28, 2014

@sempervictus I would be very, very careful using that code (IF you can get it to merge). There's a very, very (yes, yes! :) high risk that that code is the source of my loss of my pool (16TB almost full)....
We have not been able to absolutly pinpoint exactly why, but I've been running that code on my live server for almost a year, upgrading as I went and somehow, somewhere it (the crypto code) messed up the meta data so I/we might be unable to fix it...

Member

FransUrbo commented Apr 28, 2014

Also, Rougue isn't really maintaining the code any more (ZoL have gone through a lot of changes since he created the repository).

Contributor

lundman commented Apr 28, 2014

Actually he is, updating the osx one when I ask etc. He's most likely waiting on 0.6.3 to tag and release.

You can ask him to do a merge anytime if there is something you want sooner.

Member

FransUrbo commented Apr 28, 2014

@lundman All I've seen is that he have 'come in' once a month (or 'every now and then') accepting patches, sometimes without any review. There was a couple of pulls I wanted to discuss before merge (they required other ZoL pulls I did to be accepted first, which they weren't/haven't yet - and might not ever be).
And after the core-update of ZoL (some Illumos merge), the update wasn't really up to snuff. Considering how fast ZoL is moving right now, I think it's important that he keeps a much closer eye on what's happening, not just accepting others pulls willy-nilly. And basically, that's all he does now...
Don't misunderstand me, I want his code as much as the next guy, but the core issue is that that code somehow, somewhere messed up my pool. My take of all that is that it's because he isn't maintaining it properly. It (I) doesn't mean that the code shouldn't be considered, just that care should be taken before using it...

Contributor

lundman commented Apr 28, 2014

You are always running a risk pulling from master, in any repo. The master is the edge of development, he tags releases that are considered stable. It is appreciated that you are brave enough to help debug master of course. But even main ZOL made pool incompatibilities, so you run the risk there too :) He only pulls from ZOL, not really "patches willy nilly" You and I might be the only pull requests there are.

Have you considered it is up to us to maintain it, he's just hosting it to ensure the "sights" are off us open source developers.

Member

FransUrbo commented Apr 28, 2014

Fair enough, and I'm not blaming him (or you) in any way for my current predicament - I'm well aware that I take a chance every time I use code that isn't fully testing. I'm just saying that anyone should be very, very careful using it, because I (might - ohhh, I really hope it's "might" and not "have" :) lost my pool and you haven't. I blame myself for not being careful enough.
You and I might also be the only ones actually using and testing it throughly :). I don't know enough about the core code (of either ZoL or ZFS-Crypto) to be able to help in that part. All I can do is submit issues and some documentation and simpler pull requests.
Rogue might want to wait for 0.6.3, but considering how fast ZoL is moving, a much closer eye is needed. And if he's not maintaining it, will you? I just don't have the know-how to do it... Someone needs to keep it up to date with ZoL, not just around every tagging of ZoL...
And 'someone' also needs to figure out why exactly I lost my pool, so that it can't happen again to someone else. Might be to late for me (I really hope not), but if it can happen to me, it is very possible that it can happen again...

Having done some basic digging around the issue, it looks like there are many people offering suggestions for how to get going on this, but everyone's punting to see who jumps off this cliff first.
@FransUrbo: agreed 100% for any production/client-oriented use. I've left a comment in the issues section about merging in current code + the ABD patches, it looks like we're plain missing sha-mac here. I've been doing my own merges much like you, getting a decent sense of the logic flow...
@devs: basic crypto implementation, especially compatible with Solaris would be very useful from a "marketing" standpoint - to get traction we need better corporate penetration, and most companies dont want to go into new storage ventures without the crypto box checked off. I've been giving folks ZFS/dm-crypt, but its not ideal to work with an intermediate block layer. If a slow-as-hell but stable compatibility layer is implemented, illumos, here, wherever, it would at least give clients seeking to use specific platform features of non Solaris OS' a way to maintain consistent data stores.
@ALL: has anyone asked Oracle directly what their litigious stance on this code-leak mess is? Companies (Google) have been known to swear off present/future litigation for derived works, and if there's any chance of that here, it would be of significant help in terms of maintaining an inter-operable data format.

Member

FransUrbo commented Jun 3, 2014

Having done some basic digging around the issue, it looks like there are many people offering suggestions for how to get going on this, but everyone's punting to see who jumps off this cliff first.
I think it's more of the fact that there's really not anyone competent enough to start it.

Dabbling with encryption is hard and difficult enough, but designing it!? Very few people is knowledgable enough to dare to do this. And if there is such a person, he/she would need extensive knowledge on ZFS as well...

I seriously doubt that there is such a person, and if there is, this person is obviously way to busy at the moment with other things (like making sure the open source version of ZFS - any os/dist - is stable and functioning properly).

I've been doing my own merges much like you, getting a decent sense of the logic flow...

Do realize that you've just cut yourself of the 'True Open Source' version of crypto in ZFS.
@ALL: has anyone asked Oracle directly what their litigious stance on this code-leak mess is?

Doubt it. Feel free to offer yourself up on their altar :)

Asking doesn't cost anything, but time. I'm just to convinced on the answer to bother...

Companies (Google) have been known to swear off present/future litigation for derived works

Yes, but they on the other hand have a company statement that literally say 'Do No Evil'. They also have quite a good and long track record of working with/for Open Source.

Oracle don't... :). They on the other hand have proved that their 'secret' company statement must be 'Do As Much Evil As Possible'... (no smily on that!)

Contributor

lundman commented Jun 4, 2014

Large part of the Sun ZFS-Crypto work is in the somewhat advanced kernel key store. Where you can rekey your dataset at any time, and it handles that work for you. I would advocate not being Solaris compatible, you wont be able to import pool v30 without handling hybrid v29 first anyway.

I had an informal chat with ZFS dev at last conference, and the idea of having a DMU layer encryption would be a good start. Something you could probably knock out in a couple of days, like at hackathon. Perhaps that could be suggested at the next Open ZFS Day. Although I wrote the Solaris kernel crypto API layer for Linux SPL, I don't think I have the skill to design a new DMU layer.

The ABD work is a bit of a porting hassle, mainly as it is a linux only feature and really should be in SPL layer (if it can) as it diverges greatly. It is also strange they add new scatterlist, when ZFS's built in UIOs already are, and are supported in SPL.
Even the OS X is stuck on this work, the future is uncertain for using ZOL as upstream. But it's not like we even asked for permission before using them as upstream ;)

Owner

behlendorf commented Jun 5, 2014

@lundman It's very likely some form of the ABD changes will be going back upstream to OpenZFS. The idea is to do this in as compatibile a way as possible so if you have concerns please post them in the pull request.

Contributor

lundman commented Jun 6, 2014

Ah if they are, I will get on another merge here. It was actually nothing difficult (as I assumed), bar the unexpected smp_rmb() smp_wmb() and strdup() (not spa_strdup). I withdraw my rant :)

Contributor

lundman commented Jun 8, 2014

Looks like rogue did another merge two days back, no surprises there, looks like it went normal. As for ADB specifically, I will wait for them to be in master before we tackle the merge.

@behlendorf behlendorf added Difficulty - Hard and removed Illumos labels Oct 6, 2014

@behlendorf behlendorf removed this from the 1.0.0 milestone Oct 6, 2014

ryao added a commit to ryao/zfs that referenced this issue Oct 9, 2014

vdev_raidz_io_start: Ignore empty writes
The below excerpt of a backtrace is from a ztest failure when running
ZoL's ztest.

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The backtrace was an infinite loop of `vdev_queue_io_to_issue()` invoking
`zio_execute()` until it overran the stack. vdev_queue_io_to_issue() will ony
invoke `zio_execute()` on raidz vdevs when aggregation I/Os are generated to
improve aggregation continuity. These I/Os do not trigger any writes. However,
it appears that they can be generated in such a way that they recurse
infinitely upon return to `vdev_queue_io_to_issue()`. As a consequence, we see
the number of parents by 1 each time the recursion returns to
`vdev_raidz_io_start()`.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 9, 2014

vdev_raidz_io_start: Ignore empty writes
The below excerpt of a backtrace is from a ztest failure when running
ZoL's ztest.

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The backtrace was an infinite loop of `vdev_queue_io_to_issue()` invoking
`zio_execute()` until it overran the stack. vdev_queue_io_to_issue() will ony
invoke `zio_execute()` on raidz vdevs when aggregation I/Os are generated to
improve aggregation continuity. These I/Os do not trigger any writes. However,
it appears that they can be generated in such a way that they recurse
infinitely upon return to `vdev_queue_io_to_issue()`. As a consequence, we see
the number of parents by 1 each time the recursion returns to
`vdev_raidz_io_start()`.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>

ryao added a commit to ryao/zfs that referenced this issue Oct 9, 2014

vdev_raidz_io_start: Ignore empty writes
The below excerpt of a backtrace is from a ztest failure when running
ZoL's ztest.

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

The backtrace was an infinite loop of `vdev_queue_io_to_issue()` invoking
`zio_execute()` until it overran the stack. vdev_queue_io_to_issue() will ony
invoke `zio_execute()` on raidz vdevs when aggregation I/Os are generated to
improve aggregation continuity. These I/Os do not trigger any writes. However,
it appears that they can be generated in such a way that they recurse
infinitely upon return to `vdev_queue_io_to_issue()`.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>

ryao added a commit to ryao/zfs that referenced this issue Oct 10, 2014

Redispatch ZIOs in deep notification call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 10, 2014

Redispatch ZIOs in deep call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 10, 2014

Redispatch ZIOs in deep call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 10, 2014

Redispatch ZIOs in deep call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 10, 2014

Redispatch ZIOs in deep call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 10, 2014

Redispatch ZIOs in deep call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 11, 2014

Redispatch ZIOs in deep call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>

ryao added a commit to ryao/zfs that referenced this issue Oct 11, 2014

Redispatch ZIOs in deep call graphs
The below excerpt of a backtrace is from a ztest failure when running ZoL's
ztest:

/#453 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#454 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350003de0) at ../../module/zfs/vdev_queue.c:747
/#455 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350003de0) at ../../module/zfs/zio.c:2659
/#456 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1399
/#457 zio_nowait (zio=0x7f0350003de0) at ../../module/zfs/zio.c:1456
/#458 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350003a10) at ../../module/zfs/vdev_mirror.c:374
/#459 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1399
/#460 zio_nowait (zio=0x7f0350003a10) at ../../module/zfs/zio.c:1456
/#461 0x00007f03c806464c in vdev_raidz_io_start (zio=0x7f0350003380) at ../../module/zfs/vdev_raidz.c:1607
/#462 0x00007f03c807f243 in __zio_execute (zio=0x7f0350003380) at ../../module/zfs/zio.c:1399
/#463 zio_nowait (zio=0x7f0350003380) at ../../module/zfs/zio.c:1456
/#464 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002fb0) at ../../module/zfs/vdev_mirror.c:374
/#465 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1399
/#466 zio_nowait (zio=0x7f0350002fb0) at ../../module/zfs/zio.c:1456
/#467 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033957ebf0) at ../../module/zfs/vdev_mirror.c:499
/#468 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:2707
/#469 0x00007f03c808285b in __zio_execute (zio=0x7f033957ebf0) at ../../module/zfs/zio.c:1399
/#470 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f0390001330, pio=0x7f033957ebf0) at ../../module/zfs/zio.c:547
/#471 zio_done (zio=0x7f0390001330) at ../../module/zfs/zio.c:3278
/#472 0x00007f03c808285b in __zio_execute (zio=0x7f0390001330) at ../../module/zfs/zio.c:1399
/#473 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4013a00, pio=0x7f0390001330) at ../../module/zfs/zio.c:547
/#474 zio_done (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:3278
/#475 0x00007f03c808285b in __zio_execute (zio=0x7f03b4013a00) at ../../module/zfs/zio.c:1399
/#476 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014210, pio=0x7f03b4013a00) at ../../module/zfs/zio.c:547
/#477 zio_done (zio=0x7f03b4014210) at ../../module/zfs/zio.c:3278
/#478 0x00007f03c808285b in __zio_execute (zio=0x7f03b4014210) at ../../module/zfs/zio.c:1399
/#479 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03b4014620, pio=0x7f03b4014210) at ../../module/zfs/zio.c:547
/#480 zio_done (zio=0x7f03b4014620) at ../../module/zfs/zio.c:3278
/#481 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03b4014620) at ../../module/zfs/zio.c:1399
/#482 zio_execute (zio=zio@entry=0x7f03b4014620) at ../../module/zfs/zio.c:1337
/#483 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#484 0x00007f03c806106e in vdev_queue_io (zio=zio@entry=0x7f0350002be0) at ../../module/zfs/vdev_queue.c:747
/#485 0x00007f03c80818c1 in zio_vdev_io_start (zio=0x7f0350002be0) at ../../module/zfs/zio.c:2659
/#486 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1399
/#487 zio_nowait (zio=0x7f0350002be0) at ../../module/zfs/zio.c:1456
/#488 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350002810) at ../../module/zfs/vdev_mirror.c:374
/#489 0x00007f03c807f243 in __zio_execute (zio=0x7f0350002810) at ../../module/zfs/zio.c:1399
/#490 zio_nowait (zio=0x7f0350002810) at ../../module/zfs/zio.c:1456
/#491 0x00007f03c8064593 in vdev_raidz_io_start (zio=0x7f0350001270) at ../../module/zfs/vdev_raidz.c:1591
/#492 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001270) at ../../module/zfs/zio.c:1399
/#493 zio_nowait (zio=0x7f0350001270) at ../../module/zfs/zio.c:1456
/#494 0x00007f03c805f71b in vdev_mirror_io_start (zio=0x7f0350001e60) at ../../module/zfs/vdev_mirror.c:374
/#495 0x00007f03c807f243 in __zio_execute (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1399
/#496 zio_nowait (zio=0x7f0350001e60) at ../../module/zfs/zio.c:1456
/#497 0x00007f03c805ed43 in vdev_mirror_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/vdev_mirror.c:499
/#498 0x00007f03c807a0c0 in zio_vdev_io_done (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:2707
/#499 0x00007f03c808285b in __zio_execute (zio=0x7f033a0c39c0) at ../../module/zfs/zio.c:1399
/#500 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f03a8003c00, pio=0x7f033a0c39c0) at ../../module/zfs/zio.c:547
/#501 zio_done (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:3278
/#502 0x00007f03c808285b in __zio_execute (zio=0x7f03a8003c00) at ../../module/zfs/zio.c:1399
/#503 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800c400, pio=0x7f03a8003c00) at ../../module/zfs/zio.c:547
/#504 zio_done (zio=0x7f038800c400) at ../../module/zfs/zio.c:3278
/#505 0x00007f03c808285b in __zio_execute (zio=0x7f038800c400) at ../../module/zfs/zio.c:1399
/#506 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800da00, pio=0x7f038800c400) at ../../module/zfs/zio.c:547
/#507 zio_done (zio=0x7f038800da00) at ../../module/zfs/zio.c:3278
/#508 0x00007f03c808285b in __zio_execute (zio=0x7f038800da00) at ../../module/zfs/zio.c:1399
/#509 zio_notify_parent (wait=ZIO_WAIT_DONE, zio=0x7f038800fd80, pio=0x7f038800da00) at ../../module/zfs/zio.c:547
/#510 zio_done (zio=0x7f038800fd80) at ../../module/zfs/zio.c:3278
/#511 0x00007f03c807a6d3 in __zio_execute (zio=0x7f038800fd80) at ../../module/zfs/zio.c:1399
/#512 zio_execute (zio=zio@entry=0x7f038800fd80) at ../../module/zfs/zio.c:1337
/#513 0x00007f03c8060b35 in vdev_queue_io_to_issue (vq=vq@entry=0x99f8a8) at ../../module/zfs/vdev_queue.c:706
/#514 0x00007f03c806119d in vdev_queue_io_done (zio=zio@entry=0x7f03a0010950) at ../../module/zfs/vdev_queue.c:775
/#515 0x00007f03c807a0e8 in zio_vdev_io_done (zio=0x7f03a0010950) at ../../module/zfs/zio.c:2686
/#516 0x00007f03c807a6d3 in __zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1399
/#517 zio_execute (zio=0x7f03a0010950) at ../../module/zfs/zio.c:1337
/#518 0x00007f03c7fcd0c4 in taskq_thread (arg=0x966d50) at ../../lib/libzpool/taskq.c:215
/#519 0x00007f03c7fc7937 in zk_thread_helper (arg=0x967e90) at ../../lib/libzpool/kernel.c:135
/#520 0x00007f03c78890a3 in start_thread (arg=0x7f03c2703700) at pthread_create.c:309
/#521 0x00007f03c75c50fd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This occurred when ztest was simulating a scrub under heavy I/O load. Under
those circumstances, it was possible for a mix of noop I/Os for aggregation
continuity and the I/O elevator to generate arbitrarily deep recursion.

This patch modifies ZFS to propapage a recursion counter inside the zio_t
objects such that IOs will be redispatched upon reaching a given recursion
depth.  We can detect long call chains and dispatch to another ZIO taskq. We
cut in-line when we do this to minimize the potential for taskq exhaustion that
can prevent a zio from notifying its parent.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Member

ryao commented Nov 19, 2014

Here are some links that might be useful for this:

https://hg.openindiana.org/upstream/oracle/onnv-gate-zfscrypto/
https://blogs.oracle.com/darren/entry/zfs_encryption_what_is_on
http://www.oracle.com/technetwork/articles/servers-storage-admin/manage-zfs-encryption-1715034.html
https://docs.oracle.com/cd/E26502_01/html/E29007/gkkih.html
http://www.snia.org/sites/default/files2/sdc_archives/2008_presentations/wednesday/DarrenMoffat_ZFSEncryption.pdf
https://www.usenix.org/legacy/events/fast09/wips_posters/moffat_wip.pdf

In particular, the GRUB2 souce code has a comment saying that its encryption support was implemented mostly using the above blog post as a reference. In addition, the GRUB2 source code shows that bit 62 is used to indicate a block pointer to an encrypted block and that the hidden dataset property is called "salt".

Upon some poking around, it seems that the encryption property is inheritable, but not editable. send streams of encrypted filesystems are not encrypted, but recving to an non-encrypted filesystem is not permitted. Also, the buffers from encrypted datasets are not written to L2ARC. Each dataset has a key chain in ZAP that stores the actual keys used for encryption along with information on which algorithm was used, which txgs were in effect at the time, etcetera. Decryption requires figuring out which key was used for a given txg while encryption requires using the latest. You can read more details at that blog post.

@cancan101 cancan101 referenced this issue in ClusterHQ/flocker Jan 29, 2015

Closed

Support encrypting the ZFS volumes on disk #1108

Member

ryao commented Aug 22, 2015

Some discussion of full disk encryption in #gentoo-dev on freenode lead me to poke around. I wrote an outline of the disk format changes mostly via header changes here:

https://github.com/ryao/zfs/tree/crypto

Member

ryao commented Aug 22, 2015

The Linux encryption API is GPL-exported, so we cannot use it without having a fairly unpleasant conversation about whether the GPL-exported symbol restriction is correct. The OpenBSD/FreeBSD AES code is "public domain". We do not have the rights to public domain code in jurisdictions that do not recognize public domain, such as France, so we cannot use that either. The OpenSSL AES code is under the "OpenSSL License", which is a 6-clause variant of the 4-clause BSD license, which is problematic.

That said, the Illumos AES code is under the CDDL, so anyone implementing encryption support should have no issue reusing it in ZoL:

https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/crypto/io/aes.c

Contributor

tcaputi commented Feb 20, 2016

@ilovezfs the order will be compress then encrypt so we will have good compression ratios. Removing compression will be strictly for the bonus buffers of dnodes (up to 320 bytes at the tail of the structure). This will not impact file data or other related non-metadata in any way.

Contributor

ilovezfs commented Feb 20, 2016

Thanks, so space isn't a big deal. However, I still think it would be significantly safer to compress there, too. Anywhere an attacker knows to expect more zeros in the plain text is a problem IMO.

Contributor

tcaputi commented Feb 20, 2016

@ilovezfs I don't believe it is, because the data there wont be zeros. It will be the bonus buffer, which is not zeros. The dnode itself will be in plain text for scrubbing purposes. The encryption algorithms we are using require an IV anyway, so this shouldn't be a problem.

With this whole Apple/Bureau mess going on it dawns on me that we may to
build in master key destruction into the core implementation.
We patch LUKS with the nuke patch which converts one of our key slots to
contain a key which when entered, kills the master and renders the volume
useless. Unfortunately, this only helps if someone is silly enough to try
on original media with the luks version in the initrd/os...
Is there any way by which we could construct the unwrap process to force a
write on each attempt such that we end up eating the key with "bad attempt"
bits? Such logic would have to be intrinsic to the operation since we are
open source and can be altered by attackers (no subpoena needed). It would
mean no ability to unlock datasets in RO imported pools, but if you're
using crypto, I'm thinking security wins over accessibility.
Thoughts?
On Feb 19, 2016 8:03 PM, "Tom Caputi" notifications@github.com wrote:

@ilovezfs https://github.com/ilovezfs I don't believe it is, because
the data there wont be zeros. It will be the bonus buffer, which is not
zeros. The dnode itself will be in plain text for scrubbing purposes. The
encryption algorithms we are using require an IV anyway, so this shouldn't
be a problem.


Reply to this email directly or view it on GitHub
#494 (comment).

Contributor

tcaputi commented Feb 21, 2016

@sempervictus

Is there any way by which we could construct the unwrap process to force a
write on each attempt such that we end up eating the key with "bad attempt"
bits?

I like that idea, but I'm having a hard time thinking of how that could work in such a way that an attacker couldn't simply comment it out.

Contributor

ilovezfs commented Feb 21, 2016

Better idea: build in a big, beautiful backdoor paid for by Mexico.

Contributor

tcaputi commented Feb 21, 2016

@sempervictus Another quick thought on that matter. It should be fairly easy to trick zfs into thinking it is read/write without altering the code at all. All you have to do is set up a dm-snapshot of each volume in the pool and then import the pool using these volumes instead of the originals. At that point, any writes zfs would make to destroy the key would only go to the snapshot's copy-on-write data and not the original volume. As much as I like the idea, I'm not sure there's a way to implement it that isn't easily circumvented

Any forensics cat worth their salt wouldn't use original media, so that
sort of substrate recovery is always a threat. Having some sort of
dependency on the on disk structure would however make brute force
harder/longer. If you have to reload the snap every 10 tries it slows you
down, and brute force is all about time. Just food for thought here, but
would be nice if deterrents were forcibly linked to disk ops.
On Feb 21, 2016 12:25 PM, "Tom Caputi" notifications@github.com wrote:

@sempervictus https://github.com/sempervictus Another quick thought on
that matter. It should be fairly easy to trick zfs into thinking it is
read/write without altering the code at all. All you have to do is set up a
dm-snapshot of each volume in the pool and then import the pool using these
volumes instead of the originals. At that point, any writes zfs would make
to destroy the key would only go to the snapshot's copy-on-write data and
not the original volume. As much as I like the idea, I'm not sure there's a
way to implement it that isn't easily circumvented


Reply to this email directly or view it on GitHub
#494 (comment).

Member

kpande commented Feb 24, 2016

I like that idea, but I'm having a hard time thinking of how that could work in such a way that an attacker couldn't simply comment it out.

You need to build in support for hardware security modules (HSM) or trusted platform module (TPM) - both of which serve as hardware key stores. In the case of HSM (as I understand it) the device is capable of performing encryption and decryption using encrypted memory buses so it is resistant to cold boot attacks as well as having built-in delays - this is similar to apple's Secure Enclave (as I understand it). I would love to know more about these technologies and their implementation - I only know /of/ them, and that it is the next logical progression for all of my own cryptographic work.

Contributor

tcaputi commented Feb 24, 2016

@kpande I can look to that for future implementations. All of that code will pretty much be in userspace so it should be easy to patch in later. As a first implementation I am only planning on supporting raw, hex, and passphrase wrapping keys which may be provided from a prompt or from a file. Right now, the code only accepts raw keys from the prompt. There are placeholders in the code to allow more kinds of keysources, but I still have a lot to do already with the actual data encryption, zfs sending, and many other bits and pieces before I can start to think about other keysources and other accessories like that.

I've got a TPM device so I can help test it. What's the ETA on this new
feature? Very exciting work.
On 24/02/2016 2:45 PM, "Tom Caputi" notifications@github.com wrote:

@kpande https://github.com/kpande I can look to that for future
implementations. All of that code will pretty much be in userspace so it
should be easy to patch in later. As a first implementation I am only
planning on supporting raw, hex, and passphrase wrapping keys which may be
provided from a prompt or from a file. Right now, the code only accepts raw
keys from the prompt. There are placeholders in the code to allow more
kinds of keysources, but I still have a lot to do already with the actual
data encryption, zfs sending, and many other bits and pieces before I can
start to think about other keysources and other accessories like that.


Reply to this email directly or view it on GitHub
#494 (comment).

Contributor

tcaputi commented Feb 24, 2016

@CMCDragonkai I don't really have a good ETA because (as far as I'm aware of) nobody has looked at my WIP pull request to let me know how close it is to acceptable. This is partially because it is still a WIP, but also because the current work has 109 file changes, so I don't think anybody has had sufficient time to look at it yet.

@ALL That said, as another development update for those who are concerned, I am still working on data encryption. I'm hoping to have basic data encryption (as in not the L2ARC or the ZIL) implemented by the end of the week. Unfortunately I had to jump back on another project for the better part of last week, but as of today I should be back to working on this.

Contributor

lundman commented Feb 24, 2016

openzfsonosx/zfs@bc28b16

I've started work on both testing the portability and review. The initial merge is fairly trivial, just minor Linux headers added, highbit/lowbit was added to a different header to mine.

After that is the porting work in icp, which appears to be a stand-alone kmodule, (and ZOL has a few after all). So some files perhaps should be named with "linux" in it, like illumos-crypto.c. There is a fair bit of UIO work, which I need to replace with the UIO API calls. (it's uio_offset(uio) instead of uio->offset on OSX).

But I have only started on the second phase.

Contributor

tcaputi commented Feb 24, 2016

@lundman Thanks for working on getting this to other platforms. illumos-crypto.c is the only file that is not ported from Illumos and it effectively just initializes everything. Perhaps a better way to do it would be to to move the module_init(), module_exit() and other linux-specific module hooks to a separate linux specific file. However, it seems that the current osx zfs code just #if 0's out the linux specific lines: https://github.com/openzfsonosx/zfs/blob/master/module/avl/avl.c#L1069

Pardon the derailment i introduced here with the resistance to BF piece, i
didnt mean to imply that we have to have TPM support baked in at day one.
My primary point here is that while the core interactions are being
developed, we have the best opportunity to tie unwrap attempts to the
execution flow with some horribly inefficient data dependency which would
slow down attackers trying to BF the system. Agree that TPM/HSM will be
required at some point, but this is mostly a train of thought toward
purposeful inefficiencies for the sake of obstruction in situations where
the time increments are measured in orders of magnitude.

On Wed, Feb 24, 2016 at 4:02 AM, Tom Caputi notifications@github.com
wrote:

@lundman https://github.com/lundman Thanks for working on getting this
to other platforms. illumos-crypto.c is the only file that is not ported
from Illumos and it effectively just initializes everything. Perhaps a
better way to do it would be to to move the module_init(), module_exit()
and other linux-specific module hooks to a separate linux specific file.
However, it seems that the current osx zfs code just #if 0's out the
linux specific lines:
https://github.com/openzfsonosx/zfs/blob/master/module/avl/avl.c#L1069


Reply to this email directly or view it on GitHub
#494 (comment).

FreeBSD

I'm told that ZFS encryption was discussed at the recent FreeBSD Storage Summit, but I don't know whether any part of that discussion was with reference to developments here.


Back a few months, #494 (comment):

… The OpenBSD/FreeBSD AES code is "public domain". We do not have the rights to public domain code in jurisdictions that do not recognize public domain, such as France, so we cannot use that either. …

A little later:

… FreeBSD is missing a crypto API …

If it helps, I'm aware of:

  • crypto(4)hardware crypto access driver
  • crypto(9)API for cryptographic services in the kernel.

From https://www.freebsd.org/releases/8.2R/relnotes.html (2013-11-13) I see that the FreeBSD crypto(4) framework is also known as opencrypto.

Contributor

tcaputi commented Feb 25, 2016

@grahamperrin I wonder if any of them are aware of the development here. It would be a shame to have multiple people working on the same thing....

As for the crypto api, I have ported over the one from Illumos and (pending review / approval) it should suffice for our needs so I think we're doing pretty well on that front already.

Contributor

ilovezfs commented Feb 25, 2016

Since the basis for the ZoL crypto PR is an illumos crypto API, is there a reason it isn't first being submitted to the openzfs repo (https://github.com/openzfs/openzfs/pulls) sans-illumos-crypto-API-Linux-port before being submitted to ZoL? If it's accepted into ZoL first, and then upstream illumos/openzfs and/or the other openzfs platforms want to request significant changes before integration, there could end up being a bunch of incompatible ZoL pools floating around vs. the final openzfs version, and/or pressure exerted on upstream and the other platforms not to request any changes in order to prevent such incompatibilities, with the only recourse being to try to make everyone aware of the work going on here as soon as possible with big neon signs if they want to have any input.

Contributor

tcaputi commented Feb 25, 2016

@ilovezfs I am fine doing it either way. I only started the work here because this is the project I use and wasn't really aware of the other variants.

Contributor

ilovezfs commented Feb 25, 2016

@tcaputi Yeah, that makes sense. Once you're satisfied that the main cross-platform piece sans the illumos crypto API port is working for you locally, I'd recommend opening the PR there not only for the feedback from illumos land and from the other platforms, but also because their buildbot is particularly vicious and helpful for honing things.

Owner

behlendorf commented Feb 25, 2016

@ilovezfs @tcaputi we definitely need to get buy in from the openZFS developers from other platforms on the core design before this can be merged to ZoL. However, that doesn't mean it won't get merged first in ZoL and then upstreamed. This is already what happens on the other platforms with new features.

Contributor

ilovezfs commented Feb 26, 2016

@behlendorf That is true strictly speaking, but the scope of this change is a little different from something like adding async_destroy or lz4.

Contributor

lundman commented Feb 26, 2016

Ah I'd forgotten how it sucks to work with uio on linux as you have to do everything yourself. We could just have separate repos, like we already do for uio code in zfs_vnops.c.

But it could be tempting to create new standard ZFS macros for uio work, something I thought about doing for ZFS too, but doubt upstream would be bothered with it.

The function crypto_uio_data() would essentially just become two calls, uio_update() and uio_move().

Thoughts?

#ifdef LINUX
#define ZUIO_ISSYSSPACE(U) ((U)->uio_segflg == UIO_SYSSPACE)
#define ZUIO_ISUSERSPACE(U) ((U)->uio_segflg != UIO_SYSSPACE)
#endif
#ifdef __APPLE__
#define ZUIO_ISUSERSPACE(U) uio_isuserspace((U))
#define ZUIO_ISSYSSPACE(U) !uio_isuserspace((U))
#endif
#ifdef sun
#define ZUIO_ISSYSSPACE(U) ((U)->uio_segflg == UIO_SYSSPACE)
#define ZUIO_ISUSERSPACE(U) ((U)->uio_segflg != UIO_SYSSPACE)
#endif

    if (ZUIO_ISUSERSPACE(uiop)) {
        return (CRYPTO_ARGUMENTS_BAD);
    }

Contributor

lundman commented Feb 26, 2016

Hmm actually, FreeBSD already has compat for it, so Illumos sources just work, ZOL has uio struct from IllumOS in there and handle the change over elsewhere. So it's just OSX that is weird. So that's a bit selfish, carry on as before :)

Contributor

tcaputi commented Feb 26, 2016

@lundman Thats good to know. For the moment its looking like most of the uio will be internal to the crypto code. The current design I'm working on uses the struct in one function only (zio_do_crypt_uio()), which will effectively assemble a set of iovecs into a uio and then encrypt / decrypt them.

… buy in from the openZFS developers from other platforms on the core design …

OpenZFS Office Hours, maybe?

If you like that idea, or if (at this stage) you'd prefer some other exercise towards buy-in, please kick the ball around in irc://chat.freenode.net/#openzfs and/or irc://chat.freenode.net/#freebsd … in the latter, you'll probably find people more tuned in to BSD Now podcasts, with plenty of interest in ZFS.

In any case, I'll ping Matt Ahrens.


… wonder if any of them are aware of the development here …

Back to #494 (comment) on 2016-02-06:

… Multi-platform compatibility remains highly desirable. I'll promote that thought … in the PC-BSD area I can think of at least one issue that may be a good vehicle for promotion. …

I refrained from promotion around that time mainly because for some PC-BSD developers, it was a particularly busy time of the month (recreating a set of installers for 11.0-CURRENTFEB2016; responding to testers and so on).

2016-02-23 in irc://chat.freenode.net/#pcbsd I drew the attention of Kris Moore (iXsystems) with the following comment, on an issue that affects PersonaCrypt:

I wonder whether the man hours to fix bugs such as 11595 might be better put towards something that's less OS-specific

Later in a ZFS channel, attention was drawn to #4329 … and so on – so Kris and at least one other well-connected person are aware of work here on 494.


A little background, for readers who may be unfamiliar:

Subject to buy-in, stability etc. there's the notion of redesigning PersonaCrypt to benefit from multi-platform encryption and a larger community of developers.

Contributor

tcaputi commented Feb 29, 2016

@lundman Just so you are aware, I recently realized that I somehow ported over an older version of the Illumos Crypto layer. The one I have right now will work, but I'd like to talk to someone from Illumos and get a recent and stable version to use. The port only took me a couple of days, and the ICP needs some work anyway to get the assembly to compile only on x86 systems. Hopefully you haven't gotten too far into porting it to OSX, but even if you have I don't think the changes will be drastic. I will work on that next after data encryption is (finally) finished

Contributor

tcaputi commented Mar 5, 2016

Quick update: I apologize for the delay, but I had to jump onto another project for a bit. I now have data encryption working for plain file data. The code is written in such a way that partial plaintext objects types can be easily added. Encrypting any object type that does not need to remain partially in the clear should just be a matter of setting a flag on the object type. It was tested with several different combinations of compression, checksumming, dedup, gang writes, encryption algorithms, and with embedded blocks enabled and everything seems to work.

@behlendorf I currently am storing the MAC in the last 128 bytes of bp->blk_cksum and the IV in the first 96 bits of bp->blk_pad. The code can be changed pretty easily to store these values wherever the ZFS community deems fit.

Before I move this to my master branch and update the PR I want to fix up the ICP while I still have my rather excessive debugging in place.

Contributor

tcaputi commented Mar 7, 2016

I pushed a new commit to the PR, featuring plain file contents encryption (as I described above) and a restructured ICP that should build on any architecture, complete with assembly where it is available and generic c functions where it is not.

I have spoken with Saso Kiselkov at Illumos about the ICP and he has and which version of the Illumos Crypto Framework I should port. He is currently rewriting a good bit of the AES encryption code, and he told me he expects to see an order of magnitude performance improvement with his changes, but he isn't sure when he will have the time to finish implementing them. I was originally going to re-port the current framework from the master branch of illumos-gate so that we could get AES-NI support working, but I think I might hold off on that now, in hopes that he might finish his implementation while I am working on other things. As it stands, the current ICP works and I am not planning on changing the structure very much once I get to the update.

Contributor

lundman commented Mar 7, 2016

openzfsonosx/zfs@0dd851d

Ok, so now I compile it the whole way and load the kernel.
Although, I have a handful of #if 0 //portme still to address, mostly the uio places.

You create quite a number of new kernel modules, which is the standard on Linux after all. On FBSD and OSX, will probably stay at 2, and on IllumOS, 0. Only annoyance there is the mixture of Linux module sources, and actual crypto-api-sources, like that of icp/io/aes.c, sha2_mod.c modhash.c modconf.c. Perhaps the Linux module sources could be separated out?

cmd/icp/Makefile is in configure.ac, but not in git. I assume it was left in configure.ac by mistake, referring to test cmds you use in debug.

I'm a bit surprised how much ahead O3X is now to ZOL, since there was changes in the zio_checksum API calls, (the edonr and skein commit) but it is trivial to change the number of parameters, and test the flags instead. Or calls to zio_checksum_SHA256 will panic.

and I have ignored the assembler files, they don't work here. (gcc vs clang?)

Contributor

tcaputi commented Mar 7, 2016

@lundman

You create quite a number of new kernel modules, which is the standard on Linux after all. On FBSD and OSX, will probably stay at 2, and on IllumOS, 0. Only annoyance there is the mixture of Linux module sources, and actual crypto-api-sources, like that of icp/io/aes.c, sha2_mod.c modhash.c modconf.c. Perhaps the Linux module sources could be separated out?

I'm not really creating any new modules other than the ICP itself, although I considered doing that at one point (and it is still possible). What I am currently doing is simply emulating Illumos's kernel module structure within a single module. I stole the directory structure from Illumos, so that the root directory structure represents /usr/src/uts/common/crypto and the algs directory represents /usr/src/common/crypto The only other directories I added were asm-* folders which I needed to get architecture specific code working, and the os folder which has supporting code from various parts of Illumos, including modhash.c and modconf.c as you pointed out. I wanted to keep the directory structure as close to Illumos as possible to it would be to maintain as Illumos gets updates that we want.

cmd/icp/Makefile is in configure.ac, but not in git. I assume it was left in configure.ac by mistake, referring to test cmds you use in debug.

I'm not sure what you mean by this.... I have never had a cmd/icp directory and I don't see it in my condigure.ac.

and I have ignored the assembler files, they don't work here. (gcc vs clang?)

I wonder why the assembly doesnt work. Perhaps I can borrow a mac from someone at work and see what is going on there.

Contributor

lundman commented Mar 8, 2016

@tcaputi The dir-structure is all fine, and as you say, is like IllumOS. I meant more that any file that includes modctl.h (Linux file yes?) and mod_init calls, I took out with #ifdef LINUX, but those files also contain sources for the IllumOS crypto API, so the files are mixed with LINUX only code and IllumOS API code. No matter :)

cmd/icp/Makefile hmm, I had to remove it here, but now that I check my patch again for it, it is nowhere to be found. So I'm putting it down to my typo while resolving conflicts.

I took the assembler out for now, as I have other things to do first, before it can even run, so its not worth worrying about just yet.

Also, my comments are not meant to be negative, just reporting in as I port it over to OSX. I am pleased something is happening in this area :)

Contributor

tcaputi commented Mar 8, 2016

@lundman

The dir-structure is all fine, and as you say, is like IllumOS. I meant more that any file that includes modctl.h (Linux file yes?) and mod_init calls, I took out with #ifdef LINUX, but those files also contain sources for the IllumOS crypto API, so the files are mixed with LINUX only code and IllumOS API code. No matter :)

The *_mod_init functions are from Illumos. I renamed them to avoid name conflicts (in Illumos they were all just called _init() because each one belongs to a separate module). The only Linux-specific functions are illumos_crypto_exit() and illumos_crypto_init(), which are the entry points of the module and are needed to call these init functions, since they are not separate modules in the ICP. There are no Linux headers in the ICP.

Contributor

lundman commented Mar 8, 2016

Huh that is interesting. Then it would be mixed Linux or Illumos code, so #ifdef's it is :)

Contributor

lundman commented Mar 8, 2016

There, I have brought all that mod* stuff back in, and initialise as expected. I have ported everything I had missed. Currently it dies during init, here:

panic(cpu 1 caller 0xffffff7f8f655ec2): "rwlock 0xffffff8c89abeec0 not initialised\n"@spl-rwlock.c:88
(lldb) up
frame #3: 0xffffff7f8f7f9c44 zfs`mod_hash_insert(hash=0xffffff8c89abeec0, key=0xffffff7f8f93eae0, val=0xffffff7f8f93eb00) + 36 at modhash.c:601
-> 601      rw_enter(&hash->mh_contents, RW_WRITER);

(lldb) up
frame #4: 0xffffff7f8f7f089a zfs`kcf_init_mech_tabs + 1146 at kcf_mech_tabs.c:245
-> 245                  (void) mod_hash_insert(kcf_mech_hash,
   246                      (mod_hash_key_t)me_tab[i].me_name,
   247                      (mod_hash_val_t)&(me_tab[i].me_mechid));

(lldb) up
frame #5: 0xffffff7f8f7de292 zfs`illumos_crypto_init + 18 at illumos-crypto.c:91
   88       mod_hash_init();
   89
   90       /* initialize the mechanisms tables supported out-of-the-box */
-> 91       kcf_init_mech_tabs();

I have no call to initialise mh_contents in my patch, so maybe I missed something during the merge.

Contributor

lundman commented Mar 8, 2016

Hmm actually looks like IllumOS do not initialise it, tsk tsk eh :)

Contributor

tcaputi commented Mar 8, 2016

I tried to catch most of those, but there were a lot of places where they rely on zero'd allocation to initialize locks and such. They also didn't have *_fini functions for all of their *_init ones since a lot of their code is not designed to be removable. I tried to fix that as best as I could.

Contributor

tcaputi commented Mar 8, 2016

In case anyone is watching the PR, I pushed a couple of small fixes for encrypted zvols (which are also encrypted now)

Contributor

tcaputi commented Mar 8, 2016

@behlendorf or anyone who knows:
I'm looking into getting the test builds to work. Currently it fails when building the ICP. It seems that the current build system flattens the directory structure of all the modules. Can anybody tell me why this is needed? If it is necessary (as I assume it is for some reason or another) can anyone tell me what the preferred way to structure the ICP would be? I like the current structure because it largely emulates the structure from Illumos, which I think is valuable for maintenance purposes.

While I'm asking questions, does anyone know why zil_claim() needs to call dmu_objset_own() as opposed to dmu_objset_hold(). My current implementation requires that the dataset's keychain is loaded for dmu_objset_own() to succeed so the zil currently fails to claim encrypted datasets. Could it just do a hold and then check to see if it has an owner (under the dataset mutex)? Are there any circumstances where zil_claim() could be called while the dataset is already owned? The documentation for owning says:

Legitimate long-holders (including owners) should be long-running, cancelable tasks that should cause "zfs destroy" to fail.

Is zil_claim() really a long running process? It certainly doesn't seem cancelable.... If nobody knows, Ill just make the change and see if it breaks anything, but it would be good to know for sure why it was written like that to begin with.

Contributor

lundman commented Mar 9, 2016

I can report partial success. I cleaned up the missing rw_init with openzfsonosx/zfs@b8f2cc6
fixed the UIO code for apple, and the change to spl-list.c (allowing list_insert_before to pass NULL placement). Corrected the zio_checksum 3-args to 4-args change.

I can create a pool, and encrypted dataset, and copy a file to it, which does not show in the pool image file.

First sign of trouble is trying to import the said pool again, but will look into that too.

If any OSXers wish to play with it, it is under https://github.com/openzfsonosx/zfs/compare/wip_crypt3

Contributor

tcaputi commented Mar 14, 2016

So I've hit a bit of a roadblock and am not sure what the best solution is. If anybody knows a solution off hand, I would appreciate the input. The problem is regarding partially encrypted types, particularly dnodes. Encrypted datasets currently encrypt dnodes by encrypting bonus buffers, but not the dnode itself. Only a few bonus types need to be encrypted, and for right now I'm just working with System Attribute bonus buffers. Other bonus buffers are left in the clear. This setup should hopefully allow us to scrub datasets and check for errors without the keys being loaded.

Before keys can be loaded, several functions (spa_verify_load() during zpool import for instance) are called that need to read dnodes off the disk. These functions inevitably call arc_read() at some point, whether directly or through dnode_hold(). At this time, however, the keychains are not yet loaded into zfs, so the zio layer returns the block of dnodes with all encrypted bonus buffers still encrypted. This is OK because nothing before this point needs the encrypted bonus buffers.

The problem arises when the dnode block is used again later, when the keys are loaded. At this point zfs sees that the dnode block is cached in the ARC and doesnt bother to reread the data so it comes back with the bonus buffer still encrypted. This obviously breaks any code that relies on that.

The solutions I can think of are:

  1. Figure out all of the functions that don't need the bonus buffers and rework them to use zio_read() instead of arc_read(). This is hard because theres a lot of them, and a lot of them and I'm afraid I'll mess a few of them up in a way that we won't catch until much later wheen it breaks someone's system. This is also difficult because many of these functions go through dnode_hold(), and so I would need to add a flag parameter to that (very used) function.
  2. Force all encrypted bonus buffers into spill blocks. The spill blocks can be fully encrypted and leave dnodes fully unencrypted. Therefore, reading a block of dnodes should not read the bonus buffers. This is what Solaris chose, according to their article, but as @behlendorf commented above, this has performance repercussions (2 random reads for every dnode with an encrypted bonus buffer).
  3. Modify the ARC layer somehow to be able to mark buffers as encrypted / decrypted. If I could do this, then I could probably make it behave such that decrypt reads overtake non-decrypted reads. I can look into this, but I am not incredibly familiar with the ARC layer and I'm not sure how hard this would be to do. I also hesitate to add a flag field to arc_buf_t since (if my understanding is correct) there is one of those structs for each block in the ARC.

I apologize for the wall of text, but I would appreciate it if anyone had any input on how I should go about doing this.

@tcaputi: would it be possible to mark the bonus buffers cached in ARC as dirtied elsewhere and thus force a re-read from disk through the decryption routines? Essentially tell the ARC its somehow out of sync with the on-disk data instead of asking it to track every object as encrypted or not (which, being a boolean is small overhead, but as you point out, that's precious space).

Contributor

tcaputi commented Mar 14, 2016

@sempervictus: Since I posted the question I found arc_buf_header_t, pointed to by arc_buf_t, which appears to be 1-1 mapped. This struct does have a field for flags, which solves the space problem. I'm still not sure what changes would need to be made to the ARC for this to work. From the top of my head, I know that the ARC has a hashmap of buffers, and I imagine trying to maintain 2 copies of the same buffer in that hashmap could cause problems, since the second one would attempt take the first's place. Short of any other suggestions I plan on spending a lot of time tomorrow reading through arc.c.

@tcaputi: you may want to reach out to @tuxoko regarding ARC since his ABD patch stack significantly alters its behavior and, hopefully, will soon become the defacto ARC behavior on Linux.

Member

tuxoko commented Mar 14, 2016

@tcaputi
While bonus buffer are part of dnode. They are never operated "in place". If you follows dmu_bonus_hold, you'll see that they always allocate another buffer for bonus. And an interesting part is that this bonus buffer is never itself a part of ARC, and they will not be allocated during scrub.

So what you could do is don't do anything to dnode_phys, make it always have encrypted bonus. And make the decryption happens when you read the bonus into the separate bonus buffer.

But of course, for other stuff like user data and stuff, you'll likely need to mark it as encrypted/decrypted in ARC. The Illumos camp are working on compressed ARC, you might want to look at it. Maybe you can find some useful idea. https://drive.google.com/file/d/0B5hUzsxe4cdmbEh2eEZDbjY3LXM/view?usp=sharing

@tuxoko: would it be feasible to create something along the lines of a negative DVA in the arc_buf_hdr_t structure for encrypted data such that repeated access to the encrypted state can still be cached, and the real DVA for decrypted data?

Contributor

tcaputi commented Mar 14, 2016

@tuxoko:
Thanks for taking the time to respond. If what you say is true that bonus buffers are not held in the ARC, then I'm not sure what is happening in my code anymore. I will need to run some more tests to see what is going on. From my previous debugging tests, I saw the dnodes getting read through the zio layer from the disk into the arc during spa_verify_load(). After that I didn't see any other zio activity related to the encrypted dnodes. Then zfs panicked on an SA magic number assert from one of the encrypted dnodes. Somewhere, my tests must be misleading me.

Contributor

tcaputi commented Mar 15, 2016

I think I'm approaching a solution a solution, although there is now an additional (but hopefully easier) problem. It seems like the best place to decrypt the bonus buffers would be in dbuf_read_impl(), called by dmu_bonus_hold() (thanks very much to @tuxoko). Currently all encrypted bonus buffers in a DMU_OT_DNODE block are encrypted together, so they collectively have 1 IV and 1 MAC, which is stored in the blkptr_t to the block (as it is for all other object types).

The issue is that decrypting a single bonus buffer requires decrypting all the bonus buffers in the block. I believe the full DMU_OT_DNODE block should be available as long as a bonus buffer within it is available, but this is still at the very least inefficient. Perhaps this wouldn't be quite as bad of an option if it also cached the other decrypted bonus buffers instead of throwing them away.

That said, I had mentioned earlier that we could fit the IV and MAC for a dnode into dnode_phys_t's dn_pad3. This, however, would use 32 bytes of the remaining 36 bytes of padding in that struct, so I would prefer not to do that.

I'm not sure which method is better here, but I'm (again) open to any input. In the meantime I will continue looking myself.

Member

kpande commented Mar 15, 2016

excellent work!

Contributor

tcaputi commented Mar 31, 2016

I had to work on another project for a few weeks, but now I am back to working on this full time. After looking at the status of large dnode support in #3542 I am now less averse to using some of the padding in dnode_phys_t. I don't think it will be possible to encrypt / decrypt bonus buffers without keeping the IV / MAC in there or being incredibly inefficient. Without storing these parameters on a per bonus buffer basis, entire dnode_phys_t blocks need to be decrypted together in order to get the contents of just one bonus buffer. The only other alternative seems to be to push all encrypted bonus buffer types into spill blocks, but that seems to work against #3542, which looks like it has the end goal of deprecating spill blocks altogether.

For now I will start working on adding encryption parameters to dnode_phys_t.

Contributor

tcaputi commented May 4, 2016

@ALL ZFS encryption is (just about) ready for review (#4329). There are some issues with the buildbot's tests right now (which I will resolve shortly), but everything is otherwise implemented.

Is it possible to get this into 0.7.0?
I've build a debian package for my local apt mirror including tcaputis patch.
Encryption is now productive on my fileserver (raidz2, 6x4TB) with multiple datasets and different options (crypt+compr, crypt-only, compr-only).
Building my own dkms package is trivial, but continue testing with keeping track on current master branch is very difficult for such a big patch. Maybe we can include but mark it as experimental for 0.7.0?

Contributor

tcaputi commented Sep 14, 2016

@cytrinox I would make the comment on the #4329 PR. I think most of the conversation about this feature is happening there at this point.

@tcaputi tcaputi referenced this issue Feb 9, 2017

Merged

ZFS Encryption #5769

Conan-Kudo added a commit to Conan-Kudo/zfs that referenced this issue Aug 17, 2017

Native Encryption for ZFS on Linux
This change incorporates three major pieces:

The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.

The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.

The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment