Detect a slow raidz child during reads #17227

robn · 2025-04-08T00:45:48Z

Motivation and Context

Replacing #16900, which was almost finished with review updates but has stalled. I've been asked to take it over.

See original PR for details.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

robn · 2025-04-08T00:48:52Z

Because there will be a little bouncing between two PRs, and because there's two different authors involved, I'll be pushing fixup commits to this branch. Once everyone is happy with review, I will squash them down for merge.

I'll close out the remaining review comments on #16900, and would like it if new comments could be added here. Thanks all for your patience; I know its a bit fiddly (it'd be nicer if Github would allow a PR to change branches, alas).

tonyhutter · 2025-04-17T23:05:45Z

I haven't looked into it, but I see all the FreeBSD runners have:

Tests with results other than PASS that are unexpected:
    SKIP events/setup (expected PASS)
    SKIP events/slow_vdev_sit_out (expected PASS)

tonyhutter · 2025-05-15T00:49:23Z

@robn this fixed the FreeBSD CI errors for me: tonyhutter@a491397. Try squashing that in and rebasing.

behlendorf

Thanks for picking up this work! It'd be great if we can refine this a bit more and get it integrated.

behlendorf · 2025-05-16T19:00:18Z

tests/zfs-tests/include/libtest.shlib

@@ -1589,6 +1599,7 @@ function create_pool #pool devs_list

 	if is_global_zone ; then
 		[[ -d /$pool ]] && rm -rf /$pool
+		echo zpool create -f $pool $@


Why the extra "echo" when log_must already logs this? This looks like some stray debugging which should be dropped.

behlendorf · 2025-05-16T19:05:48Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+for raidtype in raidz raidz2 raidz3 draid1 draid2 draid3 ; do
+	log_must zpool create $TESTPOOL2 $raidtype $TEST_BASE_DIR/vdev.$$.{0..9}
+	log_must dd if=/dev/urandom of=/$TESTPOOL2/bigfile bs=1M count=100
+	log_must zpool export $TESTPOOL2


Let's add an comment above this zpool export / zpool import to indicate this is done to flush the ARC.

behlendorf · 2025-05-16T19:08:54Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+		fi
+	done
+
+	log_must test "$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)" == "on"


Let's close any potential race here and test against $sit_out from above instead of reading it from the pool again.

log_must test "$sit_out" == "on"

More over it looks like all three test cases implement a version of this. It could be implemented as a function and moved it redundancy/redundancy.kshlib, at a minimum though let's make sure they're all implemented the same way.

behlendorf · 2025-05-16T19:13:45Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+	# Wait for us to exit our sit out period
+	log_must wait_sit_out $TESTPOOL2 $BAD_VDEV 10
+
+	log_must test "$(get_vdev_prop sit_out $TESTPOOL2 $BAD_VDEV)" == "off"


This check looks redundant since wait_sit_out returns an error if the timeout is reached.

behlendorf · 2025-05-16T19:25:04Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+save_tunable READ_SIT_OUT_SECS
+set_tunable32 READ_SIT_OUT_SECS 5
+
+log_must truncate -s 150M $TEST_BASE_DIR/vdev.$$.{0..9}


nit: It shouldn't matter but it'd be nice to mix up the geometry here at least a little bit. Say use a vdev count in the range of 4-10 and then randomly select one of the vdevs to be bad.

behlendorf · 2025-05-16T23:10:48Z

module/zfs/vdev_raidz.c

@@ -2745,6 +2846,156 @@ vdev_raidz_worst_error(raidz_row_t *rr)
 	return (error);
 }

+/*
+ * Find the median value from a set of n values


* Find the median value from a set of n values. For sets of an even size * and no exact middle value, the average of the two middle values is used.

behlendorf · 2025-05-16T23:28:49Z

module/zfs/vdev_raidz.c

+/*
+ * Check for any latency outlier from latest set of child reads.
+ *
+ * Uses a Tukey's fence, with K = 2, for detecting extreme outliers. This


@don-brady was a K=2 experimentally determined to be a good value for the fence? K=3 is usually used to characterize extreme outliers, but perhaps that didn't catch enough? It'd be nice to have some more visibility in to this so we could collect some usual statistics from real systems with problem drives.

I think because you need to have repeated outlier incidents (50 of them) before you actually trigger a sit-out, a lower fence threshold is not unreasonable. If there were a decay mechanism for vdev_outlier_count, the disk would need to have consistently worse behavior than its siblings to trip that many times. But I don't see such a mechanism (other than zeroing when a child gets chosen to sit out), which I think means that over time you're just going to end up randomly sitting out whichever child happens to hit 50 outliers first. A higher fence value would make that take longer, but unless you set it very high it will still happen eventually even with no real disk problems.

behlendorf · 2025-05-16T23:47:22Z

module/zfs/vdev_raidz.c

+
+	qsort((void *)lat_data, samples, sizeof (uint64_t), latency_compare);
+	uint64_t fence = latency_quartiles_fence(lat_data, samples);
+	if (lat_data[samples - 1] > fence) {


We should be able to sit out multiple drives. What I think we want is to identify the worst outliers here (up to the parity level) and set ->vdev_read_sit_out_expire on them. Then in vdev_*_io_start_read* function we make the last minute decision if we need to issue an IO to this child. It's only when issuing the IO can we fully take in to account failed drives, known missing data, scrub/resilver IOs, IO errors, etc. Setting ->vdev_read_sit_out_expire should be considered additional advisory information and we can (and should) read this device if we need too.

behlendorf · 2025-05-16T23:53:15Z

tests/zfs-tests/tests/functional/replacement/attach_resilver_sit_out.ksh

+# Test file size in MB
+count=200
+
+for type in "raidz1" "raidz2" "raidz3" ; do


We should be testing draid[1-3] here as well. If the test run time gets to bad one trick we play in other tests is to randomly pick one of these to test. We have enough CI test runners we still get good coverage.

behlendorf · 2025-05-16T23:55:55Z

tests/zfs-tests/tests/functional/events/slow_vdev_sit_out.ksh

+#	1. Create various raidz/draid pools
+#	2. Inject delays into one of the disks
+#	3. Verify disk is set to 'sit out' for awhile.
+#	4. Wait for READ_SIT_OUT_SECS and verify sit out state is lifted.


We really should also include some test coverage for raidz[2,3] and draid[2-3] pools with failed drives.

module/zfs/vdev_raidz.c

pcd1193182 · 2025-08-05T16:56:41Z

I've updated this branch to fix a few things. First, I've significantly reduced (hopefully eliminated almost entirely) the unnecessary sit-outs Brian was reporting. We reproduced them internally during our performance testing and the updated version doesn't display them at all. The changes here are to use the latency histogram stats instead of the EWMA as a better source of data. We also decrease the check frequency dramatically to reduce noise, decrease the number of outliers to compensate (along with adding a facility for extreme events to increase their outlier count more rapidly), increase the fence value significantly, and add a decay mechanism to prevent random noise from eventually causing a sit-out of healthy disks.

Second, I've added an autosit property to vdevs. When set to on, it activates this code. This allows users to decide if they want this logic enabled or not. This is intended to work with the next change:

Third, I've made the sitout property writeable. This allows individual vdevs to be sat out from userland. This, in conjunction with the autosit property, allows the user to decide if they want no disk sit-outs, the kernel's automatic sit-outs, or to do something more complex. Using zpool iostat latency data, SMART stats, or any other data source they can think of, they could now create a userland daemon that monitors disk health and sits out disks that it feels are unhealthy.

Giving the capability to this in userland has a number of advantages: easier access to high-level languages and their rich libraries, more safe and rapid iteration of complex logic, and the ability to improve the logic using new developments and advanced approaches without requiring a kernel upgrade or downtime. The kernel functionality is left in place as a simple plug-and-play approach.

amotin

I agree that histograms should indeed be a better source of data, comparing to EWMA, except as I mention below, we may take a closer look when we update the previous state. Same time, I wonder if histograms may actually give us even more statistical information about the distribution curves, so that we could better estimate the confidence interval on a small number of disks.

amotin · 2025-08-05T19:01:39Z

module/zfs/vdev.c

+		error = zap_lookup(spa->spa_meta_objset, vd->vdev_top_zap,
+		    vdev_prop_to_name(VDEV_PROP_AUTOSIT), sizeof (autosit),
+		    1, &autosit);
+		if (error == 0) {
+			vd->vdev_autosit = autosit == 1;


This logic of read and overwrite looks odd. Is it more of "on"/unset vs "on"/"off"? BTW, I don't see it documented.

The logic here is pretty much the same as failfast above it, the property is either on meaning that we do the autosit, or off meaning that we don't. I'm not sure what looks odd about it? I do need to update the man page for autosit and for sit_out.

Ah, sorry, I've misread == 1 as = 1.

amotin · 2025-08-05T19:11:01Z

module/zfs/vdev.c

+			/* Only expose this for a draid or raidz leaf */
+			if (!vd->vdev_ops->vdev_op_leaf ||
+			    vd->vdev_top == NULL ||
+			    (vd->vdev_top->vdev_ops != &vdev_raidz_ops &&
+			    vd->vdev_top->vdev_ops != &vdev_draid_ops)) {
+				error = ENOTSUP;


I wonder what happen with this in general and with the loop below in particular in case of spare activation, when leaf is not a direct child of the top-level vdev?

Hm, that's a good question, I will test that and fix whatever parts are broken. Almost certainly some part of this logic will need correction for that case.

Yeah, this is a little awkward. What ends up happening is that the spare or replacing vdev that sits between the raidz and the actual disks can be told to auto-sit as usual, but it can't be manually told to sit, because we don't allow setting vdev properties on those vdevs. The underlying disks can be told to sit out, but that doesn't matter because we only look at the direct children when actually considering the property.

I think the solution here is to forbid non-direct children from having the property set (both since it can't matter and because it could result in us over-sitting when the replace operation finishes), and also to allow the property to be set on the spare/replacing device. That one might be tricky to implement sanely, but I'll give it a try. We can also think about whether we want that sit-out to get passed on to the child when the spare/replacing vdev goes away.

Setting properties is tied into having a vdev zap, which these intermediate-level vdevs don't have. I don't really want to change that just for this functionality. With that, the two choices are to try to drill down to the leaf vdevs and set the sitting properties on them directly, or to just not allow manual sitting/inspection of intermediate vdevs. I guess we could also allow a janky bypass of the vdev property code just for this property, but that seems extremely distasteful.

I'm not especially happy with any of these solutions; the code to drill down to the leaves is going to be messy, and having vdevs that are sitting out that we can't inspect is unpleasant. I'm open to suggestions on what others think would be the cleanest design here.

I ended up going with the approach where non-direct children can have sit-out applied and it's respected properly; the code wasn't too bad, in the end.

amotin · 2025-08-05T19:28:46Z

module/zfs/vdev_raidz.c

+	    vd->vdev_children < LAT_SAMPLES_MIN)
+		return;
+
+	hrtime_t now = gethrtime();


I've already proposed once to switch to seconds to make it faster, that is why we have vdev_read_sit_out_expire in seconds. But here it does again for vdev_last_latency_check, as I understand called for every I/O, while I don't think we really need it. Any reason why we want vdev_raidz_outlier_check_interval_ms in milliseconds?

This is the sort of thing I could imagine someone wanting to be more often than once per second. I don't think it's very important though, I'm fine switching it to second granularity.

Another idea I had that might be helpful for this is to add a getlrsetime function. Currently on Linux, we implement faster time functions with gethrestime_sec, which gets the number of seconds using the coarse time function. The underlying function returns time at the granularity of 10ms (or rather, 1000/CONFIG_HZ). That's more than fast enough for most of our purposes; we could just convert that value into nanoseconds and return it. It would give us a lower-effective-resolution version of gethrtime that still gives us better than second granularity.

I think this offers a good compromise between execution time and precision.

Yea. FreeBSD also has get*uptime() set of functions with 1/HZ precision. I was thinking about wrapping them for years, but it haven't happened so far.

amotin · 2025-08-05T19:39:09Z

module/zfs/vdev_raidz.c

+		mutex_enter(&cvd->vdev_stat_lock);
+		size_t size = sizeof (cvd->vdev_stat_ex.vsx_disk_histo[0]);
+		memcpy(cvd->vdev_prev_histo,
+		    cvd->vdev_stat_ex.vsx_disk_histo[ZIO_TYPE_READ], size);
+		mutex_exit(&cvd->vdev_stat_lock);


Any reason we couldn't do it in the loop above to not enter/exit the lock again? On the opposite side, is it intentional that we don't do it in any case of "goto out"? Because I guess that once we stood out maximum number of disks, we may end up collecting average for all the time until some expire.

The thing I was worried about with doing it in the loop is the possibility that we update some but not all of them, since we goto out. I agree it probably makes sense to update them all even if we do goto out, though. I'll fix it so we don't need to re-grab the lock and always update all of them.

behlendorf

Given the amount of churn in this PR it'd be nice to squash the commits the next time it's updated.

behlendorf · 2025-08-06T23:23:20Z

module/zfs/vdev_raidz.c

+vdev_raidz_unsit_child(vdev_t *vd)
+{
+	for (int c = 0; c < vd->vdev_children; c++) {
+		vdev_raidz_sit_child(vd->vdev_child[c]);


Suggested change

vdev_raidz_sit_child(vd->vdev_child[c]);

vdev_raidz_unsit_child(vd->vdev_child[c]);

behlendorf · 2025-08-06T23:32:21Z

module/zfs/vdev_raidz.c

+			 * We can't sit out more disks than we have
+			 * parity
+			 */
+				skip = B_TRUE;


behlendorf · 2025-08-06T23:41:03Z

include/os/linux/spl/sys/time.h

+getlrtime(void)
+{
+	inode_timespec_t ts;
+	ktime_get_coarse_real_ts64(&ts);


I believe you want ktime_get_coarse_ts64() here. The _real_ version of this function isn't monotonic and can jump backwards due to NTP updates. By using the _coarse_ version the resolution may be as low as 10ms (CONFIG_HZ_100). Although, these days I believe most kernels set either CONFIG_HZ_250 or CONFIG_HZ_1000.

https://docs.kernel.org/core-api/timekeeping.html

behlendorf · 2025-08-06T23:48:26Z

module/zfs/vdev_raidz.c

+	if (samples > LAT_SAMPLES_STACK)
+		lat_data = kmem_alloc(sizeof (uint64_t) * samples, KM_SLEEP);
+	else
+		lat_data = &data[0];


Given that we're now by default only calling this now once a second it should be fine to always kmem_alloc/free this memory.

behlendorf · 2025-08-06T23:56:04Z

module/zfs/vdev_raidz.c

+ZFS_MODULE_PARAM(zfs_vdev, vdev_, read_sit_out_secs, ULONG, ZMOD_RW,
+	"Raidz/draid slow disk sit out time period in seconds");
+ZFS_MODULE_PARAM(zfs_vdev, vdev_, raidz_outlier_check_interval_ms, ULONG,
+	ZMOD_RW, "Interval to check for slow raidz children");


Suggested change

ZMOD_RW, "Interval to check for slow raidz children");

ZMOD_RW, "Interval to check for slow raidz/draid children");

behlendorf · 2025-08-07T00:01:15Z

man/man4/zfs.4

+data will be reconstructed as needed from parity.
+Resilver and scrub operations will always read from a disk, even if it's
+sitting out.
+Only a single disk in a RAID-Z or dRAID vdev may sit out at the same time.


Up to the number of parity devices...

A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <don.brady@klarasystems.com>

robn mentioned this pull request Apr 8, 2025

Detect a slow raidz child during reads #16900

Closed

13 tasks

robn requested review from amotin, behlendorf, tonyhutter and don-brady April 8, 2025 01:09

robn force-pushed the raidz-detect-slow-outlier branch from 905c466 to 4a1a3d3 Compare April 8, 2025 04:15

amotin added the Status: Code Review Needed label Apr 8, 2025

behlendorf reviewed May 16, 2025

View reviewed changes

amotin reviewed May 27, 2025

View reviewed changes

module/zfs/vdev_raidz.c Outdated Show resolved Hide resolved

pcd1193182 force-pushed the raidz-detect-slow-outlier branch 2 times, most recently from 83d0de9 to 6fa275b Compare July 28, 2025 23:59

amotin reviewed Aug 5, 2025

View reviewed changes

pcd1193182 force-pushed the raidz-detect-slow-outlier branch 2 times, most recently from e620961 to 9138823 Compare August 6, 2025 23:52

behlendorf reviewed Aug 7, 2025

View reviewed changes

pcd1193182 force-pushed the raidz-detect-slow-outlier branch from 9138823 to b4ad263 Compare August 7, 2025 21:39

pcd1193182 force-pushed the raidz-detect-slow-outlier branch from b4ad263 to 7013cb6 Compare August 7, 2025 21:50

	vdev_raidz_sit_child(vd->vdev_child[c]);
	vdev_raidz_unsit_child(vd->vdev_child[c]);

	ZMOD_RW, "Interval to check for slow raidz children");
	ZMOD_RW, "Interval to check for slow raidz/draid children");

Detect a slow raidz child during reads #17227

Are you sure you want to change the base?

Detect a slow raidz child during reads #17227

Conversation

robn commented Apr 8, 2025

Motivation and Context

Types of changes

Checklist:

Uh oh!

robn commented Apr 8, 2025

Uh oh!

tonyhutter commented Apr 17, 2025

Uh oh!

tonyhutter commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcd1193182 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcd1193182 Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

behlendorf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

tonyhutter commented May 15, 2025 •

edited

Loading

pcd1193182 commented Aug 5, 2025 •

edited

Loading

pcd1193182 Aug 5, 2025 •

edited

Loading