Skip to content

Commit e8b96c6

Browse files
ahrensbehlendorf
authored andcommitted
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead. * the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Approved by: Robert Mustacchi <rm@joyent.com> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5 Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1913
1 parent 384f8a0 commit e8b96c6

38 files changed

+1943
-846
lines changed

include/sys/Makefile.am

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ COMMON_H = \
6262
$(top_srcdir)/include/sys/zfs_context.h \
6363
$(top_srcdir)/include/sys/zfs_ctldir.h \
6464
$(top_srcdir)/include/sys/zfs_debug.h \
65+
$(top_srcdir)/include/sys/zfs_delay.h \
6566
$(top_srcdir)/include/sys/zfs_dir.h \
6667
$(top_srcdir)/include/sys/zfs_fuid.h \
6768
$(top_srcdir)/include/sys/zfs_rlock.h \

include/sys/arc.h

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -145,12 +145,13 @@ int arc_referenced(arc_buf_t *buf);
145145
#endif
146146

147147
int arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
148-
arc_done_func_t *done, void *private, int priority, int flags,
148+
arc_done_func_t *done, void *private, zio_priority_t priority, int flags,
149149
uint32_t *arc_flags, const zbookmark_t *zb);
150150
zio_t *arc_write(zio_t *pio, spa_t *spa, uint64_t txg,
151151
blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc, boolean_t l2arc_compress,
152-
const zio_prop_t *zp, arc_done_func_t *ready, arc_done_func_t *done,
153-
void *private, int priority, int zio_flags, const zbookmark_t *zb);
152+
const zio_prop_t *zp, arc_done_func_t *ready, arc_done_func_t *physdone,
153+
arc_done_func_t *done, void *private, zio_priority_t priority,
154+
int zio_flags, const zbookmark_t *zb);
154155

155156
arc_prune_t *arc_add_prune_callback(arc_prune_func_t *func, void *private);
156157
void arc_remove_prune_callback(arc_prune_t *p);
@@ -179,11 +180,6 @@ void l2arc_fini(void);
179180
void l2arc_start(void);
180181
void l2arc_stop(void);
181182

182-
/* Global tunings */
183-
extern int zfs_write_limit_shift;
184-
extern unsigned long zfs_write_limit_max;
185-
extern kmutex_t zfs_write_limit_lock;
186-
187183
#ifndef _KERNEL
188184
extern boolean_t arc_watch;
189185
#endif

include/sys/dbuf.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,9 @@ typedef struct dbuf_dirty_record {
112112
/* pointer to parent dirty record */
113113
struct dbuf_dirty_record *dr_parent;
114114

115+
/* How much space was changed to dsl_pool_dirty_space() for this? */
116+
unsigned int dr_accounted;
117+
115118
union dirty_types {
116119
struct dirty_indirect {
117120

@@ -252,7 +255,7 @@ dmu_buf_impl_t *dbuf_hold_level(struct dnode *dn, int level, uint64_t blkid,
252255
int dbuf_hold_impl(struct dnode *dn, uint8_t level, uint64_t blkid, int create,
253256
void *tag, dmu_buf_impl_t **dbp);
254257

255-
void dbuf_prefetch(struct dnode *dn, uint64_t blkid);
258+
void dbuf_prefetch(struct dnode *dn, uint64_t blkid, zio_priority_t prio);
256259

257260
void dbuf_add_ref(dmu_buf_impl_t *db, void *tag);
258261
uint64_t dbuf_refcount(dmu_buf_impl_t *db);

include/sys/dmu.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@ typedef enum dmu_object_type {
218218
typedef enum txg_how {
219219
TXG_WAIT = 1,
220220
TXG_NOWAIT,
221+
TXG_WAITED,
221222
} txg_how_t;
222223

223224
void byteswap_uint64_array(void *buf, size_t size);

include/sys/dmu_tx.h

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
* Use is subject to license terms.
2424
*/
2525
/*
26-
* Copyright (c) 2012 by Delphix. All rights reserved.
26+
* Copyright (c) 2013 by Delphix. All rights reserved.
2727
*/
2828

2929
#ifndef _SYS_DMU_TX_H
@@ -60,8 +60,22 @@ struct dmu_tx {
6060
txg_handle_t tx_txgh;
6161
void *tx_tempreserve_cookie;
6262
struct dmu_tx_hold *tx_needassign_txh;
63-
list_t tx_callbacks; /* list of dmu_tx_callback_t on this dmu_tx */
64-
uint8_t tx_anyobj;
63+
64+
/* list of dmu_tx_callback_t on this dmu_tx */
65+
list_t tx_callbacks;
66+
67+
/* placeholder for syncing context, doesn't need specific holds */
68+
boolean_t tx_anyobj;
69+
70+
/* has this transaction already been delayed? */
71+
boolean_t tx_waited;
72+
73+
/* time this transaction was created */
74+
hrtime_t tx_start;
75+
76+
/* need to wait for sufficient dirty space */
77+
boolean_t tx_wait_dirty;
78+
6579
int tx_err;
6680
#ifdef DEBUG_DMU_TX
6781
uint64_t tx_space_towrite;
@@ -121,7 +135,8 @@ typedef struct dmu_tx_stats {
121135
kstat_named_t dmu_tx_memory_reclaim;
122136
kstat_named_t dmu_tx_memory_inflight;
123137
kstat_named_t dmu_tx_dirty_throttle;
124-
kstat_named_t dmu_tx_write_limit;
138+
kstat_named_t dmu_tx_dirty_delay;
139+
kstat_named_t dmu_tx_dirty_over_max;
125140
kstat_named_t dmu_tx_quota;
126141
} dmu_tx_stats_t;
127142

include/sys/dsl_dir.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
*/
2121
/*
2222
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
23-
* Copyright (c) 2012 by Delphix. All rights reserved.
23+
* Copyright (c) 2013 by Delphix. All rights reserved.
2424
*/
2525

2626
#ifndef _SYS_DSL_DIR_H

include/sys/dsl_pool.h

Lines changed: 21 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
*/
2121
/*
2222
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
23-
* Copyright (c) 2012 by Delphix. All rights reserved.
23+
* Copyright (c) 2013 by Delphix. All rights reserved.
2424
*/
2525

2626
#ifndef _SYS_DSL_POOL_H
@@ -51,6 +51,14 @@ struct dsl_pool;
5151
struct dmu_tx;
5252
struct dsl_scan;
5353

54+
extern unsigned long zfs_dirty_data_max;
55+
extern unsigned long zfs_dirty_data_max_max;
56+
extern unsigned long zfs_dirty_data_sync;
57+
extern int zfs_dirty_data_max_percent;
58+
extern int zfs_dirty_data_max_max_percent;
59+
extern int zfs_delay_min_dirty_percent;
60+
extern unsigned long zfs_delay_scale;
61+
5462
/* These macros are for indexing into the zfs_all_blkstats_t. */
5563
#define DMU_OT_DEFERRED DMU_OT_NONE
5664
#define DMU_OT_OTHER DMU_OT_NUMTYPES /* place holder for DMU_OT() types */
@@ -85,9 +93,6 @@ typedef struct dsl_pool {
8593

8694
/* No lock needed - sync context only */
8795
blkptr_t dp_meta_rootbp;
88-
hrtime_t dp_read_overhead;
89-
uint64_t dp_throughput; /* bytes per millisec */
90-
uint64_t dp_write_limit;
9196
uint64_t dp_tmp_userrefs_obj;
9297
bpobj_t dp_free_bpobj;
9398
uint64_t dp_bptree_obj;
@@ -97,12 +102,19 @@ typedef struct dsl_pool {
97102

98103
/* Uses dp_lock */
99104
kmutex_t dp_lock;
100-
uint64_t dp_space_towrite[TXG_SIZE];
101-
uint64_t dp_tempreserved[TXG_SIZE];
105+
kcondvar_t dp_spaceavail_cv;
106+
uint64_t dp_dirty_pertxg[TXG_SIZE];
107+
uint64_t dp_dirty_total;
102108
uint64_t dp_mos_used_delta;
103109
uint64_t dp_mos_compressed_delta;
104110
uint64_t dp_mos_uncompressed_delta;
105111

112+
/*
113+
* Time of most recently scheduled (furthest in the future)
114+
* wakeup for delayed transactions.
115+
*/
116+
hrtime_t dp_last_wakeup;
117+
106118
/* Has its own locking */
107119
tx_state_t dp_tx;
108120
txg_list_t dp_dirty_datasets;
@@ -131,10 +143,8 @@ void dsl_pool_sync_done(dsl_pool_t *dp, uint64_t txg);
131143
int dsl_pool_sync_context(dsl_pool_t *dp);
132144
uint64_t dsl_pool_adjustedsize(dsl_pool_t *dp, boolean_t netfree);
133145
uint64_t dsl_pool_adjustedfree(dsl_pool_t *dp, boolean_t netfree);
134-
int dsl_pool_tempreserve_space(dsl_pool_t *dp, uint64_t space, dmu_tx_t *tx);
135-
void dsl_pool_tempreserve_clear(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx);
136-
void dsl_pool_memory_pressure(dsl_pool_t *dp);
137-
void dsl_pool_willuse_space(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx);
146+
void dsl_pool_dirty_space(dsl_pool_t *dp, int64_t space, dmu_tx_t *tx);
147+
void dsl_pool_undirty_space(dsl_pool_t *dp, int64_t space, uint64_t txg);
138148
void dsl_free(dsl_pool_t *dp, uint64_t txg, const blkptr_t *bpp);
139149
void dsl_free_sync(zio_t *pio, dsl_pool_t *dp, uint64_t txg,
140150
const blkptr_t *bpp);
@@ -143,6 +153,7 @@ void dsl_pool_upgrade_clones(dsl_pool_t *dp, dmu_tx_t *tx);
143153
void dsl_pool_upgrade_dir_clones(dsl_pool_t *dp, dmu_tx_t *tx);
144154
void dsl_pool_mos_diduse_space(dsl_pool_t *dp,
145155
int64_t used, int64_t comp, int64_t uncomp);
156+
boolean_t dsl_pool_need_dirty_delay(dsl_pool_t *dp);
146157
void dsl_pool_config_enter(dsl_pool_t *dp, void *tag);
147158
void dsl_pool_config_exit(dsl_pool_t *dp, void *tag);
148159
boolean_t dsl_pool_config_held(dsl_pool_t *dp);

include/sys/sa_impl.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
*/
2121
/*
2222
* Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
23-
* Copyright (c) 2012 by Delphix. All rights reserved.
23+
* Copyright (c) 2013 by Delphix. All rights reserved.
2424
*/
2525

2626
#ifndef _SYS_SA_IMPL_H
@@ -153,7 +153,7 @@ struct sa_os {
153153
*
154154
* The header has a fixed portion with a variable number
155155
* of "lengths" depending on the number of variable sized
156-
* attribues which are determined by the "layout number"
156+
* attributes which are determined by the "layout number"
157157
*/
158158

159159
#define SA_MAGIC 0x2F505A /* ZFS SA */

include/sys/spa_impl.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
*/
2121
/*
2222
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
23-
* Copyright (c) 2012 by Delphix. All rights reserved.
23+
* Copyright (c) 2013 by Delphix. All rights reserved.
2424
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
2525
*/
2626

@@ -234,7 +234,7 @@ struct spa {
234234
uint64_t spa_feat_desc_obj; /* Feature descriptions */
235235
taskqid_t spa_deadman_tqid; /* Task id */
236236
uint64_t spa_deadman_calls; /* number of deadman calls */
237-
uint64_t spa_sync_starttime; /* starting time fo spa_sync */
237+
hrtime_t spa_sync_starttime; /* starting time of spa_sync */
238238
uint64_t spa_deadman_synctime; /* deadman expiration timer */
239239
spa_stats_t spa_stats; /* assorted spa statistics */
240240

include/sys/txg.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
* Use is subject to license terms.
2424
*/
2525
/*
26-
* Copyright (c) 2012 by Delphix. All rights reserved.
26+
* Copyright (c) 2013 by Delphix. All rights reserved.
2727
*/
2828

2929
#ifndef _SYS_TXG_H
@@ -76,6 +76,7 @@ extern void txg_register_callbacks(txg_handle_t *txghp, list_t *tx_callbacks);
7676

7777
extern void txg_delay(struct dsl_pool *dp, uint64_t txg, hrtime_t delta,
7878
hrtime_t resolution);
79+
extern void txg_kick(struct dsl_pool *dp);
7980

8081
/*
8182
* Wait until the given transaction group has finished syncing.

0 commit comments

Comments
 (0)