Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduced performance for 0.6.5 in virtualized environment #4880

Closed
koplover opened this issue Jul 26, 2016 · 44 comments
Closed

Reduced performance for 0.6.5 in virtualized environment #4880

koplover opened this issue Jul 26, 2016 · 44 comments
Assignees
Labels
Component: ZVOL ZFS Volumes Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem Type: Regression Indicates a functional regression

Comments

@koplover
Copy link

koplover commented Jul 26, 2016

We have been using zfs for a couple of years in our xen virtualized environment. Initially we used 0.6.3 before moving to 0.6.4.2 running within our disk driver domain.

We have been experiencing various lockup issues in these releases, impacted by memory pressure, and impacting our guest workloads and end user experience, and have been trying to move to the 0.6.5 stream for the last 6 months to improve stability. Unfortunately, we experience performance issues when moving to 0.6.5

With two identical systems (hardware and software) except zfs version (0.6.4.2 and 0.6.5.7), I run a benchmark (Microsoft JetStress) on a Windows Xen guest to establish disk subsystem performance to understand performance for an Exchange DB. Running this benchmark against 0.6.4.2 (Achieved Transactional I/O per Second 464.083) shows a far better performance figure than the configuration using 0.6.5.7 (Achieved Transactional I/O per Second 226.203)

In these benchmarks an Exchange Database is created and manipulated in a separate volume of the Windows guest VM, formatted as 64K block NTFS, supported underlying by a 64k block zvol. From monitoring this underlying zvol using iostat, it seems something has changed in the implementation of the zvols which means that concurrency at the disk layer is impaired (the queue size in the 0.6.5.7 deployment for the actual measurement stage is just below 1 when viewed with iostat).

Looking at the various stages of the benchmark and the impact on the zvol via iostat I see the following:

Database creation phase (synchronous writes, no reads)
0.6.5.7 - example sample

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd112             0.00     0.00    0.00  536.30     0.00 152144.00   567.38     0.70    1.31    0.00    1.31   1.31  70.27

0.6.4.2 - example sample

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd4640            0.00     0.00    0.00  347.87     0.00 94684.40   544.37     0.24    0.68    0.00    0.68   0.48  16.53

So write performance on this benchmark initialisation stage significantly better for 0.6.5.7 compared to 0.6.4.2 deployment

Benchmark measurement phase (mix read / write)
0.6.5.7 - example sample

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd112             0.00     0.00  119.33  107.20  5596.00  3315.50    78.68     0.80    3.53    6.67    0.03   3.53  79.99

0.6.4.2 - example sample

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd4640            0.00     0.00  243.40  204.57  9823.20  6188.03    71.48     1.39    3.10    5.56    0.18   1.61  72.28

Read / write throughput significantly lower here, I suspect read difference leading the impact on write. Average queue size not huge for 0.6.4.2 but > 1

Checksum checking phase (read only phase, checking db integrity at the end of the benchmark)
0.6.5.7 - example sample

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd112             0.00     0.00  619.87    0.00 39671.47     0.00   128.00     0.98    1.57    1.57    0.00   1.58  98.08

0.6.4.2 - example sample

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
zd4640            0.00     0.00  817.23    0.00 52302.93     0.00   128.00    27.61   34.17   34.17    0.00   1.22  99.83

Stats for 0.6.4.2 looks worse for 0.6.4.2, with the exception of read KB/s which is better. This is a sequentual read (mostly) 128 blocks at a time through the db to perform the verification. I assume the average queue size is allow for greater concurrency lower down the stack - and for some reason in 0.6.5.7 there is some form of serialization maintaining the queue size below 1 here?

Some further information on the deployments:
0.6.5.7
VM running zfs:
8GB, 4 vcpus

zpool status:
  pool: diskconvm
 state: ONLINE
  scan: none requested
config:


NAME                                    STATE     READ WRITE CKSUM
diskconvm                               ONLINE       0     0     0
  mirror-0                              ONLINE       0     0     0
    ata-MB4000GCVBU_WMC130027610-part6  ONLINE       0     0     0
    ata-MB4000GCVBU_WMC130027304-part6  ONLINE       0     0     0
logs
  nvme0n1p1                             ONLINE       0     0     0
cache
  nvme0n1p2                             ONLINE       0     0     0

errors: No known data errors

SLOG is 5GB
L2-cache is 40GB

arcstats (note not when running, taking later)

cat /proc/spl/kstat/zfs/arcstats

name                            type data
hits                            4    800938102
misses                          4    95584272
demand_data_hits                4    680089930
demand_data_misses              4    24056278
demand_metadata_hits            4    98553094
demand_metadata_misses          4    31939474
prefetch_data_hits              4    22183275
prefetch_data_misses            4    39487999
prefetch_metadata_hits          4    111803
prefetch_metadata_misses        4    100521
mru_hits                        4    291564434
mru_ghost_hits                  4    7763753
mfu_hits                        4    490582956
mfu_ghost_hits                  4    4077487
deleted                         4    141253300
mutex_miss                      4    8202
evict_skip                      4    3399460
evict_not_enough                4    20995
evict_l2_cached                 4    938505652224
evict_l2_eligible               4    598995230720
evict_l2_ineligible             4    47057618944
evict_l2_skip                   4    157936
hash_elements                   4    3954384
hash_elements_max               4    11437702
hash_collisions                 4    510396727
hash_chains                     4    926524
hash_chain_max                  4    33
p                               4    3183601582
c                               4    5115232256
c_min                           4    2557616128
c_max                           4    5115232256
size                            4    5076138536
hdr_size                        4    716308416
data_size                       4    3611287552
metadata_size                   4    276021248
other_size                      4    261026608
anon_size                       4    3710976
anon_evictable_data             4    0
anon_evictable_metadata         4    0
mru_size                        4    1790918656
mru_evictable_data              4    1595752448
mru_evictable_metadata          4    6385664
mru_ghost_size                  4    3185651200
mru_ghost_evictable_data        4    2179940352
mru_ghost_evictable_metadata    4    1005710848
mfu_size                        4    2092679168
mfu_evictable_data              4    2015510528
mfu_evictable_metadata          4    18325504
mfu_ghost_size                  4    1926894080
mfu_ghost_evictable_data        4    1571565568
mfu_ghost_evictable_metadata    4    355328512
l2_hits                         4    10246487
l2_misses                       4    54510480
l2_feeds                        4    1186238
l2_rw_clash                     4    45
l2_read_bytes                   4    131037714432
l2_write_bytes                  4    628141021184
l2_writes_sent                  4    834651
l2_writes_done                  4    834651
l2_writes_error                 4    0
l2_writes_lock_retry            4    654
l2_evict_lock_retry             4    244
l2_evict_reading                4    39
l2_evict_l1cached               4    1608607
l2_free_on_write                4    829258
l2_cdata_free_on_write          4    3708
l2_abort_lowmem                 4    29
l2_cksum_bad                    4    0
l2_io_error                     4    0
l2_size                         4    39083837952
l2_asize                        4    35688606720
l2_hdr_size                     4    211494712
l2_compress_successes           4    26986228
l2_compress_zeros               4    0
l2_compress_failures            4    31411278
memory_throttle_count           4    0
duplicate_buffers               4    0
duplicate_buffers_size          4    0
duplicate_reads                 4    3675
memory_direct_count             4    67864
memory_indirect_count           4    542404
arc_no_grow                     4    0
arc_tempreserve                 4    0
arc_loaned_bytes                4    0
arc_prune                       4    0
arc_meta_used                   4    1464850984
arc_meta_limit                  4    4603709030
arc_meta_max                    4    2473087208
arc_meta_min                    4    16777216
arc_need_free                   4    0
arc_sys_free                    4    130256896

The zfs parameters are as follows:

for i in *; do echo $i $(cat $i); done

l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_nocompress 0
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_aliquot 524288
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslabs_per_vdev 200
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_verify_data 1
spa_load_verify_maxinflight 10000
spa_load_verify_metadata 1
spa_slop_shift 5
zfetch_array_rd_sz 1048576
zfetch_block_cap 256
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_admin_snapshot 0
zfs_arc_average_blocksize 8192
zfs_arc_grow_retry 0
zfs_arc_lotsfree_percent 0
zfs_arc_max 5115232256
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 4603709030
zfs_arc_meta_min 0
zfs_arc_meta_prune 10000
zfs_arc_meta_strategy 1
zfs_arc_min 2557616128
zfs_arc_min_prefetch_lifespan 0
zfs_arc_num_sublists_per_state 4
zfs_arc_p_aggressive_disable 1
zfs_arc_p_dampener_disable 1
zfs_arc_p_min_shift 0
zfs_arc_shrink_shift 0
zfs_arc_sys_free 0
zfs_autoimport_disable 1
zfs_dbgmsg_enable 1
zfs_dbgmsg_maxsize 4194304
zfs_dbuf_state_index 0
zfs_deadman_enabled 1
zfs_deadman_synctime_ms 1000000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_dirty_data_max 833645772
zfs_dirty_data_max_max 2084114432
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync 67108864
zfs_disable_dup_eviction 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_leak_on_eio 0
zfs_free_max_blocks 100000
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_max_recordsize 1048576
zfs_mdcomp_disable 0
zfs_metaslab_fragmentation_threshold 70
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_object_mutex_size 64
zfs_pd_bytes_max 52428800
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_recover 0
zfs_resilver_delay 2
zfs_resilver_min_time_ms 3000
zfs_scan_idle 50
zfs_scan_min_time_ms 1000
zfs_scrub_delay 4
zfs_send_corrupt_data 0
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_rewrite 2
zfs_top_maxinflight 32
zfs_txg_history 0
zfs_txg_timeout 5
zfs_vdev_aggregation_limit 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 1
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_max_active 1000
zfs_vdev_mirror_switch_us 10000
zfs_vdev_read_gap_limit 32768
zfs_vdev_scheduler noop
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 64
zil_replay_disable 0
zil_slog_limit 1048576
zio_delay_max 30000
zio_requeue_io_start_cut_in_line 1
zio_taskq_batch_pct 75
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_prefetch_bytes 131072

cat /proc/spl/kstat/zfs/dmu_tx

5 1 0x01 11 528 3239285416 1723566213675618
name                            type data
dmu_tx_assigned                 4    121905724
dmu_tx_delay                    4    0
dmu_tx_error                    4    0
dmu_tx_suspended                4    0
dmu_tx_group                    4    0
dmu_tx_memory_reserve           4    0
dmu_tx_memory_reclaim           4    0
dmu_tx_dirty_throttle           4    0
dmu_tx_dirty_delay              4    19266866
dmu_tx_dirty_over_max           4    82
dmu_tx_quota                    4    0
Note: the dmy_tx_dirty_delay is increased during the initial database creation write phase as far as I could monitor, when performance was actually better than 0.6.4.2

0.6.4.2
VM running zfs:
8GB, 4 vcpus

zpool status:
  pool: diskconvm
 state: ONLINE
  scan: none requested
config:

NAME                                    STATE     READ WRITE CKSUM
diskconvm                               ONLINE       0     0     0
  mirror-0                              ONLINE       0     0     0
    ata-MB2000GDUNV_Y597Y0GTF1BA-part6  ONLINE       0     0     0
    ata-MB2000GDUNV_Y597Y0GVF1BA-part6  ONLINE       0     0     0
logs
  6f3a40b4-173b-4aa9-90b1-17db58231105  ONLINE       0     0     0
cache
  891a8189-6020-4045-92c4-8a090b34495c  ONLINE       0     0     0


errors: No known data errors

SLOG is 5GB
L2-cache is 40GB

The zfs parameters are as follows:

for i in *; do echo $i $(cat $i); done

l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_nocompress 0
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslabs_per_vdev 200
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_verify_data 1
spa_load_verify_maxinflight 10000
spa_load_verify_metadata 1
zfetch_array_rd_sz 1048576
zfetch_block_cap 256
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_arc_average_blocksize 8192
zfs_arc_grow_retry 5
zfs_arc_max 12568722432
zfs_arc_memory_throttle_disable 1
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 11311850188
zfs_arc_meta_prune 10000
zfs_arc_min 6284361216
zfs_arc_min_prefetch_lifespan 250
zfs_arc_p_aggressive_disable 1
zfs_arc_p_dampener_disable 1
zfs_arc_shrink_shift 5
zfs_autoimport_disable 1
zfs_dbuf_state_index 0
zfs_deadman_enabled 1
zfs_deadman_synctime_ms 1000000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_dirty_data_max 1675829657
zfs_dirty_data_max_max 4189574144
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync 67108864
zfs_disable_dup_eviction 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_leak_on_eio 0
zfs_free_max_blocks 100000
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_mdcomp_disable 0
zfs_metaslab_fragmentation_threshold 70
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_pd_bytes_max 52428800
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_recover 0
zfs_resilver_delay 2
zfs_resilver_min_time_ms 3000
zfs_scan_idle 50
zfs_scan_min_time_ms 1000
zfs_scrub_delay 4
zfs_send_corrupt_data 0
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_rewrite 2
zfs_top_maxinflight 32
zfs_txg_history 0
zfs_txg_timeout 5
zfs_vdev_aggregation_limit 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 1
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_max_active 1000
zfs_vdev_mirror_switch_us 10000
zfs_vdev_read_gap_limit 32768
zfs_vdev_scheduler noop
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 64
zil_replay_disable 0
zil_slog_limit 1048576
zio_delay_max 30000
zio_requeue_io_start_cut_in_line 1
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_threads 32

Unfortunately, the machine was redeployed so I lost this information, I'll rerun the benchmark and post again when the details below are available

cat /proc/spl/kstat/zfs/dmu_tx

arcstats

cat /proc/spl/kstat/zfs/arcstats

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Jul 26, 2016

referencing: #4512 zfs 0.6.5 and write performance problems

@koplover please take a look at that issue, there are a bunch of patches and fixes suggested that seem to mitigate this issue

@koplover
Copy link
Author

I've also demonstrated this on a separate system while trying to back port performance commits from the 0.6.5 stream (taking the merge created by ryao in issue #3888 , which merged back 4 commits)

The two deployments are pretty much the same, outlined below (different previous version of our software / Xen etc to earlier post - no SSDs in these machines):

Deployment A (ZFS 0.6.4.2)

Xen disk driver domain: 4GB, 4 vcpus

zpool status

pool: diskconvm
state: ONLINE
scan: none requested
config:

NAME                                STATE     READ WRITE CKSUM
diskconvm                           ONLINE       0     0     0
  mirror-0                          ONLINE       0     0     0
    ata-MB2000GCWDA_Z1X27P5H-part4  ONLINE       0     0     0
    ata-MB2000GCWLT_P6JZY54X-part4  ONLINE       0     0     0

These are 2 x 2TB disks

arcstats
##cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 2736915471 94176989998384
name type data
hits 4 125969307
misses 4 59757241
demand_data_hits 4 67405578
demand_data_misses 4 16535798
demand_metadata_hits 4 45380523
demand_metadata_misses 4 2987791
prefetch_data_hits 4 13083695
prefetch_data_misses 4 40184049
prefetch_metadata_hits 4 99511
prefetch_metadata_misses 4 49603
mru_hits 4 58985004
mru_ghost_hits 4 3734620
mfu_hits 4 54131981
mfu_ghost_hits 4 3101725
deleted 4 76900020
recycle_miss 4 911849
mutex_miss 4 722624
evict_skip 4 279000235
evict_l2_cached 4 0
evict_l2_eligible 4 414932482048
evict_l2_ineligible 4 60780019712
hash_elements 4 646719
hash_elements_max 4 982824
hash_collisions 4 69973937
hash_chains 4 183989
hash_chain_max 4 12
p 4 1758900736
c 4 2515021824
c_min 4 1257510912
c_max 4 2515021824
size 4 2513804216
hdr_size 4 238695576
data_size 4 1367511040
meta_size 4 801668096
other_size 4 105929504
anon_size 4 2146304
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 1756189696
mru_evict_data 4 1348640768
mru_evict_metadata 4 283821056
mru_ghost_size 4 758035456
mru_ghost_evict_data 4 725659648
mru_ghost_evict_metadata 4 32375808
mfu_size 4 410843136
mfu_evict_data 4 17805312
mfu_evict_metadata 4 365096960
mfu_ghost_size 4 1134907392
mfu_ghost_evict_data 4 114790400
mfu_ghost_evict_metadata 4 1020116992
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_hdr_miss 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 0
l2_cdata_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 150036
memory_direct_count 4 43683
memory_indirect_count 4 134449
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 1146293176
arc_meta_limit 4 2263519641
arc_meta_max 4 1529330560

##cat /proc/spl/kstat/zfs/dmu_tx
4 1 0x01 11 528 2734945681 94222409432733
name type data
dmu_tx_assigned 4 7721150
dmu_tx_delay 4 0
dmu_tx_error 4 0
dmu_tx_suspended 4 0
dmu_tx_group 4 13371
dmu_tx_memory_reserve 4 0
dmu_tx_memory_reclaim 4 0
dmu_tx_dirty_throttle 4 0
dmu_tx_dirty_delay 4 87733
dmu_tx_dirty_over_max 4 46
dmu_tx_quota 4 0

##for i in *; do echo $i $(cat $i); done
l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_nocompress 0
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslabs_per_vdev 200
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_verify_data 1
spa_load_verify_maxinflight 10000
spa_load_verify_metadata 1
zfetch_array_rd_sz 1048576
zfetch_block_cap 256
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_arc_average_blocksize 8192
zfs_arc_grow_retry 5
zfs_arc_max 2515021824
zfs_arc_memory_throttle_disable 1
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 2263519641
zfs_arc_meta_prune 10000
zfs_arc_min 1257510912
zfs_arc_min_prefetch_lifespan 250
zfs_arc_p_aggressive_disable 1
zfs_arc_p_dampener_disable 1
zfs_arc_shrink_shift 5
zfs_autoimport_disable 1
zfs_dbuf_state_index 0
zfs_deadman_enabled 1
zfs_deadman_synctime_ms 1000000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_dirty_data_max 412563456
zfs_dirty_data_max_max 1031408640
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync 67108864
zfs_disable_dup_eviction 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_leak_on_eio 0
zfs_free_max_blocks 100000
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_mdcomp_disable 0
zfs_metaslab_fragmentation_threshold 70
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_pd_bytes_max 52428800
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_recover 0
zfs_resilver_delay 2
zfs_resilver_min_time_ms 3000
zfs_scan_idle 50
zfs_scan_min_time_ms 1000
zfs_scrub_delay 4
zfs_send_corrupt_data 0
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_rewrite 2
zfs_top_maxinflight 32
zfs_txg_history 0
zfs_txg_timeout 5
zfs_vdev_aggregation_limit 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 1
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_max_active 1000
zfs_vdev_mirror_switch_us 10000
zfs_vdev_read_gap_limit 32768
zfs_vdev_scheduler noop
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 64
zil_replay_disable 0
zil_slog_limit 1048576
zio_delay_max 30000
zio_requeue_io_start_cut_in_line 1
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384
zvol_threads 32

Depoyment B (ZFS 0.6.4.2, with back port)

disk driver domain: 4GB, 4vcpus

zpool status
pool: diskconvm
state: ONLINE
scan: none requested
config:

NAME                                STATE     READ WRITE CKSUM
diskconvm                           ONLINE       0     0     0
  mirror-0                          ONLINE       0     0     0
    ata-MB8000GEQUU_2EG4XXHR-part4  ONLINE       0     0     0
    ata-MB8000GEQUU_2EG4Y6ZR-part4  ONLINE       0     0     0

errors: No known data errors

These are 2 x 2TB disks

arcstats
##cat /proc/spl/kstat/zfs/arcstats
5 1 0x01 86 4128 2554722947 93845612144867
name type data
hits 4 19503108
misses 4 6148845
demand_data_hits 4 13611702
demand_data_misses 4 1486205
demand_metadata_hits 4 4421491
demand_metadata_misses 4 2544201
prefetch_data_hits 4 1467022
prefetch_data_misses 4 2118405
prefetch_metadata_hits 4 2893
prefetch_metadata_misses 4 34
mru_hits 4 7104500
mru_ghost_hits 4 125406
mfu_hits 4 10968276
mfu_ghost_hits 4 317014
deleted 4 4335955
recycle_miss 4 177467
mutex_miss 4 177
evict_skip 4 1046032
evict_l2_cached 4 0
evict_l2_eligible 4 129262052864
evict_l2_ineligible 4 5302394880
hash_elements 4 341749
hash_elements_max 4 876428
hash_collisions 4 10784965
hash_chains 4 73332
hash_chain_max 4 11
p 4 1577656320
c 4 1965079992
c_min 4 1257510912
c_max 4 2515021824
size 4 1964826328
hdr_size 4 136406040
data_size 4 1137586176
meta_size 4 605348864
other_size 4 85485248
anon_size 4 851968
anon_evict_data 4 0
anon_evict_metadata 4 0
mru_size 4 1569953280
mru_evict_data 4 1088741376
mru_evict_metadata 4 373411328
mru_ghost_size 4 353947648
mru_ghost_evict_data 4 353947648
mru_ghost_evict_metadata 4 0
mfu_size 4 172129792
mfu_evict_data 4 48402432
mfu_evict_metadata 4 108338688
mfu_ghost_size 4 1009916416
mfu_ghost_evict_data 4 864636928
mfu_ghost_evict_metadata 4 145279488
l2_hits 4 0
l2_misses 4 0
l2_feeds 4 0
l2_rw_clash 4 0
l2_read_bytes 4 0
l2_write_bytes 4 0
l2_writes_sent 4 0
l2_writes_done 4 0
l2_writes_error 4 0
l2_writes_hdr_miss 4 0
l2_evict_lock_retry 4 0
l2_evict_reading 4 0
l2_free_on_write 4 0
l2_cdata_free_on_write 4 0
l2_abort_lowmem 4 0
l2_cksum_bad 4 0
l2_io_error 4 0
l2_size 4 0
l2_asize 4 0
l2_hdr_size 4 0
l2_compress_successes 4 0
l2_compress_zeros 4 0
l2_compress_failures 4 0
memory_throttle_count 4 0
duplicate_buffers 4 0
duplicate_buffers_size 4 0
duplicate_reads 4 6638
memory_direct_count 4 6884
memory_indirect_count 4 272
arc_no_grow 4 0
arc_tempreserve 4 0
arc_loaned_bytes 4 0
arc_prune 4 0
arc_meta_used 4 827240152
arc_meta_limit 4 2263519641
arc_meta_max 4 1098895272

cat /proc/spl/kstat/zfs/dmu_tx

4 1 0x01 11 528 2552776171 93956863414260
name type data
dmu_tx_assigned 4 3965411
dmu_tx_delay 4 0
dmu_tx_error 4 0
dmu_tx_suspended 4 0
dmu_tx_group 4 0
dmu_tx_memory_reserve 4 0
dmu_tx_memory_reclaim 4 0
dmu_tx_dirty_throttle 4 0
dmu_tx_dirty_delay 4 304682
dmu_tx_dirty_over_max 4 0
dmu_tx_quota 4 0

for i in *; do echo $i $(cat $i); done

l2arc_feed_again 1
l2arc_feed_min_ms 200
l2arc_feed_secs 1
l2arc_headroom 2
l2arc_headroom_boost 200
l2arc_nocompress 0
l2arc_noprefetch 1
l2arc_norw 0
l2arc_write_boost 8388608
l2arc_write_max 8388608
metaslab_bias_enabled 1
metaslab_debug_load 0
metaslab_debug_unload 0
metaslab_fragmentation_factor_enabled 1
metaslab_lba_weighting_enabled 1
metaslab_preload_enabled 1
metaslabs_per_vdev 200
spa_asize_inflation 24
spa_config_path /etc/zfs/zpool.cache
spa_load_verify_data 1
spa_load_verify_maxinflight 10000
spa_load_verify_metadata 1
zfetch_array_rd_sz 1048576
zfetch_block_cap 256
zfetch_max_streams 8
zfetch_min_sec_reap 2
zfs_arc_average_blocksize 8192
zfs_arc_grow_retry 5
zfs_arc_max 2515021824
zfs_arc_memory_throttle_disable 1
zfs_arc_meta_adjust_restarts 4096
zfs_arc_meta_limit 2263519641
zfs_arc_meta_prune 10000
zfs_arc_min 1257510912
zfs_arc_min_prefetch_lifespan 250
zfs_arc_p_aggressive_disable 1
zfs_arc_p_dampener_disable 1
zfs_arc_shrink_shift 5
zfs_autoimport_disable 1
zfs_dbuf_state_index 0
zfs_deadman_enabled 1
zfs_deadman_synctime_ms 1000000
zfs_dedup_prefetch 0
zfs_delay_min_dirty_percent 60
zfs_delay_scale 500000
zfs_dirty_data_max 412563456
zfs_dirty_data_max_max 1031408640
zfs_dirty_data_max_max_percent 25
zfs_dirty_data_max_percent 10
zfs_dirty_data_sync 67108864
zfs_disable_dup_eviction 0
zfs_expire_snapshot 300
zfs_flags 0
zfs_free_leak_on_eio 0
zfs_free_max_blocks 100000
zfs_free_min_time_ms 1000
zfs_immediate_write_sz 32768
zfs_mdcomp_disable 0
zfs_metaslab_fragmentation_threshold 70
zfs_mg_fragmentation_threshold 85
zfs_mg_noalloc_threshold 0
zfs_nocacheflush 0
zfs_nopwrite_enabled 1
zfs_no_scrub_io 0
zfs_no_scrub_prefetch 0
zfs_pd_bytes_max 52428800
zfs_prefetch_disable 0
zfs_read_chunk_size 1048576
zfs_read_history 0
zfs_read_history_hits 0
zfs_recover 0
zfs_resilver_delay 2
zfs_resilver_min_time_ms 3000
zfs_scan_idle 50
zfs_scan_min_time_ms 1000
zfs_scrub_delay 4
zfs_send_corrupt_data 0
zfs_sync_pass_deferred_free 2
zfs_sync_pass_dont_compress 5
zfs_sync_pass_rewrite 2
zfs_top_maxinflight 32
zfs_txg_history 0
zfs_txg_timeout 5
zfs_vdev_aggregation_limit 131072
zfs_vdev_async_read_max_active 3
zfs_vdev_async_read_min_active 1
zfs_vdev_async_write_active_max_dirty_percent 60
zfs_vdev_async_write_active_min_dirty_percent 30
zfs_vdev_async_write_max_active 10
zfs_vdev_async_write_min_active 1
zfs_vdev_cache_bshift 16
zfs_vdev_cache_max 16384
zfs_vdev_cache_size 0
zfs_vdev_max_active 1000
zfs_vdev_mirror_switch_us 10000
zfs_vdev_read_gap_limit 32768
zfs_vdev_scheduler noop
zfs_vdev_scrub_max_active 2
zfs_vdev_scrub_min_active 1
zfs_vdev_sync_read_max_active 10
zfs_vdev_sync_read_min_active 10
zfs_vdev_sync_write_max_active 10
zfs_vdev_sync_write_min_active 10
zfs_vdev_write_gap_limit 4096
zfs_zevent_cols 80
zfs_zevent_console 0
zfs_zevent_len_max 64
zil_replay_disable 0
zil_slog_limit 1048576
zio_delay_max 30000
zio_requeue_io_start_cut_in_line 1
zvol_inhibit_dev 0
zvol_major 230
zvol_max_discard_blocks 16384

This followed the same general profile as the above posted case, this time with 0.6.4.2 and 0.6.4.2 with commit back ports. Although read throughput was not unduly affected presumably because performance was that little bit slower. In essence I saw:

Database Monitoring phase:
vanilla 0.6.4.2, iostat important takeaways
r Kb/s w Kb/s avgqu-sz
~8700 ~5600 ~3.8

0.6.4.2 backport
r Kb/s w Kb/s avgqu-sz
~5800 ~3200 ~3.8

Verification phase
vanilla 0.6.4.2 rKb/s ~35000-43000 avgqu-sz 27
0.6.4.2 backport rKb/s ~39000-47000 avgqu-sz 0.95
So in this case the actual throughput not far off with 0.6.4.2 with backport still edging, presumably as performance unable to make use of the concurrency sufficiently to catch up

The final results from the benchmark, for what they are worth were:
vanilla 0.6.4.2: Achieved Transactional I/O per Second 310.01
0.6.4.2 with back port: Achieved Transactional I/O per Second 182.968

The only difference here in the deployments were the backport of the 4 commits to zfs, and that the deployment with this backport was on 8TB disks while the other was on 2TB

These are the same physical server models on this case - differing from the two identical servers in the original post

@koplover
Copy link
Author

@kernelOfTruth I had looked through that issue a little while back (although probably over a month now) and it seemed orthogonal, it is referencing write performance and my issue seems very much to relate to read performance.

However, giving it a quick look I see it has moved on since I last looked at it, and there are reference to the bio changes, which my last post showed was also showing the slow down when backported to 0.6.4.2

I'll give it a proper read and try applying any patches to 0.6.5.7 branch to see what happens to this issue

@koplover
Copy link
Author

@kernelOfTruth @dweeezil I've taken a detailed look through the #4512 issue thread - quite a catchup!

Anyway, wondering if my issue is the same, as the symptoms I'm seeing would seem to be read rather than write centred, but same sort of area with the reads in 0.6.5.7 not being able to be coalesced as the queue size never gets above 1 in the read phase of the benchmark

There was a mention in the thread tracing the performance issue to #37f9dac by one participant - this would seem to be aligned with my second comparison, with both 0.6.4.2 but backport on one of them of this commit - does this make sense?

For the first comparision (0.6.4.2 v 0.6.5.7) the kernels were:

For the second comparison (0.6.4.2 v 0.6.4.2 with backport) the kernels were:
0.6.4.2 zfs: Linux zdiskdd0000-0017-00-00 3.19.0-26-zdomu #2714.04.1 SMP Thu Nov 26 05:23:55 GMT 2015 x86_64 x86_64 x86_64 GNU/Linux
0.6.4.2 with backport: Linux zdiskdd0000-0012-00-00 3.19.0-26-zdomu #27
14.04.1 SMP Thu Nov 26 05:23:55 GMT 2015 x86_64 x86_64 x86_64 GNU/Linux

Any pointers on specific commits to revert / patches to try appreciated - I'll try the commit on the other thread as revert but not expecting it to help here

@koplover
Copy link
Author

One more thing: all block devices in all scenarios above (all 4 systems tested) have:

cat /sys/block/sd[a-b]/queue/max_sectors_kb
32767
32767

cat /sys/block/sd[a-b]/queue/max_sectors_kb
1
1

@koplover
Copy link
Author

I'm still trying to look into this issue but without much luck.

The second scenario above in particular, where there are a few 0.6.5 stream commits backported to 0.6.4.2 version is fascinating as it seems to show our performance drop off as for the 0.6.5 stream with a small set of changes under the microscope.

If anybody wants any more details of the system / stats taken during tests, let me know - happy to do some further leg work on this

@ironMann
Copy link
Contributor

ironMann commented Sep 12, 2016

Hi @koplover, this issue seem to be observed by multiple users, and is related to using zvols in virtualization environments. It would be very useful if benchmarks could be boiled down to some minimal setup, reproducible by different setups and people. I don't know if there's a test setup for ZFS for this already existing. For example, Ceph has a tool to benchmark block device under kvm-like environment called kvmrbdfio, which uses fio to produce VM like load. It would be great if something similar could be used to track down this issue.

@koplover
Copy link
Author

koplover commented Oct 5, 2016

@dweeezil I intend to put my details of 0.7.0 rc1 in this issue rather than create a brand new one for essentially the same issue - let me know if it would help to create a new one for 0.7.0 RC1 as you requested in #4512 and I'll get that done

Anyway. in essence the situation seems exactly the same under 0.7.0 rc1 as for 0.6.5.x. While running iostat still shows a < 1 avg queue size in 0.7.0 rc1 implying it cannot be fed quickly enough.

zpool iostat -r gives:

diskconvm sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
512 0 0 0 0 0 0 0 0 0 0
1K 0 0 0 0 0 0 0 0 0 0
2K 0 0 0 0 0 0 0 0 0 0
4K 1.12M 0 29.6K 0 23.9K 0 563K 0 10 0
8K 74.5K 8.54K 9.50K 0 33.8K 2.34K 155K 253K 0 0
16K 2 9.87K 1.49K 3 0 2.69K 1.35K 80.9K 0 0
32K 0 7.23K 1.18K 0 0 2.56K 0 47.4K 0 6
64K 1.11M 5.37K 10.1K 0 257K 3.71K 29.7K 364K 0 0
128K 0 3.29K 3.13K 1 0 185K 0 2.14M 0 0
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 0 0 0 0 0 0
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0

Unfortunately, blkparse is still not working in this rc, so I cannot provide a trace of the block actions on the zvol. However, this is an Exchange benchmark, and the overlying VM is the same, so I've been trying to reproduce with fio jobs the general pattern but its proving tricky.

The general pattern of the disk requests in 0.6.4.2 that I can trace with blktrace is as follows:

  • writes are all synchronous
  • reads and writes are interleaved (with 1 or 2 of each type, before the next)
  • read blocks are completely random on logical disk block, always 32K requests
  • writes are varying sizes, seemingly between 4 and 64k, orders of 4k
  • these writes are largely sequential, with write to 'random' logical block every 'few' sequential writes (3-5 such writes)

The benchmark tool indicates an increase in the read / write latency as viewed from the VM whose database is running on the monitored zvol:
0.6.4.2 0.7.0 RC 1
database read avg latency (ms) 9.095 17.871
database write avg latency (ms) 1.693 30.186

I had heard that there was a patch to re-enable blktrace tracing for ZoL and was expecting it to be included in this candidate release - but it still shows only the Queue actions. Was I imagining this as cannot find reference to that patch now?

I'll add the 0.6.4.2 blktrace in a gist later as that may be helpful, but the pattern is as above

@koplover
Copy link
Author

koplover commented Oct 5, 2016

Gist of small sample of 0.6.4.2 blktrace trace while benchmarking occurring uploaded to:

https://gist.github.com/koplover/6af0f58220d9df0b63819eed2c9500ec

@dweeezil
Copy link
Contributor

dweeezil commented Oct 5, 2016

@koplover What's your kernel version?

I've discovered a potential issue which appears to be exacerbated by the re-worked (bio based) zvol code on write-heavy workloads, particularly on systems with a lot of CPUs. There are a handful of variables involved but what seems to be happening is that without the old "zvol" taskq acting as a "buffering layer", a write-heavy workload can overload the write issue taskq to the point that the CPUs are spending most of their time contending for its tq_lock. In the harshest test case in which fio is running 32 sequential write threads [EDIT] not using direct IO to a zvol, the write issue taskq frequently has thousands of pending tasks but will occasionally have bursts in which there are hundreds of thousands of pending tasks!

Somewhat related, it's also a problem that the write issue taskq is dynamic because it increases the contention on tq_lock even further. I'll be posting a pull request for this minor issue soon. This may explain the better performance some people have reported when disabling dynamic taskqs altogether.

This issue largely goes away, however, with b58986e and a sufficiently new kernel with 16K stacks, but no released version of ZoL yet has that patch.

Your issue (@koplover) may have nothing to do with all this, however, it would be interesting to run perf top when the system is under heavy write load and see how much time is being spent in raw_spin_lock and friends.

As to the blktrace situation, you might be referring to 8198d18 which give accounting (iostat, etc.) but I don't think it will impact blktrace at all. Only a small number of other standard block devices in the kernel use blk_queue_make_request() (various dm targets, for example) and they also seem to be a bit sparse on blktrace output but I've been meaning to look into this further.

@koplover Skimming your information posted above, the change in write latency between 0.6.4.2 and 0.7.0 is rather striking but it begs the question as to whether the pool really is able to deliver sub 2ms latency or whether something may have actually been broken in 0.6.4.2. Have you tried the new zpool iostat latency options yet? In particular, zpool iostat -w and zpool iostat -l might be very useful (maybe with "-v" as well) to get a better view of things.

@koplover
Copy link
Author

koplover commented Oct 5, 2016

A few additional iostat dumps for this 0.7.0 RC1 build

zpool iostat -r

diskconvm    total_wait     disk_wait    sync_queue    async_queue
latency      read  write   read  write   read  write   read  write  scrub
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----
1ns             0      0      0      0      0      0      0      0      0
3ns             0      0      0      0      0      0      0      0      0
7ns             0      0      0      0      0      0      0      0      0
15ns            0      0      0      0      0      0      0      0      0
31ns            0      0      0      0      0      0      0      0      0
63ns            0      0      0      0      0      0      0      0      0
127ns           0      0      0      0      0      0      0      0      0
255ns           0      0      0      0      0      0      0      0      0
511ns           0      0      0      0      0      0      0      0      0
1us             0      0      0      0   109K  11.2K  7.98K    612      0
2us             0      0      0      0  2.04M  57.8K   167K  32.4K      4
4us             0      0      0      0   552K  44.0K  56.1K  23.5K      0
8us             0      0      0      0  28.1K  2.69K  5.52K  1.90K      0
16us            0      0      0      0  6.30K    149  1.60K  1.78K      0
32us            0  13.8K      0  16.8K  4.51K     74  1.27K  3.24K      0
65us        8.11K  39.1K  9.74K  45.3K  1.90K     30  1.36K  9.53K      0
131us        199K  35.7K   207K   181K  2.01K     54  2.22K  27.1K      0
262us       1.07M  49.0K  1.09M   392K  6.26K     66  4.79K  39.3K      1
524us        304K  62.1K   317K  1.32M  6.33K    176  8.00K  49.9K      0
1ms          193K  93.0K   199K  1.18M  5.33K    255  17.2K  70.9K      0
2ms         73.0K   142K   136K   309K  5.31K    476  7.16K   129K      0
4ms          130K   209K   305K   266K  10.3K    844  10.5K   189K      0
8ms          453K   242K   535K  1.14M  10.2K  1.33K  26.6K   223K      4
16ms         561K   220K   620K   244K  6.03K  1.99K  65.4K   203K      2
33ms         214K   208K   158K  14.5K  2.19K  2.85K   125K   198K      5
67ms         197K   283K  54.9K  19.6K  1.15K  5.79K   148K   275K      0
134ms        165K   481K  21.7K  9.86K  1.00K  9.95K   126K   469K      0
268ms       64.6K   656K  11.8K  1.80K    452  11.6K  41.4K   642K      0
536ms       22.3K   595K  5.14K    530    118  2.75K  15.6K   590K      0
1s          10.6K   542K  3.69K    596    127      0  6.26K   541K      0
2s          2.51K   673K    148    126      1      0  2.10K   672K      0
4s            760   646K      0      0      0      0    741   643K      0
8s            197  39.1K      0      0      0      0    187  38.3K      0
17s             0      0      0      0      0      0      0      0      0
34s             0      0      0      0      0      0      0      0      0
68s             0      0      0      0      0      0      0      0      0
137s            0      0      0      0      0      0      0      0      0
-------------------------------------------------------------------------

so some very high latencies here! There are other 'background' operations going on which I'll try to tune off but the majority of writes in the period are from the test

For zpool iostat -l 2 gives

              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
diskconvm    163G  1.66T    179    259  8.27M  18.3M    1us   37us  377ns  148ns   10ns    1us    2us   38us  517ns
diskconvm    163G  1.66T    113      2  6.98M  58.0K    4ms   26us    4ms   26us  848ns    1us  800ns      -      -
diskconvm    163G  1.66T     74    686  4.57M  21.5M    6ms  218ms    6ms  579us  843ns  929ns   10us  221ms      -
diskconvm    163G  1.66T     27    655  1.66M  18.2M   37ms  305ms   37ms  841us    1us    1us    1ms  308ms      -
diskconvm    163G  1.66T    125      9  7.70M   418K    4ms   46us    4ms   43us    1us    1us  103us      -      -
diskconvm    163G  1.66T     19    649  1.16M  18.7M   47ms  227ms   47ms  635us    1us    1us    1us  228ms      -
diskconvm    163G  1.66T     80    426  4.91M  8.12M   14ms  313ms   12ms    1ms  896ns  826ns    4ms  321ms      -
diskconvm    163G  1.66T     80    622  4.97M  15.0M    4ms  119ms    4ms  484us  849ns    1us   10us  120ms      -
diskconvm    163G  1.66T     17    243  1.10M  9.40M   64ms  150ms   64ms    3ms    1us  802ns  825ns  151ms      -
diskconvm    163G  1.66T     90      3  5.63M  22.0K    4ms   24us    4ms   22us    1us  986ns  872ns      -      -
diskconvm    163G  1.66T     26    764  1.43M  21.6M   31ms  107ms   30ms  926us    1us    1us  497us  108ms      -
diskconvm    163G  1.66T    105      8  6.48M  44.0K    6ms    1ms    6ms    1ms    1us    1us  113us      -      -
diskconvm    163G  1.66T     81    522  5.00M  13.4M    8ms  224ms    8ms  931us    1us  986ns  858ns  226ms      -
diskconvm    163G  1.66T     18    512  1004K  19.1M   38ms  624ms   38ms    1ms  907ns    1us   27us  651ms      -
diskconvm    163G  1.66T    114      7  7.04M   116K    5ms  267us    5ms  267us  922ns    1us    6us      -      -
diskconvm    163G  1.66T     69    341  4.37M  24.2M    7ms   46ms    7ms  558us  892ns    1us   28us   46ms      -

Not really clear what these statistics mean as do not seem to be aggregated over the period given (2 seconds). If I run at a point later for smaller period (0.1) I see below, and as the traffic is fairly constant seems a little strange

              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
diskconvm    163G  1.66T    178    259  8.25M  18.2M    1us   37us  375ns  146ns   10ns    1us    2us   38us  513ns
diskconvm    163G  1.66T     29  1.93K  1.86M   240M  291ms     1s  291ms   10ms   15us      -      -     1s      -
diskconvm    163G  1.66T      9  2.48K   635K   316M     1s     1s     1s    8ms   15us      -      -     1s      -
diskconvm    163G  1.66T     59  2.61K  3.72M   197M  624ms  984ms  624ms    8ms   30us      -   22us  972ms      -
diskconvm    163G  1.66T      0    814      0  4.39M      -   58ms      -    2ms      -   22us      -   56ms      -
diskconvm    163G  1.66T      9      0   636K      0     1s      -     1s      -   30us      -      -      -      -
diskconvm    163G  1.66T     39     19  2.48M   119K     1s  488us  750ms  366us      -   22us  250ms      -      -
diskconvm    163G  1.66T      0     19      0   159K      -  609us      -  609us      -   22us      -      -      -
diskconvm    163G  1.66T      9     79   636K   318K     3s   34ms     3s   34ms   15us   17us      -      -      -
diskconvm    163G  1.66T    109      0  6.29M      0  160ms      -  160ms      -   21us      -   22us      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   82ms      -   82ms      -   27us      -   19us      -      -
diskconvm    163G  1.66T     69      0  4.35M      0   54ms      -   54ms      -   20us      -   15us      -      -
diskconvm    163G  1.66T     59      0  3.72M      0   58ms      -   58ms      -   22us      -   15us      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   80ms      -   80ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    109      0  6.83M      0  102ms      -  102ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    149      0  9.31M      0   77ms      -   77ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   75ms      -   75ms      -   18us      -   15us      -      -
diskconvm    163G  1.66T    128     19  8.05M  79.3K   98ms  487us   98ms  487us   15us   15us   15us      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   78ms      -   78ms      -   18us      -   15us      -      -
diskconvm    163G  1.66T    148      0  9.31M      0   96ms      -   96ms      -   15us      -   19us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   69ms      -   69ms      -   19us      -   15us      -      -
diskconvm    163G  1.66T     69      0  4.35M      0  116ms      -  116ms      -   26us      -      -      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   69ms      -   69ms      -   18us      -   15us      -      -
diskconvm    163G  1.66T     79      0  4.97M      0   67ms      -   67ms      -   20us      -   11us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   93ms      -   93ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    109     19  6.83M  79.5K   65ms  366us   64ms  366us   19us   22us    1ms      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   93ms      -   93ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.01M      0   45ms      -   44ms      -   15us      -    1ms      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   54ms      -   54ms      -   15us      -   19us      -      -
diskconvm    163G  1.66T    138      0  8.68M      0   75ms      -   75ms      -   16us      -   15us      -      -
diskconvm    163G  1.66T    149      0  9.31M      0   66ms      -   66ms      -   17us      -   15us      -      -
diskconvm    163G  1.66T    119     19  7.45M   675K  107ms  488us  107ms  366us   17us   15us   19us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0  102ms      -  102ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    119      0  6.91M      0  114ms      -  114ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    109      0  6.83M      0  103ms      -  103ms      -   17us      -   15us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   94ms      -   94ms      -   19us      -   13us      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   75ms      -   75ms      -   16us      -   15us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0  119ms      -  119ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   93ms      -   93ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T     69      0  3.26M      0   35ms      -   35ms      -   15us      -  338us      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   93ms      -   93ms      -   15us      -      -      -      -
diskconvm    163G  1.66T    148      0  9.31M      0   75ms      -   75ms      -   18us      -   15us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   68ms      -   68ms      -   19us      -   15us      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   78ms      -   78ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    129     19  8.07M   437K   70ms  732us   70ms  732us   16us   22us   15us      -      -
diskconvm    163G  1.66T    148      0  9.31M      0   76ms      -   76ms      -   16us      -   17us      -      -
diskconvm    163G  1.66T     49      0  3.10M      0  137ms      -  137ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   97ms      -   97ms      -   20us      -   15us      -      -
diskconvm    163G  1.66T    149      0  9.31M      0   89ms      -   89ms      -   15us      -   13us      -      -
diskconvm    163G  1.66T    138     19  8.10M   119K   91ms  365us   91ms  365us   16us   22us   15us      -      -
diskconvm    163G  1.66T      0  2.66K      0   278M      -  494ms      -    7ms      -      -      -  486ms      -
diskconvm    163G  1.66T      0  3.20K      0   309M      -     1s      -    6ms      -      -      -     1s      -
diskconvm    163G  1.66T      9   1013   636K   113M     1s     2s     1s   20ms   30us      -      -     2s      -
diskconvm    163G  1.66T      9  2.75K   636K   108M     1s  136ms     1s    3ms   15us      -      -  134ms      -
diskconvm    163G  1.66T      0      0      0      0      -      -      -      -      -      -      -      -      -
diskconvm    163G  1.66T      9     19   636K   119K     2s  732us     2s  488us   30us   22us      -      -      -
diskconvm    163G  1.66T     69     99  3.18M   437K     1s   11ms     1s   11ms   25us   19us   22us      -      -
diskconvm    163G  1.66T     69      0  4.35M      0  125ms      -  125ms      -   19us      -   20us      -      -
diskconvm    163G  1.66T     89     19  5.59M   119K   90ms  366us   90ms  366us   15us   15us   15us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   98ms      -   98ms      -   16us      -   15us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0  101ms      -  101ms      -   19us      -   13us      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   71ms      -   71ms      -   16us      -    1ms      -      -
diskconvm    163G  1.66T     79      0  4.96M      0  117ms      -  117ms      -   22us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   84ms      -   83ms      -   16us      -    1ms      -      -
diskconvm    163G  1.66T     49      0  3.10M      0  112ms      -  112ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    129     19  8.07M   119K   60ms  366us   60ms  244us   16us   22us   15us      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   71ms      -   71ms      -   17us      -   15us      -      -
diskconvm    163G  1.66T     79      0  4.97M      0  125ms      -  125ms      -   20us      -      -      -      -
diskconvm    163G  1.66T     79      9  4.38M  39.7K  132ms  488us  132ms  488us   17us   30us      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0  125ms      -  125ms      -   19us      -   22us      -      -
diskconvm    163G  1.66T     49     19  3.10M   119K   22ms  488us   22ms  366us   15us   15us   17us      -      -
diskconvm    163G  1.66T     99     19  6.21M   119K  112ms  488us  100ms  488us   22us   15us   31ms      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  124ms      -  124ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   54ms      -   54ms      -   19us      -    1ms      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   94ms      -   94ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   79ms      -   78ms      -   16us      -    1ms      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   96ms      -   96ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0  109ms      -  109ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    119     19  7.45M   715K  112ms  610us  112ms  366us   15us   22us   15us      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   85ms      -   85ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   81ms      -   81ms      -   15us      -   15us      -      -
diskconvm    163G  1.66T     39      0  2.48M      0   93ms      -   93ms      -   19us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0  122ms      -  122ms      -   23us      -   22us      -      -
diskconvm    163G  1.66T    139     19  8.11M   119K   88ms  366us   88ms  366us   21us   22us   15us      -      -
diskconvm    163G  1.66T     69     19  4.34M   119K   98ms  366us   98ms  366us   19us   30us      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   59ms      -   59ms      -   22us      -   19us      -      -
diskconvm    163G  1.66T    119      0  7.44M      0   77ms      -   77ms      -   24us      -   13us      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   74ms      -   74ms      -   25us      -   15us      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   93ms      -   93ms      -   22us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0  109ms      -  109ms      -   26us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.20M      0  106ms      -  106ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     79     19  4.97M   993K  109ms  610us  109ms  610us   19us   22us      -      -      -
diskconvm    163G  1.66T     59      0  3.73M      0   72ms      -   72ms      -   20us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0  106ms      -  106ms      -   24us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  107ms      -  107ms      -   23us      -      -      -      -
diskconvm    163G  1.66T     79     39  4.97M  3.10M   93ms    1ms   93ms    1ms   30us   22us      -      -      -
diskconvm    163G  1.66T      9      0   635K      0  124ms      -  124ms      -   30us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   90ms      -   90ms      -   25us      -      -      -      -
diskconvm    163G  1.66T     59      0  3.73M      0  104ms      -  104ms      -   20us      -      -      -      -
diskconvm    163G  1.66T     79    695  4.97M  64.6M   97ms   34ms   97ms    4ms   24us      -      -   30ms      -
diskconvm    163G  1.66T     19  1.38K  1.24M   174M  498ms  796ms  498ms   17ms   22us      -      -  774ms      -
diskconvm    163G  1.66T      9  3.60K   634K   446M  997ms     1s  997ms    4ms   30us      -      -     1s      -
diskconvm    163G  1.66T      0    456      0  55.2M      -     1s      -    3ms      -      -      -     1s      -
diskconvm    163G  1.66T     19      0  1.24M      0     1s      -     1s      -   22us      -      -      -      -
diskconvm    163G  1.66T      0  2.98K      0   105M      -  149ms      -    9ms      -   22us      -  138ms      -
diskconvm    163G  1.66T      0      0      0      0      -      -      -      -      -      -      -      -      -
diskconvm    163G  1.66T      9     19   636K   119K     1s  488us     1s  488us   30us   22us      -      -      -
diskconvm    163G  1.66T     19     79  1.24M   318K  562ms   18ms  562ms   18ms   30us   20us      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  146ms      -  146ms      -   23us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  107ms      -  107ms      -   28us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.19M      0   90ms      -   90ms      -   24us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  111ms      -  111ms      -   25us      -      -      -      -
diskconvm    163G  1.66T    118      0  7.42M      0   96ms      -   96ms      -   25us      -      -      -      -
diskconvm    163G  1.66T     69     19  4.35M   119K   89ms  488us   89ms  488us   21us   30us      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0  101ms      -  101ms      -   24us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   86ms      -   86ms      -   27us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   80ms      -   80ms      -   26us      -      -      -      -
diskconvm    163G  1.66T     59      0  3.72M      0   73ms      -   73ms      -   27us      -   15us      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   93ms      -   93ms      -   21us      -      -      -      -
diskconvm    163G  1.66T    108      0  6.81M      0   73ms      -   73ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0  112ms      -  112ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   85ms      -   85ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     59      0  3.73M      0  145ms      -  145ms      -   15us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   93ms      -   93ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   97ms      -   97ms      -   18us      -      -      -      -
diskconvm    163G  1.66T     19     19  1.24M   119K  125ms  366us  125ms  366us   15us   15us      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   90ms      -   90ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   93ms      -   93ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     69      0  4.35M      0   89ms      -   89ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     49      0  3.10M      0  112ms      -  112ms      -   24us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   83ms      -   83ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   93ms      -   93ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0  106ms      -  106ms      -   22us      -      -      -      -
diskconvm    163G  1.66T     39    149  2.48M  4.11M   78ms  520us   78ms  504us   15us   16us      -      -      -
diskconvm    163G  1.66T     39    317  2.48M  9.31M  109ms  961us  109ms  946us   15us   14us      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   90ms      -   90ms      -   20us      -      -      -      -
diskconvm    163G  1.66T    158      0  9.94M      0   72ms      -   72ms      -   16us      -      -      -      -
diskconvm    163G  1.66T    148      9  8.15M  79.5K   93ms  488us   93ms  488us   16us   15us      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0   97ms      -   97ms      -   17us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   74ms      -   74ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0   78ms      -   78ms      -   17us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.42M      0   55ms      -   55ms      -   19us      -      -      -      -
diskconvm    163G  1.66T    119     39  7.45M   278K   81ms  427us   81ms  366us   21us   22us      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0  124ms      -  124ms      -   20us      -      -      -      -
diskconvm    163G  1.66T     89     19  5.59M   159K   90ms  366us   90ms  366us   16us   15us      -      -      -
diskconvm    163G  1.66T    138      0  8.69M      0   69ms      -   69ms      -   15us      -      -      -      -
diskconvm    163G  1.66T    149     19  9.31M   159K   67ms  488us   67ms  488us   15us   22us      -      -      -
diskconvm    163G  1.66T     99     39  5.63M   596K   99ms  488us   99ms  366us   16us   19us      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   63ms      -   63ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     49  1.79K  3.10M   186M  106ms  307ms  106ms    7ms   18us   19us      -  306ms      -
diskconvm    163G  1.66T      0  3.18K      0   308M      -     1s      -    6ms      -      -      -     1s      -
diskconvm    163G  1.66T     19  1.86K  1.24M   194M     1s     1s     1s    8ms   22us      -      -     1s      -
diskconvm    163G  1.66T     19  1.75K  1.24M   117M  749ms  500ms  749ms    9ms   15us      -      -  488ms      -
diskconvm    163G  1.66T      0    645      0  4.46M      -   78ms      -    2ms      -      -      -   75ms      -
diskconvm    163G  1.66T      9      9   636K  79.5K     1s  976us     1s  976us   15us   30us      -      -      -
diskconvm    163G  1.66T     79    109  3.80M   516K  531ms    6ms  531ms    6ms   22us   12us      -      -      -
diskconvm    163G  1.66T     49      0  3.10M      0   51ms      -   51ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     29     19  1.86M   159K  166ms  366us  166ms  366us   15us   15us      -      -      -
diskconvm    163G  1.66T     69      0  4.35M      0   55ms      -   55ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     49      0  3.10M      0   93ms      -   93ms      -   15us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   68ms      -   68ms      -   16us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   77ms      -   77ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0  106ms      -  106ms      -   22us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.58M      0   97ms      -   97ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0  112ms      -  112ms      -   21us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   94ms      -   94ms      -   17us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   77ms      -   77ms      -   18us      -      -      -      -
diskconvm    163G  1.66T     89      9  5.59M  39.7K   97ms  488us   97ms  488us   18us   30us      -      -      -
diskconvm    163G  1.66T     69      0  4.33M      0   93ms      -   93ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  101ms      -  101ms      -   16us      -      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   96ms      -   96ms      -   22us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.96M      0  120ms      -  120ms      -   22us      -      -      -      -
diskconvm    163G  1.66T     59      0  3.73M      0  104ms      -  104ms      -   20us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   91ms      -   91ms      -   17us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   70ms      -   70ms      -   17us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   99ms      -   99ms      -   15us      -      -      -      -
diskconvm    163G  1.66T    148      0  9.31M      0   75ms      -   75ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0   55ms      -   55ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     59      0  3.73M      0   94ms      -   94ms      -   15us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   58ms      -   58ms      -   16us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   63ms      -   63ms      -   15us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   95ms      -   95ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   97ms      -   97ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     59      0  3.73M      0   83ms      -   83ms      -   20us      -      -      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   91ms      -   91ms      -   18us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  104ms      -  104ms      -   20us      -      -      -      -
diskconvm    163G  1.66T    148      0  9.31M      0   65ms      -   65ms      -   16us      -      -      -      -
diskconvm    163G  1.66T     69      0  4.35M      0   98ms      -   98ms      -   17us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   93ms      -   93ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   85ms      -   85ms      -   14us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0  109ms      -  109ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     99     19  5.63M   159K  112ms  488us  112ms  488us   16us   22us      -      -      -
diskconvm    163G  1.66T     69      0  4.35M      0   93ms      -   93ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     89     19  5.59M  79.5K   70ms  366us   70ms  366us   18us   22us      -      -      -
diskconvm    163G  1.66T    109     39  6.83M   199K   82ms  427us   82ms  427us   16us   19us      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   88ms      -   88ms      -   16us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   76ms      -   76ms      -   16us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.44M      0   73ms      -   73ms      -   17us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   58ms      -   58ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     29  1.25K  1.86M   153M  270ms  517ms  270ms    9ms   25us      -      -  511ms      -
diskconvm    163G  1.66T     19  3.42K  1.24M   401M  624ms     1s  624ms    7ms   22us   15us      -     1s      -
diskconvm    163G  1.66T      9    814   636K  68.1M     1s  254ms     1s    3ms   15us      -      -  252ms      -
diskconvm    163G  1.66T     19    715  1.24M  29.8M     1s  909ms     1s   31ms   30us      -      -  879ms      -
diskconvm    163G  1.66T      9  1.03K   636K  5.78M     2s   67ms     2s    2ms   15us   19us      -   68ms      -
diskconvm    163G  1.66T      0     19      0   159K      -  366us      -  366us      -   22us      -      -      -
diskconvm    163G  1.66T     39     99  2.48M   437K  640ms    9ms  640ms    9ms   19us   19us      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   86ms      -   86ms      -   18us      -      -      -      -
diskconvm    163G  1.66T     29     19  1.86M   159K  104ms  366us  104ms  366us   20us   22us      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0   93ms      -   93ms      -   17us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0   78ms      -   78ms      -   26us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  121ms      -  121ms      -   23us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.97M      0  109ms      -  109ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   88ms      -   88ms      -   21us      -      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   79ms      -   79ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0   88ms      -   88ms      -   18us      -      -      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   65ms      -   65ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   77ms      -   77ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   81ms      -   81ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  111ms      -  111ms      -   20us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   90ms      -   90ms      -   23us      -      -      -      -
diskconvm    163G  1.66T    139      0  8.69M      0   71ms      -   71ms      -   25us      -      -      -      -
diskconvm    163G  1.66T     29      0  1.86M      0   93ms      -   93ms      -   20us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   77ms      -   77ms      -   25us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   84ms      -   84ms      -   25us      -      -      -      -
diskconvm    163G  1.66T     39     19  1.90M   119K  109ms  486us  109ms  486us   30us   30us      -      -      -
diskconvm    163G  1.66T      9     19   636K   119K   31ms  125ms   31ms  125ms   15us   45us      -      -      -
diskconvm    163G  1.66T     49      0  2.52M      0   90ms      -   90ms      -   21us      -      -      -      -
diskconvm    163G  1.66T     59     19  3.71M   872K   72ms    2ms   72ms    2ms   22us   22us      -      -      -
diskconvm    163G  1.66T     49      0  3.10M      0   75ms      -   75ms      -   18us      -      -      -      -
diskconvm    163G  1.66T     29      0  1.86M      0   93ms      -   93ms      -   15us      -      -      -      -
diskconvm    163G  1.66T     49      0  3.10M      0   87ms      -   87ms      -   24us      -      -      -      -
diskconvm    163G  1.66T     99      0  6.21M      0   93ms      -   93ms      -   22us      -      -      -      -
diskconvm    163G  1.66T    108      0  6.81M      0   60ms      -   60ms      -   17us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0   75ms      -   75ms      -   20us      -      -      -      -
diskconvm    163G  1.66T    118      0  7.42M      0   75ms      -   75ms      -   19us      -      -      -      -
diskconvm    163G  1.66T     79      0  4.96M      0  109ms      -  109ms      -   24us      -      -      -      -
diskconvm    163G  1.66T    109      0  6.83M      0  107ms      -  107ms      -   47us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   86ms      -   86ms      -   27us      -      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0  100ms      -  100ms      -   27us      -      -      -      -
diskconvm    163G  1.66T    109      0  6.82M      0   82ms      -   82ms      -   24us      -      -      -      -
diskconvm    163G  1.66T    119      0  7.45M      0  109ms      -  109ms      -   22us      -      -      -      -
diskconvm    163G  1.66T     69     19  4.33M   119K   58ms  486us   58ms  486us   22us   30us      -      -      -
diskconvm    163G  1.66T     89      0  5.59M      0   97ms      -   97ms      -   33us      -      -      -      -
diskconvm    163G  1.66T    129      0  8.07M      0   87ms      -   87ms      -   21us      -      -      -      -
diskconvm    163G  1.66T    108      0  6.81M      0   72ms      -   72ms      -   24us      -      -      -      -
diskconvm    163G  1.66T     89     19  5.57M   119K   32ms  486us   32ms  364us   21us   15us      -      -      -
diskconvm    163G  1.66T     98     19  6.19M   198K   78ms  364us   78ms  364us   25us   15us      -      -      -
diskconvm    163G  1.66T     79      0  4.37M      0  155ms      -  155ms      -   28us      -      -      -      -
diskconvm    163G  1.66T     59    695  3.72M  78.7M   95ms  183ms   95ms    7ms   25us   30us      -  176ms      -
diskconvm    163G  1.66T     19  1.36K  1.24M   147M  498ms  782ms  498ms   12ms   22us   22us      -  786ms      -
diskconvm    163G  1.66T      9  2.14K   635K   220M  999ms     1s  999ms    9ms   30us      -      -     1s      -
diskconvm    163G  1.66T      0  1.15K      0   114M      -     3s      -   14ms      -      -      -     3s      -
diskconvm    163G  1.66T      0    218      0  21.0M      -     4s      -    6ms      -      -      -     4s      -
diskconvm    163G  1.66T     39  1.54K  2.48M   132M     1s     1s     1s   17ms   19us      -      -     1s      -
diskconvm    163G  1.66T      9  1.11K   636K  24.9M  999ms   45ms  999ms    2ms   30us      -      -   43ms      -
diskconvm    163G  1.66T      9     79   636K   318K     1s   34ms     1s   34ms   15us   12us      -      -      -
diskconvm    163G  1.66T     69      0  4.35M      0  339ms      -  339ms      -   30us      -      -      -      -
diskconvm    163G  1.66T     59      0  3.72M      0   25ms      -   25ms      -   17us      -      -      -      -

Looking at the queues (new zpool iostat -q 2

              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
diskconvm    163G  1.66T     54    488  3.32M  35.4M      0      1      0      0      0      0  1.17K      1      0      0
diskconvm    163G  1.66T     50  1.07K  2.89M  46.1M      0      0      0      0      0      0     93      1      0      0
diskconvm    163G  1.66T     52    148  3.25M   787K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    125     12  7.64M   104K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    117      1  7.26M  12.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     69    802  4.26M  20.0M      0      0      0      0      0      2  1.38K      1      0      0
diskconvm    163G  1.66T     13    777   895K  33.5M      0      0      0      0      0      0    127      0      0      0
diskconvm    163G  1.66T     36    650  2.14M  13.3M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    122      3  7.68M  24.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    139     19  7.82M   164K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     68    653  4.19M  27.2M      0      0      0      0      0      2  1.03K      1      0      0
diskconvm    163G  1.66T     36    800  2.14M  19.1M      0      0      0      0      0      0    253      1      0      0
diskconvm    163G  1.66T     43    526  2.75M  8.37M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    113      1  7.06M  12.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    121      3  7.62M  32.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     60  1.23K  3.53M  24.5M      0      0      0      0      0      0  1.78K      1      0      0
diskconvm    163G  1.66T      9  1.48K   460K  34.5M      0      2      0      0      0      2    105      0      0      0
diskconvm    163G  1.66T     42    862  2.45M  8.53M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    117      7  7.26M  75.9K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    111      2  6.94M  91.9K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     52    831  3.31M  28.6M      0      0      0      0      0      0  2.06K      1      0      0
diskconvm    163G  1.66T     14  1.59K   959K  21.2M      0      1      0      0      0      0     55      0      0      0
diskconvm    163G  1.66T     26    700  1.63M  8.60M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    109      1  6.87M  16.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    108      5  6.75M   244K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     78  1.10K  4.87M  25.0M      0      0      0      0      0      0  1.67K      1      0      0
diskconvm    163G  1.66T     19  1.33K  1.19M  21.7M      0      0      0      0      0      1      0      0      0      0
diskconvm    163G  1.66T     38    776  2.50M  9.44M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    113      5  7.00M   372K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     92      1  5.70M  12.0K      0      0      0      0      0      1      0      0      0      0
diskconvm    163G  1.66T     59    772  3.63M  15.0M      0      1      0      0      0      0  1.99K      1      0      0
diskconvm    163G  1.66T     19  1.59K  1.01M  28.6M      0      0      0      0      0      0     69      0      0      0
diskconvm    163G  1.66T     14    757   899K  8.58M      0      0      0      0      0      0     34      0      0      0
diskconvm    163G  1.66T     92     34  5.81M   192K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    111      9  6.94M  91.9K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     61  1.10K  3.81M  29.0M      0      2      0      0      0      0  1.18K      1      0      0
diskconvm    163G  1.66T      7  1.39K   272K  12.8M      0      0      0      0      0      0  1.45K      1      0      0
diskconvm    163G  1.66T     29    520  1.76M  6.67M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     93      5  5.87M  36.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    118      0  7.38M      0      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     64    841  4.00M  26.8M      0      1      0      0      0      0    711      1      0      0
diskconvm    163G  1.66T      5  1.06K   324K  20.3M      0      1      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     58     16  3.39M  87.9K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     70      1  4.43M  12.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    122      3  7.63M  24.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     69  1.04K  4.37M  7.58M      0      0      0      0      0      0  1.95K      1      0      0
diskconvm    163G  1.66T     12  1.43K   652K  45.2M      0      1      0      0      0      0    155      1      0      0
diskconvm    163G  1.66T     33    303  2.01M  3.16M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     91      7  5.69M  52.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    111      0  6.88M  7.99K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     77    838  4.87M  22.5M      0      1      0      0      0      0  2.25K      1      0      0
diskconvm    163G  1.66T     11  1.52K   592K  26.2M      0      0      0      0      0      0    391      1      0      0
diskconvm    163G  1.66T      4    938   260K  12.8M      0      0      0      0      0      0     49      0      0      0
diskconvm    163G  1.66T     58    108  3.69M   600K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    102      9  6.32M   116K      0      0      0      0      0      2      0      0      0      0
diskconvm    163G  1.66T     65  1.02K  4.01M  16.1M      0      0      0      0      0      0  1.63K      1      0      0
diskconvm    163G  1.66T      6  1.38K   388K  27.7M      0      0      0      0      0      0     62      1      0      0
diskconvm    163G  1.66T     14    744   839K  9.53M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    109      3  6.64M  40.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    127      1  7.99M  99.9K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     61    678  3.87M  10.6M      0      0      0      0      0      0  2.52K      1      0      0
diskconvm    163G  1.66T      3  1.10K   136K  25.2M      0      0      0      0      0      3  1.08K      1      0      0
diskconvm    163G  1.66T      5  1.58K   324K  22.5M      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     57      9  3.56M  48.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T    106      5  6.57M  44.0K      0      0      0      0      0      0      0      0      0      0
diskconvm    163G  1.66T     57    718  3.62M  8.18M      0      0      0      0      0      0  1.74K      1      0      0
diskconvm    163G  1.66T      0    536  64.0K  8.40M      0      0      0      0      0      0  1.13K      1      0      0
diskconvm    163G  1.66T      5   1009   384K  12.6M      0      0      0      0      0      0     33      1      0      0
diskconvm    163G  1.66T     11     21   767K  2.21M      0      0      0      0      0      1      0      0      0      0

@dweeezil
Copy link
Contributor

dweeezil commented Oct 5, 2016

@koplover Could you please enclose those dumps in triple backticks (```) to preserve the original formatting.

@koplover
Copy link
Author

koplover commented Oct 5, 2016

@dweeezil Ah just read your post, was gathering some additional information posted rather messily above.

The kernel is 3.19.0-39
uname -a
Linux lee-ThinkPad-T530 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

The domain running zfs zvols runs with 4 vcpus, so not a large number

What remains interesting from an external monitoring perspective of iostat on the zvol being stressed is the relatively high queue size for 0.6.4.2 (sometimes > 20 if memory serves), and the < 1 always seen with 0.6.5.x and 0.7 now.

It feels like the newer zfs versions are faster, but additionally something else is stopping them being fed quickly enough by higher applications

@koplover
Copy link
Author

koplover commented Oct 5, 2016

@dweezil done - apologies was dumping as the run was going and was going to look at formatting later

@koplover
Copy link
Author

koplover commented Oct 6, 2016

@dweezil I've had a look at perf while the performance logging is going on, and pretty much 90% of the time is sat in the Xen scheduler.

Locking does not seem to be an issue in this scenario, the greatest I saw was 0.15% in the highest spinlock routine.

It would be really great to backport the new zpool iostat options back to 0.6.4.2 to compare what was going on there, I'll take a look at the feasibility.

As I say, I've been trying to nail it by running fio tests, but have so far been unsuccessful with an array of sync / async and random / sequential read and write fio tests. I have a combined read / write job which shows an approximate match to the benchmark traffic when viewed through blktrace on a 0.6.4.2 system and this is showing some promise. Do not see the same disparity although write seems faster on 0.6.4.2 (read is faster on 0.7.0 in this combined test however).

@redtex
Copy link

redtex commented Oct 6, 2016

@ironMann I've described minimal setup, to re-produce zvol-related performance issue in #3871

@dweeezil
Copy link
Contributor

dweeezil commented Oct 6, 2016

@koplover Regarding your observation of short queue lengths on iostat's avgqu-sz for zvols, I think this is to be expected with a bio-based block device. Although a queue is created, from what I've read, it's not actually used (for bio-based), however, it would seem that it occasionally must have a single entry in it.

@richardelling
Copy link
Contributor

NB, zpool iostat isn't the correct tool for this, it is too low in the stack. At the underlying layer (sd) you should see multiple I/Os in the queue from ZFS. Regular iostat -x will show queue depths and response times at the volume level

@koplover do you see one I/O queued per-zvol or per-pool?
Also, fio and most other benchmark tools will not reproduce this.

@dweeezil I think b58986e is a good idea, but not related to this issue

@koplover
Copy link
Author

koplover commented Oct 6, 2016

@ironMann Interesting find. however, not strictly aligned with my issue I believe.

My test systems I am using are just a single mirrored vdev, so not much scope for the unbalanced IOPS to the vdevs you describe. To be sure, and out of interest for how our system is affected by moving to metadata rather than all on the target zvol, I set it up for 0.6.4.2 and 0.7.0

[We do have other systems with 3 x mirror vdevs, so I'd assume your finding is very interesting for those and help in their performance in the future, unfortunately I do not have any to hand to test]

In a 0.6.4.2 system it downgraded its performance to that I see when benchmarking a 0.7.0 system

For the 0.7.0 ZFS it cut benchmark reported performance in half. This was also true for writes (although the benchmark is a mixed read/write as discussed earlier, in the initial phase there is a write only stage where the db is created). When monitoring this 0.6.5.x and 0.7.0 has always been much better during this pure write stage (>40% better than 0.6.4.2) - with the primarycache set to metadata it was approx 20% slower - I had not expected that to be the case

@koplover
Copy link
Author

koplover commented Oct 6, 2016

@richardelling I'm not using zpool iostat but iostat on the zvol volume and the underlying sd[a-b] in the vdev mirror. I was talking about queue depth on the zvol which was being monitored which was always < 1 for 0.6.5.x and 0.7.0 but @dweezil indicates that this is to be expected with the move to bio-based block devices - which I was not aware of

When monitoring the underlying disks, I need to regather the information and confirm, but I thought I saw here a significantly lower queue depth with 0.6.5.x / 0.7.0 and utilization compared to 0.6.4.2. However I'll rerun and post some real results

@koplover
Copy link
Author

koplover commented Oct 6, 2016

Analysis during the performance benchmark. With 0.7.0 RC1 see the following (zpool is single mirror vdev containing sda and sdb


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               7.00     2.20   50.40  105.40  3516.00  8727.40   157.17     0.93    5.98   13.59    2.34   3.87  60.24
sdb               7.40     2.20   44.60  106.00  3179.20  8727.40   158.12     0.82    5.44   12.54    2.45   3.81  57.44

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               6.60     0.00   42.00   91.40  3030.40  7129.60   152.32     0.50    3.77    8.84    1.44   3.08  41.12
sdb               7.20     0.00   49.40   92.60  3507.20  7129.60   149.81     0.71    5.00   11.19    1.70   3.63  51.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               4.80     2.00   52.00   92.00  3064.00  7193.00   142.46     0.79    5.52   11.42    2.19   4.03  58.00
sdb               5.80     2.00   56.20   93.00  3440.00  7193.00   142.53     0.77    5.18    9.99    2.28   3.74  55.76

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               4.60     0.00   57.00  102.20  3757.60  8269.80   151.10     0.65    4.10    9.21    1.24   3.20  50.96
sdb               3.60     0.00   50.20  102.00  3297.60  8269.80   152.00     0.67    4.38   10.36    1.44   3.46  52.72

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               6.20     2.20   44.60  101.40  3172.80  7866.80   151.23     0.73    5.00   10.80    2.45   3.87  56.48
sdb               4.60     2.20   46.40  101.80  3160.00  7866.80   148.81     0.68    4.56   10.29    1.95   3.60  53.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               3.20     0.00   47.20  120.00  3112.00  9575.20   151.76     0.54    3.24    8.39    1.21   2.84  47.52
sdb               3.60     0.00   58.40  119.20  3830.40  9575.20   150.96     0.74    4.19   10.03    1.34   3.41  60.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               5.00     2.40   47.80   92.00  3312.00  6709.00   143.36     0.72    5.16   10.49    2.38   3.97  55.44
sdb               6.60     2.40   49.80   91.40  3447.20  6709.00   143.86     0.65    4.61   10.12    1.61   3.64  51.44

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               4.00     1.40   49.60   97.60  3362.40  7325.60   145.22     0.70    4.73   10.55    1.77   3.54  52.16
sdb               7.00     1.40   46.80   97.40  3385.60  7325.60   148.56     0.52    3.64    7.86    1.61   3.36  48.48

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               6.40     2.60   46.60   87.80  3288.00  6328.20   143.10     0.62    4.63    8.67    2.49   3.74  50.24
sdb               4.80     2.60   47.40   88.00  3228.00  6328.20   141.16     0.73    5.39   10.97    2.38   4.01  54.32

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               5.00     0.00   53.80   95.40  3696.00  7252.00   146.76     0.59    3.98    8.01    1.71   3.35  50.00
sdb               3.00     0.00   57.00   94.00  3706.40  7252.00   145.14     0.65    4.28    9.09    1.35   3.62  54.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               7.00     2.00   51.40  117.40  3577.60  9211.40   151.53     0.66    3.93    9.79    1.37   3.17  53.44
sdb               6.20     2.00   48.80  116.80  3324.00  9211.40   151.39     0.66    3.97    9.13    1.82   3.14  52.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               6.20     0.00   51.60  115.40  3548.80  8283.40   141.70     0.57    3.44    8.67    1.11   2.96  49.36
sdb               6.00     0.00   55.00  114.60  3837.60  8283.40   142.94     0.60    3.53    8.83    0.99   3.12  52.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               5.20     2.60   44.40  116.80  3012.80  8821.20   146.82     0.93    5.79   15.32    2.17   3.44  55.44
sdb               6.60     2.60   50.20  116.20  3546.40  8821.20   148.65     0.71    4.25    9.48    1.99   3.50  58.24

Note, I've switched back to primarycache=all on both systems, so the write phase back to significantly faster on 0.7.0 RC1, and in this case the queue depths were large (8-10)

For 0.6.4.2 see improved utilization and queue depths hovering around 2

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.80   79.20  197.80  5902.40 16199.30   159.58     2.35    8.45   22.33    2.89   2.91  80.64
sdb               0.00     0.80   64.20  208.00  4200.80 16374.50   151.18     2.35    8.62   25.37    3.45   3.08  83.84

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   51.00  275.60  3280.00 26257.20   180.88     1.75    5.39   28.06    1.19   2.48  81.12
sdb               0.00     0.00   69.20  282.20  5556.00 26081.80   180.07     2.05    5.38   22.09    1.28   2.40  84.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.20   64.80  159.80  4520.00 14024.00   165.13     1.35    5.91   17.11    1.37   3.32  74.64
sdb               0.00     2.20   74.00  156.00  5569.60 14024.20   170.38     1.49    7.05   16.49    2.57   3.49  80.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   61.00  156.40  3820.00 14177.20   165.57     1.38    6.43   19.12    1.48   3.50  76.08
sdb               0.00     0.00   60.00  156.80  3904.80 14177.20   166.81     1.17    5.55   16.53    1.34   3.42  74.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.00   55.40  197.60  3561.60 14333.40   141.46     1.93    7.63   25.85    2.52   3.14  79.36
sdb               0.60     2.00   78.40  197.20  6171.20 14333.40   148.80     2.32    8.44   21.46    3.26   2.93  80.64

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   70.80  189.00  4982.40 14251.40   148.07     1.52    5.84   17.56    1.44   3.13  81.36
sdb               0.00     0.00   78.20  186.40  5724.00 14251.60   150.99     1.39    5.26   15.49    0.97   2.99  79.12

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.80   55.00  210.80  3496.00 14049.80   132.02     1.38    5.21   20.22    1.29   2.90  77.12
sdb               0.00     2.80   57.20  208.60  3636.80 14049.60   133.08     1.35    5.09   19.43    1.15   2.87  76.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.20   66.20  167.60  4496.80 14288.50   160.70     2.16    9.24   26.49    2.42   3.48  81.44
sdb               0.20     1.20   77.00  168.40  5661.60 14288.50   162.59     2.26    9.21   25.15    1.92   3.43  84.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.80   57.60  205.00  3674.40 21205.50   189.49     1.78    6.62   20.78    2.64   3.17  83.12
sdb               0.00     0.80   82.40  230.80  6337.60 24751.90   198.53     1.98    6.14   16.04    2.61   2.79  87.28

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   50.20  238.60  3164.80 21786.00   172.79     1.75    6.20   26.66    1.90   2.83  81.84
sdb               0.20     0.00   67.80  218.40  5249.60 18239.60   164.15     1.77    6.38   20.72    1.93   2.77  79.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.40   57.40  146.40  3637.60 14044.60   173.53     1.29    6.37   18.08    1.78   3.68  74.96
sdb               0.00     2.40   59.60  140.40  3778.40 14044.60   178.23     1.30    6.52   17.64    1.81   3.84  76.80

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.40     0.00   60.40  238.80  4072.80 22769.60   179.43     1.60    4.61   18.72    1.04   2.68  80.24
sdb               0.20     0.00   74.80  264.20  5928.00 25391.20   184.77     1.78    4.24   15.02    1.19   2.47  83.60

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.20   63.40  221.00  5198.40 18994.80   170.14     1.96    7.68   22.97    3.30   3.07  87.36
sdb               0.00     2.20   51.80  184.40  3483.20 15934.00   164.41     1.76    8.92   28.12    3.53   3.60  85.04

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     1.00   59.20  177.80  3778.40 14485.10   154.12     1.54    6.49   20.45    1.85   3.39  80.40
sdb               0.00     1.00   65.20  159.80  4418.40 14924.30   171.94     1.54    6.86   18.88    1.95   3.56  80.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.80     1.80   66.20  137.00  4610.40 14078.90   183.95     1.70    8.37   20.97    2.28   3.99  81.04
sdb               0.00     1.80   73.80  137.80  5224.00 14078.90   182.45     1.77    8.37   20.26    2.01   3.86  81.68

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   72.40  194.20  5136.80 17123.40   166.99     1.29    4.81   15.09    0.98   2.87  76.64
sdb               0.00     0.00   74.60  185.00  5390.40 16441.80   168.20     1.36    5.24   15.30    1.18   3.07  79.68

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     2.60   61.80  264.60  4943.20 24035.40   177.56     2.09    6.42   24.35    2.23   2.63  85.76
sdb               0.00     2.60   55.00  262.40  3865.60 24096.20   176.19     1.86    5.77   23.52    2.05   2.69  85.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   52.80  202.40  3343.20 15113.00   144.64     1.43    5.59   21.45    1.45   3.03  77.36
sdb               0.00     0.00   54.80  205.40  3495.20 15733.80   147.80     1.38    5.43   20.60    1.38   2.96  76.96

Here for the initial write phase (which does not contribute to the benchmark values), the queue depths were low < 1.5

@koplover
Copy link
Author

Been out of the loop for a little while, does anybody want any further investigation on our system to look at this performance issue for zvols in virtualized environment.

As outlined above, last tried 0.7.0 RC to see if it addressed any of the issues, with the same results on our benchmark

@koplover
Copy link
Author

I've built our test system with 0.7.0 RC3 with PR 5824 applied and the results are encouraging.

There is certainly a reported uplift of performance for out benchmark results (JetStress), not quite at the reported values for 0.6.4.2 but way above what was reported for the various vanilla 0.6.5.x versions and vanilla 0.7.0 RC1 tested most recently

For comparison, the various benchmark results were as follows:
0.6.4.2
Achieved transactional I/Os: 452.223

0.7.0 RC1
Achieved transactional I/Os: 158.279

0.7.0 RC3 with PR5824 applied
Achieved transactional I/Os: 350.435

The latest test is on the same physical server as the prior 0.7.0 RC1 test, with a single mirrored zpool (sda/sdb) the same configuration as all cases above

I'll pass on the iostat analysis for this latest test, but is not longer stuck on a <1 average queue size

@koplover
Copy link
Author

Taking a look at the iostat on sda/sdb in the same way as above, the results during performance benchmarking stage are looking a lot better - aligned with the 0.6.4.2 results above with the average queue size no longer being constrained.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.40     0.00   72.20  209.40  3952.00 14339.20   129.91     3.11   11.36   37.92    2.20   3.31  93.12
sdb               0.00     0.00   72.80  232.60  4213.60 14339.20   121.50     3.04   10.13   37.88    1.44   2.92  89.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   69.40  181.80  4387.20 14735.20   152.25     2.50    9.95   31.23    1.83   3.39  85.20
sdb               0.20     0.00   71.20  221.20  4393.60 14735.20   130.84     2.58    8.82   31.94    1.37   2.93  85.68

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.40     0.00   67.20  232.20  3920.00 17744.80   144.72     3.16   10.44   42.49    1.16   2.99  89.44
sdb               0.00     0.00   73.60  232.60  4352.00 17728.80   144.22     2.87    9.34   35.58    1.04   2.78  85.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   64.80  390.40  3369.60 26840.00   132.73     3.40    7.29   44.09    1.18   2.08  94.48
sdb               0.40     0.00   64.40  386.00  3251.20 26864.00   133.73     3.38    7.37   43.91    1.27   2.08  93.76

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.80     0.00   56.60  312.80  3521.60 16674.40   109.34     2.98    8.38   46.93    1.41   2.40  88.48
sdb               0.60     0.00   59.80  300.80  3782.40 16666.40   113.42     2.76    7.44   37.99    1.36   2.45  88.24

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00   87.80  235.20  3780.00 14860.00   115.42     3.12    9.66   31.53    1.50   2.56  82.64
sdb               0.20     0.00   87.20  233.00  3860.80 14860.00   116.93     3.10   10.12   32.73    1.66   2.65  84.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00   67.80  209.40  4127.20 14559.20   134.82     2.41    8.68   30.86    1.50   3.16  87.60
sdb               0.00     0.00   63.40  230.60  3902.40 14559.20   125.59     2.63    8.98   36.57    1.39   3.02  88.88

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   58.80  354.60  3608.80 25243.20   139.58     1.92    4.44   25.93    0.88   2.08  85.84
sdb               0.00     0.00   54.60  338.80  3242.40 25251.20   144.86     2.31    5.83   34.89    1.15   2.28  89.84

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   55.80  288.60  3449.60 19101.60   130.96     2.39    7.21   36.30    1.58   2.59  89.28
sdb               1.00     0.00   56.60  286.40  3449.60 19093.60   131.45     2.15    6.30   30.53    1.51   2.58  88.40

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   64.60  213.60  3920.00 14876.80   135.13     2.20    7.90   29.50    1.36   2.98  83.04
sdb               0.60     0.00   69.40  218.80  4168.80 14876.80   132.17     1.74    6.05   21.44    1.17   2.78  80.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00   63.80  255.60  4140.00 18624.00   142.54     2.51    7.76   33.14    1.43   2.72  86.96
sdb               0.00     0.00   68.80  262.20  4644.80 19249.60   144.38     2.68    8.07   32.78    1.59   2.70  89.44

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   60.40  228.20  3727.20 15622.40   134.09     2.65    9.09   37.87    1.48   2.81  81.04
sdb               0.40     0.00   57.60  257.80  3466.40 15806.40   122.21     2.94    8.79   41.51    1.47   2.71  85.52

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   52.40  404.60  2960.80 24576.80   120.51     3.56    6.66   49.92    1.06   2.06  94.08
sdb               0.20     0.00   68.40  348.80  3895.20 23767.20   132.61     3.14    6.59   35.86    0.85   2.19  91.36

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.40     0.00   68.20  229.60  4182.40 15076.00   129.34     2.89   11.58   44.18    1.90   2.93  87.36
sdb               0.40     0.00   74.80  221.60  3309.60 15076.00   124.06     3.31   13.07   45.38    2.16   3.02  89.52

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00   68.60  227.60  4323.20 14988.00   130.39     2.09    7.08   25.69    1.48   2.83  83.76
sdb               0.80     0.00   74.00  244.40  4696.00 14988.00   123.64     1.97    6.17   23.75    0.85   2.61  83.20

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00   62.00  186.80  3974.40 14732.80   150.38     2.55   10.26   35.47    1.90   3.21  79.84
sdb               0.00     0.00   66.40  236.00  4171.20 14732.80   125.03     1.96    6.49   26.90    0.75   2.58  78.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.40     0.00   55.60  343.80  3468.00 25656.80   145.84     2.64    6.56   38.79    1.35   2.26  90.32
sdb               0.60     0.00   56.40  399.00  3567.20 25656.80   128.34     2.61    5.54   37.42    1.03   1.94  88.16

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00   68.40  266.20  3752.80 18392.80   132.37     3.01    9.05   38.40    1.51   2.57  86.08
sdb               0.80     0.00   66.20  316.80  3610.40 18392.80   114.90     2.37    6.43   31.67    1.15   2.20  84.08

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.20     0.00   62.80  195.40  3258.40 14781.60   139.74     1.71    6.63   23.10    1.33   3.30  85.12
sdb               0.00     0.00   61.00  207.00  3174.40 14781.60   134.00     1.91    7.12   25.17    1.80   3.22  86.32

@koplover
Copy link
Author

The various stages of the benchmark (database creation - write only phase, benchmarking - mixed read/write phase, and checksum checking - read only phase) provide some interesting numbers when compared with the original post (which compared 0.6.4.2 and 0.6.5.7 versions)

Note, although the geometry of the tested server is the same (mirrored 2x disk) the disk sizes for those original tests were 4TB and this latest test is on 2 x 2TB

Database creation phase (synchronous writes, no reads)
Example sample:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd720             0.00     0.00    0.00  170.40     0.00 48944.00   574.46     3.48   20.92    0.00   20.92   4.98  84.80

Benchmark measurement phase (mix read / write)
Example sample:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd720             0.00     0.00  255.20  224.20  9499.20  6891.20    68.38     2.90    6.00   11.09    0.20   1.93  92.40

Checksum checking phase (read only phase, checking db integrity at the end of the benchmark)
Example Sample:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd720             0.00     0.00  863.60    0.00 55270.40     0.00   128.00    28.45   32.72   32.72    0.00   1.16 100.00

@koplover
Copy link
Author

So in comparison the outlier from this benchmark that still requires further investigation is the initial write phase for database creation - I'll try turning some of the knobs suggested in #5902 / #5824

@ryao ryao self-assigned this Apr 26, 2017
@ryao
Copy link
Contributor

ryao commented Apr 26, 2017

@koplover I am actively working on this as part of my new job that I started last week. Expect to hear back from me within the week.

@ryao
Copy link
Contributor

ryao commented Apr 26, 2017

@koplover Just as a FYI, that patch wasn't what I meant when I said that I am actively working on this. There are multiple changes in the pipeline, including pipelining ZIL itself.

@ryao ryao added Type: Regression Indicates a functional regression Component: ZVOL ZFS Volumes labels Apr 26, 2017
@koplover
Copy link
Author

koplover commented May 9, 2017

@ryao Sounds good, looking forward to seeing the fruits of this labour - are you using zvols in this new job?

@koplover
Copy link
Author

koplover commented May 9, 2017

I've found time to do a little more testing, to compare 0.6.4.2 and 0.7.0 RC3 performance in our environment to act as a more robust data point.

I realised looking back on my notes, that the 0.6.4.2 results from the start of this thread were a little unfair. We have a couple of server configurations in play here, one basic (2x2TB vdev with log and cache), and the other premium (2x4TB with log and cache device)

Although the disk size should not significantly skew the results, the SSD and log partition used in both cases are different. For the basic we have 128GB SSD, and premium 256 GB SSD, the premium SSD write speed is sginificantly faster that the 128GB spec, and the partition assigned to the log vdev for the premium server is twice the size of that for the basic server.

The original tests were all against a premium server, while my latest 0.7.0 RC3 + PR results were for a basic server.

I've now got hold of another basic server and installed 0.6.4.2 on that so the servers for the following results are exactly the same spec, and the operations performed are equivalent synchronously following the initial restart of the server.

@koplover
Copy link
Author

koplover commented May 9, 2017

Initial case taken back to the best known state possible.

  • Restart each server
  • Reformat SSD, and repartition for log and l2arc partitions
  • ARC recreated from scratch

There remain obvious differences, in particular the metaslab consumption as 0.7.0 system has been up for double the time, but both are relatively low. And the initial results show very close correlation for the two servers

For this starting point, the basic results were as follows from the benchmark:

  1. 0.6.4.2 server: 445 transactional IOs
  2. 0.7.0 RC3 + PR server: 432 transactional IOs

This went through the full phases of the test benchmark:

  1. Database creation (pure write phase, allows ARC population) - optional first stage
  2. Performance logging (mixed read / write phase - simulates exchange db access)
  3. Database checksum verification (pure read phase)

@koplover
Copy link
Author

koplover commented May 9, 2017

Detailed results for this run were as follows:

Database Creation:
0.6.4.2 example datapoint:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00    0.00  496.60     0.00 153596.80   618.59     2.80    5.63    0.00    5.63   1.49  73.84

0.7.0 RC3 + PR example datapoint

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00    0.00  379.60     0.00 114328.80   602.36     3.45    9.03    0.00    9.03   2.15  81.52

The duration for this 48GB database write was:
0.6.4.2 server: 9 minutes 5 seconds
0.7.0 RC3 server: 10 minutes 43 seconds
So seeing sequential write format here approx 15% faster on the 0.6.4.2 server

Performance Logging (time fixed for 30 minutes):
Example 0.6.4.2 data point:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00  276.60  218.80 10900.00  6711.30    71.10     2.62    5.27    9.32    0.16   1.80  88.96

Example 0.7.0 RC3 + PR data point:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00  269.40  216.60 10696.00  6521.60    70.85     2.82    5.99    9.78    1.27   1.86  90.24

Obviously actual read and write values are variable around this point by 10-20% as the data points were only over 5s, probably should take a longer sample period for future tests

Database checksum verification (read through the database checksums)
0.6.4.2 Example datapoint

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00  663.20    0.00 42444.80     0.00   128.00    26.50   33.01   33.01    0.00   1.51  99.84

0.7.0 RC3 + PR example datapoint

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00 1212.80    0.00 77619.20     0.00   128.00    27.02   22.20   22.20    0.00   0.82 100.00

The actual duration of the operation shows again that 0.7.0 RC3 + PR is the clear winner for sequential read, as below:
0.6.4.2 server: 19 minutes 53 seconds
0.7.0 RC3 server: 10 minutes and 35 seconds

So the performance for this sequential read is roughly double for 0.7.0 RC3 over 0.6.4.2 in this test.

The arc stats are a little strange and do not seem on first glance back this read performance up.

On 0.6.4.2 server, an arcstat sample was:

   time  read  miss  miss%  dmis  dm%  mmis  mm%  l2read  l2miss  l2miss%  pmis  pm%  mtxmis  eskip  arcsz     c  l2size   
13:13:51  2.1K  1.2K     59  1.2K   60     0    0    1.2K     952       7      0      0   4.4G  4.4G     26G   
13:14:06   437    96     21     5    1     0    0      96      95       98    91   92       0      0   4.4G  4.4G     26G   
13:14:07   759   188     24    43    8     0    0     188     184       97   145   59       0      0   4.4G  4.4G     26G   
13:14:08  4.3K   820     19     7    0     0    0     820     818       99   813   43       0      0   4.4G  4.4G     26G   
13:14:09  2.7K   604     22   121    7     0    0     604     578       95   483   45       0      0   4.4G  4.4G     26G   
13:14:10  2.2K   764     34   164   11     0    0     764     724       94   600   81       0      0   4.4G  4.4G     26G   
13:14:11  2.1K   504     23   412   27     1    0     504     376       74    92   14       0      0   4.4G  4.4G     26G   
13:14:12  1.7K   911     54   438   37     0    0     911     797       87   473   91       0      0   4.4G  4.4G     26G   
13:14:13  2.4K  1.5K     61   797   47     0    0    1.5K    1.3K       88   700   91       0      0   4.4G  4.4G     26G   
13:14:14   298    92     30     8    3     1    1      92      92      100    84  100       0      0   4.4G  4.4G     26G   
13:14:15  1.1K   388     35   122   17     0    0     388     358       92   266   72       0      0   4.4G  4.4G     26G   
13:14:16  2.4K   831     33   170   10     1    0     831     794       95   661   73       0      0   4.4G  4.4G     26G   
13:14:17  1.4K   819     57   600   51     0    0     819     690       84   219   82       0      0   4.4G  4.4G     26G   
13:14:18  2.3K   957     42   235   17     0    0     957     910       95   722   80       0      0   4.4G  4.4G     26G   
13:14:19  2.4K  1.0K     42   325   22     0    0    1.0K     941       92   691   73       0      0   4.4G  4.4G     26G   
13:14:20    45     3      6     3    6     0    0       3       3      100     0    0       0      0   4.4G  4.4G     26G   
13:14:21  3.7K   943     25   281   10     0    0     943     880       93   662   72       0      0   4.4G  4.4G     26G   
13:14:22  1.9K   502     25    28    2     0    0     502     492       98   474   58       0      0   4.4G  4.4G     26G   
13:14:23   844   240     28    57    9     0    0     240     223       92   183   70       0      0   4.4G  4.4G     26G   
13:14:24  3.3K  1.2K     35   133    6     0    0    1.2K    1.1K       97  1.0K   74       0      0   4.4G  4.4G     26G   
13:14:25  2.0K   659     33   185   15     0    0     659     601       91   474   63       0      0   4.4G  4.4G     26G   
13:14:26  1.9K   374     19    84    7     1    0     374     352       94   290   39       0      0   4.4G  4.4G     26G   
13:14:27   867   506     58   480   58     0    0     506     391       77    26   61       0      0   4.4G  4.4G     26G   
13:14:28  1.7K  1.0K     59   475   43     0    0    1.0K     895       88   536   88       0      0   4.4G  4.4G     26G   
13:14:29  1.6K   663     42   165   17     0    0     663     630       95   498   79       0      0   4.4G  4.4G     26G   
13:14:30  2.7K   890     32     4    0     0    0     890     890      100   886   67       0      0   4.4G  4.4G     26G   
13:14:31   759   268     35    19    4     3    1     268     268      100   249   75       0      0   4.4G  4.4G     26G   
13:14:32  2.9K   453     15     4    0     0    0     453     452       99   449   47       0      0   4.4G  4.4G     26G   
13:14:33  1.9K   950     49   727   44     0    0     950     774       81   223   83       0      0   4.4G  4.4G     26G   
13:14:34  2.5K  1.3K     51   528   32     0    0    1.3K    1.2K       91   737   88       0      0   4.4G  4.4G     26G   
13:14:35  2.9K   901     31   294   15     0    0     901     837       92   607   65       0      0   4.4G  4.4G     26G   
13:14:36  1.3K   537     40   273   26     4    1     537     486       90   264   84       0      0   4.4G  4.4G     26G   
13:14:37  1.7K   967     56   739   51     0    0     967     810       83   228   89       0      0   4.4G  4.4G     26G   
13:14:38  1.5K   663     44   394   42     0    0     663     565       85   269   48       0      0   4.4G  4.4G     26G   
13:14:39  4.7K   780     16    17    0     0    0     780     778       99   763   41       0      0   4.4G  4.4G     26G   
13:14:40  2.6K  1.0K     38   307   16     0    0    1.0K     958       93   715   90       0      0   4.4G  4.4G     26G   

And for 0.7.0 RC3 + PR

   time  read  miss  miss%  dmis  dm%  mmis  mm%  l2read  l2miss  l2miss%  pmis  pm%  mtxmis  eskip  arcsz     c  l2size   
13:09:36  1.7K  1.7K     96  1.6K   96    11   22    1.7K     276       16    87   98       0      0   4.4G  4.4G     43G   
13:09:37  1.2K   994     85   597   85    18   35     994     460       46   397   86       0     38   4.4G  4.4G     43G   
13:09:38  1.5K  1.3K     87  1.1K   91    15   31    1.3K     348       26   191   68       0    181   4.4G  4.4G     43G   
13:09:39   646   427     66   215   50    20   15     427     254       59   212   97       0     49   4.4G  4.4G     43G   
13:09:40  1.5K  1.2K     82   702   79    12   20    1.2K     671       55   512   86       0     13   4.4G  4.4G     43G   
13:09:41  1.1K   827     76   330   57     6   15     826     573       69   497   97       0      0   4.4G  4.4G     43G   
13:09:42  1.6K  1.5K     93  1.3K   92     8   22    1.5K     528       35   190   97       0      7   4.4G  4.4G     43G   
13:09:43  1.6K  1.5K     93  1.3K   95    17   48    1.5K     535       36   142   83       0      7   4.4G  4.4G     43G   
13:09:44   936   583     62   226   47    28   16     583     429       73   357   77       0      0   4.4G  4.4G     43G   
13:09:45  1.5K  1.4K     90  1.3K   93     9   29    1.4K     405       29    61   59       0     98   4.4G  4.4G     43G   
13:09:46  1.9K  1.7K     94  1.5K   96    24   64    1.7K     559       31   244   83       0    175   4.4G  4.4G     43G   
13:09:47  1.7K  1.6K     97  1.6K   97    17   43    1.6K     348       21    68  100       0    200   4.4G  4.4G     43G   
13:09:48  1.6K  1.4K     86  1.1K   86     5    9    1.4K     430       30   293   84       0     10   4.4G  4.4G     43G   
13:09:49  1.3K  1.2K     92  1.1K   91    25   21    1.2K     339       27   113  100       0      0   4.4G  4.4G     44G   
13:09:50  1.2K  1.0K     83   493   75    11   22    1.0K     635       62   527   92       0     57   4.4G  4.4G     44G   
13:09:51  1.6K  1.1K     69   743   66     9   19    1.1K     585       50   407   76       0    153   4.4G  4.4G     44G   
13:09:52  2.0K  1.9K     96  1.8K   96    21   47    1.9K     798       42   118  100       0     35   4.4G  4.4G     44G   
13:09:53  2.1K  2.0K     94  1.8K   94    12   14    2.0K     800       40   166  100       1     36   4.4G  4.4G     44G   
13:09:54  1.0K   788     75   515   75    39   32     788     406       51   273   76       1     62   4.4G  4.4G     44G   
13:09:55  1.5K  1.4K     97  1.4K   97    17   39    1.4K     341       23    72   96       0    325   4.4G  4.4G     44G   
13:09:56  1.3K  1.1K     86   827   84     9   18    1.1K     418       38   266   93       0     43   4.4G  4.4G     44G   
13:09:57  1.4K  1.2K     81   612   70    11   28    1.2K     678       58   540   98       0     43   4.4G  4.4G     44G   
13:09:58  1.5K  1.4K     96  1.3K   96    14   28    1.4K     296       20    89  100       1     11   4.4G  4.4G     44G   
13:09:59  1.0K   830     81   579   78    72   32     830     362       43   251   90       0      6   4.4G  4.4G     44G   

The 0.7.0 RC3 seems to show a significant number of metadata misses compared with 0.6.4.2 - I'd expect looking only at these stats to see read performance for 0.7.0 significantly worse but that is not the case clearly given the measurements earlier - what is happening here?

@lnicola
Copy link
Contributor

lnicola commented May 9, 2017

Example 0.7.0 RC3 + PR data point:

Can you double-check this?

@koplover
Copy link
Author

koplover commented May 9, 2017

I reran the same test again immediately after the one above with pretty much exactly the same results.

More interesting is when I then re-ran the tests omitting the first optional database creation (so the tests re-attached to the existing database on disk). Again this was commenced only a couple of minutes after the last set of tests completed.

The difference in benchmarks results is significant in this case:
0.6.4.2 server: 487 transactional IOs
0.7.0 RC3 server: 350 transactional IOs

@koplover
Copy link
Author

koplover commented May 9, 2017

I did not capture any detailed stats for the 'attach-only' run above, so retried the same test run next morning with deteriorating results on both servers, presumably as background processes evicted test data from the ARC

This time the benchmark results gave:
0.6.4.2 server: 322 transactional IOs
0.7.0 RC3 + PR: 173 transactional IOs (actually failing the read-latency max for the test of 20ms - it took 27ms, compared to 10 - 14 ms range of the other tests, both for this server and the 0.6.4.2 server)

The details of this run are as follows:

No additional database creation - re-attached to existing database from prior tests

Performance Logging phase:
0.6.4.2 example data point:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00  177.20  136.60  7731.20  4121.60    75.54     2.84    9.06   15.90    0.19   3.07  96.32

0.7.0 RC3 example data point:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00  113.80   92.00  5657.60  2998.50    84.12     6.21   30.42   25.61   36.37   4.63  95.20

Database checksum verification phase:
0.6.4.2 example datapoint:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00  368.20    0.00 23564.80     0.00   128.00    19.84   54.14   54.14    0.00   2.72 100.00

0.7.0 RC3 + PR example datapoint:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00  535.80    0.00 34291.20     0.00   128.00    27.37   52.37   52.37    0.00   1.86  99.92

Again 0.7.0 giving better read performance, despite the arcstats seeminly implying that it is having a great deal of problems with metadata misses, which is normally an indicator of worse performance. A sample of the 0.7.0 RC3 arcstats during this period is:

   time  read  miss  miss%  dmis  dm%  mmis  mm%  l2read  l2miss  l2miss%  pmis  pm%  mtxmis  eskip  arcsz     c  l2size   
10:54:12   984   532     54   287   50     5    9     532     249       46   245   59       0      0   4.4G  4.4G     73G   
10:54:13   407   224     55   224   55     4    5     224       1        0     0    0       0      0   4.4G  4.4G     73G   
10:54:14   815   664     81   522   79     0    0     664     144       21   142   90       0      0   4.4G  4.4G     73G   
10:54:15   856   487     56    50   17     1    2     487     438       89   437   77       0      0   4.4G  4.4G     73G   
10:54:16   630   510     80   473   79     1    5     510      49        9    37  100       0      0   4.4G  4.4G     73G   
10:54:17   739   607     82   332   78     3    8     607     275       45   275   87       0      0   4.4G  4.4G     73G   
10:54:18   524   331     63   313   61     6    5     331      25        7    18  100       1      0   4.4G  4.4G     73G   
10:54:19   603   542     89   541   89     0    0     542       1        0     1  100       0      0   4.4G  4.4G     73G   
10:54:20   848   609     71   295   69     1    3     609     314       51   314   74       0      0   4.4G  4.4G     73G   
10:54:21   838   525     62    62   25     5    9     525     464       88   463   78       0      0   4.4G  4.4G     73G   
10:54:22   778   567     72   567   73     0    0     567       4        0     0    0       0      0   4.4G  4.4G     73G   
10:54:23   550   314     57   178   46     5    4     314     143       45   136   81       0      0   4.4G  4.4G     73G   
10:54:24   989   561     56   132   42     3    5     561     430       76   429   63       0      0   4.4G  4.4G     73G   
10:54:25   736   514     69   330   61     0    0     514     190       36   184   91       0      0   4.4G  4.4G     73G   
10:54:26   962   543     56   230   45     0    0     543     314       57   313   68       0      0   4.4G  4.4G     73G   
10:54:27  1.2K   756     64   417   69     1    1     756     344       45   339   59       0      0   4.4G  4.4G     73G   
10:54:28   527   259     49   202   43     7    7     259      57       22    57   86       0      0   4.4G  4.4G     73G   
10:54:29   686   557     81   554   81     0    0     557       5        0     3  100       0      0   4.4G  4.4G     73G   
10:54:30   603   545     90   447   88     0    0     545     100       18    98  100       0      0   4.4G  4.4G     73G   
10:54:31   940   557     59   252   53     2    3     557     310       55   305   65       0      0   4.4G  4.4G     73G   
10:54:32   700   501     71   288   67     0    0     501     213       42   213   78       0      0   4.4G  4.4G     73G   
10:54:33   425   241     56   241   57     3    2     241       2        0     0    0       0      0   4.4G  4.4G     73G   
10:54:34   698   516     73   362   69     0    0     516     154       29   154   88       0      0   4.4G  4.4G     73G   
10:54:35   685   590     86   373   83     1    3     590     217       36   217   90       0      0   4.4G  4.4G     73G   
10:54:37   610   493     80   493   80     0    0     493      18        3     0    0       0      0   4.4G  4.4G     73G   
10:54:38   883   508     57   162   38     1    2     508     351       69   346   74       0      0   4.4G  4.4G     73G   
10:54:39   647   427     65   166   48     0    0     427     265       62   261   85       0      0   4.4G  4.4G     73G   
10:54:40   553   420     75   355   73     6   14     420      73       17    65   95       0      0   4.4G  4.4G     73G   
10:54:41   600   440     73   141   62     2    5     440     310       70   299   79       0      0   4.4G  4.4G     73G   
10:54:42   720   558     77   357   71     0    0     558     204       36   201   90       0      0   4.4G  4.4G     73G   
10:54:43   593   462     77   259   71     1    3     462     210       45   203   88       0      0   4.4G  4.4G     73G   
10:54:44   791   529     66   297   71    10    8     529     247       46   232   62       0      0   4.4G  4.4G     73G   
10:54:45   897   409     45   409   67     0    0     409       4        0     0    0       0      0   4.4G  4.4G     73G   
10:54:46   737   591     80   544   80     1    6     591      59        9    47   82       0      0   4.4G  4.4G     73G   
10:54:47   849   572     67   347   65     4    9     572     229       40   225   70       0      0   4.4G  4.4G     73G   
10:54:48   771   533     69   532   75     0    0     533       3        0     1    1       0      0   4.4G  4.4G     73G   
10:54:49   555   330     59    49   20    11    8     330     287       86   281   89       0      0   4.4G  4.4G     73G   
10:54:50   633   512     80   320   76     2    6     512     195       38   192   89       0      0   4.4G  4.4G     73G   
10:54:51  1.1K   549     50   288   50     0    0     549     262       47   261   50       0      0   4.4G  4.4G     73G   
10:54:52   730   537     73   382   70     0    0     537     155       28   155   82       0      0   4.4G  4.4G     73G   
10:54:53   864   705     81   546   79     1    3     705     161       22   159   90       0      0   4.4G  4.4G     73G   
10:54:54   519   227     43   122   33     4    3     227     107       47   105   67       0      0   4.4G  4.4G     73G   
10:54:55   912   690     75   319   63     2    5     690     376       54   371   90       0      0   4.4G  4.4G     73G   
10:54:56   629   494     78   291   72     1    5     494     204       41   203   89       0      0   4.4G  4.4G     73G   
10:54:57   693   534     77   359   72     0    0     534     184       34   175   88       0      0   4.4G  4.4G     73G   
10:54:58  1.2K   490     41   256   39     1    1     490     237       48   234   43       0      0   4.4G  4.4G     73G   
10:54:59   689   522     75   363   72     8   10     522     163       31   159   83       0      0   4.4G  4.4G     73G   
10:55:00   726   487     67   436   72     0    0     487      77       15    51   40       0      2   4.4G  4.4G     73G   
10:55:01   747   512     68    80   28     3    7     512     439       85   432   92       0      0   4.4G  4.4G     73G   
10:55:02   658   614     93   594   93     4   23     614      21        3    20  100       0      0   4.4G  4.4G     73G   
10:55:03   639   503     78   338   75     0    0     503     170       33   165   85       0      0   4.4G  4.4G     73G   
10:55:04   483   308     63   115   52     2    1     308     197       63   193   72       0      0   4.4G  4.4G     73G   
10:55:05   776   552     71   552   71     0    0     552      11        1     0    0       1      0   4.4G  4.4G     73G   
10:55:06   758   537     70   316   61     0    0     537     223       41   221   91       0      0   4.4G  4.4G     73G   
10:55:07   781   623     79   471   77     2    3     623     181       29   152   88       0      0   4.4G  4.4G     73G   
10:55:08   712   544     76   544   76     0    0     544       5        0     0    0       0      0   4.4G  4.4G     73G   
10:55:09   566   338     59   338   61    15   12     338       3        0     0    0       0      0   4.4G  4.4G     73G   
10:55:10   806   626     77   236   68     3    7     626     395       63   390   84       0      0   4.4G  4.4G     73G   
10:55:11   648   501     77   491   76     0    0     501      11        2    10  100       0      0   4.4G  4.4G     73G   
10:55:12   881   610     69   396   77     5   25     610     219       35   214   57       0      0   4.4G  4.4G     73G   
10:55:13   667   416     62   130   45     0    0     416     288       69   286   74       0      0   4.4G  4.4G     73G   
10:55:14   849   355     41    77   19    12    9     355     279       78   278   61       0      0   4.4G  4.4G     73G   
10:55:15  1.0K   671     64   360   61     0    0     671     314       46   311   66       0      0   4.4G  4.4G     73G   
10:55:16   674   474     70   420   67     0    0     474      56       11    54   96       0      0   4.4G  4.4G     73G   
10:55:17   825   564     68   304   58     0    0     564     263       46   260   85       0      0   4.4G  4.4G     73G   
10:55:18   856   646     75   202   54     0    0     646     445       68   444   91       0      0   4.4G  4.4G     73G

@behlendorf
The l2 arc size is also interesting. Above it is 73G, but the log device is only 44G. What does this value now represent - is it a non-compressed value for the data stored compressed on the l2arc, an error, or something else? If it is the former, is there any increase in the metadata needed to be stored in the ARC to support this?

@koplover
Copy link
Author

koplover commented May 9, 2017

Running the same test again immediately after this one completes raises the figures for both:

0.6.4.2 server: 481 transactional IOs
this is back to the values seen originally when the database was created as part of the test. In this case all that has happened is that the same database is being used, but has been the subject of the proceeding test so presumably has populated the ARC for database blocks

0.7.0 RC3 + PR server: 334 transactional IOs
increased but remains roughly 33% below the original benchmark figures. It now seems to stay around this mark.

@koplover
Copy link
Author

koplover commented May 9, 2017

Last data point test I've tried, immediately after the above, run as for the first test (include the optional database creation sequential write phase)

0.6.4.2 server: 485 transactional IOs /s
0.7.0 RC3 + PR: 312 transactional IOs /s

Again 0.6.4.2 returns to its initial value, whereas 0.7.0 is stuck 33% down on this original benchmark value

Database creation phase:
0.6.4.2 server

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00    0.00  195.00     0.00 58178.40   596.70     3.69   15.80    0.00   15.80   4.34  84.64

0.7.0 RC3 + PR server

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00    0.00  141.60     0.00 44528.80   600.69     3.76   26.42    0.00   26.42   6.18  87.52 

0.6.4.2 duration: 23 minutes 42 seconds
0.7.0 RC3 + PR: 27 minutes 14 seconds
So again about 15% slower for 0.7.0 RC3 + PR

Performance logging phase:
0.6.4.2 server:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00  340.80  314.00 12954.40 10153.70    70.58     2.46    3.76    7.15    0.08   1.38  90.64

0.7.0 RC3 + PR server:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00  206.00  193.20  8328.00  6078.40    72.18     4.63   11.71   13.42    9.89   2.45  97.92

Database checksum verification phase
0.6.4.2 server

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd32              0.00     0.00  590.80    0.00 37811.20     0.00   128.00    24.97   39.85   39.85    0.00   1.69  99.84

0.7.0 RC3 + PR server

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util 
zd4432            0.00     0.00  868.80    0.00 55603.20     0.00   128.00    24.19   27.77   27.77    0.00   1.15 100.00

This sequential read phase is again significantly faster:
0.6.4.2 duration: 22 minutes and 1 second
0.7.0 RC3 duration: 10 minutes and 39 seconds

So again twice as fast in 0.7.0 RC3 + PR, even though the arcstats would imply issues to me:

12:52:56  1.3K  1.1K     84   654   79    24   16    1.1K     418       39   410   92       0      2   4.4G  4.4G     99G   
12:52:57  1.7K  1.3K     80  1.0K   85    45   46    1.3K     301       22   281   66       0     29   4.4G  4.4G     99G   
12:52:58  1.3K  1.2K     93  1.2K   93    13   30    1.2K     146       12    19   86       0    134   4.4G  4.4G     99G   
12:52:59  1.3K  1.1K     87   785   85     8   15    1.1K     531       46   347   92       0      0   4.4G  4.4G     99G   
12:53:00  1.0K   885     87   546   83     6   11     885     365       41   339   93       0     41   4.4G  4.4G     99G   
12:53:01  1.2K   943     79   708   75    15   10     943     247       26   235   94       0     31   4.4G  4.4G     99G   
12:53:02   653   549     84   541   83    14   22     549      61       11     8  100       0      0   4.4G  4.4G     99G   
12:53:03  1.1K   915     84   586   81     4    4     915     468       51   329   90       0      1   4.4G  4.4G     99G   
12:53:04  1.2K  1.1K     89   914   89     3    7    1.1K     316       29   161   93       0      1   4.4G  4.4G     99G   
12:53:05   934   724     77   539   73    10   19     724     203       28   185   91       0      2   4.4G  4.4G     99G   
12:53:06  1.0K   927     91   767   90    25   45     927     198       21   160   95       0     11   4.4G  4.4G     99G   
12:53:07   532   348     65   164   48    37   17     348     200       57   184   94       1     64   4.4G  4.4G     99G   
12:53:08  1.3K  1.0K     80   817   78     4   12    1.0K     486       46   225   93       0      1   4.4G  4.4G     99G   
12:53:09  1.6K  1.4K     89  1.1K   92    10   21    1.4K     418       29   312   80       0      2   4.4G  4.4G     99G   
12:53:10  1.4K  1.2K     81  1.1K   81     4   12    1.2K     161       13   103   83       0     66   4.4G  4.4G     99G   
12:53:11  1.1K   756     66   222   48    16   12     756     537       71   534   78       0     25   4.4G  4.4G     99G   
12:53:12  1.7K  1.4K     84  1.3K   83    38   20    1.4K     148       10   145   94       0    315   4.4G  4.4G     99G   
12:53:13   690   593     85   510   85     1    3     593      83       13    83   92       0     15   4.4G  4.4G     99G   
12:53:14  1.5K  1.4K     94  1.3K   94    10   20    1.4K     204       14   168   94       0     22   4.4G  4.4G     99G   
12:53:15  1.7K  1.5K     85  1.3K   84     6    8    1.5K     204       13   198   92       0     12   4.4G  4.4G     99G   
12:53:16  1.1K   796     73   433   62     7    6     796     377       47   363   94       0     20   4.4G  4.4G     99G   
12:53:17   976   767     78   470   71     9    7     767     314       40   297   93       0     52   4.4G  4.4G     99G   
12:53:18   852   627     73   246   54    14   19     627     387       61   381   94       0     14   4.4G  4.4G     99G   
12:53:19  1.1K  1.0K     90   721   88     4   10    1.0K     318       30   316   93       0    204   4.4G  4.4G     99G   
12:53:20  1.2K  1.0K     87   836   86     5   10    1.0K     200       19   199   91       0     61   4.4G  4.4G     99G   
12:53:21  1.1K   885     83   738   82     9    8     885     154       17   147   93       0      3   4.4G  4.4G     99G   
12:53:22   707   582     82   216   68    19   28     582     367       63   366   93       0      0   4.4G  4.4G     99G   

@koplover
Copy link
Author

koplover commented May 9, 2017

In summary, these are my key observations from the above tests:

  1. 0.6.4.2 is approx 15% faster than 0.7.0 RC3 + PR 5824 for sequential writes

  2. 0.7.0 RC3 + PR 5824 is roughly double the read speed than 0.6.4.2 for the sequential read operation (more information available from ARC)

  3. 0.7.0 RC3 + PR 5824 deteriorates from its 'top' benchmark value quickly and cannot get back to that value by recreating the database from scratch, unlike 0.6.4.2. I assume that this has something to do with changes in the way the ARC / L2ARC is being populated. While they were empty in the first test most of the database could be held in ARC / L2ARC as these were being warmed. Once they have been warmed to certainly ARC fully populated and L2ARC having significant content, it is harder for the database blocks to displace the resident blocks.

What significant changes have there been to the ARC / L2ARC caching that may explain this. Its not entirely a good answer anyway as the sequential read phase of db checksums is significantly faster in 0.7.0 RC3 + PR

  1. arcstats details for 0.7.0 RC3 + PR are misleading. The l2arc size is going above the partition size of the l2arc - I assume this is the uncompressed value and it is now being compressed in the l2arc - can somebody confirm?

  2. The arcstat metadata miss values imply big problems and that read performance should be affected - yet it is double that of the 0.6.4.2 server which looks much happier

@behlendorf, @dweezil - does the above make sense? Anything else it would be useful to find out from these two systems?

@sempervictus I've looked at the write improvement suggestions fyi I've tried:

  • setting read_ahead_kb under the zvol queue to 0kb, with no impact on results
  • I do not have a write_cache tunable available - any idea why not (ubuntu 14:04) and what the default value would be (write through / write back)

@koplover
Copy link
Author

koplover commented May 9, 2017

But to reiterate - results for 0.7.0 RC3 much better with this PR than without it on our system

@koplover
Copy link
Author

koplover commented May 9, 2017

@lnicola yes thanks - that was a typo in the first set of performance logging results for 0.7.0 RC3 - pasted in the same sample twice!

@behlendorf
Copy link
Contributor

@koplover thanks for the testing results and confirming PR 5824 is a step in the right direction.

If your testing system is still all ready it would be great if you could re-run the same tests using rc4 which was recently tagged. This release contains PR 5824 (enabled by default), additional zvol performance tweaks (5731140, bc17f10) and some changes to address issues observed with the ARC (9b50146, 30fffb9, b855550, 2e91c2f, 1a31dcf, 44813ae, 4149bf4, 03b60ee). You should see comparable performance to to rc3+PR 5824.

0.6.4.2 is approx 15% faster than 0.7.0 RC3 + PR 5824 for sequential writes

There's a significant amount of change between these two releases which could account for this. Compressed ARC, ABD, etc. But it's good to see we're competitive again with the older release. With @ryao focusing on further optimizing zvols I'm optimistic.

0.7.0 RC3 + PR 5824 is roughly double the read speed than 0.6.4.2 for the sequential read operation (more information available from ARC)

This may be the result of the recent compressed ARC feature. Not only is the ARC now compressed in memory, but the L2ARC is compressed on-disk and the size of the headers needed to track the L2ARC data have been reduced. The net effect is you're able to cache considerably more data which is great for reads.

0.7.0 RC3 + PR 5824 deteriorates from its 'top' benchmark value quickly and cannot get back to that value by recreating the database from scratch.

This might be related to an ARC accounting leak recently fixed ( 423e7b6 ). It will be interesting to see if you observe this same behavior in rc4 where this has been addressed.

arcstats details for 0.7.0 RC3 + PR are misleading. The l2arc size is going above the partition size of the l2arc - I assume this is the uncompressed value and it is now being compressed in the l2arc - can somebody confirm?

There are two values reported in /proc/spl/kstat/zfs/arcstats, l2_size is what's reported by arcstat and it is the uncompressed size of what's stored in the l2arc device. There's also a l2_asize size which is the allocated size.

The arcstat metadata miss values imply big problems

I can see why this is misleading. The meta miss rate is clearly up, but many more I/Os are being pushed through the system. It's making much better use of the l2arc device. Rarely do we need to go all the way back to disk according to the l2miss% which is down significantly.

@sempervictus
Copy link
Contributor

sempervictus commented May 9, 2017 via email

@koplover
Copy link
Author

@behlendorf Thanks for the explanation, my alias was using l2size so that explains the arc size, I'll use both the values in future to provide a better, fuller picture

In terms of trying RC4, this probably means a server redeploy so will be slower time than just reusing the existing server, but will try in-place from the dailys first

Are you intending to back port any of this performance enhancement into the 0.6.5.x stream, I was hoping to build that next as this stream should be more stable than a new 0.7.0, as need to get this out into our production servers as soon as possible. We are struggling with 0.6.4.2 instabilities in the field and want to take the stability enhancements without the performance penalties and with the least risk possible.

0.6.5.x (0.6.5.10?) gives the best compromise of risk v timing from our perspective - I assume 0.7.0 is still some way off, and when released will have teething issues despite the additional testing from the community.

@gmelikov
Copy link
Member

Closed as stale due to actual 0.8 stable branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: ZVOL ZFS Volumes Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem Type: Regression Indicates a functional regression
Projects
None yet
Development

No branches or pull requests

12 participants