Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TRIM support - replaces #5925 and #7363 #8255

Closed
wants to merge 38 commits into from

Conversation

dweeezil
Copy link
Contributor

@dweeezil dweeezil commented Jan 8, 2019

Motivation and Context

Frequently-requested feature to help performance on SSDs and on various other SAN-like storage back-ends which support UNMAP/TRIM.

Description

This is a port of Nextenta's TRIM support to the current master branch - post-device removal and post-device initialization.

How Has This Been Tested?

New ZTS tests. Also additions to ztest.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

@codecov
Copy link

codecov bot commented Jan 9, 2019

Codecov Report

Merging #8255 into master will decrease coverage by 0.04%.
The diff coverage is 88.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8255      +/-   ##
==========================================
- Coverage   78.33%   78.29%   -0.05%     
==========================================
  Files         380      382       +2     
  Lines      115719   116678     +959     
==========================================
+ Hits        90653    91350     +697     
- Misses      25066    25328     +262
Flag Coverage Δ
#kernel 78.93% <86.25%> (+0.04%) ⬆️
#user 65.8% <87.99%> (-1.21%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a10863...3a184b8. Read the comment docs.

@dweeezil
Copy link
Contributor Author

This has been refreshed with fixes to issues discovered during testing. There are still some deadlock issues which occur mainly when the autotrim property is changed and another pool management is performed (mainly adding and possibly removing vdevs). The deadlocks are being actively investigated.

@@ -6649,6 +6700,53 @@ zpool_do_resilver(int argc, char **argv)
return (for_each_pool(argc, argv, B_TRUE, NULL, scrub_callback, &cb));
}

/*
* zpool trim [-s|-r <rate>] <pool> ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing -p flag

print_trim_status(trim_prog, (vs->vs_space -
vs->vs_alloc) + zpool_slog_space(nvroot),
trim_rate, trim_start_time, trim_stop_time);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's model getting the trim stats off the way this is handled for checkpoint, scan, and removal stats. Specifically, with a new ZPOOL_CONFIG_TRIM_STATS key containing a uint64 array. For consistency let's try and get all this code to work in basically the same way.

@@ -539,6 +539,40 @@ bio_is_fua(struct bio *bio)
#endif
}

/*
* bio_set_discard - Set the appropriate flags in a bio to indicate
* that the specific random of sectors should be discarded.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/random/range

@@ -383,6 +383,8 @@ typedef struct dmu_buf {
#define DMU_POOL_OBSOLETE_BPOBJ "com.delphix:obsolete_bpobj"
#define DMU_POOL_CONDENSING_INDIRECT "com.delphix:condensing_indirect"
#define DMU_POOL_ZPOOL_CHECKPOINT "com.delphix:zpool_checkpoint"
#define DMU_POOL_TRIM_START_TIME "trim_start_time"
#define DMU_POOL_TRIM_STOP_TIME "trim_stop_time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This need to converted this to a single entry containing an array of uint64's for TRIM. Just like DMU_POOL_SCAN or DMU_POOL_REMOVING. Something like this (I may have missed some needed entries).

#define	DMU_POOL_TRIM		"org.open-zfs:trim"

...

/*
 * All members must be uint64_t, for byteswap purposes.
 */
typedef struct spa_trim_phys {
        uint64_t tr_state; /* dsl_scan_state_t */
        uint64_t tr_start_time;
        uint64_t tr_end_time;
        uint64_t tr_rate;
        uint64_t tr_to_trim; /* bytes that need to be trimmed */
        uint64_t tr_trimmed; /* bytes that have been trimmed */
} spa_trim_phys_t;

#define ZPOOL_CONFIG_TRIM_PROG "trim_prog"
#define ZPOOL_CONFIG_TRIM_RATE "trim_rate"
#define ZPOOL_CONFIG_TRIM_START_TIME "trim_start_time"
#define ZPOOL_CONFIG_TRIM_STOP_TIME "trim_stop_time"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be dropped since we want to move this all in to spa_trim_phys_t.

kcondvar_t spa_man_trim_done_cv; /* manual trim has completed */
/* For details on trim start/stop times see spa_get_trim_prog. */
uint64_t spa_man_trim_start_time;
uint64_t spa_man_trim_stop_time;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a spa_trim_phys_t here instead and use those values. Like spa_removing_phys above. We might also want to consider moving the manual and auto trim state in to their own structures as well.

@@ -118,6 +118,8 @@ extern "C" {
#define ESC_ZFS_BOOTFS_VDEV_ATTACH "bootfs_vdev_attach"
#define ESC_ZFS_POOL_REGUID "pool_reguid"
#define ESC_ZFS_HISTORY_EVENT "history_event"
#define ESC_ZFS_TRIM_START "trim_start"
#define ESC_ZFS_TRIM_FINISH "trim_finish"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but it reminds be that we forget to add these for removal and initialize.

kmutex_t vdev_trim_zios_lock;
kcondvar_t vdev_trim_zios_cv;
uint64_t vdev_trim_zios; /* # of in-flight async trim zios */
boolean_t vdev_trim_zios_stop; /* see zio_trim_should_bypass */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to move these up under the vdev_initialize_* members, and even slightly renaming the new fields for consistency would be help but isn't critical.

* a couple unused pointers is too low to justify that risk.
*/
void (*dfl_ck_func)(uint64_t, uint64_t, void *);
void *dfl_ck_arg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constraint really doesn't apply to us, nothing on Linux will call this, We should either drop this or wrap it and the assigned zio_trim_check functions with __sun__ to make it clear this is dead code.

zc.zc_cookie = (uintptr_t)&tci;

if (zfs_ioctl(hdl, ZFS_IOC_POOL_TRIM, &zc) == 0)
return (0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be converted to a new in/out nvlist style ioctl and use libzfs_core.c and lzc_ioctl.

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Jan 18, 2019
metaslab_t *ms = msp;
boolean_t held = MUTEX_HELD(&ms->ms_lock);
if (!held)
mutex_enter(&ms->ms_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this routine is a no-op on nondebug bits, can we also make it not grab/drop the mutex on nondebug?

* ==========================================================================
*/
static uint64_t
metaslab_ff_alloc(metaslab_t *msp, uint64_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are the changes to metaslab_*_alloc() related to TRIM? Could you break them out into a separate PR? (FYI, I would support this change.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to the removal of metaslab_ff_alloc(), this appears to have been done as part of porting your openzfs/openzfs@8629fdc patch from the OpenZFS PR. The removal of this function does certainly seem like something that ought to be in its own standalone PR.

}
}
/*
* If there's a trim ongoing, punch out the holes that will
* be filled back in in metaslab_trim_done.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about be filled in by metaslab_trim_done?

*/
if (msp->ms_trimming_ts != NULL) {
range_tree_walk(msp->ms_trimming_ts, range_tree_remove,
msp->ms_allocatable);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we aren't able to allocate space that is free but in the process of being trimmed. Do we update ms_weight and ms_max_size to reflect that (and be consistent with ms_allocatable)? If not, I think metaslab_group_alloc_normal() will get confused.

Rather than deal with the implications of ms_trimming_ts everywhere, could we instead prevent allocations from metaslabs that we are in the middle of trimming? Like ms_initializing does? (see vdev_initialize_ms_mark())

@@ -3265,8 +3287,7 @@ metaslab_group_alloc_normal(metaslab_group_t *mg, zio_alloc_list_t *zal,
* we may end up in an infinite loop retrying the same
* metaslab.
*/
ASSERT(!metaslab_should_allocate(msp, asize));

/* ASSERT(!metaslab_should_allocate(msp, asize)); XXX */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is where we will get confused because some space is not allocatable, but that is not reflected in all the metadata about the metaslab.

@@ -3937,6 +3969,7 @@ metaslab_claim_concrete(vdev_t *vd, uint64_t offset, uint64_t size,
VERIFY3U(range_tree_space(msp->ms_allocatable) - size, <=,
msp->ms_size);
range_tree_remove(msp->ms_allocatable, offset, size);
metaslab_trim_remove(msp, offset, size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that the trimsets should all be empty at this point. Can we assert that, instead of trying to remove this?

* 1) Keeping the trim commands reasonably small.
* 2) Keeping the ability to rollback back for as many txgs as possible.
* 3) Waiting around too long that the user starts to get uneasy about not
* seeing any space being freed after they remove some files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that space which has been freed but not yet trimmed is in fact available for allocation (and thus "free"). That's good. However, this comment implies that this space is not counted as free in the user-visible accounting somewhere. That seems wrong to me - if the space is available for allocation then it is by definition free and should be displayed as such.

@dweeezil dweeezil force-pushed the ntrim3 branch 2 times, most recently from d70be42 to 9efe6c0 Compare January 21, 2019 14:16
@dweeezil
Copy link
Contributor Author

dweeezil commented Jan 21, 2019

Status update: This has just been refreshed with some fixes to the various deadlocks observed while running ztest (which have been left as separate commits). Reviewers should look at 1eda737 and the related follow-on commits in particular. With these changes in place, zloop can now generally survive an overnight run. I've not yet done any actual testing to make sure this plays well (i.e. actually works as opposed to simply not panicking) with allocation classes or non-trivial cases of device removal (other than those which are performed via ztest, which are plenty). We (@behlendorf and myself) will begin addressing the review feedback shortly. Also, some of the new ZTS tests need fixing.

[EDIT]: Note that there were actually 3 (force) pushes in rapid succession because there was an experimental fix attempt left in by accident and also to rebase on current master with recent changes to metaslab.c. Also, I'll note that as new fixes are added, they're commited to the dweeezil:ntrim3-next branch.

dweeezil and others added 18 commits January 21, 2019 08:31
Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their
dkioc_free_util.h header for TRIM support.

Requires-builders: none
Ported by: Tim Chase <tim@chase2k.com>

Porting notes:
    The trim kstats are in zfs/<pool> along with the other per-pool stats.
    The kstats can be cleared by writing to the kstat file.

    Null format parameters to strftime() were replaced with "%c".

    Added vdev trace support.

    New dfl_alloc() function in the SPL is used to allocate arrays of
    dkioc_free_list_t objects since they may be large enough to require
    virtual memory.

Other changes:
    Suppressed kstat creation for pools with "$" names.

    The changes to vdev_raidz_map_alloc() have been minimized in order to allow
    more conflict-free merging with future changes (ABD).

Added the following module parameters:
    zfs_trim - Enable TRIM
    zfs_trim_min_ext_sz - Minimum size to trim
    zfs_txgs_per_trim - Transaction groups over which to batch trims

Requires-builders: none
The extended zpool iostat options -wlqr will display information about
automatic and manual TRIMs.

This commit also fixes a completely unrelated bug in which the IOS_LATENCY
row in the vsx_type_to_nvlist array was missing an entry for the scrub
nvlist.

Requires-builders: none
The blkdev_issue_discard() function has been available for
a long time by the kernel but it only supports synchronous
discards.  The __blkdev_issue_discard() function provides
an asynchronous interface but was added in the 4.6 kernel.

Only supporting synchronously discards can potentially limit
performance when processing a large number of small extents.
To avoid this an asynchronous discard implementation has been
added to vdev_disk.c which builds on existing functionality.

The kernel provided synchronous version remains the default
pending additional functional and performance testing.

Due to different mechamism used for submitting TRIM commands
there were not being properly accounted for in the extended
statistics.  Resolve this by allow for aggregated stats to
be returned as part of the TRIM zio.  This allows for far
better visibility in to the discard request sizes.

Minor documentation updates.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Requires-builders: none
Requires-builders: none
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Requires-builders: none
Porting Notes:
Man page changes dropped for the moment.  This can be reconsiled
when the final version is merged to OpenZFS.  They are accurate
now, only worded a little differently.

Requires-builders: none
1) Removed the first-fit allocator.
2) Moved the autotrim metaslab scheduling logic into vdev_auto_trim.
2a) As a consequence of openzfs#2, metaslab_trimset_t was rendered superfluous. New
   trimsets are simple range_tree_t's.
3) Made ms_trimming_ts remove extents it is working on from ms_tree and then
   add them back in.
3a) As a consequence of openzfs#3, undone all the direct changes to the allocators and
   removed metaslab_check_trim_conflict and range_tree_find_gap.

Porting Notes:
* Removed WITH_*_ALLOCATOR macros and aligned remaining allocations
  with OpenZFS.  Unused wariables warnings resolved with the gcc
  __attribute__ ((unused__ keyword.
* Added missing calls for ms_condensing_cv.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Requires-builders: none
Porting Notes:
* metaslab_sync changes already applied.
* resync of test cases needed

Requires-builders: none
1) Simplified the SM_FREE spacemap writing while a trim is active.
2) Simplified the range_tree_verify in metaslab_check_free.
3) Clarified comment above metaslab_trim_all.
4) Substituted 'flust out' with 'drop' in comment in metaslab_trim_all.
5) Moved ms_prev_ts clearing up to ms_cur_ts claring in metaslab_trim_all.
6) Added recomputation of metaslab weight when metaslab is loaded.
7) Moved dmu_tx_commit inside of spa_trim_update_time.
8) Made the smallest allowable manual trim rate 1/1000th of a metaslab size.
9) Switched to using hrtime_t in manual trim timing logic.
10) Changed "limited" to "preserve_spilled" in vdev_auto_trim.
11) Moved vdev_notrim setting into zio_vdev_io_assess.a

Porting Notes:
  * vdev_disk.c and zio.c hunks already applied.
  * nsec_per_tick -> MSEC2NSEC(1)

Requires-builders: none
… wanting to condense.

Requires-builders: none
Some storage backends such as large thinly-provisioned SANs are very slow
for large trims.  Manual trim now supports "zpool trim -p" (partial trim)
to skip metaslabs for which there is no spacemap.

Signed-off-by: Tim Chase <tim@chase2k.com>
Requires-builders: none
The existing test cases were split in to multiple test cases and
refactored.  There are now test cases for the following:

zpool_trim_001_pos - Verify manual TRIM
zpool_trim_002_pos - Verify manual trim can be interrupted
zpool_trim_003_pos - Verify 'zpool trim -s' rate limiting
zpool_trim_004_pos - Verify 'zpool trim -p' partial TRIM works
zpool_trim_005_neg - Verify bad parameters to 'zpool trim'
zpool_trim_006_neg - Verify bad parameters to 'zpool trim -r'
autotrim_001_pos   - Verify 'autotrim=on' pool data integrity
autotrim_002_pos   - Verify various pool geometries
manualtrim_001_pos - Verify manual trim pool data integrity
manualtrim_002_pos - Verify various pool geometries
manualtrim_003_pos - Verify 'zpool import|export'
manualtrim_004_pos - Verify 'zpool online|offline|replace'
manualtrim_005_pos - Verify TRIM and scrub run concurrently

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Requires-builders: none
* Rename TRIM taskq threads to be more concise for Linux.
* Fix divide by zero panic

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Requires-builders: none
behlendorf and others added 19 commits January 21, 2019 08:32
* Fixed missing taskq_destroy when exporting a pool which is
  being actively trimmed.
* Add auto/manual TRIM coverage to ztest.
* Temporarily disable manualtrim_004_pos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Requires-builders: none
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Requires-builders: none
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Requires-builders: none
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Requires-builders: none
Signed-off-by: Tim Chase <tim@chase2k.com>
Requires-builders: none
Requires-builders: none
Requires-builders: none
NOTE: should be squashed into a previous commit

Requires-builders: none
Also, comment-out the ASSERT(!metaslab_should_allocate(msp, asize));
in metaslab_group_alloc_normal().  It seems that the additional
metaslab_group_sort() performed by trim makes this assertion invalid.

Requires-builders: none
Requires-builders: none
Requires-builders: none
Also, stop trimming in places where initialization is stopped:
spa_vdev_remove_log() and spa_vdev_remove_top().

Requires-builders: none
NOTE: This should be its own separate PR.  It was discovered during
debugging of the TRIM work.

Requires-builders: none
In trim_stop_set() and trim_stop_wait().

Requires-builders: none
... and related locking

Don't take the spa_auto_trim_lock in the sync path: The auto trim taskq
is now stopped and started in response to changes in the "autotrim"
property asynchronously by the spa async task.  If the spa_auto-trim_lock
is taken in the sync task, the system can deadlock
as follows:

	Non-sync context task: Acquires spa_auto_trim_lock via
	    spa_vdev_enter() or some other path.
	Sync task: syncing a change to the "autotrim" property attempts
	    to take spa_auto_trim_lock and blocks.
	Non-sync context task: blocks attempting to take a spa config
	    lock which won't be released until the sync task completes.
	Deadlock.

Also, since the auto trim taskq is now started asynchronously, it
may not yet be ready yet when the sync task calls spa_auto_trim().
Modified spa_auto_trim so it will simply return if the taskq is not
available yet (it will do its thing on the next pass).

Also, avoid starting auto trim taskqs for the "import$" pseudo-pool.

Also, don't attempt to create the taskq if a spa unload is in-progress.

Requires-builders: none
The change in the previous commit to not attempt to create the taskq if
a spa unload is in-progress caused the auto trim taskq to not be started
upon import.

Requires-builders: none
Rather than using the default value for the property, we need to use
the value which might be set by "zpool create -o autotrim=".

Requires-builders: none
Long zloop runs were occasionally hitting the
"ASSERT(spa->spa_man_trim_taskq != NULL)" in spa_man_trim_taskq_destroy().
It's not clear to me how this was happening because the only place
"spa->spa_man_trim_taskq" is cleared is in spa_man_trim_taskq_destroy()
itself which is only called from the (single) spa_async_thread() or
from spa_unload().

To that end, this commit adds a non-NULL check in spa_async_thread()
analagous to the tests which were added when support for stopping/starting
the auto trim taskqs were added to spa_async_thread().

NOTE: Yes, that means I consider this to be a band-aid.
@@ -2110,6 +2159,70 @@ again.
Starts a resilver. If an existing resilver is already running it will be
restarted from the beginning. Any drives that were scheduled for a deferred
resilver will be added to the new one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add for newlines:

Starts a resilver. If an existing resilver is already running it will be
 restarted from the beginning. Any drives that were scheduled for a deferred
 resilver will be added to the new one.
+.It Xo
+.Nm
 .Cm trim
 .Op Fl p
 .Op Fl r Ar rate | Fl s

@tonyhutter
Copy link
Contributor

Include atrim and mtrim latency histograms in the man page

--- a/man/man8/zpool.8
+++ b/man/man8/zpool.8
@@ -1820,6 +1820,10 @@ Amount of time IO spent in asynchronous priority queues.  Does not include
 disk time.
 .Ar scrub :
 Amount of time IO spent in scrub queue. Does not include disk time.
+.Ar atrim :
+Amount of time IO spent doing a trim operation during an autotrim.
+.Ar mtrim :
+Amount of time IO spent doing a trim operation during a manual trim.
 .It Fl l
 Include average latency statistics:
 .Pp

@tonyhutter
Copy link
Contributor

You'll want to add the new trim_start and trim_end events to man/man5/zfs-events.5.

# You may not use this file except in compliance with the License.
#
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
# or http://www.opensolaris.org/os/licensing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opensolaris.org is no more. Please refer to http://opensource.org/licenses/CDDL-1.0

@pdf

This comment has been minimized.

tests = ['autotrim_001_pos', 'autotrim_002_pos', 'manualtrim_001_pos',
'manualtrim_002_pos', 'manualtrim_003_pos', 'manualtrim_004_pos',
'manualtrim_005_pos']

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tags = ['functional', 'trim']

@RJVB

This comment has been minimized.

@dweeezil
Copy link
Contributor Author

I'm going to be closing this PR. @behlendorf has been doing some major re-working of the original TRIM work to, among other things:

  • Harmonize its operation with that of device initialization so it's possible, for example, to TRIM individual vdevs.
  • Add per-vdev trim stats
  • Re-work the underlying discard mechanism to use a new ZIO type rather than the DKIOCFREE ioctl.
  • Utilize the new "xlat" vdev method from device initialization where appropriate.
  • Switch from the legacy ioctl interface to the new nvlist-style ioctl interface.
  • Plenty of other restructuring, in particular w.r.t. the management of the trim threads.

Expect a new PR to be posted soon.

@dweeezil dweeezil closed this Feb 15, 2019
Redsandro referenced this pull request Jul 20, 2019
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Status: Code Review Needed Ready for review and testing Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants