New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS - 6363 Add UNMAP/TRIM functionality (v2) #7363

Closed
wants to merge 23 commits into
base: master
from

Conversation

Projects
None yet
@tuxoko
Copy link
Member

tuxoko commented Mar 28, 2018

From #5925

Rebased from @dweeezil 's ntrim2-next
Added trim related manpage changes.

Description

Motivation and Context

How Has This Been Tested?

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.
@tuxoko

This comment has been minimized.

Copy link
Member

tuxoko commented Mar 29, 2018

This happens a lot in ztest.
WARNING: Pool 'ztest' has encountered an uncorrectable I/O failure and has been suspended.

Also manualtrim_002_pos failed with
TRIM interrupted it was expected to complete.

@kpande

This comment has been minimized.

Copy link
Member

kpande commented Apr 2, 2018

er, this replaces #5925? why couldn't that PR just be updated instead?

@behlendorf

This comment has been minimized.

Copy link
Member

behlendorf commented Apr 6, 2018

@kpande, @tuxoko was nice enough to offer to help address some of the remaining issues with TRIM. Thanks! Step one is getting all the old and new tests passing.

This happens a lot in ztest.
WARNING: Pool 'ztest' has encountered an uncorrectable I/O failure and has been suspended.

This looks like one of those remaining issues which needs to be understood and fixed. In the master branch we aren't seeing anything like this so it's likely TRIM related.

Also manualtrim_002_pos failed with
TRIM interrupted it was expected to complete.

It also looks like several of the ZTS runs resulted in kernel panics. So there's definitely some work to do root causing these issues.


/* trim I/Os have no single meaningful offset */
if (zio->io_priority != ZIO_PRIORITY_AUTO_TRIM ||
zio->io_priority != ZIO_PRIORITY_MAN_TRIM)

This comment has been minimized.

@trisk

trisk Apr 10, 2018

Contributor

As per previous review, the comparison should be &&.

This comment has been minimized.

@tuxoko

tuxoko Apr 11, 2018

Member

Right, || doesn't make sense.

Saso Kiselkov and others added some commits Apr 20, 2015

6363 Add UNMAP/TRIM functionality to ZFS
Ported by: Tim Chase <tim@chase2k.com>

Porting notes:
    The trim kstats are in zfs/<pool> along with the other per-pool stats.
    The kstats can be cleared by writing to the kstat file.

    Null format parameters to strftime() were replaced with "%c".

    Added vdev trace support.

    New dfl_alloc() function in the SPL is used to allocate arrays of
    dkioc_free_list_t objects since they may be large enough to require
    virtual memory.

Other changes:
    Suppressed kstat creation for pools with "$" names.

    The changes to vdev_raidz_map_alloc() have been minimized in order to allow
    more conflict-free merging with future changes (ABD).

Added the following module parameters:
    zfs_trim - Enable TRIM
    zfs_trim_min_ext_sz - Minimum size to trim
    zfs_txgs_per_trim - Transaction groups over which to batch trims
Refresh dkio.h and add dkioc_free_util.h
Update dkio.h from Nexenta's version to pick up DKIOCFREE and add their
dkioc_free_util.h header for TRIM support.
Want manual trim feature to skip never-allocated space
Some storage backends such as large thinly-provisioned SANs are very slow
for large trims.  Manual trim now supports "zpool trim -p" (partial trim)
to skip metaslabs for which there is no spacemap.
Matt Ahrens' review comments.
Porting Notes:
Man page changes dropped for the moment.  This can be reconsiled
when the final version is merged to OpenZFS.  They are accurate
now, only worded a little differently.
Want extended zpool iostat trim support
The extended zpool iostat options -wlqr will display information about
automatic and manual TRIMs.

This commit also fixes a completely unrelated bug in which the IOS_LATENCY
row in the vsx_type_to_nvlist array was missing an entry for the scrub
nvlist.
Async TRIM, Extended Stats
The blkdev_issue_discard() function has been available for
a long time by the kernel but it only supports synchronous
discards.  The __blkdev_issue_discard() function provides
an asynchronous interface but was added in the 4.6 kernel.

Only supporting synchronously discards can potentially limit
performance when processing a large number of small extents.
To avoid this an asynchronous discard implementation has been
added to vdev_disk.c which builds on existing functionality.

The kernel provided synchronous version remains the default
pending additional functional and performance testing.

Due to different mechamism used for submitting TRIM commands
there were not being properly accounted for in the extended
statistics.  Resolve this by allow for aggregated stats to
be returned as part of the TRIM zio.  This allows for far
better visibility in to the discard request sizes.

Minor documentation updates.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Matt Ahrens' review comments, round 3.
1) Removed the first-fit allocator.
2) Moved the autotrim metaslab scheduling logic into vdev_auto_trim.
2a) As a consequence of #2, metaslab_trimset_t was rendered superfluous. New
   trimsets are simple range_tree_t's.
3) Made ms_trimming_ts remove extents it is working on from ms_tree and then
   add them back in.
3a) As a consequence of #3, undone all the direct changes to the allocators and
   removed metaslab_check_trim_conflict and range_tree_find_gap.

Porting Notes:
* Removed WITH_*_ALLOCATOR macros and aligned remaining allocations
  with OpenZFS.  Unused wariables warnings resolved with the gcc
  __attribute__ ((unused__ keyword.
* Added missing calls for ms_condensing_cv.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix abd_alloc_sametype() panic
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Matt Ahren's review comments round 4:
1) Simplified the SM_FREE spacemap writing while a trim is active.
2) Simplified the range_tree_verify in metaslab_check_free.
3) Clarified comment above metaslab_trim_all.
4) Substituted 'flust out' with 'drop' in comment in metaslab_trim_all.
5) Moved ms_prev_ts clearing up to ms_cur_ts claring in metaslab_trim_all.
6) Added recomputation of metaslab weight when metaslab is loaded.
7) Moved dmu_tx_commit inside of spa_trim_update_time.
8) Made the smallest allowable manual trim rate 1/1000th of a metaslab size.
9) Switched to using hrtime_t in manual trim timing logic.
10) Changed "limited" to "preserve_spilled" in vdev_auto_trim.
11) Moved vdev_notrim setting into zio_vdev_io_assess.a

Porting Notes:
  * vdev_disk.c and zio.c hunks already applied.
  * nsec_per_tick -> MSEC2NSEC(1)
Tim Chase's review comments, round 2.
Porting Notes:
* metaslab_sync changes already applied.
* resync of test cases needed
Update and add additional TRIM test cases
The existing test cases were split in to multiple test cases and
refactored.  There are now test cases for the following:

zpool_trim_001_pos - Verify manual TRIM
zpool_trim_002_pos - Verify manual trim can be interrupted
zpool_trim_003_pos - Verify 'zpool trim -s' rate limiting
zpool_trim_004_pos - Verify 'zpool trim -p' partial TRIM works
zpool_trim_005_neg - Verify bad parameters to 'zpool trim'
zpool_trim_006_neg - Verify bad parameters to 'zpool trim -r'
autotrim_001_pos   - Verify 'autotrim=on' pool data integrity
autotrim_002_pos   - Verify various pool geometries
manualtrim_001_pos - Verify manual trim pool data integrity
manualtrim_002_pos - Verify various pool geometries
manualtrim_003_pos - Verify 'zpool import|export'
manualtrim_004_pos - Verify 'zpool online|offline|replace'
manualtrim_005_pos - Verify TRIM and scrub run concurrently

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Review feedback
* Rename TRIM taskq threads to be more concise for Linux.
* Fix divide by zero panic

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove vdev_raidz_map_alloc()
Rather than hacking `vdev_raidz_map_alloc()` to get the child
offsets calculate the values directly.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Review feedback 2
* Fixed missing taskq_destroy when exporting a pool which is
  being actively trimmed.
* Add auto/manual TRIM coverage to ztest.
* Temporarily disable manualtrim_004_pos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add trim manpage
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Fix wrong logical operator
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Wait for 1 sec before check trim status
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>

@tuxoko tuxoko force-pushed the tuxoko:ntrim2 branch from 2968849 to aee5c52 Apr 11, 2018

@dweeezil dweeezil referenced this pull request Apr 27, 2018

Closed

OpenZFS - 6363 Add UNMAP/TRIM functionality #5925

1 of 11 tasks complete
@KernelKyupCom

This comment has been minimized.

Copy link

KernelKyupCom commented May 9, 2018

Hi there,
I am also experiencing the "WARNING: Pool 'ztest3' has encountered an uncorrectable I/O failure and has been suspended." and the txg_sync_thread is blocked for more than 120 seconds (a hung task).

After some investigation it turns out there is ENOSPC returned by metaslab_alloc.
The actual call stack is:
#10 [ffffc90003c17bc0] metaslab_alloc at ffffffffa0508fdb [zfs]
#11 [ffffc90003c17cb0] zio_dva_allocate at ffffffffa058f470 [zfs]
#12 [ffffc90003c17da8] zio_execute at ffffffffa058b28a [zfs]
#13 [ffffc90003c17dd8] taskq_thread at ffffffffa04113d9 [spl]
#14 [ffffc90003c17f00] kthread at ffffffff8107ae2d
#15 [ffffc90003c17f50] ret_from_fork at ffffffff818001fa

I am creating zfs pool in a sparse file image.

I can easily reproduce it by copying two sets of small files - the linux kernel tree (including its git) and my ccache directory.
Basically I am importing the pool, copying the first set, creating snapshot1, deleting the files and then trimming. Then the same for the second set of files in another snapshot2. Exporting the pool and reimporting it, destroying the both snapshots, then again trim and export.
I am attaching my test script (just edit the paths): zfs_test.sh.txt
About the hung task it seems we are AB-BAdeadlocking. See the attached log:
lockdep_4.16.7.log

I hope this helps.
Any ideas? Some other useful data to capture to help resolving that?

Best regards,
Angel

VERIFY(!msp->ms_condensing);
if (msp->ms_loaded) {
range_tree_walk(msp->ms_trimming_ts, range_tree_add,
msp->ms_tree);

This comment has been minimized.

@aerusso

aerusso Jun 9, 2018

Contributor

Sorry if this has already been addressed elsewhere. During the time the trimming was being done (initiated by metaslab_exec_trim), ms_trimming_ts is removed from msp->ms_tree. If an allocation request occurs on such an mg, isn't it possible that mg->mg_no_free_space could be set (because of the unavailable range---the range being trimmed)?

And if so, is there code that would trigger a re-evaluation of the validity of that mg_no_free_space flag (now that said extent is again available for allocation)? If not, that could explain @KernelKyupCom's ENOSPACE issue.

@KernelKyupCom

This comment has been minimized.

Copy link

KernelKyupCom commented Jul 12, 2018

Hi guys,
I've been running for a month the code rebased to zfs-0.7.9-11. I haven't experienced the enospc issue anymore, but we hit another problem: in a few days our machines are going out of pids because of.. leaked threads.
I've spent some time to investigate it and found a way to reproduce it relatively easy (not every time, but in several minutes...).
It is just import && trim && export in an endless loop like that:

while [ true ]; do
    zpool import -d /root/ssd/zpool_test2/r ztest2 && zpool trim ztest2 && zpool export ztest2
done

The important part here is to export the pool immediately after the trim is issued. The issue occurs when the trim complete event is rised and SPA_ASYNC_MAN_TRIM_TASKQ_DESTROY is being displatched to have the spa_async_thread already suspended or not yet scheduled when it gets suspended by the export ioctl. This way the export path will not take care of destroying the man_trim_task at all and it remains out there forever.
Currently I am running totally out of tree, but I guess that should apply or at least could be easily reused:

diff --git a/module/zfs/spa.c b/module/zfs/spa.c
index 9702df0c26f3..f609d834acb1 100644
--- a/module/zfs/spa.c
+++ b/module/zfs/spa.c
@@ -4624,7 +4624,34 @@ spa_export_common(char *pool, int new_state, nvlist_t **oldconfig,
         */
        spa_open_ref(spa, FTAG);
        mutex_exit(&spa_namespace_lock);
+
+       /*
+        * Before we suspend the async thread wait if there is ongoing trim.
+        * Otherwise we may not execute the man_trim_tasq_destroy scheduled by
+        * man_trim_done because the async thread is suspended. This results in
+        * tasq thread leak. Drop the mutex after the trim completes and
+        * reacquire it later when checking for still existing taskq with async
+        * thread suspended. This way we give chance to get it regullary cleaned
+        * by the done callback async request if the asyn thread is still
+        * running. Eventually if the request havent been executed before we
+        * suspend the async_thread explixitly clean up the man_trim taskq.
+        */
+       mutex_enter(&spa->spa_man_trim_lock);
+       while (spa->spa_num_man_trimming > 0)
+               cv_wait(&spa->spa_man_trim_done_cv, &spa->spa_man_trim_lock);
+       mutex_exit(&spa->spa_man_trim_lock);
+
        spa_async_suspend(spa);
+       /*
+        * After the suspend, chech if there is still man trim taskq which is
+        * about to leak otherwise adn destroy it. This way its work will be
+        * commited by the txg sync work flushed below.
+        */
+       mutex_enter(&spa->spa_man_trim_lock);
+               if (spa->spa_man_trim_taskq)
+                       spa_man_trim_taskq_destroy(spa);
+       mutex_exit(&spa->spa_man_trim_lock);
+
        if (spa->spa_zvol_taskq) {
                zvol_remove_minors(spa, spa_name(spa), B_TRUE);
                taskq_wait(spa->spa_zvol_taskq);

Also I suppose the same should be done to address the auto trim ?

I have also changed the trim task que thread pool name formatting in order to get closer what is going on. That way it is much easier to identify who is who and what happens, since the task->comm is being truncated to 16 chars by default and the "_man_trim"/"_auto_trim" suffix gets truncated at all (or at least in my case). That's just a suggestion about the naming convention.

diff --git a/module/zfs/spa_misc.c b/module/zfs/spa_misc.c
index 8a91b40510da..ab0443f5537e 100644
--- a/module/zfs/spa_misc.c
+++ b/module/zfs/spa_misc.c
@@ -2257,7 +2257,7 @@ spa_auto_trim_taskq_create(spa_t *spa)
 
        ASSERT(MUTEX_HELD(&spa->spa_auto_trim_lock));
        ASSERT(spa->spa_auto_trim_taskq == NULL);
-       (void) snprintf(name, MAXPATHLEN, "%s_auto_trim", spa->spa_name);
+       (void) snprintf(name, MAXPATHLEN, "atrim_%s", spa->spa_name);
        spa->spa_auto_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
            spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
        VERIFY(spa->spa_auto_trim_taskq != NULL);
@@ -2283,7 +2283,7 @@ spa_man_trim_taskq_create(spa_t *spa)
                 */
                return;
        }
-       (void) snprintf(name, MAXPATHLEN, "%s_man_trim", spa->spa_name);
+       (void) snprintf(name, MAXPATHLEN, "mtrim_%s", spa->spa_name);
        spa->spa_man_trim_taskq = taskq_create(name, 1, minclsyspri, 1,
            spa->spa_root_vdev->vdev_children, TASKQ_DYNAMIC);
        VERIFY(spa->spa_man_trim_taskq != NULL);

I am open for any critics about this fix.

Best regards,
Angel Shtilianov

@aerusso

This comment has been minimized.

Copy link
Contributor

aerusso commented Jul 15, 2018

@tuxoko Would it be possible to update this PR to @dweeezil's post-removal rebase? It might be helpful to see the test suite results.

@tuxoko

This comment has been minimized.

Copy link
Member

tuxoko commented Jul 16, 2018

@aerusso
I'll look into it. Unfortunately, my priority have changed recently, I'll try to find time to do it.

@dweeezil

This comment has been minimized.

Copy link
Member

dweeezil commented Jul 16, 2018

@aerusso FYI, my post-removal-rebase branch, which I'd not yet widely publicized, is in fact based on @tuxoko's "v2" work. I've not yet had the time to test it at all yet, short of making sure it compiles. I hope to get it tested out in the next couple of days.

@kpande

This comment has been minimized.

Copy link
Member

kpande commented Sep 4, 2018

@dweeezil any updates?

@mailinglists35

This comment has been minimized.

Copy link

mailinglists35 commented Sep 25, 2018

is there any chance this will be ready for 0.8.0 release? or 0.8.x?

@RJVB

This comment has been minimized.

Copy link

RJVB commented Oct 24, 2018

Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

@mailinglists35

This comment has been minimized.

Copy link

mailinglists35 commented Oct 25, 2018

Maybe not the best place to ask, but what alternatives are there in the meantime until the change is incorporated? How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

switch to freebsd https://blog.plein.org/2014/04/15/freebsd-9-2-supports-zfs-trimunmap/

@RJVB

This comment has been minimized.

Copy link

RJVB commented Oct 25, 2018

@drescherjm

This comment has been minimized.

Copy link

drescherjm commented Oct 25, 2018

How about not giving the entire SSD to ZFS, but keeping a part of it for some other filesystem, or just swap - would that allow the kernel to TRIM the entire device or only the non-ZFS part?

It would be only the non zfs part

@RJVB

This comment has been minimized.

Copy link

RJVB commented Oct 25, 2018

@mailinglists35

This comment has been minimized.

Copy link

mailinglists35 commented Oct 25, 2018

It would be only the non zfs part
Well, that's not a big deal if I can do the ZFS part by importing the pool under a FreeBSD clone. Would importing the pool be enough, or would I'd have to do a scrub or something similar?

I suggest we stop discussing this here and now before a mod will lock this thread. Use the mailing lists for this.

@dweeezil

This comment has been minimized.

Copy link
Member

dweeezil commented Jan 8, 2019

Replaced with #8255.

@dweeezil dweeezil closed this Jan 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment