Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenZFS 9166 - zfs storage pool checkpoint #7570

Closed
wants to merge 2 commits into from

Conversation

dweeezil
Copy link
Contributor

Description

See references below.

Motivation and Context

Details about the motivation of this feature and its usage can
be found in this blogpost:

https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

  • renames metaslab trees to be shorter without losing meaning
  • space_map_{alloc,truncate}() accept a block size as a
    parameter. The reason is that in the current state all space
    maps that we allocate through the DMU use a global tunable
    (space_map_blksz) which defauls to 4KB. This is ok for
    metaslab space maps in terms of bandwirdth since they are
    scattered all over the disk. But for other space maps this
    default is probably not what we want. Examples are device
    removal's vdev_obsolete_sm or vdev_chedkpoint_sm from this
    review. Both of these have a 1:1 relationship with each vdev
    and could benefit from a bigger block size.

Authored by: Serapheim Dimitropoulos serapheim.dimitro@delphix.com
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: John Kennedy john.kennedy@delphix.com
Reviewed by: Dan Kimmel dan.kimmel@delphix.com
Approved by: Richard Lowe richlowe@richlowe.net
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: openzfs/openzfs@7159fdb
Ported-by: Tim Chase tim@chase2k.com
Signed-off-by: Tim Chase tim@chase2k.com

Porting notes:

The part of dsl_scan_sync() which handles async destroys has
been moved into the new dsl_process_async_destroys() function.

Renamed module parameter metaslabs_per_vdev to vdev_max_ms_count.

New module parameter:

    zfs_spa_discard_memory_limit,

Desirable administrative feature, particularly to allow reverting pool configuration changes.

How Has This Been Tested?

Light manual testing so far.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@dweeezil
Copy link
Contributor Author

I've only done light manual testing of this feature and it seems to work properly. Linux-specific changes to the ZTS have not yet been made. It's being submitted now to get an initial run through the buildbots to find any gross problems.

@dweeezil dweeezil force-pushed the illumos-9166 branch 8 times, most recently from 1399feb to 68ff1ff Compare May 30, 2018 16:05
@ahrens ahrens requested a review from sdimitro May 30, 2018 17:08
@dweeezil dweeezil force-pushed the illumos-9166 branch 2 times, most recently from 1615849 to 7db8ffc Compare June 1, 2018 03:11
Copy link
Contributor

@sdimitro sdimitro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello there!

Thank you for working on this.
Overall the changes look good.
I also went through the split in the dsl_scan() code that you brought up and they seem fine too.

I found a couple of small omissions that I pointed out below and I also raised some questions (most of them are probably due to my lack of knowledge of Linux dev).

@@ -397,6 +398,8 @@ ztest_info_t ztest_info[] = {
ZTI_INIT(ztest_vdev_aux_add_remove, 1, &ztest_opts.zo_vdevtime),
ZTI_INIT(ztest_device_removal, 1, &zopt_sometimes),
ZTI_INIT(ztest_remap_blocks, 1, &zopt_sometimes),
ZTI_INIT(ztest_remap_blocks, 1, &zopt_sometimes),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to duplicate ZTI_INIT(ztest_remap_blocks....),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -6977,6 +7074,7 @@ ztest_import(ztest_shared_t *zs)

(void) rwlock_destroy(&ztest_name_lock);
mutex_destroy(&ztest_vdev_lock);
mutex_destroy(&ztest_checkpoint_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm missing something here but shouldn't we init the lock in this functions before we destroy it? (similarly to ztest_vdev_lock which is destroyed here but it is also initialized in this functions?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -6991,6 +7089,7 @@ ztest_init(ztest_shared_t *zs)
int i;

mutex_init(&ztest_vdev_lock, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&ztest_checkpoint_lock, NULL, USYNC_THREAD, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also destroy destroy this lock in this function the same way we destroy ztest_vdev_lock?

Copy link
Contributor Author

@dweeezil dweeezil Jun 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. These init/destroy mis-merges were due to ztest_import() which was added by the MMP work.

@@ -497,8 +497,6 @@ vn_open(char *path, int x1, int flags, int mode, vnode_t **vpp, int x2, int x3)
#ifdef __linux__
flags |= O_DIRECT;
#endif
/* We shouldn't be writing to block devices in userspace */
VERIFY(!(flags & FWRITE));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what is this for? Is this related somehow to the rest of the checkpoint changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an additional sanity check which was added a while back when we only expected zdb to access the block devices read-only from user space. When zhack was added that changed and this check wasn't dropped at the time. One of the new test cases uncovered this since it uses zhack on a pool with disks instead of files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks!

@@ -574,6 +579,29 @@ dsl_pool_dirty_delta(dsl_pool_t *dp, int64_t delta)
cv_signal(&dp->dp_spaceavail_cv);
}

#ifdef ZFS_DEBUG
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that #ifdef really necessary? This function is only called through an ASSERT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It prevents an unused function warning in non-debug builds.

*/
#define __EXTENSIONS__

#include <libzfs_core.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be missing something but why is libzfs_core.h included here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, it was to give access to the typedefs such as uint64_t, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't something like stdint.h work?
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/stdint.h.html

It just seems unnecessary to me to include a whole new dependency (libzfs_core) for a small utility like randwritecomp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Using libzfs_core.h was expedient at the time. Using stdint.h works perfectly well and also allows to remove -I$(top_srcdir)/lib/libspl/include from Makefile.am. I'll be updating it accordingly (and also adding an include of string.h so grab the declaration of strcmp() (which we apparently need under Linux).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great! Thank you Tim

log_must truncate -s $DISKSIZE $FILEDISK1
log_must truncate -s $DISKSIZE $FILEDISK2

log_must zpool create -O sync=disabled $NESTEDPOOL $FILEDISKS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably some illumos/Linux difference that I'm missing but why is sync disabled for these pools?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an attempt to speed up the "setup" process, which all by itself was taking almost 10 minutes per test in a VM guest. I've not yet determined whether the sync=disabled makes any difference. I added it on both the outer and the nested pools. Speaking of which, are the nested pools really necessary? It was my understanding (and @behlendorf's too, I think) that nested pools weren't guaranteed to be deadlock-free (but maybe they are in illumos).

-I$(top_srcdir)/lib/libspl/include

pkgexec_PROGRAMS = randwritecomp
mktree_SOURCES = randwritecomp.c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/mktree_SOURCES/randwritecomp_SOURCES

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, copy/paste bug. Fixed.


#ifdef DEBUG
spa_checkpoint_accounting_verify(vd->vdev_spa);
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Results in a 'spa_checkpoint_accounting_verify' defined but not used [-Wunused-function] warning.

I'd suggest wrapping the function itself with ZFS_DEBUG. You could also

#define spa_checkpoint_accounting_verify ((void) 0)

for the non-debug case to get rid on the unsightly DEBUG wrapper where it's called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I find this way least ugly:

/* ARGSUSED */
void
foo_verify(...)
{
#ifdef ZFS_DEBUG
...
#endif
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ZFS_DEBUG. Also wrapped the definition of spa_checkpoint_accounting_verify() in an #ifdef ZFS_DEBUG.

@@ -497,8 +497,6 @@ vn_open(char *path, int x1, int flags, int mode, vnode_t **vpp, int x2, int x3)
#ifdef __linux__
flags |= O_DIRECT;
#endif
/* We shouldn't be writing to block devices in userspace */
VERIFY(!(flags & FWRITE));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an additional sanity check which was added a while back when we only expected zdb to access the block devices read-only from user space. When zhack was added that changed and this check wasn't dropped at the time. One of the new test cases uncovered this since it uses zhack on a pool with disks instead of files.

@@ -0,0 +1,25 @@
#!/usr/bin/ksh -p
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be #!/bin/ksh for all the new test scripts. It was causing the CentOS 6 failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I saw that in one of the tests introduced by vdev removal and totally forgot to look for it.

@dweeezil dweeezil force-pushed the illumos-9166 branch 4 times, most recently from df76923 to bbbd614 Compare June 7, 2018 03:54
@ahrens ahrens added the Type: Feature Feature request or new feature label Jun 7, 2018
@dweeezil dweeezil force-pushed the illumos-9166 branch 3 times, most recently from ef59091 to 3b69b11 Compare June 13, 2018 22:44
@codecov
Copy link

codecov bot commented Jun 14, 2018

Codecov Report

Merging #7570 into master will increase coverage by 0.07%.
The diff coverage is 89.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7570      +/-   ##
==========================================
+ Coverage   78.04%   78.12%   +0.07%     
==========================================
  Files         366      368       +2     
  Lines      110618   111627    +1009     
==========================================
+ Hits        86329    87205     +876     
- Misses      24289    24422     +133
Flag Coverage Δ
#kernel 78.68% <92.94%> (+0.21%) ⬆️
#user 67.05% <74.78%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7637ef8...5dd9c73. Read the comment docs.

@dweeezil dweeezil force-pushed the illumos-9166 branch 3 times, most recently from 5f0c492 to cddfe80 Compare June 19, 2018 14:17
Otherwise the output is consumed by the output redirection.

Signed-off-by: Tim Chase <tim@chase2k.com>
Requires-builders: none
@dweeezil dweeezil force-pushed the illumos-9166 branch 3 times, most recently from 97c4672 to 3397340 Compare June 24, 2018 20:32
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: openzfs/openzfs@7159fdb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

Porting notes:

    The part of dsl_scan_sync() which handles async destroys has
    been moved into the new dsl_process_async_destroys() function.

    Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
    to block device backed pools.

    ZTS:

        Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

        Don't use large dd block sizes on /dev/urandom under Linux in
        checkpoint_capacity.

        Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
        SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
        its attempts to fill the pool

	Create the base and nested pools with sync=disabled to speed up
	the "setup" phase.

        Clear labels in test pool between checkpoint tests to avoid
        duplicate pool issues.

        The import_rewind_device_replaced test has been marked as "known
        to fail" for the reasons listed in its DISCLAIMER.

    New module parameters:

        zfs_spa_discard_memory_limit,
        zfs_remove_max_bytes_pause (not documented - debugging only)
        vdev_max_ms_count (formerly metaslabs_per_vdev)
        vdev_min_ms_count
behlendorf pushed a commit that referenced this pull request Jun 26, 2018
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Porting notes:

* The part of dsl_scan_sync() which handles async destroys has
  been moved into the new dsl_process_async_destroys() function.

* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
  to block device backed pools.

* ZTS:
  * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

  * Don't use large dd block sizes on /dev/urandom under Linux in
    checkpoint_capacity.

  * Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
    SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
    its attempts to fill the pool

  * Create the base and nested pools with sync=disabled to speed up
    the "setup" phase.

  * Clear labels in test pool between checkpoint tests to avoid
    duplicate pool issues.

  * The import_rewind_device_replaced test has been marked as "known
    to fail" for the reasons listed in its DISCLAIMER.

  * New module parameters:

      zfs_spa_discard_memory_limit,
      zfs_remove_max_bytes_pause (not documented - debugging only)
      vdev_max_ms_count (formerly metaslabs_per_vdev)
      vdev_min_ms_count

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: openzfs/openzfs@7159fdb8
Closes #7570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants