Permalink
Browse files

OpenZFS 9075 - Improve ZFS pool import/load process and corrupted poo…

…l recovery

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

	7638 Refactor spa_load_impl into several functions
	8961 SPA load/import should tell us why it failed
	7277 zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.

Porting notes (ZTS):
* Fix 'make dist' target in zpool_import

* The maximum path length allowed by tar is 99 characters.  Several
  of the new test cases exceeded this limit resulting in them not
  being included in the tarball.  Shorten the names slightly.

* Set/get tunables using accessor functions.

* Get last synced txg via the "zfs_txg_history" mechanism.

* Clear zinject handlers in cleanup for import_cache_device_replaced
  and import_rewind_device_replaced in order that the zpool can be
  exported if there is an error.

* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
  ext4 file system to be created on ZFS_DISK2.  Also, there's
  no need to partition ZFS_DISK2 at all.  The partitioning had
  already been disabled for multipath devices.  Among other things,
  the partitioning steals some space from the ext4 file system,
  makes it difficult to accurately calculate the paramters to
  parted and can make some of the tests fail.

* Increase FS_SIZE and FILE_SIZE in the zpool_import test
  configuration now that FILESIZE is larger.

* Write more data in order that device evacuation take lonnger in
  a couple tests.

* Use mkdir -p to avoid errors when the directory already exists.

* Remove use of sudo in import_rewind_config_changed.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: openzfs/openzfs@619c012
Closes #7459
  • Loading branch information...
pzakha authored and behlendorf committed Jul 22, 2016
1 parent afd2f7b commit 6cb8e5306d9696d155ae7a808f56c4e46d69b64c
Showing with 2,841 additions and 553 deletions.
  1. +9 −0 cmd/zpool/zpool_main.c
  2. +1 −0 include/libzfs.h
  3. +2 −0 include/sys/fs/zfs.h
  4. +6 −0 include/sys/spa.h
  5. +13 −0 include/sys/spa_impl.h
  6. +7 −2 include/sys/vdev.h
  7. +0 −1 include/sys/vdev_impl.h
  8. +1 −0 include/sys/zfs_context.h
  9. +17 −2 lib/libzfs/libzfs_import.c
  10. +3 −2 lib/libzfs/libzfs_pool.c
  11. +10 −0 lib/libzpool/kernel.c
  12. +24 −0 man/man5/zfs-module-parameters.5
  13. +542 −360 module/zfs/spa.c
  14. +2 −1 module/zfs/spa_config.c
  15. +27 −3 module/zfs/spa_misc.c
  16. +336 −128 module/zfs/vdev.c
  17. +6 −1 module/zfs/vdev_label.c
  18. +51 −8 module/zfs/vdev_mirror.c
  19. +36 −5 module/zfs/vdev_root.c
  20. +45 −4 module/zfs/zio.c
  21. +13 −2 tests/runfiles/linux.run
  22. +11 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/Makefile.am
  23. +76 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_cache_device_added.ksh
  24. +145 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_cache_device_removed.ksh
  25. +166 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_cache_device_replaced.ksh
  26. +72 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_cache_mirror_attached.ksh
  27. +70 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_cache_mirror_detached.ksh
  28. +113 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_cache_shared_device.ksh
  29. +122 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_devices_missing.ksh
  30. +98 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_paths_changed.ksh
  31. +239 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_rewind_config_changed.ksh
  32. +186 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/import_rewind_device_replaced.ksh
  33. +6 −18 tests/zfs-tests/tests/functional/cli_root/zpool_import/setup.ksh
  34. +10 −16 tests/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import.cfg
  35. +376 −0 tests/zfs-tests/tests/functional/cli_root/zpool_import/zpool_import.kshlib
View
@@ -1820,6 +1820,10 @@ print_status_config(zpool_handle_t *zhp, status_cbdata_t *cb, const char *name,
(void) printf(gettext("currently in use"));
break;
case VDEV_AUX_CHILDREN_OFFLINE:
(void) printf(gettext("all children offline"));
break;
default:
(void) printf(gettext("corrupted data"));
break;
@@ -1919,6 +1923,10 @@ print_import_config(status_cbdata_t *cb, const char *name, nvlist_t *nv,
(void) printf(gettext("currently in use"));
break;
case VDEV_AUX_CHILDREN_OFFLINE:
(void) printf(gettext("all children offline"));
break;
default:
(void) printf(gettext("corrupted data"));
break;
@@ -2752,6 +2760,7 @@ zpool_do_import(int argc, char **argv)
idata.guid = searchguid;
idata.cachefile = cachefile;
idata.scan = do_scan;
idata.policy = policy;
pools = zpool_search_import(g_zfs, &idata);
View
@@ -413,6 +413,7 @@ typedef struct importargs {
int unique : 1; /* does 'poolname' already exist? */
int exists : 1; /* set on return if pool already exists */
int scan : 1; /* prefer scanning to libblkid cache */
nvlist_t *policy; /* rewind policy (rewind txg, etc.) */
} importargs_t;
extern nvlist_t *zpool_search_import(libzfs_handle_t *, importargs_t *);
View
@@ -704,6 +704,7 @@ typedef struct zpool_rewind_policy {
#define ZPOOL_CONFIG_VDEV_TOP_ZAP "com.delphix:vdev_zap_top"
#define ZPOOL_CONFIG_VDEV_LEAF_ZAP "com.delphix:vdev_zap_leaf"
#define ZPOOL_CONFIG_HAS_PER_VDEV_ZAPS "com.delphix:has_per_vdev_zaps"
#define ZPOOL_CONFIG_CACHEFILE "cachefile" /* not stored on disk */
#define ZPOOL_CONFIG_MMP_STATE "mmp_state" /* not stored on disk */
#define ZPOOL_CONFIG_MMP_TXG "mmp_txg" /* not stored on disk */
#define ZPOOL_CONFIG_MMP_HOSTNAME "mmp_hostname" /* not stored on disk */
@@ -811,6 +812,7 @@ typedef enum vdev_aux {
VDEV_AUX_BAD_ASHIFT, /* vdev ashift is invalid */
VDEV_AUX_EXTERNAL_PERSIST, /* persistent forced fault */
VDEV_AUX_ACTIVE, /* vdev active on a different host */
VDEV_AUX_CHILDREN_OFFLINE, /* all children are offline */
} vdev_aux_t;
/*
View
@@ -410,6 +410,7 @@ typedef enum bp_embedded_type {
#define SPA_BLKPTRSHIFT 7 /* blkptr_t is 128 bytes */
#define SPA_DVAS_PER_BP 3 /* Number of DVAs in a bp */
#define SPA_SYNC_MIN_VDEVS 3 /* min vdevs to update during sync */
/*
* A block is a hole when it has either 1) never been written to, or
@@ -1015,11 +1016,16 @@ extern boolean_t spa_has_pending_synctask(spa_t *spa);
extern int spa_maxblocksize(spa_t *spa);
extern int spa_maxdnodesize(spa_t *spa);
extern void zfs_blkptr_verify(spa_t *spa, const blkptr_t *bp);
extern boolean_t zfs_dva_valid(spa_t *spa, const dva_t *dva,
const blkptr_t *bp);
typedef void (*spa_remap_cb_t)(uint64_t vdev, uint64_t offset, uint64_t size,
void *arg);
extern boolean_t spa_remap_blkptr(spa_t *spa, blkptr_t *bp,
spa_remap_cb_t callback, void *arg);
extern uint64_t spa_get_last_removal_txg(spa_t *spa);
extern boolean_t spa_trust_config(spa_t *spa);
extern uint64_t spa_missing_tvds_allowed(spa_t *spa);
extern void spa_set_missing_tvds(spa_t *spa, uint64_t missing);
extern boolean_t spa_multihost(spa_t *spa);
extern unsigned long spa_get_hostid(void);
View
@@ -184,6 +184,15 @@ typedef enum spa_all_vdev_zap_action {
AVZ_ACTION_INITIALIZE
} spa_avz_action_t;
typedef enum spa_config_source {
SPA_CONFIG_SRC_NONE = 0,
SPA_CONFIG_SRC_SCAN, /* scan of path (default: /dev/dsk) */
SPA_CONFIG_SRC_CACHEFILE, /* any cachefile */
SPA_CONFIG_SRC_TRYIMPORT, /* returned from call to tryimport */
SPA_CONFIG_SRC_SPLIT, /* new pool in a pool split */
SPA_CONFIG_SRC_MOS /* MOS, but not always from right txg */
} spa_config_source_t;
struct spa {
/*
* Fields protected by spa_namespace_lock.
@@ -202,6 +211,8 @@ struct spa {
uint8_t spa_sync_on; /* sync threads are running */
spa_load_state_t spa_load_state; /* current load operation */
boolean_t spa_indirect_vdevs_loaded; /* mappings loaded? */
boolean_t spa_trust_config; /* do we trust vdev tree? */
spa_config_source_t spa_config_source; /* where config comes from? */
uint64_t spa_import_flags; /* import specific flags */
spa_taskqs_t spa_zio_taskq[ZIO_TYPES][ZIO_TASKQ_TYPES];
dsl_pool_t *spa_dsl_pool;
@@ -263,6 +274,8 @@ struct spa {
int spa_async_suspended; /* async tasks suspended */
kcondvar_t spa_async_cv; /* wait for thread_exit() */
uint16_t spa_async_tasks; /* async task mask */
uint64_t spa_missing_tvds; /* unopenable tvds on load */
uint64_t spa_missing_tvds_allowed; /* allow loading spa? */
spa_removing_phys_t spa_removing_phys;
spa_vdev_removal_t *spa_vdev_removal;
View
@@ -48,9 +48,12 @@ typedef enum vdev_dtl_type {
extern int zfs_nocacheflush;
extern void vdev_dbgmsg(vdev_t *vd, const char *fmt, ...);
extern void vdev_dbgmsg_print_tree(vdev_t *, int);
extern int vdev_open(vdev_t *);
extern void vdev_open_children(vdev_t *);
extern int vdev_validate(vdev_t *, boolean_t);
extern int vdev_validate(vdev_t *);
extern int vdev_copy_path_strict(vdev_t *, vdev_t *);
extern void vdev_copy_path_relaxed(vdev_t *, vdev_t *);
extern void vdev_close(vdev_t *);
extern int vdev_create(vdev_t *, uint64_t txg, boolean_t isreplace);
extern void vdev_reopen(vdev_t *);
@@ -100,6 +103,7 @@ extern void vdev_scan_stat_init(vdev_t *vd);
extern void vdev_propagate_state(vdev_t *vd);
extern void vdev_set_state(vdev_t *vd, boolean_t isopen, vdev_state_t state,
vdev_aux_t aux);
extern boolean_t vdev_children_are_offline(vdev_t *vd);
extern void vdev_space_update(vdev_t *vd,
int64_t alloc_delta, int64_t defer_delta, int64_t space_delta);
@@ -145,7 +149,8 @@ typedef enum vdev_config_flag {
VDEV_CONFIG_SPARE = 1 << 0,
VDEV_CONFIG_L2CACHE = 1 << 1,
VDEV_CONFIG_REMOVING = 1 << 2,
VDEV_CONFIG_MOS = 1 << 3
VDEV_CONFIG_MOS = 1 << 3,
VDEV_CONFIG_MISSING = 1 << 4
} vdev_config_flag_t;
extern void vdev_top_config_generate(spa_t *spa, nvlist_t *config);
View
@@ -437,7 +437,6 @@ extern void vdev_remove_parent(vdev_t *cvd);
/*
* vdev sync load and sync
*/
extern void vdev_load_log_state(vdev_t *nvd, vdev_t *ovd);
extern boolean_t vdev_log_state_valid(vdev_t *vd);
extern int vdev_load(vdev_t *vd);
extern int vdev_dtl_load(vdev_t *vd);
@@ -674,6 +674,7 @@ typedef struct callb_cpr {
#define zone_dataset_visible(x, y) (1)
#define INGLOBALZONE(z) (1)
extern uint32_t zone_get_hostid(void *zonep);
extern char *kmem_vasprintf(const char *fmt, va_list adx);
extern char *kmem_asprintf(const char *fmt, ...);
View
@@ -897,7 +897,8 @@ vdev_is_hole(uint64_t *hole_array, uint_t holes, uint_t id)
* return to the user.
*/
static nvlist_t *
get_configs(libzfs_handle_t *hdl, pool_list_t *pl, boolean_t active_ok)
get_configs(libzfs_handle_t *hdl, pool_list_t *pl, boolean_t active_ok,
nvlist_t *policy)
{
pool_entry_t *pe;
vdev_entry_t *ve;
@@ -1230,6 +1231,12 @@ get_configs(libzfs_handle_t *hdl, pool_list_t *pl, boolean_t active_ok)
continue;
}
if (policy != NULL) {
if (nvlist_add_nvlist(config, ZPOOL_REWIND_POLICY,
policy) != 0)
goto nomem;
}
if ((nvl = refresh_config(hdl, config)) == NULL) {
nvlist_free(config);
config = NULL;
@@ -2080,7 +2087,7 @@ zpool_find_import_impl(libzfs_handle_t *hdl, importargs_t *iarg)
free(cache);
pthread_mutex_destroy(&lock);
ret = get_configs(hdl, &pools, iarg->can_be_active);
ret = get_configs(hdl, &pools, iarg->can_be_active, iarg->policy);
for (pe = pools.pools; pe != NULL; pe = penext) {
penext = pe->pe_next;
@@ -2209,6 +2216,14 @@ zpool_find_import_cached(libzfs_handle_t *hdl, const char *cachefile,
if (active)
continue;
if (nvlist_add_string(src, ZPOOL_CONFIG_CACHEFILE,
cachefile) != 0) {
(void) no_memory(hdl);
nvlist_free(raw);
nvlist_free(pools);
return (NULL);
}
if ((dst = refresh_config(hdl, src)) == NULL) {
nvlist_free(raw);
nvlist_free(pools);
View
@@ -1935,8 +1935,9 @@ zpool_import_props(libzfs_handle_t *hdl, nvlist_t *config, const char *newname,
nvlist_lookup_nvlist(nvinfo,
ZPOOL_CONFIG_MISSING_DEVICES, &missing) == 0) {
(void) printf(dgettext(TEXT_DOMAIN,
"The devices below are missing, use "
"'-m' to import the pool anyway:\n"));
"The devices below are missing or "
"corrupted, use '-m' to import the pool "
"anyway:\n"));
print_vdev_tree(hdl, NULL, missing, 2);
(void) printf("\n");
}
View
@@ -297,6 +297,16 @@ rw_tryenter(krwlock_t *rwlp, krw_t rw)
return (0);
}
/* ARGSUSED */
uint32_t
zone_get_hostid(void *zonep)
{
/*
* We're emulating the system's hostid in userland.
*/
return (strtoul(hw_serial, NULL, 10));
}
int
rw_tryupgrade(krwlock_t *rwlp)
{
@@ -351,6 +351,18 @@ they operate close to quota or capacity limits.
Default value: \fB24\fR.
.RE
.sp
.ne 2
.na
\fBspa_load_print_vdev_tree\fR (int)
.ad
.RS 12n
Whether to print the vdev tree in the debugging message buffer during pool import.
Use 0 to disable and 1 to enable.
.sp
Default value: \fB0\fR.
.RE
.sp
.ne 2
.na
@@ -701,6 +713,18 @@ the code that may use them. A value of \fB0\fR will default to 6000 ms.
Default value: \fB0\fR.
.RE
.sp
.ne 2
.na
\fBzfs_max_missing_tvds\fR (int)
.ad
.RS 12n
Number of missing top-level vdevs which will be allowed during
pool import (only in read-only mode).
.sp
Default value: \fB0\fR
.RE
.sp
.ne 2
.na
Oops, something went wrong.

0 comments on commit 6cb8e53

Please sign in to comment.