-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce read/write kstats per dataset #7705
Conversation
Just curious, are there any plans to expose this to userspace, ala a |
@bunder2015 No plans from my side as of now, but these stats could definitely be used as sources for a utility like that. |
module/zfs/zfs_vfsops.c
Outdated
* GUID from the pool's config which could change upona reguid). | ||
*/ | ||
char kstat_name[KSTAT_STRLEN]; | ||
int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Linux, unlike illumos, there is already a subdirectory for the pool kstats, by name. This works in Linux because the kstat namespace is heirarchical, not flat like illumos. Rather than using the load guid here, let's use the pool's subdirectory.
I'm thinking something like:
int n = snprintf(kstat_name, sizeof (kstat_name), "%s/objset-0x%llx",
spa_name(dmu_objset_spa(zfsvfs->z_os)),
(unsigned long long)dmu_objset_id(zfsvfs->z_os));
we might be able to get more clever with the objset name, too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If you read my comment in the code I do mention that (and I clarify more on my other comment on github). The reason why I do this is to keep the codebase and the format of the key for retrieval the same between OSs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to changing this to something else than what I have right now. That said, I think it would be easier to do this for now from the perspective of porting. Plus in the future if some utility ships in the zol userland that is ported to illumos and FreeBSD, the routines for formatting the key for the kstat retrieval will be the same across all platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it (kstats) already isn't the same in Linux. There is no kstat(1), so there isn't a way other than traversing the procfs tree to get there. Use the tree, Luke!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the code and the behavior of ZFS to be consistent and uniform between different OSes
Normally yes, but in this case I'm not convinced it's all that important. The way kstats are accessed on Linux is different enough from the other platforms that I don't really see any benefit from inheriting these old limitations. Any code that consumes the kstats is already going to need to be Linux specific unless someone ports the userspace kstat libraries to Linux too.
I think adopting @richardelling's proposed naming scheme for now would be reasonable. Since pools can't currently be renamed without re-importing the pool name is functionally as good as the load_guid. If we wanted to get fancier creating a subdirectory for each objset id would be nice, we're going to inevitably want to add additional kstats per-dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. I'll change the current design to the above naming scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest preserving everything but the final name to make it easy for the other OpenZFS implementations to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review update changed the naming scheme to the one proposed above.
module/zfs/zvol.c
Outdated
* naming scheme was decided. | ||
*/ | ||
char kstat_name[KSTAT_STRLEN]; | ||
int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment on pool naming in kstat space above
module/zfs/zvol.c
Outdated
kstat_named_t zk_nread; | ||
} zvol_kstats_t; | ||
|
||
static zvol_kstats_t empty_zvol_kstats = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory, we could add the zvol name here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful would that be here.
If you were able to get the objsetID for a dataset, I assume you already had its name.
If there is a practical use-case that I'm missing though I'd be open to including this change.
module/zfs/zfs_vfsops.c
Outdated
@@ -972,6 +972,152 @@ zfs_id_overquota(zfsvfs_t *zfsvfs, uint64_t usedobj, uint64_t id) | |||
zfs_id_overobjquota(zfsvfs, usedobj, id)); | |||
} | |||
|
|||
static zfsvfs_kstats_t empty_zfsvfs_kstats = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in theory, we could add the dataset name here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review update added a field with the dataset name.
module/zfs/zfs_vfsops.c
Outdated
(unsigned long long)dmu_objset_id(zfsvfs->z_os)); | ||
|
||
/* | ||
* At the time of this writing, KSTAT_STRLEN is 31 (in illumos), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KSTAT_STRLEN = 255 for Linux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you read the comment all the way to the end, I do mention that KSTAT_STRLEN in Linux is currently 255, but also wanted to give some background on why I used the load_guid + objset ID in hex (again, trying to keep things uniform between Linux, Illumos, and BSD - 255 vs 31 & 31)
module/zfs/zfs_vfsops.c
Outdated
* object ID should take 29 characters in total: | ||
* - pool GUID (64 bits - 16 chars in hex - 16 total chars) | ||
* - dash ( 8 bits - 1 char - 17 total chars) | ||
* - object id (48 bits - 12 chars in hex - 29 total chars) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see other comments on the kstat namespace
@@ -240,6 +240,7 @@ typedef enum { | |||
ZPOOL_PROP_MAXDNODESIZE, | |||
ZPOOL_PROP_MULTIHOST, | |||
ZPOOL_PROP_CHECKPOINT, | |||
ZPOOL_PROP_LOAD_GUID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we stick with the proposed naming scheme we should add a ZFS_PROP_ID
dataset property. Otherwise users will need to resort to zdb to map dataset names to their matching kstat entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about this?
$ zfs get -Hp -o value objsetid rpool/swap
203
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's exactly what we need! Since it's a hidden property I didn't even realize that existed, we may want to unhide it and add it to the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review update made this a normal property
module/zfs/zfs_vfsops.c
Outdated
* GUID from the pool's config which could change upona reguid). | ||
*/ | ||
char kstat_name[KSTAT_STRLEN]; | ||
int n = snprintf(kstat_name, sizeof (kstat_name), "%llx-%llx", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the code and the behavior of ZFS to be consistent and uniform between different OSes
Normally yes, but in this case I'm not convinced it's all that important. The way kstats are accessed on Linux is different enough from the other platforms that I don't really see any benefit from inheriting these old limitations. Any code that consumes the kstats is already going to need to be Linux specific unless someone ports the userspace kstat libraries to Linux too.
I think adopting @richardelling's proposed naming scheme for now would be reasonable. Since pools can't currently be renamed without re-importing the pool name is functionally as good as the load_guid. If we wanted to get fancier creating a subdirectory for each objset id would be nice, we're going to inevitably want to add additional kstats per-dataset.
@@ -93,6 +93,8 @@ zpool_prop_init(void) | |||
ZFS_TYPE_POOL, "<size>", "CAP"); | |||
zprop_register_number(ZPOOL_PROP_GUID, "guid", 0, PROP_READONLY, | |||
ZFS_TYPE_POOL, "<guid>", "GUID"); | |||
zprop_register_number(ZPOOL_PROP_LOAD_GUID, "load_guid", 0, | |||
PROP_READONLY, ZFS_TYPE_POOL, "<load_guid>", "LOAD_GUID"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new properties will need to be added to zpool_get_002_pos
test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I'll update it
module/zfs/zfs_vfsops.c
Outdated
KM_SLEEP); | ||
bcopy(&empty_zfsvfs_kstats, zs, sizeof (empty_zfsvfs_kstats)); | ||
|
||
char *ds_name = kmem_alloc(ZFS_MAX_DATASET_NAME_LEN, KM_SLEEP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: kmem_zalloc() is safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
, the | ||
.Sy objsetid | ||
of a dataset is not transferred to other pools when the snapshot is copied | ||
with a send/receive operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be rarely used, outside of the case for kstats. Do you think it is worth mentioning here that the objsetid can be referenced to the kstats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also mention that the objsetid can be reused (for a new dataset) after the dataset is deleted.
include/sys/zvol.h
Outdated
@@ -22,6 +22,7 @@ | |||
/* | |||
* Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved. | |||
* Copyright (c) 2016 Actifio, Inc. All rights reserved. | |||
* Copyright (c) 2018 by Delphix. All rights reserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There aren't any other changes to this file, so you don't need to update the copyright.
, the | ||
.Sy objsetid | ||
of a dataset is not transferred to other pools when the snapshot is copied | ||
with a send/receive operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also mention that the objsetid can be reused (for a new dataset) after the dataset is deleted.
module/zfs/zfs_vfsops.c
Outdated
* | ||
* Thus, pool names with lengths 1 to 242 should have valid | ||
* kstats created for their datasets. We do place an assertion | ||
* below to hopefully raise an issue when and if one of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could have a valid pool name that's 250 bytes long, failing an assertion in this case seems wrong. A better behavior would be so silently not create the kstat (and maybe log a zfs_dbgmsg()).
However, the (current) maximum pool name length is 240 bytes:
int
pool_namecheck(const char *pool, namecheck_err_t *why, char *what)
{
const char *c;
/*
* Make sure the name is not too long.
* If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11)
* we need to account for additional space needed by the origin ds which
* will also be snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN".
* Play it safe and enforce this limit even if the pool version is < 11
* so it can be upgraded without issues.
*/
if (strlen(pool) >= (ZFS_MAX_DATASET_NAME_LEN - 2 -
strlen(ORIGIN_DIR_NAME) * 2)) {
if (why)
*why = NAME_ERR_TOOLONG;
return (-1);
}
So the code you have is probably good, but we should update the comment to explain why this works (i.e. pool name is up to 240 chars).
EXCEPT, the above length analysis looks wrong because it doesn't account for the length of the literal "/objset-0x"
in the name (it only counts 1 byte for the -
).
module/zfs/zfs_vfsops.c
Outdated
* pool names: | ||
* - pool name (1 to 256 chars) | ||
* - dash (1 char) | ||
* - object id (1 to 12 chars in hex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like there are 2 spaces after 12
module/zfs/zvol.c
Outdated
{ "reads", KSTAT_DATA_UINT64 }, | ||
{ "nread", KSTAT_DATA_UINT64 }, | ||
{ "dataset_name", KSTAT_DATA_STRING }, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a generic "dataset kstats" rather than separate zvol and zpl kstats? zvol vs zpl would need to call into the generic dataset kstats from different places, but I would think that we could use the same structs, and zvol_kstats_update(), zvol_kstats_create(), zvol_kstats_destroy(), and update funcs could be shared.
You could add a new .c / .h for the dataset kstats, since this doesn't seem closely tied in with e.g. objset code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
include/sys/zfs_vfsops.h
Outdated
kstat_named_t zk_nwritten; | ||
kstat_named_t zk_reads; | ||
kstat_named_t zk_nread; | ||
kstat_named_t zk_ds_name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd suggest putting zk_ds_name
first since we may end up adding entries to the end.
include/sys/zfs_vfsops.h
Outdated
@@ -117,6 +134,8 @@ struct zfsvfs { | |||
boolean_t z_xattr_sa; /* allow xattrs to be stores as SA */ | |||
uint64_t z_version; /* ZPL version */ | |||
uint64_t z_shares_dir; /* hidden shares dir */ | |||
kstat_t *z_iokstat; /* kstat of I/O to this zvol */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider changing the name to something more generic like z_kstat
. It seems inevitable we'll find more per-dataset properties or statistics that it would be useful to add.
nit: indent missing
module/zfs/zfs_vfsops.c
Outdated
if (n >= KSTAT_STRLEN) | ||
return; | ||
|
||
kstat_t *iokstat = kstat_create("zfs", 0, kstat_name, "zfsvfs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works but isn't consistent with the other entries in the "zfs/<pool-name>" kstats. They only simulate the effect of a hierarchy under /proc by creating a new ks_module
for each imported pool called "zfs/ <pool-name>" to which entries can be added. Each modules namespace is still technically flat. Take a look at spa_txg_history_init()
for an example.
module/zfs/zvol.c
Outdated
{ "reads", KSTAT_DATA_UINT64 }, | ||
{ "nread", KSTAT_DATA_UINT64 }, | ||
{ "dataset_name", KSTAT_DATA_STRING }, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
703d0dd
to
d4b54e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Just a few last things.
include/sys/dataset_kstats.h
Outdated
kstat_named_t dkv_nread; | ||
} dataset_kstat_values_t; | ||
|
||
typedef struct datset_kstats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/datset_kstats/dataset_kstats
module/zfs/dataset_kstats.c
Outdated
dataset_kstats_destroy(dataset_kstats_t *dk) | ||
{ | ||
if (dk->dk_kstats == NULL) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra tab
module/zfs/dataset_kstats.c
Outdated
ASSERT3S(nwritten, >=, 0); | ||
|
||
if (dk->dk_kstats == NULL) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra tab
module/zfs/dataset_kstats.c
Outdated
ASSERT3S(nread, >=, 0); | ||
|
||
if (dk->dk_kstats == NULL) | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra tab
module/zfs/zfs_vfsops.c
Outdated
@@ -1944,6 +1948,7 @@ zfs_umount(struct super_block *sb) | |||
dmu_objset_disown(os, B_TRUE, zfsvfs); | |||
} | |||
|
|||
dataset_kstats_destroy(&zfsvfs->z_kstat); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be called twice, once here and a second time in zfsvfs_free()
. What do you think about moving the dataset_kstats_create()
call in to the if (mounting) { }
conditional of zfsvfs_setup()
. This way you'll only need the single call in zfsvfs_free()
to cover all the cleanup paths.
module/zfs/zfs_vnops.c
Outdated
@@ -937,6 +929,8 @@ zfs_write(struct inode *ip, uio_t *uio, int ioflag, cred_t *cr) | |||
zfsvfs->z_os->os_sync == ZFS_SYNC_ALWAYS) | |||
zil_commit(zilog, zp->z_id); | |||
|
|||
dataset_kstats_update_write_kstats(&zfsvfs->z_kstat, | |||
start_resid - uio->uio_resid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the per-objset IO accounting this PR is adding there already exists per-task IO accounting. The calls for this are task_io_account_read()
and task_io_account_write()
in zpl_read_common_iovec()
and zpl_write_common_iovec()
respectively. As an aside this, is where iotop(8)
is getting its data from.
In the interests of keeping all of the IO accounting in one place let's move these calls in to zfs_write()
and zfs_read()
. This will also have the effect that non-ZPL consumers, like ZVOLS and Lustre servers will get the IO properly accounted to the issuing process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought we probably should leave the task_io_*
calls where they are or we'll end up double accounting volumes since this is already handled by submit_bio()
.
[edit] On closer inspection I see that volumes use the dmu_*
interfaces so this shouldn't be a problem.
d4b54e5
to
93dfb13
Compare
Codecov Report
@@ Coverage Diff @@
## master #7705 +/- ##
==========================================
+ Coverage 78.26% 78.48% +0.21%
==========================================
Files 373 374 +1
Lines 112791 112877 +86
==========================================
+ Hits 88279 88587 +308
+ Misses 24512 24290 -222
Continue to review full report at Codecov.
|
The following patch introduces a few statistics on reads and writes grouped by dataset. These statistics are implemented as kstats (backed by aggregate sums for performance) and can be retrieved by using the dataset objset ID number. The motivation for this change is to provide some preliminary analytics on dataset usage/performance. Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together. LGTM.
@sdimitro did you have any plan about making something like |
FWIW, we do collect that with telegraf and node_exporter for graphing and capacity planning |
@gmelikov sorry for the late reply - no I didn't, go for it |
Description
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved
by using the dataset's objset ID number and the pool's load_guid.
As a side, the pool's load_guid is also exposed to the userland with
this change similarly to other pool properties and the man page has
been updated to describe the property.
Note: A PR for this change has been opened for OpenZFS in the
following link -> openzfs/openzfs#664
Example output:
Motivation and Context
The motivation for this change is to provide some preliminary statistics
on dataset usage/performance. These statistics can be used by higher
level tools in the userland for analysis and visualizations.
How Has This Been Tested?
The accuracy of the statistics has been tested manually to a
certain extend using both normal filesystems and volumes.
To verify that the performance of the read/write codepaths, that are
generally pretty hot, is not affected I ran the
sequential_reads_arc_cached
test from the perf-regression tests of the test suite. My setup was
an Ubuntu VM running 4.15.0-23-generic with 8 CPUs and 12 GB
of RAM, that used an eager-zeroed SSD of 800GB. The VM was
running on VMware ESX. The performance was the same, even if
I was consuming the kstats every second while the tests where
running.
Types of changes
Checklist:
Signed-off-by
.