-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs cloning takes longer to do as the number of clones increases #6372
Comments
There is some easy low hanging fruit here which should buy you a factor of 2 in performance. The reason the first diff --git a/include/libzfs_impl.h b/include/libzfs_impl.h
index 2efd85e..e72129a 100644
--- a/include/libzfs_impl.h
+++ b/include/libzfs_impl.h
@@ -132,6 +132,7 @@ typedef enum {
} zfs_share_type_t;
#define CONFIG_BUF_MINSIZE 262144
+#define STATS_BUF_MINSIZE 131072
int zfs_error(libzfs_handle_t *, int, const char *);
int zfs_error_fmt(libzfs_handle_t *, int, const char *, ...);
diff --git a/lib/libzfs/libzfs_dataset.c b/lib/libzfs/libzfs_dataset.c
index d6e8502..c1762a3 100644
--- a/lib/libzfs/libzfs_dataset.c
+++ b/lib/libzfs/libzfs_dataset.c
@@ -402,7 +402,7 @@ get_stats(zfs_handle_t *zhp)
int rc = 0;
zfs_cmd_t zc = {"\0"};
- if (zcmd_alloc_dst_nvlist(zhp->zfs_hdl, &zc, 0) != 0)
+ if (zcmd_alloc_dst_nvlist(zhp->zfs_hdl, &zc, STATS_BUF_MINSIZE) != 0)
return (-1);
if (get_stats_ioctl(zhp, &zc) != 0)
rc = -1; Additional improvements are possible and I think @tcaputi may already be looking in to this. |
I will be, but I'm not at the moment. I will probably start once encryption is merged and sequential resilvers are completely ready for review |
When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. This may sound large, but on 64 bit systems running ZFS this is not a huge chunk of memory for the speed improvement we gains for large sets of clones: 16K zcmd 256K zcmd Clones Time Clones Time Clone % improvement (secs) per sec (secs) per sec 100 7 14.29 5 20.00 28.57 200 10 20.00 9 22.22 10.00 300 19 15.79 18 16.67 5.26 400 22 18.18 22 18.18 0.00 500 29 17.24 29 17.24 0.00 600 39 15.38 39 15.38 0.00 700 46 15.22 45 15.56 2.17 800 58 13.79 51 15.69 12.07 900 74 12.16 61 14.75 17.57 1000 90 11.11 74 13.51 17.78 1100 98 11.22 87 12.64 11.22 1200 102 11.76 95 12.63 6.86 1300 113 11.50 104 12.50 7.96 1400 143 9.79 109 12.84 23.78 1500 145 10.34 132 11.36 8.97 1600 165 9.70 145 11.03 12.12 1700 187 9.09 156 10.90 16.58 1800 210 8.57 166 10.84 20.95 1900 226 8.41 183 10.38 19.03 2000 256 7.81 198 10.10 22.66 2200 311 7.07 238 9.24 23.47 2400 373 6.43 271 8.86 27.35 2600 487 5.34 316 8.23 35.11 3000 619 4.85 426 7.04 31.18 3400 915 3.72 549 6.19 40.00 4000 1332 3.00 923 4.33 30.71 As one can see, with > 2000 clones we get 25-40% speed improvement. This patch was originally suggested by Brian Behlendorf (see openzfs#6372), however this fix is a more generic fix to cover all zcmd cases. Signed-off-by: Colin Ian King <colin.king@canonical.com>
When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. This may sound large, but on 64 bit systems running ZFS this is not a huge chunk of memory for the speed improvement we gains for large sets of clones: 16K zcmd 256K zcmd Clones Time Clones Time Clone % improvement (secs) per sec (secs) per sec 100 7 14.29 5 20.00 28.57 200 10 20.00 9 22.22 10.00 300 19 15.79 18 16.67 5.26 400 22 18.18 22 18.18 0.00 500 29 17.24 29 17.24 0.00 600 39 15.38 39 15.38 0.00 700 46 15.22 45 15.56 2.17 800 58 13.79 51 15.69 12.07 900 74 12.16 61 14.75 17.57 1000 90 11.11 74 13.51 17.78 1100 98 11.22 87 12.64 11.22 1200 102 11.76 95 12.63 6.86 1300 113 11.50 104 12.50 7.96 1400 143 9.79 109 12.84 23.78 1500 145 10.34 132 11.36 8.97 1600 165 9.70 145 11.03 12.12 1700 187 9.09 156 10.90 16.58 1800 210 8.57 166 10.84 20.95 1900 226 8.41 183 10.38 19.03 2000 256 7.81 198 10.10 22.66 2200 311 7.07 238 9.24 23.47 2400 373 6.43 271 8.86 27.35 2600 487 5.34 316 8.23 35.11 3000 619 4.85 426 7.04 31.18 3400 915 3.72 549 6.19 40.00 4000 1332 3.00 923 4.33 30.71 As one can see, with > 2000 clones we get 25-40% speed improvement. This patch was originally suggested by Brian Behlendorf (see openzfs#6372), however this fix is a more generic fix to cover all zcmd cases. Signed-off-by: Colin Ian King <colin.king@canonical.com>
When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. This may sound large, but on 64 bit systems running ZFS this is not a huge chunk of memory for the speed improvement we gains for large sets of clones: 16K zcmd 256K zcmd Clones Time Clones Time Clone % improvement (secs) per sec (secs) per sec 100 7 14.29 5 20.00 28.57 200 10 20.00 9 22.22 10.00 300 19 15.79 18 16.67 5.26 400 22 18.18 22 18.18 0.00 500 29 17.24 29 17.24 0.00 600 39 15.38 39 15.38 0.00 700 46 15.22 45 15.56 2.17 800 58 13.79 51 15.69 12.07 900 74 12.16 61 14.75 17.57 1000 90 11.11 74 13.51 17.78 1100 98 11.22 87 12.64 11.22 1200 102 11.76 95 12.63 6.86 1300 113 11.50 104 12.50 7.96 1400 143 9.79 109 12.84 23.78 1500 145 10.34 132 11.36 8.97 1600 165 9.70 145 11.03 12.12 1700 187 9.09 156 10.90 16.58 1800 210 8.57 166 10.84 20.95 1900 226 8.41 183 10.38 19.03 2000 256 7.81 198 10.10 22.66 2200 311 7.07 238 9.24 23.47 2400 373 6.43 271 8.86 27.35 2600 487 5.34 316 8.23 35.11 3000 619 4.85 426 7.04 31.18 3400 915 3.72 549 6.19 40.00 4000 1332 3.00 923 4.33 30.71 As one can see, with > 2000 clones we get 25-40% speed improvement. This patch was originally suggested by Brian Behlendorf (see openzfs#6372), however this fix is a more generic fix to cover all zcmd cases. Signed-off-by: Colin Ian King <colin.king@canonical.com>
When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. This may sound large, but on 64 bit systems running ZFS this is not a huge chunk of memory for the speed improvement we gains for large sets of clones: 16K zcmd 256K zcmd Clones Time Clones Time Clone % improvement (secs) per sec (secs) per sec 100 7 14.29 5 20.00 28.57 200 10 20.00 9 22.22 10.00 300 19 15.79 18 16.67 5.26 400 22 18.18 22 18.18 0.00 500 29 17.24 29 17.24 0.00 600 39 15.38 39 15.38 0.00 700 46 15.22 45 15.56 2.17 800 58 13.79 51 15.69 12.07 900 74 12.16 61 14.75 17.57 1000 90 11.11 74 13.51 17.78 1100 98 11.22 87 12.64 11.22 1200 102 11.76 95 12.63 6.86 1300 113 11.50 104 12.50 7.96 1400 143 9.79 109 12.84 23.78 1500 145 10.34 132 11.36 8.97 1600 165 9.70 145 11.03 12.12 1700 187 9.09 156 10.90 16.58 1800 210 8.57 166 10.84 20.95 1900 226 8.41 183 10.38 19.03 2000 256 7.81 198 10.10 22.66 2200 311 7.07 238 9.24 23.47 2400 373 6.43 271 8.86 27.35 2600 487 5.34 316 8.23 35.11 3000 619 4.85 426 7.04 31.18 3400 915 3.72 549 6.19 40.00 4000 1332 3.00 923 4.33 30.71 As one can see, with > 2000 clones we get 25-40% speed improvement. This patch was originally suggested by Brian Behlendorf (see openzfs#6372), however this fix is a more generic fix to cover all zcmd cases. Signed-off-by: Colin Ian King <colin.king@canonical.com>
When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. This may sound large, but on 64 bit systems running ZFS this is not a huge chunk of memory for the speed improvement we gains for large sets of clones: 16K zcmd 256K zcmd Clones Time Clones Time Clone % improvement (secs) per sec (secs) per sec 100 7 14.29 5 20.00 28.57 200 10 20.00 9 22.22 10.00 300 19 15.79 18 16.67 5.26 400 22 18.18 22 18.18 0.00 500 29 17.24 29 17.24 0.00 600 39 15.38 39 15.38 0.00 700 46 15.22 45 15.56 2.17 800 58 13.79 51 15.69 12.07 900 74 12.16 61 14.75 17.57 1000 90 11.11 74 13.51 17.78 1100 98 11.22 87 12.64 11.22 1200 102 11.76 95 12.63 6.86 1300 113 11.50 104 12.50 7.96 1400 143 9.79 109 12.84 23.78 1500 145 10.34 132 11.36 8.97 1600 165 9.70 145 11.03 12.12 1700 187 9.09 156 10.90 16.58 1800 210 8.57 166 10.84 20.95 1900 226 8.41 183 10.38 19.03 2000 256 7.81 198 10.10 22.66 2200 311 7.07 238 9.24 23.47 2400 373 6.43 271 8.86 27.35 2600 487 5.34 316 8.23 35.11 3000 619 4.85 426 7.04 31.18 3400 915 3.72 549 6.19 40.00 4000 1332 3.00 923 4.33 30.71 As one can see, with > 2000 clones we get 25-40% speed improvement. This patch was originally suggested by Brian Behlendorf (see openzfs#6372), however this fix is a more generic fix to cover all zcmd cases. Signed-off-by: Colin Ian King <colin.king@canonical.com>
When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. This may sound large, but on 64 bit systems running ZFS this is not a huge chunk of memory for the speed improvement we gains for large sets of clones: 16K zcmd 256K zcmd Clones Time Clones Time Clone % improvement (secs) per sec (secs) per sec 100 7 14.29 5 20.00 28.57 200 10 20.00 9 22.22 10.00 300 19 15.79 18 16.67 5.26 400 22 18.18 22 18.18 0.00 500 29 17.24 29 17.24 0.00 600 39 15.38 39 15.38 0.00 700 46 15.22 45 15.56 2.17 800 58 13.79 51 15.69 12.07 900 74 12.16 61 14.75 17.57 1000 90 11.11 74 13.51 17.78 1100 98 11.22 87 12.64 11.22 1200 102 11.76 95 12.63 6.86 1300 113 11.50 104 12.50 7.96 1400 143 9.79 109 12.84 23.78 1500 145 10.34 132 11.36 8.97 1600 165 9.70 145 11.03 12.12 1700 187 9.09 156 10.90 16.58 1800 210 8.57 166 10.84 20.95 1900 226 8.41 183 10.38 19.03 2000 256 7.81 198 10.10 22.66 2200 311 7.07 238 9.24 23.47 2400 373 6.43 271 8.86 27.35 2600 487 5.34 316 8.23 35.11 3000 619 4.85 426 7.04 31.18 3400 915 3.72 549 6.19 40.00 4000 1332 3.00 923 4.33 30.71 As one can see, with > 2000 clones we get 25-40% speed improvement. This patch was originally suggested by Brian Behlendorf (see openzfs#6372), however this fix is a more generic fix to cover all zcmd cases. Signed-off-by: Colin Ian King <colin.king@canonical.com>
When creating hundreds of clones (for example using containers with LXD) cloning slows down as the number of clones increases over time. The reason for this is that the fetching of the clone information using a small zcmd buffer requires two ioctl calls, one to determine the size and a second to return the data. However, this requires gathering the data twice, once to determine the size and again to populate the zcmd buffer to return it to userspace. These are expensive ioctl() calls, so instead, make the default buffer size much larger: 256K. This may sound large, but on 64 bit systems running ZFS this is not a huge chunk of memory for the speed improvement we gains for large sets of clones: 16K zcmd 256K zcmd Clones Time Clones Time Clone % improvement (secs) per sec (secs) per sec 100 7 14.29 5 20.00 28.57 200 10 20.00 9 22.22 10.00 300 19 15.79 18 16.67 5.26 400 22 18.18 22 18.18 0.00 500 29 17.24 29 17.24 0.00 600 39 15.38 39 15.38 0.00 700 46 15.22 45 15.56 2.17 800 58 13.79 51 15.69 12.07 900 74 12.16 61 14.75 17.57 1000 90 11.11 74 13.51 17.78 1100 98 11.22 87 12.64 11.22 1200 102 11.76 95 12.63 6.86 1300 113 11.50 104 12.50 7.96 1400 143 9.79 109 12.84 23.78 1500 145 10.34 132 11.36 8.97 1600 165 9.70 145 11.03 12.12 1700 187 9.09 156 10.90 16.58 1800 210 8.57 166 10.84 20.95 1900 226 8.41 183 10.38 19.03 2000 256 7.81 198 10.10 22.66 2200 311 7.07 238 9.24 23.47 2400 373 6.43 271 8.86 27.35 2600 487 5.34 316 8.23 35.11 3000 619 4.85 426 7.04 31.18 3400 915 3.72 549 6.19 40.00 4000 1332 3.00 923 4.33 30.71 As one can see, with > 2000 clones we get 25-40% speed improvement. This patch was originally suggested by Brian Behlendorf (see openzfs#6372), however this fix is a more generic fix to cover all zcmd cases. Signed-off-by: Colin Ian King <colin.king@canonical.com>
System information
Type | Version/Name
Distribution Name | Ubuntu
Distribution Version | Artful
Linux Kernel | 4.12
Architecture | x86-64
ZFS Version | 0.6.5.9
SPL Version | 0.6.5.9
Creating clones takes longer and longer as the number of clones increases. Here are timings I get for time to create N clones:
Clones | Time (secs)
100 | 7
200 | 10
300 | 19
400 | 22
500 | 29
600 | 39
700 | 46
800 | 58
900 | 74
1000 | 90
1100 | 98
1200 | 102
1300 | 113
1400 | 143
1500 | 145
1600 | 165
1700 | 187
1800 | 210
1900 | 226
2000 | 256
2200 | 311
2400 | 373
2600 | 487
3000 | 619
3400 | 915
4000 | 1332
Simple bash script to reproduce:
Running strace against zfs create I see the following ioctl() taking the time:
I believe this and ioctl on /dev/zfs, namely ZFS_IOC_OBJSET_STATS which is getting stats on all the zfs file systems. This ioctl takes longer to do as the number of clones increases.
perf shows that over 99.9% of the zfs clone is indeed performing this ioctl:
This is a considerable bottleneck on something that seems to be a deficiency in the API between userspace and the zfs driver. Is there anyway this can be optimized? This is a time critical bottleneck when dealing with thousands of ZFS clones as backing store for containers.
The text was updated successfully, but these errors were encountered: