Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zfs cloning takes longer to do as the number of clones increases #6372

Closed
ColinIanKing opened this issue Jul 19, 2017 · 3 comments · Fixed by #9084
Closed

zfs cloning takes longer to do as the number of clones increases #6372

ColinIanKing opened this issue Jul 19, 2017 · 3 comments · Fixed by #9084
Labels
Type: Performance Performance improvement or performance problem

Comments

@ColinIanKing
Copy link
Contributor

ColinIanKing commented Jul 19, 2017

System information

Type | Version/Name
Distribution Name | Ubuntu
Distribution Version | Artful
Linux Kernel | 4.12
Architecture | x86-64
ZFS Version | 0.6.5.9
SPL Version | 0.6.5.9

Creating clones takes longer and longer as the number of clones increases. Here are timings I get for time to create N clones:

Clones | Time (secs)
100 | 7
200 | 10
300 | 19
400 | 22
500 | 29
600 | 39
700 | 46
800 | 58
900 | 74
1000 | 90
1100 | 98
1200 | 102
1300 | 113
1400 | 143
1500 | 145
1600 | 165
1700 | 187
1800 | 210
1900 | 226
2000 | 256
2200 | 311
2400 | 373
2600 | 487
3000 | 619
3400 | 915
4000 | 1332

Simple bash script to reproduce:

#!/bin/bash

clone() {
    echo "Destroying existing clones.."
    zfs destroy -R testpool/testzfs
    rm -Rf /tmp/testzfs
    zfs create testpool/testzfs -o mountpoint=none
    zfs snapshot testpool/testzfs@base
    mkdir /tmp/testzfs
    echo "Creating ${1} clones"
    BASE=$(date +%s)
    for i in $(seq ${1}); do
        UUID=$(uuidgen)
        zfs clone testpool/testzfs@base testpool/testzfs-$UUID -o mountpoint=/tmp/testzfs/${UUID}
    done
    END=$(date +%s)
    TOTAL=$((END-BASE))
    PER=$((${1}/${TOTAL}))
    echo "Took: $TOTAL seconds ($PER/s)"
}

N=0
while true
do
        N=$((N + 100))
        clone $N
done

Running strace against zfs create I see the following ioctl() taking the time:

1500475028.118005 ioctl(3, _IOC(0, 0x5a, 0x12, 0x00), 0x7ffc7c2184f0) = -1 ENOMEM (Cannot allocate memory) <0.390093>
1500475028.508153 mmap(NULL, 290816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fbfd487b000 <0.000017>
1500475028.508201 ioctl(3, _IOC(0, 0x5a, 0x12, 0x00), 0x7ffc7c2184f0) = 0 <0.382304>

I believe this and ioctl on /dev/zfs, namely ZFS_IOC_OBJSET_STATS which is getting stats on all the zfs file systems. This ioctl takes longer to do as the number of clones increases.

perf shows that over 99.9% of the zfs clone is indeed performing this ioctl:

-   99.39%     0.00%  zfs        [kernel.kallsyms]   [k] sys_ioctl                                                                     
     sys_ioctl                                                                                                                         
     do_vfs_ioctl                                                                                                                      
   - zfsdev_ioctl                                                                                                                      
      - 99.33% zfs_ioc_objset_stats                                                                                                    
         - 99.30% zfs_ioc_objset_stats_impl.part.20                                                                                   
            - 99.21% dmu_objset_stats                                                                                                 
               - dsl_dataset_stats                                                                                                    
                  - 99.18% get_clones_stat                                                                                             
                     - 60.46% fnvlist_add_nvlist                                                                                       
                          nvlist_add_nvlist                                                                                            
                          nvlist_add_common.part.51                                                                                    
                          nvlist_copy_embedded.isra.54                                                                                 
                          nvlist_copy_pairs.isra.52                                                                                    
                        - nvlist_add_common.part.51                                                                                    
                           - 30.35% nvlist_copy_embedded.isra.54                                                                       
                                nvlist_copy_pairs.isra.52                                                                              
                              + nvlist_add_common.part.51                                                                           
                           - 29.62% nvlist_remove_all.part.49                                                                       
                                strcmp                                                                                               
                     - 31.23% fnvlist_add_boolean                                                                                    
                        - nvlist_add_boolean                                                                                          
                        - nvlist_add_common.part.51                                                                             
                           - 30.20% nvlist_remove_all.part.49                                                                         
                                strcmp                                                                                                
                             0.94% strcmp                                                                                              
                     - 6.37% dsl_dataset_hold_obj                                                                                      
                        - 6.28% dmu_bonus_hold                                                                                         
                           - 5.89% dnode_hold                                                                                          
                              - 5.83% dnode_hold_impl                                                                                  
                                 - 5.40% dbuf_read                                                                                     
                                    - dmu_zfetch                                                                                       
                                       - 5.19% dmu_zfetch_dofetch.isra.7                                                               
                                          - 4.76% dbuf_prefetch                                                                        
                                             - 2.67% dbuf_find                                                                         
                                                  mutex_lock                                                                           
                                               0.91% mutex_unlock                                                                      
                                               0.55% dnode_block_freed   

This is a considerable bottleneck on something that seems to be a deficiency in the API between userspace and the zfs driver. Is there anyway this can be optimized? This is a time critical bottleneck when dealing with thousands of ZFS clones as backing store for containers.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Jul 25, 2017
@behlendorf
Copy link
Contributor

There is some easy low hanging fruit here which should buy you a factor of 2 in performance. The reason the first ioctl() fails with ENOMEM is because we're only providing a 16K buffer to the kernel initially. By increasing the default size we can avoid needing to retry the ioctl in most cases.

diff --git a/include/libzfs_impl.h b/include/libzfs_impl.h
index 2efd85e..e72129a 100644
--- a/include/libzfs_impl.h
+++ b/include/libzfs_impl.h
@@ -132,6 +132,7 @@ typedef enum {
 } zfs_share_type_t;
 
 #define        CONFIG_BUF_MINSIZE      262144
+#define        STATS_BUF_MINSIZE       131072
 
 int zfs_error(libzfs_handle_t *, int, const char *);
 int zfs_error_fmt(libzfs_handle_t *, int, const char *, ...);
diff --git a/lib/libzfs/libzfs_dataset.c b/lib/libzfs/libzfs_dataset.c
index d6e8502..c1762a3 100644
--- a/lib/libzfs/libzfs_dataset.c
+++ b/lib/libzfs/libzfs_dataset.c
@@ -402,7 +402,7 @@ get_stats(zfs_handle_t *zhp)
        int rc = 0;
        zfs_cmd_t zc = {"\0"};
 
-       if (zcmd_alloc_dst_nvlist(zhp->zfs_hdl, &zc, 0) != 0)
+       if (zcmd_alloc_dst_nvlist(zhp->zfs_hdl, &zc, STATS_BUF_MINSIZE) != 0)
                return (-1);
        if (get_stats_ioctl(zhp, &zc) != 0)
                rc = -1;

Additional improvements are possible and I think @tcaputi may already be looking in to this.

@tcaputi
Copy link
Contributor

tcaputi commented Jul 25, 2017

I will be, but I'm not at the moment. I will probably start once encryption is merged and sequential resilvers are completely ready for review

c0d3z3r0 added a commit to c0d3z3r0/zfs that referenced this issue Jun 22, 2019
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.

These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K. This may sound large, but on 64 bit systems
running ZFS this is not a huge chunk of memory for the speed
improvement we gains for large sets of clones:

        16K zcmd        256K zcmd
Clones	Time    Clones  Time    Clone   % improvement
        (secs) 	per sec (secs)  per sec
100	7	14.29	5	20.00	28.57
200	10	20.00	9	22.22	10.00
300	19	15.79	18	16.67	5.26
400	22	18.18	22	18.18	0.00
500	29	17.24	29	17.24	0.00
600	39	15.38	39	15.38	0.00
700	46	15.22	45	15.56	2.17
800	58	13.79	51	15.69	12.07
900	74	12.16	61	14.75	17.57
1000	90	11.11	74	13.51	17.78
1100	98	11.22	87	12.64	11.22
1200	102	11.76	95	12.63	6.86
1300	113	11.50	104	12.50	7.96
1400	143	9.79	109	12.84	23.78
1500	145	10.34	132	11.36	8.97
1600	165	9.70	145	11.03	12.12
1700	187	9.09	156	10.90	16.58
1800	210	8.57	166	10.84	20.95
1900	226	8.41	183	10.38	19.03
2000	256	7.81	198	10.10	22.66
2200	311	7.07	238	9.24	23.47
2400	373	6.43	271	8.86	27.35
2600	487	5.34	316	8.23	35.11
3000	619	4.85	426	7.04	31.18
3400	915	3.72	549	6.19	40.00
4000	1332	3.00	923	4.33	30.71

As one can see, with > 2000 clones we get 25-40% speed
improvement.

This patch was originally suggested by Brian Behlendorf
(see openzfs#6372), however
this fix is a more generic fix to cover all zcmd cases.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
c0d3z3r0 added a commit to c0d3z3r0/zfs that referenced this issue Jun 28, 2019
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.

These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K. This may sound large, but on 64 bit systems
running ZFS this is not a huge chunk of memory for the speed
improvement we gains for large sets of clones:

        16K zcmd        256K zcmd
Clones	Time    Clones  Time    Clone   % improvement
        (secs) 	per sec (secs)  per sec
100	7	14.29	5	20.00	28.57
200	10	20.00	9	22.22	10.00
300	19	15.79	18	16.67	5.26
400	22	18.18	22	18.18	0.00
500	29	17.24	29	17.24	0.00
600	39	15.38	39	15.38	0.00
700	46	15.22	45	15.56	2.17
800	58	13.79	51	15.69	12.07
900	74	12.16	61	14.75	17.57
1000	90	11.11	74	13.51	17.78
1100	98	11.22	87	12.64	11.22
1200	102	11.76	95	12.63	6.86
1300	113	11.50	104	12.50	7.96
1400	143	9.79	109	12.84	23.78
1500	145	10.34	132	11.36	8.97
1600	165	9.70	145	11.03	12.12
1700	187	9.09	156	10.90	16.58
1800	210	8.57	166	10.84	20.95
1900	226	8.41	183	10.38	19.03
2000	256	7.81	198	10.10	22.66
2200	311	7.07	238	9.24	23.47
2400	373	6.43	271	8.86	27.35
2600	487	5.34	316	8.23	35.11
3000	619	4.85	426	7.04	31.18
3400	915	3.72	549	6.19	40.00
4000	1332	3.00	923	4.33	30.71

As one can see, with > 2000 clones we get 25-40% speed
improvement.

This patch was originally suggested by Brian Behlendorf
(see openzfs#6372), however
this fix is a more generic fix to cover all zcmd cases.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
c0d3z3r0 added a commit to c0d3z3r0/zfs that referenced this issue Jun 30, 2019
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.

These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K. This may sound large, but on 64 bit systems
running ZFS this is not a huge chunk of memory for the speed
improvement we gains for large sets of clones:

        16K zcmd        256K zcmd
Clones	Time    Clones  Time    Clone   % improvement
        (secs) 	per sec (secs)  per sec
100	7	14.29	5	20.00	28.57
200	10	20.00	9	22.22	10.00
300	19	15.79	18	16.67	5.26
400	22	18.18	22	18.18	0.00
500	29	17.24	29	17.24	0.00
600	39	15.38	39	15.38	0.00
700	46	15.22	45	15.56	2.17
800	58	13.79	51	15.69	12.07
900	74	12.16	61	14.75	17.57
1000	90	11.11	74	13.51	17.78
1100	98	11.22	87	12.64	11.22
1200	102	11.76	95	12.63	6.86
1300	113	11.50	104	12.50	7.96
1400	143	9.79	109	12.84	23.78
1500	145	10.34	132	11.36	8.97
1600	165	9.70	145	11.03	12.12
1700	187	9.09	156	10.90	16.58
1800	210	8.57	166	10.84	20.95
1900	226	8.41	183	10.38	19.03
2000	256	7.81	198	10.10	22.66
2200	311	7.07	238	9.24	23.47
2400	373	6.43	271	8.86	27.35
2600	487	5.34	316	8.23	35.11
3000	619	4.85	426	7.04	31.18
3400	915	3.72	549	6.19	40.00
4000	1332	3.00	923	4.33	30.71

As one can see, with > 2000 clones we get 25-40% speed
improvement.

This patch was originally suggested by Brian Behlendorf
(see openzfs#6372), however
this fix is a more generic fix to cover all zcmd cases.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
c0d3z3r0 added a commit to c0d3z3r0/zfs that referenced this issue Jul 11, 2019
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.

These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K. This may sound large, but on 64 bit systems
running ZFS this is not a huge chunk of memory for the speed
improvement we gains for large sets of clones:

        16K zcmd        256K zcmd
Clones	Time    Clones  Time    Clone   % improvement
        (secs) 	per sec (secs)  per sec
100	7	14.29	5	20.00	28.57
200	10	20.00	9	22.22	10.00
300	19	15.79	18	16.67	5.26
400	22	18.18	22	18.18	0.00
500	29	17.24	29	17.24	0.00
600	39	15.38	39	15.38	0.00
700	46	15.22	45	15.56	2.17
800	58	13.79	51	15.69	12.07
900	74	12.16	61	14.75	17.57
1000	90	11.11	74	13.51	17.78
1100	98	11.22	87	12.64	11.22
1200	102	11.76	95	12.63	6.86
1300	113	11.50	104	12.50	7.96
1400	143	9.79	109	12.84	23.78
1500	145	10.34	132	11.36	8.97
1600	165	9.70	145	11.03	12.12
1700	187	9.09	156	10.90	16.58
1800	210	8.57	166	10.84	20.95
1900	226	8.41	183	10.38	19.03
2000	256	7.81	198	10.10	22.66
2200	311	7.07	238	9.24	23.47
2400	373	6.43	271	8.86	27.35
2600	487	5.34	316	8.23	35.11
3000	619	4.85	426	7.04	31.18
3400	915	3.72	549	6.19	40.00
4000	1332	3.00	923	4.33	30.71

As one can see, with > 2000 clones we get 25-40% speed
improvement.

This patch was originally suggested by Brian Behlendorf
(see openzfs#6372), however
this fix is a more generic fix to cover all zcmd cases.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
c0d3z3r0 added a commit to c0d3z3r0/zfs that referenced this issue Jul 13, 2019
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.

These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K. This may sound large, but on 64 bit systems
running ZFS this is not a huge chunk of memory for the speed
improvement we gains for large sets of clones:

        16K zcmd        256K zcmd
Clones	Time    Clones  Time    Clone   % improvement
        (secs) 	per sec (secs)  per sec
100	7	14.29	5	20.00	28.57
200	10	20.00	9	22.22	10.00
300	19	15.79	18	16.67	5.26
400	22	18.18	22	18.18	0.00
500	29	17.24	29	17.24	0.00
600	39	15.38	39	15.38	0.00
700	46	15.22	45	15.56	2.17
800	58	13.79	51	15.69	12.07
900	74	12.16	61	14.75	17.57
1000	90	11.11	74	13.51	17.78
1100	98	11.22	87	12.64	11.22
1200	102	11.76	95	12.63	6.86
1300	113	11.50	104	12.50	7.96
1400	143	9.79	109	12.84	23.78
1500	145	10.34	132	11.36	8.97
1600	165	9.70	145	11.03	12.12
1700	187	9.09	156	10.90	16.58
1800	210	8.57	166	10.84	20.95
1900	226	8.41	183	10.38	19.03
2000	256	7.81	198	10.10	22.66
2200	311	7.07	238	9.24	23.47
2400	373	6.43	271	8.86	27.35
2600	487	5.34	316	8.23	35.11
3000	619	4.85	426	7.04	31.18
3400	915	3.72	549	6.19	40.00
4000	1332	3.00	923	4.33	30.71

As one can see, with > 2000 clones we get 25-40% speed
improvement.

This patch was originally suggested by Brian Behlendorf
(see openzfs#6372), however
this fix is a more generic fix to cover all zcmd cases.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
c0d3z3r0 added a commit to c0d3z3r0/zfs that referenced this issue Jul 19, 2019
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.

These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K. This may sound large, but on 64 bit systems
running ZFS this is not a huge chunk of memory for the speed
improvement we gains for large sets of clones:

        16K zcmd        256K zcmd
Clones	Time    Clones  Time    Clone   % improvement
        (secs) 	per sec (secs)  per sec
100	7	14.29	5	20.00	28.57
200	10	20.00	9	22.22	10.00
300	19	15.79	18	16.67	5.26
400	22	18.18	22	18.18	0.00
500	29	17.24	29	17.24	0.00
600	39	15.38	39	15.38	0.00
700	46	15.22	45	15.56	2.17
800	58	13.79	51	15.69	12.07
900	74	12.16	61	14.75	17.57
1000	90	11.11	74	13.51	17.78
1100	98	11.22	87	12.64	11.22
1200	102	11.76	95	12.63	6.86
1300	113	11.50	104	12.50	7.96
1400	143	9.79	109	12.84	23.78
1500	145	10.34	132	11.36	8.97
1600	165	9.70	145	11.03	12.12
1700	187	9.09	156	10.90	16.58
1800	210	8.57	166	10.84	20.95
1900	226	8.41	183	10.38	19.03
2000	256	7.81	198	10.10	22.66
2200	311	7.07	238	9.24	23.47
2400	373	6.43	271	8.86	27.35
2600	487	5.34	316	8.23	35.11
3000	619	4.85	426	7.04	31.18
3400	915	3.72	549	6.19	40.00
4000	1332	3.00	923	4.33	30.71

As one can see, with > 2000 clones we get 25-40% speed
improvement.

This patch was originally suggested by Brian Behlendorf
(see openzfs#6372), however
this fix is a more generic fix to cover all zcmd cases.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
c0d3z3r0 added a commit to c0d3z3r0/zfs that referenced this issue Jul 26, 2019
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.

These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K. This may sound large, but on 64 bit systems
running ZFS this is not a huge chunk of memory for the speed
improvement we gains for large sets of clones:

        16K zcmd        256K zcmd
Clones	Time    Clones  Time    Clone   % improvement
        (secs) 	per sec (secs)  per sec
100	7	14.29	5	20.00	28.57
200	10	20.00	9	22.22	10.00
300	19	15.79	18	16.67	5.26
400	22	18.18	22	18.18	0.00
500	29	17.24	29	17.24	0.00
600	39	15.38	39	15.38	0.00
700	46	15.22	45	15.56	2.17
800	58	13.79	51	15.69	12.07
900	74	12.16	61	14.75	17.57
1000	90	11.11	74	13.51	17.78
1100	98	11.22	87	12.64	11.22
1200	102	11.76	95	12.63	6.86
1300	113	11.50	104	12.50	7.96
1400	143	9.79	109	12.84	23.78
1500	145	10.34	132	11.36	8.97
1600	165	9.70	145	11.03	12.12
1700	187	9.09	156	10.90	16.58
1800	210	8.57	166	10.84	20.95
1900	226	8.41	183	10.38	19.03
2000	256	7.81	198	10.10	22.66
2200	311	7.07	238	9.24	23.47
2400	373	6.43	271	8.86	27.35
2600	487	5.34	316	8.23	35.11
3000	619	4.85	426	7.04	31.18
3400	915	3.72	549	6.19	40.00
4000	1332	3.00	923	4.33	30.71

As one can see, with > 2000 clones we get 25-40% speed
improvement.

This patch was originally suggested by Brian Behlendorf
(see openzfs#6372), however
this fix is a more generic fix to cover all zcmd cases.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants