Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vhost_boot]: bdev_lvol_create_lvstore times out on large device #3050

Closed
mikeBashStuff opened this issue Jun 9, 2023 · 15 comments
Closed
Assignees
Labels
bug Medium Medium Priority Bug

Comments

@mikeBashStuff
Copy link
Contributor

https://ci.spdk.io/results/autotest-per-patch/builds/105754/archive/vhost-autotest/index.html

This error was raised after bdev_lvol_create_lvstore() hit a timeout, while creating lvstore under 8TB nvme drive. It's likely that the workaround here would be to increase the timeout, similarly to 487da02, however, app termination should be still handled gracefully.

00:08:06.851    11:34:10	-- vhost_boot/vhost_boot.sh@73 -- # /var/jenkins/workspace/vhost-autotest/spdk/scripts/rpc.py -s /root/vhost_test/vhost/0/rpc.sock bdev_get_bdevs -b Nvme0n1
00:08:06.851   11:34:10	-- vhost_boot/vhost_boot.sh@73 -- # nvme_bdev='[
00:08:06.851    {
00:08:06.851      "name": "Nvme0n1",
00:08:06.851      "aliases": [
00:08:06.851        "ac8ebdc7-c1fa-4819-9acf-30b045e44af6"
00:08:06.851      ],
00:08:06.851      "product_name": "NVMe disk",
00:08:06.851      "block_size": 512,
00:08:06.851      "num_blocks": 15628053168,
00:08:06.851      "uuid": "ac8ebdc7-c1fa-4819-9acf-30b045e44af6",
00:08:06.851      "assigned_rate_limits": {
00:08:06.851        "rw_ios_per_sec": 0,
00:08:06.851        "rw_mbytes_per_sec": 0,
00:08:06.851        "r_mbytes_per_sec": 0,
00:08:06.851        "w_mbytes_per_sec": 0
00:08:06.851      },
00:08:06.851      "claimed": false,
00:08:06.851      "zoned": false,
00:08:06.851      "supported_io_types": {
00:08:06.851        "read": true,
00:08:06.851        "write": true,
00:08:06.851        "unmap": true,
00:08:06.851        "write_zeroes": true,
00:08:06.851        "flush": true,
00:08:06.851        "reset": true,
00:08:06.851        "compare": false,
00:08:06.851        "compare_and_write": false,
00:08:06.851        "abort": true,
00:08:06.851        "nvme_admin": true,
00:08:06.851        "nvme_io": true
00:08:06.851      },
00:08:06.851      "driver_specific": {
00:08:06.851        "nvme": [
00:08:06.851          {
00:08:06.851            "pci_address": "0000:5f:00.0",
00:08:06.851            "trid": {
00:08:06.851              "trtype": "PCIe",
00:08:06.851              "traddr": "0000:5f:00.0"
00:08:06.851            },
00:08:06.851            "ctrlr_data": {
00:08:06.851              "cntlid": 0,
00:08:06.851              "vendor_id": "0x8086",
00:08:06.851              "model_number": "INTEL SSDPE2KX080T8",
00:08:06.851              "serial_number": "BTLJ817201BJ8P0HGN",
00:08:06.851              "firmware_revision": "VDV10184",
00:08:06.851              "oacs": {
00:08:06.851                "security": 0,
00:08:06.851                "format": 1,
00:08:06.851                "firmware": 1,
00:08:06.851                "ns_manage": 1
00:08:06.851              },
00:08:06.851              "multi_ctrlr": false,
00:08:06.851              "ana_reporting": false
00:08:06.851            },
00:08:06.851            "vs": {
00:08:06.851              "nvme_version": "1.2"
00:08:06.851            },
00:08:06.851            "ns_data": {
00:08:06.851              "id": 1,
00:08:06.851              "can_share": false
00:08:06.851            }
00:08:06.851          }
00:08:06.851        ],
00:08:06.851        "mp_policy": "active_passive"
00:08:06.851      }
00:08:06.851    }
00:08:06.851  ]'
00:08:06.852    11:34:10	-- vhost_boot/vhost_boot.sh@74 -- # jq '.[] .block_size'
00:08:06.852   11:34:10	-- vhost_boot/vhost_boot.sh@74 -- # nvme_bdev_bs=512
00:08:06.852    11:34:10	-- vhost_boot/vhost_boot.sh@75 -- # jq '.[] .name'
00:08:06.852   11:34:11	-- vhost_boot/vhost_boot.sh@75 -- # nvme_bdev_name='"Nvme0n1"'
00:08:06.852   11:34:11	-- vhost_boot/vhost_boot.sh@76 -- # [[ 512 != 512 ]]
00:08:06.852   11:34:11	-- vhost_boot/vhost_boot.sh@81 -- # lvb_size=20000
00:08:06.852    11:34:11	-- vhost_boot/vhost_boot.sh@82 -- # /var/jenkins/workspace/vhost-autotest/spdk/scripts/rpc.py -s /root/vhost_test/vhost/0/rpc.sock bdev_lvol_create_lvstore Nvme0n1 lvs0
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@82 -- # err_clean '' 82
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@18 -- # trap - ERR
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@19 -- # print_backtrace
00:09:14.575     11:35:11	-- common/autotest_common.sh@1107 -- # [[ hxBET =~ e ]]
00:09:14.575     11:35:11	-- common/autotest_common.sh@1107 -- # return 0
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@20 -- # set +e
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@21 -- # error 'Error on  82'
00:09:14.575     11:35:11	-- vhost/common.sh@82 -- # echo ===========
00:09:14.575  ===========
00:09:14.575     11:35:11	-- vhost/common.sh@83 -- # message ERROR 'Error on  82'
00:09:14.575     11:35:11	-- vhost/common.sh@60 -- # local verbose_out
00:09:14.575     11:35:11	-- vhost/common.sh@61 -- # false
00:09:14.575     11:35:11	-- vhost/common.sh@62 -- # verbose_out=
00:09:14.575     11:35:11	-- vhost/common.sh@69 -- # local msg_type=ERROR
00:09:14.575     11:35:11	-- vhost/common.sh@70 -- # shift
00:09:14.575     11:35:11	-- vhost/common.sh@71 -- # echo -e 'ERROR: Error on  82'
00:09:14.575  ERROR: Error on  82
00:09:14.575     11:35:11	-- vhost/common.sh@84 -- # echo ===========
00:09:14.575  ===========
00:09:14.575     11:35:11	-- vhost/common.sh@86 -- # false
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@22 -- # vm_kill_all
00:09:14.575     11:35:11	-- vhost/common.sh@468 -- # local vm
00:09:14.575      11:35:11	-- vhost/common.sh@469 -- # vm_list_all
00:09:14.575      11:35:11	-- vhost/common.sh@455 -- # local vms
00:09:14.575       11:35:11	-- vhost/common.sh@459 -- # shopt -s nullglob
00:09:14.575       11:35:11	-- vhost/common.sh@460 -- # echo
00:09:14.575      11:35:11	-- vhost/common.sh@459 -- # vms=
00:09:14.575      11:35:11	-- vhost/common.sh@460 -- # [[ -n '' ]]
00:09:14.575     11:35:11	-- vhost/common.sh@473 -- # rm -rf /root/vhost_test/vms
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@23 -- # /var/jenkins/workspace/vhost-autotest/spdk/scripts/rpc.py -s /root/vhost_test/vhost/0/rpc.sock vhost_scsi_controller_remove_target naa.vhost_vm.0 0
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@24 -- # /var/jenkins/workspace/vhost-autotest/spdk/scripts/rpc.py -s /root/vhost_test/vhost/0/rpc.sock vhost_delete_controller naa.vhost_vm.0
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@25 -- # /var/jenkins/workspace/vhost-autotest/spdk/scripts/rpc.py -s /root/vhost_test/vhost/0/rpc.sock bdev_lvol_delete
00:09:14.575  usage: rpc.py [options] bdev_lvol_delete [-h] name
00:09:14.575  rpc.py [options] bdev_lvol_delete: error: the following arguments are required: name
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@26 -- # /var/jenkins/workspace/vhost-autotest/spdk/scripts/rpc.py -s /root/vhost_test/vhost/0/rpc.sock bdev_lvol_delete_lvstore -u
00:09:14.575  usage: rpc.py [options] bdev_lvol_delete_lvstore [-h] [-u UUID] [-l LVS_NAME]
00:09:14.575  rpc.py [options] bdev_lvol_delete_lvstore: error: argument -u/--uuid: expected one argument
00:09:14.575     11:35:11	-- vhost_boot/vhost_boot.sh@27 -- # vhost_kill 0
00:09:14.575     11:35:11	-- vhost/common.sh@198 -- # local rc=0
00:09:14.575     11:35:11	-- vhost/common.sh@199 -- # local vhost_name=0
00:09:14.575     11:35:11	-- vhost/common.sh@201 -- # [[ -z 0 ]]
00:09:14.575     11:35:11	-- vhost/common.sh@206 -- # local vhost_dir
00:09:14.575      11:35:11	-- vhost/common.sh@207 -- # get_vhost_dir 0
00:09:14.575      11:35:11	-- vhost/common.sh@105 -- # local vhost_name=0
00:09:14.575      11:35:11	-- vhost/common.sh@107 -- # [[ -z 0 ]]
00:09:14.575      11:35:11	-- vhost/common.sh@112 -- # echo /root/vhost_test/vhost/0
00:09:14.575     11:35:11	-- vhost/common.sh@207 -- # vhost_dir=/root/vhost_test/vhost/0
00:09:14.575     11:35:11	-- vhost/common.sh@208 -- # local vhost_pid_file=/root/vhost_test/vhost/0/vhost.pid
00:09:14.575     11:35:11	-- vhost/common.sh@210 -- # [[ ! -r /root/vhost_test/vhost/0/vhost.pid ]]
00:09:14.575     11:35:11	-- vhost/common.sh@215 -- # timing_enter vhost_kill
00:09:14.575     11:35:11	-- common/autotest_common.sh@691 -- # xtrace_disable
00:09:14.575     11:35:11	-- common/autotest_common.sh@20 -- # '[' no '!=' yes ']'
00:09:14.575     11:35:11	-- common/autotest_common.sh@21 -- # PREV_BASH_OPTS=hxBET
00:09:14.575     11:35:11	-- common/autotest_common.sh@22 -- # [[ hxBET == *\x* ]]
00:09:14.575     11:35:11	-- common/autotest_common.sh@23 -- # XTRACE_DISABLED=yes
00:09:14.575     11:35:11	-- common/autotest_common.sh@25 -- # set +x
00:09:14.575     11:35:11	-- common/autotest_common.sh@693 -- # xtrace_enable
00:09:14.575     11:35:11	-- vhost/common.sh@216 -- # local vhost_pid
00:09:14.575      11:35:11	-- vhost/common.sh@217 -- # cat /root/vhost_test/vhost/0/vhost.pid
00:09:14.575     11:35:11	-- vhost/common.sh@217 -- # vhost_pid=58212
00:09:14.575     11:35:11	-- vhost/common.sh@218 -- # notice 'killing vhost (PID 58212) app'
00:09:14.575     11:35:11	-- vhost/common.sh@94 -- # message INFO 'killing vhost (PID 58212) app'
00:09:14.575     11:35:11	-- vhost/common.sh@60 -- # local verbose_out
00:09:14.576     11:35:11	-- vhost/common.sh@61 -- # false
00:09:14.576     11:35:11	-- vhost/common.sh@62 -- # verbose_out=
00:09:14.576     11:35:11	-- vhost/common.sh@69 -- # local msg_type=INFO
00:09:14.576     11:35:11	-- vhost/common.sh@70 -- # shift
00:09:14.576     11:35:11	-- vhost/common.sh@71 -- # echo -e 'INFO: killing vhost (PID 58212) app'
00:09:14.576     11:35:11	-- vhost/common.sh@220 -- # kill -INT 58212
00:09:14.576     11:35:11	-- vhost/common.sh@221 -- # notice 'sent SIGINT to vhost app - waiting 60 seconds to exit'
00:09:14.576     11:35:11	-- vhost/common.sh@94 -- # message INFO 'sent SIGINT to vhost app - waiting 60 seconds to exit'
00:09:14.576     11:35:11	-- vhost/common.sh@60 -- # local verbose_out
00:09:14.576     11:35:11	-- vhost/common.sh@61 -- # false
00:09:14.576     11:35:11	-- vhost/common.sh@62 -- # verbose_out=
00:09:14.576     11:35:11	-- vhost/common.sh@69 -- # local msg_type=INFO
00:09:14.576     11:35:11	-- vhost/common.sh@70 -- # shift
00:09:14.576     11:35:11	-- vhost/common.sh@71 -- # echo -e 'INFO: sent SIGINT to vhost app - waiting 60 seconds to exit'
00:09:14.576     11:35:11	-- vhost/common.sh@222 -- # (( i = 0 ))
00:09:14.576     11:35:11	-- vhost/common.sh@222 -- # (( i < 60 ))
00:09:14.576     11:35:11	-- vhost/common.sh@223 -- # kill -0 58212
00:09:14.576     11:35:11	-- vhost/common.sh@224 -- # echo .
00:09:14.576     11:35:11	-- vhost/common.sh@225 -- # sleep 1
00:09:14.576  =================================================================
00:09:14.576  ==58212==ERROR: AddressSanitizer: heap-use-after-free on address 0x7fd5cbcc4858 at pc 0x7fd5d0d083cb bp 0x7ffcb21c07a0 sp 0x7ffcb21c0798
00:09:14.576  READ of size 8 at 0x7fd5cbcc4858 thread T0 (reactor_0)
00:09:14.576      #0 0x7fd5d0d083ca in jsonrpc_server_send_response /var/jenkins/workspace/vhost-autotest/spdk/lib/jsonrpc/jsonrpc_server_tcp.c:313
00:09:14.576      #1 0x7fd5d0d01b34 in end_response /var/jenkins/workspace/vhost-autotest/spdk/lib/jsonrpc/jsonrpc_server.c:259
00:09:14.576      #2 0x7fd5d0d022c2 in spdk_jsonrpc_end_result /var/jenkins/workspace/vhost-autotest/spdk/lib/jsonrpc/jsonrpc_server.c:296
00:09:14.576      #3 0x7fd5d36ce06d in rpc_lvol_store_construct_cb /var/jenkins/workspace/vhost-autotest/spdk/module/bdev/lvol/vbdev_lvol_rpc.c:85
00:09:14.576      #4 0x7fd5d36ba599 in _vbdev_lvs_create_cb /var/jenkins/workspace/vhost-autotest/spdk/module/bdev/lvol/vbdev_lvol.c:207
00:09:14.576      #5 0x7fd5d32fd203 in super_create_close_cb /var/jenkins/workspace/vhost-autotest/spdk/lib/lvol/lvol.c:516
00:09:14.576      #6 0x7fd5d30d96fc in bs_call_cpl /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/request.c:31
00:09:14.576      #7 0x7fd5d30da434 in bs_request_set_complete /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/request.c:63
00:09:14.576      #8 0x7fd5d30de9b1 in bs_sequence_finish /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/request.c:295
00:09:14.576      #9 0x7fd5d30c69f6 in blob_close_cpl /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/blobstore.c:8013
00:09:14.576      #10 0x7fd5d30682ab in blob_persist /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/blobstore.c:2465
00:09:14.576      #11 0x7fd5d30c7341 in spdk_blob_close /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/blobstore.c:8066
00:09:14.576      #12 0x7fd5d32fd3d9 in super_blob_set_cb /var/jenkins/workspace/vhost-autotest/spdk/lib/lvol/lvol.c:533
00:09:14.576      #13 0x7fd5d30d96fc in bs_call_cpl /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/request.c:31
00:09:14.576      #14 0x7fd5d30da434 in bs_request_set_complete /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/request.c:63
00:09:14.576      #15 0x7fd5d30de9b1 in bs_sequence_finish /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/request.c:295
00:09:14.576      #16 0x7fd5d30c2b03 in blob_sync_md_cpl /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/blobstore.c:7765
00:09:14.576      #17 0x7fd5d305dcca in blob_persist_complete_cb /var/jenkins/workspace/vhost-autotest/spdk/lib/blob/blobstore.c:1752
00:09:14.576      #18 0x7fd5d0e78629 in msg_queue_run_batch /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:842
00:09:14.576      #19 0x7fd5d0e7ce36 in thread_poll /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:1064
00:09:14.576      #20 0x7fd5d0e7e111 in spdk_thread_poll /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:1157
00:09:14.576      #21 0x7fd5d1ad1f57 in _reactor_run /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:914
00:09:14.576      #22 0x7fd5d1ad288f in reactor_run /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:952
00:09:14.576      #23 0x7fd5d1ad37df in spdk_reactors_start /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:1068
00:09:14.576      #24 0x7fd5d1ac272b in spdk_app_start /var/jenkins/workspace/vhost-autotest/spdk/lib/event/app.c:808
00:09:14.576      #25 0x4028d2 in main /var/jenkins/workspace/vhost-autotest/spdk/app/vhost/vhost.c:77
00:09:14.576      #26 0x7fd5cfe5c50f in __libc_start_call_main (/usr/lib64/libc.so.6+0x2750f)
00:09:14.576      #27 0x7fd5cfe5c5c8 in __libc_start_main@GLIBC_2.2.5 (/usr/lib64/libc.so.6+0x275c8)
00:09:14.576      #28 0x402344 in _start (/var/jenkins/workspace/vhost-autotest/spdk/build/bin/vhost+0x402344)
00:09:14.576  
00:09:14.576  0x7fd5cbcc4858 is located 32856 bytes inside of 2102832-byte region [0x7fd5cbcbc800,0x7fd5cbebde30)
00:09:14.576  freed by thread T0 (reactor_0) here:
00:09:14.576      #0 0x7fd5d3bbe388 in __interceptor_free.part.0 (/usr/lib64/libasan.so.8+0xb9388)
00:09:14.576      #1 0x7fd5d0d0455e in spdk_jsonrpc_server_shutdown /var/jenkins/workspace/vhost-autotest/spdk/lib/jsonrpc/jsonrpc_server_tcp.c:120
00:09:14.576      #2 0x7fd5d0d46686 in spdk_rpc_close /var/jenkins/workspace/vhost-autotest/spdk/lib/rpc/rpc.c:343
00:09:14.576      #3 0x7fd5d0fa4794 in spdk_rpc_finish /var/jenkins/workspace/vhost-autotest/spdk/lib/init/rpc.c:59
00:09:14.576      #4 0x7fd5d1ac2b48 in app_stop /var/jenkins/workspace/vhost-autotest/spdk/lib/event/app.c:863
00:09:14.576      #5 0x7fd5d0e78629 in msg_queue_run_batch /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:842
00:09:14.576      #6 0x7fd5d0e7ce36 in thread_poll /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:1064
00:09:14.576      #7 0x7fd5d0e7e111 in spdk_thread_poll /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:1157
00:09:14.576      #8 0x7fd5d1ad1f57 in _reactor_run /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:914
00:09:14.576      #9 0x7fd5d1ad288f in reactor_run /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:952
00:09:14.576      #10 0x7fd5d1ad37df in spdk_reactors_start /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:1068
00:09:14.576      #11 0x7fd5d1ac272b in spdk_app_start /var/jenkins/workspace/vhost-autotest/spdk/lib/event/app.c:808
00:09:14.576      #12 0x4028d2 in main /var/jenkins/workspace/vhost-autotest/spdk/app/vhost/vhost.c:77
00:09:14.576      #13 0x7fd5cfe5c50f in __libc_start_call_main (/usr/lib64/libc.so.6+0x2750f)
00:09:14.576  
00:09:14.576  previously allocated by thread T0 (reactor_0) here:
00:09:14.576      #0 0x7fd5d3bbf097 in calloc (/usr/lib64/libasan.so.8+0xba097)
00:09:14.576      #1 0x7fd5d0d027ea in spdk_jsonrpc_server_listen /var/jenkins/workspace/vhost-autotest/spdk/lib/jsonrpc/jsonrpc_server_tcp.c:18
00:09:14.576      #2 0x7fd5d0d455c1 in spdk_rpc_listen /var/jenkins/workspace/vhost-autotest/spdk/lib/rpc/rpc.c:194
00:09:14.576      #3 0x7fd5d0fa46da in spdk_rpc_initialize /var/jenkins/workspace/vhost-autotest/spdk/lib/init/rpc.c:40
00:09:14.576      #4 0x7fd5d1abafbf in bootstrap_fn /var/jenkins/workspace/vhost-autotest/spdk/lib/event/app.c:502
00:09:14.576      #5 0x7fd5d0e78629 in msg_queue_run_batch /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:842
00:09:14.576      #6 0x7fd5d0e7ce36 in thread_poll /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:1064
00:09:14.576      #7 0x7fd5d0e7e111 in spdk_thread_poll /var/jenkins/workspace/vhost-autotest/spdk/lib/thread/thread.c:1157
00:09:14.577      #8 0x7fd5d1ad1f57 in _reactor_run /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:914
00:09:14.577      #9 0x7fd5d1ad288f in reactor_run /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:952
00:09:14.577      #10 0x7fd5d1ad37df in spdk_reactors_start /var/jenkins/workspace/vhost-autotest/spdk/lib/event/reactor.c:1068
00:09:14.577      #11 0x7fd5d1ac272b in spdk_app_start /var/jenkins/workspace/vhost-autotest/spdk/lib/event/app.c:808
00:09:14.577      #12 0x4028d2 in main /var/jenkins/workspace/vhost-autotest/spdk/app/vhost/vhost.c:77
00:09:14.577      #13 0x7fd5cfe5c50f in __libc_start_call_main (/usr/lib64/libc.so.6+0x2750f)
00:09:14.577  
00:09:14.577  SUMMARY: AddressSanitizer: heap-use-after-free /var/jenkins/workspace/vhost-autotest/spdk/lib/jsonrpc/jsonrpc_server_tcp.c:313 in jsonrpc_server_send_response
00:09:14.577  Shadow bytes around the buggy address:
00:09:14.577    0x0ffb397908b0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb397908c0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb397908d0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb397908e0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb397908f0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577  =>0x0ffb39790900: fd fd fd fd fd fd fd fd fd fd fd[fd]fd fd fd fd
00:09:14.577    0x0ffb39790910: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb39790920: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb39790930: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb39790940: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577    0x0ffb39790950: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
00:09:14.577  Shadow byte legend (one shadow byte represents 8 application bytes):
00:09:14.577    Addressable:           00
00:09:14.577    Partially addressable: 01 02 03 04 05 06 07 
00:09:14.577    Heap left redzone:       fa
00:09:14.577    Freed heap region:       fd
00:09:14.577    Stack left redzone:      f1
00:09:14.577    Stack mid redzone:       f2
00:09:14.577    Stack right redzone:     f3
00:09:14.577    Stack after return:      f5
00:09:14.577    Stack use after scope:   f8
00:09:14.577    Global redzone:          f9
00:09:14.577    Global init order:       f6
00:09:14.577    Poisoned by user:        f7
00:09:14.577    Container overflow:      fc
00:09:14.577    Array cookie:            ac
00:09:14.577    Intra object redzone:    bb
00:09:14.577    ASan internal:           fe
00:09:14.577    Left alloca redzone:     ca
00:09:14.577    Right alloca redzone:    cb
00:09:14.577  ==58212==ABORTING
00:09:14.577     11:35:12	-- vhost/common.sh@222 -- # (( i++ ))
00:09:14.577     11:35:12	-- vhost/common.sh@222 -- # (( i < 60 ))
00:09:14.577     11:35:12	-- vhost/common.sh@223 -- # kill -0 58212
00:09:14.577  /var/jenkins/workspace/vhost-autotest/spdk/test/vhost/common.sh: line 223: kill: (58212) - No such process
00:09:14.577     11:35:12	-- vhost/common.sh@227 -- # break
00:09:14.577     11:35:12	-- vhost/common.sh@230 -- # kill -0 58212
00:09:14.577  /var/jenkins/workspace/vhost-autotest/spdk/test/vhost/common.sh: line 230: kill: (58212) - No such process
00:09:14.577     11:35:12	-- vhost/common.sh@235 -- # kill -0 58212
00:09:14.577  /var/jenkins/workspace/vhost-autotest/spdk/test/vhost/common.sh: line 235: kill: (58212) - No such process
00:09:14.577     11:35:12	-- vhost/common.sh@246 -- # timing_exit vhost_kill
00:09:14.577     11:35:12	-- common/autotest_common.sh@697 -- # xtrace_disable
00:09:14.577     11:35:12	-- common/autotest_common.sh@20 -- # '[' no '!=' yes ']'
00:09:14.577     11:35:12	-- common/autotest_common.sh@21 -- # PREV_BASH_OPTS=hxBET
00:09:14.577     11:35:12	-- common/autotest_common.sh@22 -- # [[ hxBET == *\x* ]]
00:09:14.577     11:35:12	-- common/autotest_common.sh@23 -- # XTRACE_DISABLED=yes
00:09:14.577     11:35:12	-- common/autotest_common.sh@25 -- # set +x
00:09:14.577     11:35:12	-- common/autotest_common.sh@699 -- # xtrace_enable
00:09:14.577     11:35:12	-- vhost/common.sh@248 -- # rm -rf /root/vhost_test/vhost/0
00:09:14.577     11:35:12	-- vhost/common.sh@250 -- # return 0
00:09:14.577     11:35:12	-- vhost_boot/vhost_boot.sh@28 -- # exit 1
00:09:14.577   11:35:12	-- vhost_boot/vhost_boot.sh@82 -- # lvs_u='Timeout while waiting for response:
00:09:14.577  
00:09:14.577  
00:09:14.577  request:
00:09:14.577  {
00:09:14.577    "ctrlr": "naa.vhost_vm.0",
00:09:14.577    "scsi_target_num": 0,
00:09:14.577    "method": "vhost_scsi_controller_remove_target",
00:09:14.577    "req_id": 1
00:09:14.577  }
00:09:14.577  Got JSON-RPC error response
00:09:14.577  response:
00:09:14.577  {
00:09:14.577    "code": -32602,
00:09:14.577    "message": "No such device"
00:09:14.577  }
00:09:14.577  request:
00:09:14.577  {
00:09:14.577    "ctrlr": "naa.vhost_vm.0",
00:09:14.577    "method": "vhost_delete_controller",
00:09:14.577    "req_id": 1
00:09:14.577  }
00:09:14.577  Got JSON-RPC error response
00:09:14.577  response:
00:09:14.577  {
00:09:14.577    "code": -32602,
00:09:14.577    "message": "No such device"
00:09:14.577  }
00:09:14.577  INFO: killing vhost (PID 58212) app
00:09:14.577  INFO: sent SIGINT to vhost app - waiting 60 seconds to exit
00:09:14.577  .'
00:09:14.577    11:35:12	-- vhost_boot/vhost_boot.sh@82 -- # err_clean '' 82
00:09:14.577    11:35:12	-- vhost_boot/vhost_boot.sh@18 -- # trap - ERR
00:09:14.577    11:35:12	-- vhost_boot/vhost_boot.sh@19 -- # print_backtrace
00:09:14.577    11:35:12	-- common/autotest_common.sh@1107 -- # [[ ehxBET =~ e ]]
00:09:14.577    11:35:12	-- common/autotest_common.sh@1109 -- # args=('82' '' '--vm_image=/root/spdk_dependencies/spdk_test_image.qcow2')
00:09:14.577    11:35:12	-- common/autotest_common.sh@1109 -- # local args
00:09:14.577    11:35:12	-- common/autotest_common.sh@1111 -- # xtrace_disable
00:09:14.577    11:35:12	-- common/autotest_common.sh@20 -- # '[' no '!=' yes ']'
00:09:14.577    11:35:12	-- common/autotest_common.sh@21 -- # PREV_BASH_OPTS=ehxBET
00:09:14.577    11:35:12	-- common/autotest_common.sh@22 -- # [[ ehxBET == *\x* ]]
00:09:14.577    11:35:12	-- common/autotest_common.sh@23 -- # XTRACE_DISABLED=yes
00:09:14.577    11:35:12	-- common/autotest_common.sh@25 -- # set +x
00:09:14.577  ========== Backtrace start: ==========
00:09:14.577  
00:09:14.577  in /var/jenkins/workspace/vhost-autotest/spdk/test/vhost/vhost_boot/vhost_boot.sh:82 -> err_clean([""],["82"])
00:09:14.577       ...
00:09:14.577     77  		echo "ERROR: Your device $nvme_bdev_name block size is $nvme_bdev_bs, but should be 512 bytes."
00:09:14.577     78  		false
00:09:14.577     79  	fi
00:09:14.577     80  	
00:09:14.577     81  	lvb_size=20000 # MB
00:09:14.577  => 82  	lvs_u=$($rpc_py bdev_lvol_create_lvstore Nvme0n1 lvs0)
00:09:14.577     83  	lvb_u=$($rpc_py bdev_lvol_create -u $lvs_u lvb0 "$lvb_size")
00:09:14.577     84  	timing_exit create_lvol
00:09:14.577     85  	
00:09:14.577     86  	timing_enter convert_vm_image
00:09:14.577     87  	modprobe nbd
00:09:14.577       ...
00:09:14.577  in /var/jenkins/workspace/vhost-autotest/spdk/test/vhost/vhost_boot/vhost_boot.sh:82 -> main(["--vm_image=/root/spdk_dependencies/spdk_test_image.qcow2"])
00:09:14.577       ...
00:09:14.577     77  		echo "ERROR: Your device $nvme_bdev_name block size is $nvme_bdev_bs, but should be 512 bytes."
00:09:14.577     78  		false
00:09:14.577     79  	fi
00:09:14.577     80  	
00:09:14.577     81  	lvb_size=20000 # MB
00:09:14.577  => 82  	lvs_u=$($rpc_py bdev_lvol_create_lvstore Nvme0n1 lvs0)
00:09:14.578     83  	lvb_u=$($rpc_py bdev_lvol_create -u $lvs_u lvb0 "$lvb_size")
00:09:14.578     84  	timing_exit create_lvol
00:09:14.578     85  	
00:09:14.578     86  	timing_enter convert_vm_image
00:09:14.578     87  	modprobe nbd
00:09:14.578       ...
00:09:14.578  
00:09:14.578  ========== Backtrace end ==========
@spdkci
Copy link

spdkci commented Jun 9, 2023

Another instance of this failure. Reported by @mikeBashStuff. log: https://ci.spdk.io/public_build/autotest-per-patch_105754.html

@tomzawadzki
Copy link
Contributor

[Bug scrub] For the timeout during lvol store creation, try using the clear_method=none option to skip sending unmap to the whole device. If it still hangs, then it must be something different.

@tomzawadzki tomzawadzki changed the title [vhost_boot]: ERROR: AddressSanitizer: heap-use-after-free [vhost_boot]: bdev_lvol_create_lvstore times out on large device Jun 13, 2023
@tomzawadzki tomzawadzki added bug Medium Medium Priority Bug and removed Intermittent Failure labels Sep 5, 2023
@mikeBashStuff
Copy link
Contributor Author

Finally got some time to test it. As expected, using --clear-method=none speeds up the whole lvol creation (and the entire test for about ~50% under that big of a drive). That said, this issue has not occurred for months now (the WFP43, which holds this 8TB drive, is actively used in the pool, yet not similar failures have popped up) and I am not able to reproduce it on-demand anymore.

With that in mind, I will close this issue. In case it resurfaces we know the potential workaround.

@xjjx
Copy link
Contributor

xjjx commented Mar 15, 2024

@xjjx xjjx reopened this Mar 15, 2024
@xjjx
Copy link
Contributor

xjjx commented Mar 19, 2024

@karlatec karlatec assigned mchomnic and unassigned mikeBashStuff Mar 20, 2024
@tomzawadzki
Copy link
Contributor

Just to double check I've ran lvs creation on bdev of such sizes as specified in the issue. There does not seem to be anything out of ordinary and the time it takes is somewhat linear with the size. I've used the same drive here, and sizes below 8TB were measured by creating split bdev with specified size:

1TB
0m9.430s

7TB
0m57.811s

8TB
1m0.226s

Everytime all of that was handled by single unmap and time is result of the device processing that I/O operation. Ex:

[sys_sgci@spdk-wfp-43 spdk]$ sudo ./scripts/rpc.py bdev_get_iostat --name Nvme0n1 | jq .bdevs[].num_unmap_ops
1

Summing up, we need to accommodate those times in the tests.

Default timeout in rpc.py is 60 seconds, which could timeout in worst case scenarios.
There are suites which increase the default timeout like FTL suites or even lvol

$rpc_py -t 120 bdev_lvol_delete $lvol_bdev

@mchomnic Changing clean_method is troublesome, because it would have to be done in most tests, and would have to enforced in each new test, still having possibility to miss that. It was communicated to me that the rpc timeout in tests are decreased (to 15 sec), but I don't see it in the test scripts and first log in this issue times out after 1 minute. If the test suites actually use the default 60 sec, maybe extending it bit further would help elevate this issue ?

@mikeBashStuff
Copy link
Contributor Author

It was communicated to me that the rpc timeout in tests are decreased (to 15 sec), but I don't see it in the test scripts and first log in this issue times out after 1 minute. If the test suites actually use the default 60 sec, maybe extending it bit further would help elevate this issue ?

This seems to be referring to #3312 (comment) though I am not sure what exactly you discussed outside of these issues. There was an internal thread about what I mentioned in this linked comment though.

@mchomnic
Copy link
Contributor

@mikeBashStuff , do you think we can just extend the timeout to 30 seconds? I know it's always tricky to raise the timeout whenever it's needed, but it seems that bigger disks require more of it.

@tomzawadzki
Copy link
Contributor

This seems to be referring to #3312 (comment) though I am not sure what exactly you discussed outside of these issues. There was an internal thread about what I mentioned in this linked comment though.

This was mentioned as an offhand comment offline, but as I've said I couldn't find it being the case. Didn't notice the rpc_cmd(), thanks.

@mikeBashStuff , do you think we can just extend the timeout to 30 seconds? I know it's always tricky to raise the timeout whenever it's needed, but it seems that bigger disks require more of it.

Please note that even 60 seconds are not sufficient - per the original report.

Rather than changing timeout for all commands, or requiring to explicitly write tests with higher timeout (or clean_method parameter). It would be best to increase timeout specifically only for lvol store creation, yet I'm not sure what options we have for that.

@jimharris
Copy link
Member

This seems to be referring to #3312 (comment) though I am not sure what exactly you discussed outside of these issues. There was an internal thread about what I mentioned in this linked comment though.

This was mentioned as an offhand comment offline, but as I've said I couldn't find it being the case. Didn't notice the rpc_cmd(), thanks.

@mikeBashStuff , do you think we can just extend the timeout to 30 seconds? I know it's always tricky to raise the timeout whenever it's needed, but it seems that bigger disks require more of it.

Please note that even 60 seconds are not sufficient - per the original report.

Rather than changing timeout for all commands, or requiring to explicitly write tests with higher timeout (or clean_method parameter). It would be best to increase timeout specifically only for lvol store creation, yet I'm not sure what options we have for that.

You mean teach scripts/rpc.py to dynamically change the timeout for certain RPCs?

I like this solution, as long as we document it, and ensure that picking a timeout on the rpc.py command line still takes precedence.

It certainly seems possible, from looking at the JSONRPCClient python code. I think it just needs to add a method to set the timeout on a previously created client object.

We could just set the timeout dynamically when creating the JSONRPCClient, but that doesn't work for batching mode (where we pipe a bunch of RPCs into one rpc.py invocation). For this latter case we would want only the bdev_lvol_create_lvstore to have the longer timeout, where the rest get the normal default.

@spdkci
Copy link

spdkci commented May 6, 2024

Another instance of this failure. Reported by @spdkci / Known Issue Detector. log: https://ci.spdk.io/public_build/autotest-spdk-v24.01-LTS-vs-dpdk-v23.11_388.html

@spdkci
Copy link

spdkci commented May 6, 2024

Another instance of this failure. Reported by @spdkci / Known Issue Detector. log: https://ci.spdk.io/public_build/autotest-nightly-lts_1795.html

@spdkci
Copy link

spdkci commented May 9, 2024

Another instance of this failure. Reported by @spdkci / Known Issue Detector. log: https://ci.spdk.io/public_build/autotest-nightly-lts_1806.html

@spdkci
Copy link

spdkci commented May 10, 2024

Another instance of this failure. Reported by @spdkci / Known Issue Detector. log: https://ci.spdk.io/public_build/autotest-spdk-v24.01-LTS-vs-dpdk-v22.11_402.html

@spdkci
Copy link

spdkci commented May 21, 2024

Another instance of this failure. Reported by @spdkci / Known Issue Detector. log: https://ci.spdk.io/public_build/autotest-nightly-lts_1838.html

spdk-bot pushed a commit that referenced this issue May 22, 2024
The function bdev_lvol_create with the default unmap clear method may
take more than the default 60.0 sec. Increase the timeout to 90.0 for
the function.

Fixes #3050

Signed-off-by: Marek Chomnicki <marek.chomnicki@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/22753 (master)

(cherry picked from commit 679c318)
Change-Id: I6b8ac214b20601e7247ba789b485f53d3545955c
Signed-off-by: Marek Chomnicki <marek.chomnicki@intel.com>
Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/23246
Reviewed-by: Jim Harris <jim.harris@samsung.com>
Reviewed-by: Konrad Sztyber <konrad.sztyber@intel.com>
Tested-by: SPDK CI Jenkins <sys_sgci@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Medium Medium Priority Bug
Projects
Status: 02.04
Development

No branches or pull requests

6 participants