Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scylla appears to have crashed in test test_wasm.py::test_fib_called_on_null (on aarch64) #9387

Closed
bhalevy opened this issue Sep 26, 2021 · 14 comments

Comments

@bhalevy
Copy link
Member

bhalevy commented Sep 26, 2021

Seen in https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/760/artifact/testlog/aarch64_release/run.9.log

test_wasm.py .F
! _pytest.outcomes.Exit: Scylla appears to have crashed in test test_wasm.py::test_fib_called_on_null !
Traceback (most recent call last):
  File "/usr/bin/pytest", line 33, in <module>
    sys.exit(load_entry_point('pytest==6.2.2', 'console_scripts', 'pytest')())
  File "/usr/lib/python3.9/site-packages/_pytest/config/__init__.py", line 185, in console_main
    code = main()
  File "/usr/lib/python3.9/site-packages/_pytest/config/__init__.py", line 162, in main
    ret: Union[ExitCode, int] = config.hook.pytest_cmdline_main(
  File "/usr/lib/python3.9/site-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/lib/python3.9/site-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/lib/python3.9/site-packages/pluggy/manager.py", line 84, in <lambda>
    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
  File "/usr/lib/python3.9/site-packages/pluggy/callers.py", line 208, in _multicall
    return outcome.get_result()
  File "/usr/lib/python3.9/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/lib/python3.9/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/lib/python3.9/site-packages/_pytest/main.py", line 316, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/usr/lib/python3.9/site-packages/_pytest/main.py", line 304, in wrap_session
    config.hook.pytest_sessionfinish(
  File "/usr/lib/python3.9/site-packages/pluggy/hooks.py", line 286, in __call__
    return self._hookexec(self, self.get_hookimpls(), kwargs)
  File "/usr/lib/python3.9/site-packages/pluggy/manager.py", line 93, in _hookexec
    return self._inner_hookexec(hook, methods, kwargs)
  File "/usr/lib/python3.9/site-packages/pluggy/manager.py", line 84, in <lambda>
    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
  File "/usr/lib/python3.9/site-packages/pluggy/callers.py", line 203, in _multicall
    gen.send(outcome)
  File "/usr/lib/python3.9/site-packages/_pytest/terminal.py", line 803, in pytest_sessionfinish
    outcome.get_result()
  File "/usr/lib/python3.9/site-packages/pluggy/callers.py", line 80, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/lib/python3.9/site-packages/pluggy/callers.py", line 187, in _multicall
    res = hook_impl.function(*args)
  File "/usr/lib/python3.9/site-packages/_pytest/runner.py", line 103, in pytest_sessionfinish
    session._setupstate.teardown_all()
  File "/usr/lib/python3.9/site-packages/_pytest/runner.py", line 412, in teardown_all
    self._pop_and_teardown()
  File "/usr/lib/python3.9/site-packages/_pytest/runner.py", line 387, in _pop_and_teardown
    self._teardown_with_finalization(colitem)
  File "/usr/lib/python3.9/site-packages/_pytest/runner.py", line 405, in _teardown_with_finalization
    self._callfinalizers(colitem)
  File "/usr/lib/python3.9/site-packages/_pytest/runner.py", line 402, in _callfinalizers
    raise exc
  File "/usr/lib/python3.9/site-packages/_pytest/runner.py", line 395, in _callfinalizers
    fin()
  File "/usr/lib/python3.9/site-packages/_pytest/fixtures.py", line 1034, in finish
    raise exc
  File "/usr/lib/python3.9/site-packages/_pytest/fixtures.py", line 1027, in finish
    func()
  File "/usr/lib/python3.9/site-packages/_pytest/fixtures.py", line 941, in _teardown_yield_fixture
    next(it)
  File "/jenkins/workspace/scylla-master/build/scylla/test/cql-pytest/test_wasm.py", line 49, in table1
    cql.execute("DROP TABLE " + table)
  File "/usr/local/lib64/python3.9/site-packages/cassandra/cluster.py", line 2627, in execute
    return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state, host, execute_as).result()
  File "/usr/local/lib64/python3.9/site-packages/cassandra/cluster.py", line 4886, in result
    raise self._final_exception
cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {})

Subprocess output:

Scylla version 4.6.dev-0.20210926.f582bfd45 with build-id d50b4ec879233846abd8df4545ad2bf506db8329 starting ...
command used: "/jenkins/workspace/scylla-master/build/scylla/testlog/release/scylla-test-104/test_scylla --options-file /jenkins/workspace/scylla-master/build/scylla/conf/scylla.yaml --developer-mode 1 --ring-delay-ms 0 --collectd 0 --smp 2 -m 1G --overprovisioned --max-networking-io-control-blocks 100 --unsafe-bypass-fsync 1 --kernel-page-cache 1 --flush-schema-tables-after-modification false --api-address 127.1.0.104 --rpc-address 127.1.0.104 --listen-address 127.1.0.104 --prometheus-address 127.1.0.104 --seed-provider-parameters seeds=127.1.0.104 --workdir /jenkins/workspace/scylla-master/build/scylla/testlog/release/scylla-test-104 --auto-snapshot 0 --skip-wait-for-gossip-to-settle 0 --logger-log-level compaction=warn --logger-log-level migration_manager=warn --num-tokens 16 --range-request-timeout-in-ms 300000 --read-request-timeout-in-ms 300000 --counter-write-request-timeout-in-ms 300000 --cas-contention-timeout-in-ms 300000 --truncate-request-timeout-in-ms 300000 --write-request-timeout-in-ms 300000 --request-timeout-in-ms 300000 --experimental 1 --enable-user-defined-functions 1 --authenticator PasswordAuthenticator --strict-allow-filtering true"
parsed command line options: [options-file: /jenkins/workspace/scylla-master/build/scylla/conf/scylla.yaml, developer-mode: 1, ring-delay-ms: 0, collectd: 0, smp: 2, memory: 1G, overprovisioned, max-networking-io-control-blocks: 100, unsafe-bypass-fsync: 1, kernel-page-cache: 1, flush-schema-tables-after-modification: false, api-address: 127.1.0.104, rpc-address: 127.1.0.104, listen-address: 127.1.0.104, prometheus-address: 127.1.0.104, seed-provider-parameters: seeds=127.1.0.104, workdir: /jenkins/workspace/scylla-master/build/scylla/testlog/release/scylla-test-104, auto-snapshot: 0, skip-wait-for-gossip-to-settle: 0, logger-log-level: compaction=warn, logger-log-level: migration_manager=warn, num-tokens: 16, range-request-timeout-in-ms: 300000, read-request-timeout-in-ms: 300000, counter-write-request-timeout-in-ms: 300000, cas-contention-timeout-in-ms: 300000, truncate-request-timeout-in-ms: 300000, write-request-timeout-in-ms: 300000, request-timeout-in-ms: 300000, experimental: 1, enable-user-defined-functions: 1, authenticator: PasswordAuthenticator, strict-allow-filtering: true]
...
Segmentation fault on shard 0.
Backtrace:
  0x274559b
  0x27714c7
  0x2b3667f
  linux-vdso.so.1+0x79b
  /lib64/libc.so.6+0x7fed7
  0x27208d7
  0x2723f5b
  0x299738f
  0x299661b
  0x299611f
  0x2991983
  0x1441983
  0x14422b7
  0x14e37ab
  0x14f002f
  0x1827a97
  0x27554ff
  0x2756593
  0x2755a43
  0x27074b7
  0x2706ac7
  0xc8b803
  /lib64/libc.so.6+0x24b1b
  0xc8a20f
Scylla tests failure
=== TEST.PY SUMMARY START ===
Test exited with code 1

=== TEST.PY SUMMARY END ===
@bhalevy
Copy link
Member Author

bhalevy commented Sep 26, 2021

Cannot decode the backtrace since the build failed and the reloc packages weren't uploaded.

@psarna
Copy link
Contributor

psarna commented Sep 27, 2021

@bhalevy can the Jenkins script be changed to upload the packages if a coredump occurs, so that next time it will be possible to decode? Without a decoded backtrace, and given that the crash happend on aarch, pinpointing this issue would be quite hard.

@bhalevy
Copy link
Member Author

bhalevy commented Sep 27, 2021

@psarna I suggest you open an issue on scylla-pkg. I think it should be possible.
/Cc @yaronkaikov @hagitsegev

@avikivity
Copy link
Member

@psarna you have access to an aarch64 machine. It's a lot better to reproduce, fix, and test locally than to incur continuous integration latency.

@yaronkaikov
Copy link
Contributor

by default we delete the workspace after the build finished. i am running now https://jenkins.scylladb.com/job/scylla-master/job/build/765/ with PRESERVED_WORKSPACE enabled. so it will be kept. @psarna , will let you know once it's done

@yaronkaikov
Copy link
Contributor

@psarna @bhalevy https://github.com/scylladb/scylla-pkg/pull/2416 - this is the fix

So it appears that we do upload the reloc to S3 sright after we finish building Scylla. we just didn't publish the metadata file at the same stage. This PR is fixing it

@yaronkaikov
Copy link
Contributor

@psarna @bhalevy it looks like https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/760/artifact/testlog/aarch64_release/run.9.log was removed from Jenkins due to our history limitation. But i think that https://jenkins.scylladb.com/view/master/job/scylla-master/job/build/764/ is the same (also marked this build as keep for ever). In any case, the relocatable already in S3 and it's under http://downloads.scylladb.com/unstable/scylla/master/relocatable/2021-09-29T01:00:45Z

Let me know if you need anymore information

@psarna
Copy link
Contributor

psarna commented Sep 30, 2021

@yaronkaikov the link to the relocatable package at downloads.scylladb.com gives me a Error: [object Object] response instead of listing the files.

@psarna
Copy link
Contributor

psarna commented Sep 30, 2021

In any case, I proceeded with working directly on the aarch64 machine and managed to reproduce the issue after running wasm tests in a tight loop for a few minutes. The backtrace actually varies, but here's one of them:

[Backtrace #0]
addr2line: DWARF error: could not find variable specification at offset cdcb4
addr2line: DWARF error: could not find variable specification at offset ef085
void seastar::backtrace<seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}>(seastar::backtrace_buffer::append_backtrace()::{lambda(seastar::frame)#1}&&) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:59
 (inlined by) seastar::backtrace_buffer::append_backtrace() at ./build/release/seastar/./seastar/src/core/reactor.cc:756
 (inlined by) seastar::print_with_backtrace(seastar::backtrace_buffer&, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:786
seastar::print_with_backtrace(char const*, bool) at ./build/release/seastar/./seastar/src/core/reactor.cc:798
 (inlined by) seastar::sigsegv_action() at ./build/release/seastar/./seastar/src/core/reactor.cc:3721
 (inlined by) operator() at ./build/release/seastar/./seastar/src/core/reactor.cc:3707
 (inlined by) __invoke at ./build/release/seastar/./seastar/src/core/reactor.cc:3703
wasmtime_runtime::traphandlers::unix::trap_handler at wasmtime_runtime.13pp8cmd-cgu.5:?
addr2line: 'linux-vdso.so.1': No such file
linux-vdso.so.1 0x79b 
~noncopyable_function at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:189
 (inlined by) ~timer at ./build/release/seastar/./seastar/src/core/reactor.cc:589
addr2line: DWARF error: could not find variable specification at offset da2f8
addr2line: DWARF error: could not find variable specification at offset fb418
addr2line: DWARF error: could not find variable specification at offset 11e3ff
addr2line: DWARF error: could not find variable specification at offset 11fa97
~connection at ./transport/server.cc:527
 (inlined by) ~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:464
 (inlined by) ~shared_ptr_count_for at ././seastar/include/seastar/core/shared_ptr.hh:464
addr2line: DWARF error: could not find variable specification at offset 1e725
~shared_ptr at ././seastar/include/seastar/core/shared_ptr.hh:537
 (inlined by) ~ at ./generic_server.cc:187
 (inlined by) ~finally_body at ././seastar/include/seastar/core/future.hh:1753
 (inlined by) ~continuation at ././seastar/include/seastar/core/future.hh:750
 (inlined by) seastar::continuation<seastar::internal::promise_base_with_type<void>, seastar::future<void>::finally_body<generic_server::server::do_accepts(int, bool, seastar::socket_address)::$_5::operator()() const::{lambda(seastar::future<seastar::accept_result>)#1}::operator()(seastar::future<seastar::accept_result>)::{lambda(seastar::future<void>)#1}::operator()(seastar::future<void>) const::{lambda()#1}, true>, seastar::future<void>::then_wrapped_nrvo<seastar::future<void>, {lambda(seastar::future<void>)#1}>({lambda(seastar::future<void>)#1}&&)::{lambda(seastar::internal::promise_base_with_type<void>&&, {lambda(seastar::future<void>)#1}&, seastar::future_state<seastar::internal::monostate>&&)#1}, void>::run_and_dispose() at ././seastar/include/seastar/core/future.hh:771
seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2383
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2792
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2961
seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:2844
addr2line: DWARF error: could not find variable specification at offset 4c61e
seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:217
seastar::app_template::run(int, char**, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:125
addr2line: DWARF error: could not find variable specification at offset 237931
addr2line: DWARF error: could not find variable specification at offset 3860a2
main at ./main.cc:504
__libc_start_main at ??:?

I'm worried by the fact that the problem arises from wasmtime_runtime::traphandlers::unix::trap_handler at wasmtime_runtime.13pp8cmd-cgu.5:? and it's a segfault. That would indicate that the C binding library from wasmtime compiled for aarch is responsible for the segfault. And that, in turn, would mean that it's probably better to simply not download wasmtime for aarch64 and move it out of supported architectures for webassembly. I'll try to reproduce a few more times to see if this line is prevalent - I saw it 2 times out of 2 so far.

@psarna
Copy link
Contributor

psarna commented Sep 30, 2021

... and the reason why it's the test_fib_called_on_null case that segfaults is that it's a test case which exercises the path when wasmtime forcefully stops the execution of a function, because it executed too many ops already.

@psarna
Copy link
Contributor

psarna commented Sep 30, 2021

Another error that sometimes happen is an "OOM" during wasm compilation triggering an abort(), while the memory used by the Scylla binary is nowhere near oom conditions...

#0  0x0000ffff9f594750 in raise () from target:/lib64/libc.so.6
#1  0x0000ffff9f5807dc in abort () from target:/lib64/libc.so.6
#2  0x0000000002d33870 in std::sys::unix::abort_internal () at library/std/src/sys/unix/mod.rs:206
#3  0x0000000002d2e920 in std::process::abort () at library/std/src/process.rs:1952
#4  0x0000000002d30410 in std::alloc::rust_oom () at library/std/src/alloc.rs:330
#5  0x0000000002d44c1c in alloc::alloc::__alloc_error_handler::__rg_oom () at library/alloc/src/alloc.rs:398
#6  0x0000000002d44c10 in alloc::alloc::handle_alloc_error () at library/alloc/src/alloc.rs:367
#7  0x0000000002c97758 in alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle ()
#8  0x0000000002c48660 in cranelift_codegen::machinst::abi_impl::abisig_to_uses_and_defs ()
#9  0x0000000002c48af8 in cranelift_codegen::machinst::abi_impl::ABICallerImpl<M>::from_ptr ()
#10 0x0000000002ca0670 in cranelift_codegen::isa::aarch64::lower_inst::lower_insn_to_regs ()
#11 0x0000000002bb60a0 in cranelift_codegen::machinst::lower::Lower<I>::lower ()
#12 0x0000000002c2f134 in cranelift_codegen::machinst::compile::compile ()
#13 0x0000000002c42f9c in <cranelift_codegen::isa::aarch64::AArch64Backend as cranelift_codegen::machinst::MachBackend>::compile_function ()
#14 0x0000000002cdf520 in cranelift_codegen::context::Context::compile ()
#15 0x0000000002cdf05c in cranelift_codegen::context::Context::compile_and_emit ()
#16 0x0000000002ab1fa4 in wasmtime_jit::trampoline::build_trampoline ()
#17 0x0000000002aa85ec in wasmtime_jit::object::build_object ()
#18 0x0000000002aad044 in wasmtime_jit::compiler::Compiler::compile ()
#19 0x0000000002abea8c in <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold ()
#20 0x0000000002abc80c in alloc::vec::source_iter_marker::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter ()
#21 0x0000000002aaf8fc in core::iter::adapters::process_results ()
#22 0x0000000002aa058c in wasmtime_jit::instantiate::CompilationArtifacts::build ()
#23 0x0000000002a2eb9c in wasmtime_cache::ModuleCacheEntry::get_data ()
#24 0x0000000002a4a2f0 in wasmtime::module::Module::from_binary ()
#25 0x00000000029a25ec in wasmtime_module_new ()
#26 0x000000000228ddc4 in wasmtime::Module::compile (engine=..., wasm=...) at lang/wasmtime.hh:1446
#27 wasmtime::Module::compile (engine=..., wat=...) at lang/wasmtime.hh:1429
#28 0x000000000228d9cc in wasm::compile (ctx=..., arg_names=..., script=...) at lang/wasm.cc:60

@psarna
Copy link
Contributor

psarna commented Sep 30, 2021

According to http://docs.wasmtime.dev/api/src/wasmtime_runtime/traphandlers/unix.rs.html, wasmtime traps are implemented with a longjmp, which are easy to mess up... maybe that's what we can observe for aarch64? I don't know how else to explain a segfault that happens within a trap handler.

@psarna
Copy link
Contributor

psarna commented Sep 30, 2021

Meanwhile, I ran the same test in a loop on my x86_64 - it kept passing for 40+ minutes

psarna added a commit to psarna/scylla that referenced this issue Sep 30, 2021
There seems to be a problem with libwasmtime.a dependency on aarch64,
causing occasional segfaults during tests - specifically, tests
which exercise the path for halting wasm execution due to fuel
exhaustion. As a temporary measure, wasm is disabled on this
architecture to unblock the flow.

Refs scylladb#9387
psarna added a commit to psarna/scylla that referenced this issue Sep 30, 2021
There seems to be a problem with libwasmtime.a dependency on aarch64,
causing occasional segfaults during tests - specifically, tests
which exercise the path for halting wasm execution due to fuel
exhaustion. As a temporary measure, wasm is disabled on this
architecture to unblock the flow.

Refs scylladb#9387
avikivity pushed a commit that referenced this issue Sep 30, 2021
There seems to be a problem with libwasmtime.a dependency on aarch64,
causing occasional segfaults during tests - specifically, tests
which exercise the path for halting wasm execution due to fuel
exhaustion. As a temporary measure, wasm is disabled on this
architecture to unblock the flow.

Refs #9387

Closes #9414
@slivne slivne added this to the 4.8 milestone Oct 3, 2021
@DoronArazii DoronArazii modified the milestones: 5.2, 5.3, 5.x Nov 22, 2022
avikivity added a commit that referenced this issue Jan 8, 2023
This series adds the implementation and usage of rust wasmtime bindings.

The WASM UDFs introduced by this patch are interruptable and use memory allocated using the seastar allocator.

This series includes #11102 (the first two commits) because #11102 required disabling wasm UDFs completely. This patch disables them in the middle of the series, and enables them again at the end.
After this patch, `libwasmtime.a` can be removed from the toolchain.
This patch also removes the workaround for ##9387 but it hasn't been tested with ARM yet - if the ARM test causes issues I'll revert this part of the change.

Closes #11351

* github.com:scylladb/scylladb:
  build: remove references to unused c bindings of wasmtime
  test: assert that WASM allocations can fail without crashing
  wasm: limit memory allocated using mmap
  wasm: add configuration options for instance cache and udf execution
  test: check that wasmtime functions yield
  wasm: use the new rust bindings of wasmtime
  rust: add Wasmtime bindings
  rust: add build profiles more aligned with ninja modes
  rust: adjust build according to cxxbridge's recommendations
  tools: toolchain: dbuild: prepare for sharing cargo cache
@kostja
Copy link
Contributor

kostja commented May 2, 2024

Appears to have been addressed by 02c9968

@kostja kostja closed this as completed May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants