Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

types_test fails intermittently #2967

Closed
hakuch opened this issue Nov 14, 2017 · 20 comments
Closed

types_test fails intermittently #2967

hakuch opened this issue Nov 14, 2017 · 20 comments
Labels
Milestone

Comments

@hakuch
Copy link
Contributor

hakuch commented Nov 14, 2017

Installation details
Scylla version (or git commit hash): 484dde6

The following failure in test_short_type_string_conversions happens roughly 10% of the time for me. I believe the warnings are expected and unrelated.

Running 48 test cases...
unknown location(0): fatal error: in "test_short_type_string_conversions": signal: SIGABRT (application abort requested)
tests/types_test.cc(103): last checkpoint
WARN  2017-11-14 13:08:18,708 [shard 0] seastar - Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE
WARN  2017-11-14 13:08:18,708 [shard 0] org.apache.cassandra.db.marshal.TimestampType - Changing from DateType to TimestampType is allowed, but be wary that they sort differently for pre-unix-epoch timestamps (negative timestamp values) and thus this change will corrupt your data if you have such negative timestamp. So unless you know that you don't have *any* pre-unix-epoch timestamp you should change back to DateType
WARN  2017-11-14 13:08:18,708 [shard 0] org.apache.cassandra.db.marshal.DateType - Changing from TimestampType to DateType is allowed, but be wary that they sort differently for pre-unix-epoch timestamps (negative timestamp values) and thus this change will corrupt your data if you have such negative timestamp. There is no reason to switch from DateType to TimestampType except if you were using DateType in the first place and switched to TimestampType by mistake.

*** 1 failure is detected in the test module "tests/types_test.cc"
@hakuch
Copy link
Contributor Author

hakuch commented Nov 14, 2017

There's nothing obvious in the source file that would trigger the error. It's possible this is another bug, or a problem with my particular machine. On the other hand, I haven't observed this failure with any other test.

I'll supply more information if/when I get it.

@tgrabiec
Copy link
Contributor

The test fails on SIGABRT, if run the test under gdb it would get trapped and we could get a backtrace.

@hakuch
Copy link
Contributor Author

hakuch commented Nov 14, 2017

With types_test_g:

Thread 1 "types_test_g" received signal SIGABRT, Aborted.
0x00007ffff2feb9fb in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff2feb9fb in raise () from /lib64/libc.so.6
#1  0x00007ffff2fed800 in abort () from /lib64/libc.so.6
#2  0x00007ffff571ce01 in ?? () from /lib64/libgcc_s.so.1
#3  0x00007ffff571d17e in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#4  0x00007ffff7245e97 in __cxa_throw () from /lib64/libstdc++.so.6
#5  0x0000000000d02bc6 in integer_type_impl<short>::parse_int (s="32768", this=0x60000016f160) at types.cc:216
#6  integer_type_impl<short>::from_string (this=0x60000016f160, s=...) at types.cc:224
#7  0x00000000006bfc23 in test_parsing_fails (type=..., str=...) at tests/types_test.cc:43
#8  0x00000000006c366a in test_short_type_string_conversions::test_method (this=this@entry=0x7fffffffc91c) at tests/types_test.cc:107
#9  0x00000000006c40fa in test_short_type_string_conversions_invoker () at tests/types_test.cc:83
#10 0x0000000001819eab in boost::function0<void>::operator() (this=<optimized out>) at /usr/include/boost/function/function_template.hpp:770
#11 boost::detail::forward::operator() (this=<optimized out>) at /usr/include/boost/test/impl/execution_monitor.ipp:1300
#12 boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke (function_obj_ptr=...) at /usr/include/boost/function/function_template.hpp:138
#13 0x00000000017fad4d in boost::function0<int>::operator() (this=0x7fffffffd640) at /usr/include/boost/function/function_template.hpp:771
#14 boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) (F=..., tr=...)
    at /usr/include/boost/test/impl/execution_monitor.ipp:281
#15 boost::execution_monitor::catch_signals(boost::function<int ()> const&) (this=this@entry=0x260c940 <boost::unit_test::singleton<boost::unit_test::unit_test_monitor_t>::instance()::the_inst>, F=...)
    at /usr/include/boost/test/impl/execution_monitor.ipp:864
#16 0x00000000017fae26 in boost::execution_monitor::execute(boost::function<int ()> const&) (this=this@entry=0x260c940 <boost::unit_test::singleton<boost::unit_test::unit_test_monitor_t>::instance()::the_inst>, F=...)
    at /usr/include/boost/test/impl/execution_monitor.ipp:1203
#17 0x00000000017fb54a in boost::execution_monitor::vexecute(boost::function<void ()> const&) (this=this@entry=0x260c940 <boost::unit_test::singleton<boost::unit_test::unit_test_monitor_t>::instance()::the_inst>, F=...)
    at /usr/include/boost/test/impl/execution_monitor.ipp:1309
#18 0x0000000001800bc3 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) (this=0x260c940 <boost::unit_test::singleton<boost::unit_test::unit_test_monitor_t>::instance()::the_inst>, 
    func=..., timeout=timeout@entry=0) at /usr/include/boost/test/impl/unit_test_monitor.ipp:46
#19 0x000000000182a0c7 in boost::unit_test::framework::state::execute_test_tree (this=this@entry=0x260bd60 <boost::unit_test::framework::impl::(anonymous namespace)::s_frk_state()::the_inst>, tu_id=<optimized out>, timeout=0, 
    p_random_generator=p_random_generator@entry=0x0) at /usr/include/boost/test/impl/framework.ipp:717
#20 0x000000000182a2f8 in boost::unit_test::framework::state::execute_test_tree (this=0x260bd60 <boost::unit_test::framework::impl::(anonymous namespace)::s_frk_state()::the_inst>, tu_id=tu_id@entry=1, timeout=timeout@entry=0, 
    p_random_generator=p_random_generator@entry=0x0) at /usr/include/boost/test/impl/framework.ipp:666
#21 0x0000000001809242 in boost::unit_test::framework::run (id=1, id@entry=4294967295, continue_test=continue_test@entry=true) at /usr/include/boost/test/impl/framework.ipp:1442
#22 0x0000000001812bef in boost::unit_test::unit_test_main (init_func=<optimized out>, argc=<optimized out>, argv=<optimized out>) at /usr/include/boost/test/impl/unit_test_main.ipp:231
#23 0x00007ffff2fd588a in __libc_start_main () from /lib64/libc.so.6
#24 0x000000000049fa5a in _start ()

@hakuch
Copy link
Contributor Author

hakuch commented Nov 14, 2017

And, interestingly, with types_test (no debugging symbols):

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::bad_lexical_cast> >'
  what():  bad lexical cast: source type value could not be interpreted as target

Thread 1 "types_test" received signal SIGABRT, Aborted.

@tgrabiec
Copy link
Contributor

Is that release mode or debug mode? What's your gcc version?

@hakuch
Copy link
Contributor Author

hakuch commented Nov 14, 2017

In release mode. In debug mode, I wasn't (quickly) able to reproduce the error.

GCC version (Fedora 26):

gcc (GCC) 7.2.1 20170915 (Red Hat 7.2.1-2)
Copyright (C) 2017 Free Software Foundation, Inc.

@glommer
Copy link
Contributor

glommer commented Nov 14, 2017 via email

@hakuch
Copy link
Contributor Author

hakuch commented Nov 14, 2017

You're right, @glommer.

Looking at the backtrace, the exception is coming from test_parsing_fails(). That code catches marshal_exception.

What's weird at first glance is that if the problem is that the failed cast is not being converted to marshal_exception, why the heck would the failure be intermittent?

@tgrabiec
Copy link
Contributor

The test calls the parser on invalid value deliberately (test_parsing_fails), and expects an exception. The problem is that exception is not propagated to the caller but rather results in abort. It shouldn't abort.

@tgrabiec
Copy link
Contributor

Maybe it has something to do with the exception scalability workaround. @gleb-cloudius @avikivity Any idea?

@hakuch
Copy link
Contributor Author

hakuch commented Nov 14, 2017

@tgrabiec, yes. I attempted to say the same thing, but did so less clearly. :-)

@avikivity
Copy link
Member

@tgrabiec could be, easily tested by disabling it. There's a seastar configuration option for it (--exception-scalability-workaround).

We have our own _Unwind_RaiseException, but it looks like it's the gcc version that's running (our version calls gcc's).

@avikivity
Copy link
Member

I think I see it, the reactor is initialized multiple times, and so is our hack, but it probably doesn't survive multiple initializations.

@tzach tzach added the bug label Nov 15, 2017
@tzach tzach added this to the 2.1 milestone Nov 15, 2017
@gleb-cloudius
Copy link
Contributor

The problem is that some tests (those that are not SEASTAR_TEST_CASE) are running in parallel with reactor initialization and hence may throw while exception initialization is running which will cause the failure we see here. Not sure what solution should be. Changing all tests to be SEASTAR_TEST_CASE sounds impractical. Another option is to disabling the hack for the tests somehow, or initialize it before any test case explicitly.

@avikivity
Copy link
Member

Why is changing everything to SEASTAR_TEST_CASE impractical? It's a search-n-replace-n-add-return make_ready_future<>(), no?

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Nov 20, 2017 via email

@avikivity
Copy link
Member

I don't think there is a performance concern here. I think it's fine to say that if one test is a SEASTAR_TEST_CASE, then all should be as well.

We could also fix the hack, but it's probably more involved (how? check if the reactor was initialized, and make sure that we flag the reactor as initialized only after the hack is done?)

@gleb-cloudius
Copy link
Contributor

gleb-cloudius commented Nov 20, 2017 via email

@avikivity
Copy link
Member

avikivity commented Nov 20, 2017 via email

@gleb-cloudius
Copy link
Contributor

Ref scylladb/seastar#359

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants