Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guarantee failed: [gc_state.current_entry == __null] #2003

Closed
danielmewes opened this issue Feb 24, 2014 · 10 comments
Closed

Guarantee failed: [gc_state.current_entry == __null] #2003

danielmewes opened this issue Feb 24, 2014 · 10 comments
Labels
Milestone

Comments

@danielmewes
Copy link
Member

I got this crash twice when joining a new node to a cluster.

error: Error in src/serializer/log/data_block_manager.cc at line 1211:
error: Guarantee failed: [gc_state.current_entry == __null] 0x5f40ec0: 520192 garbage bytes left on the extent, 4096 index-referenced bytes, 0 token-referenced bytes, at offset 3145728.  block dump:
       [3145728..+4096)   
       [3149824..+4096)   
       [3153920..+4096)   
       [3158016..+4096)   
       [3162112..+4096)   
       [3166208..+4096)   
       [3170304..+4096)   
       [3174400..+4096)   
       [3178496..+4096)   
       [3182592..+4096)   
       [3186688..+4096)   
       [3190784..+4096)   
       [3194880..+4096)   
       [3198976..+4096)   
       [3203072..+4096)   
       [3207168..+4096)   
       [3211264..+4096)   
       [3215360..+4096)   
       [3219456..+4096)   
       [3223552..+4096)   
       [3227648..+4096)   
       [3231744..+4096)   
       [3235840..+4096)   
       [3239936..+4096)   
       [3244032..+4096)   
       [3248128..+4096)   
       [3252224..+4096)   
       [3256320..+4096)   
       [3260416..+4096)   
       [3264512..+4096)   
       [3268608..+4096)   
       [3272704..+4096)   
       [3276800..+4096)   
       [3280896..+4096)   
       [3284992..+4096)   
       [3289088..+4096)   
       [3293184..+4096)   
       [3297280..+4096)   
       [3301376..+4096)   
       [3305472..+4096)   
       [3309568..+4096)   
       [3313664..+4096)   
       [3317760..+4096)   
       [3321856..+4096)   
       [3325952..+4096)   
       [3330048..+4096)   
       [3334144..+4096)   
       [3338240..+4096)   
       [3342336..+4096)   
       [3346432..+4096)   
       [3350528..+4096)   
       [3354624..+4096)   
       [3358720..+4096)   
       [3362816..+4096)   
       [3366912..+4096)   
       [3371008..+4096)   
       [3375104..+4096)   
       [3379200..+4096)   
       [3383296..+4096)   
       [3387392..+4096)   
       [3391488..+4096)   
       [3395584..+4096)   
       [3399680..+4096)   
       [3403776..+4096)   
       [3407872..+4096)   
       [3411968..+4096)   
       [3416064..+4096)   
       [3420160..+4096)   
       [3424256..+4096)   
       [3428352..+4096)   
       [3432448..+4096)   
       [3436544..+4096)   
       [3440640..+4096)   
       [3444736..+4096)   
       [3448832..+4096)   
       [3452928..+4096)   
       [3457024..+4096)   
       [3461120..+4096)   
       [3465216..+4096)   
       [3469312..+4096)   
       [3473408..+4096)  I
       [3477504..+4096)   
       [3481600..+4096)   
       [3485696..+4096)   
       [3489792..+4096)   
       [3493888..+4096)   
       [3497984..+4096)   
       [3502080..+4096)   
       [3506176..+4096)   
       [3510272..+4096)   
       [3514368..+4096)   
       [3518464..+4096)   
       [3522560..+4096)   
       [3526656..+4096)   
       [3530752..+4096)   
       [3534848..+4096)   
       [3538944..+4096)   
       [3543040..+4096)   
       [3547136..+4096)   
       [3551232..+4096)   
       [3555328..+4096)   
       [3559424..+4096)   
       [3563520..+4096)   
       [3567616..+4096)   
       [3571712..+4096)   
       [3575808..+4096)   
       [3579904..+4096)   
       [3584000..+4096)   
       [3588096..+4096)   
       [3592192..+4096)   
       [3596288..+4096)   
       [3600384..+4096)   
       [3604480..+4096)   
       [3608576..+4096)   
       [3612672..+4096)   
       [3616768..+4096)   
       [3620864..+4096)   
       [3624960..+4096)   
       [3629056..+4096)   
       [3633152..+4096)   
       [3637248..+4096)   
       [3641344..+4096)   
       [3645440..+4096)   
       [3649536..+4096)   
       [3653632..+4096)   
       [3657728..+4096)   
       [3661824..+4096)   
       [3665920..+4096)   
error: Backtrace:
error: Sun Feb 23 21:40:36 2014

       1: rethinkdb_backtrace(void**, int) at rethinkdb_backtrace.cc:99
       2: backtrace_t::backtrace_t() at backtrace.cc:202
       3: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at backtrace.cc:282
       4: format_backtrace(bool) at backtrace.cc:197
       5: report_fatal_error(char const*, int, char const*, ...) at errors.cc:76
       6: data_block_manager_t::run_gc() at data_block_manager.cc:1201
       7: data_block_manager_t::on_gc_write_done() at data_block_manager.cc:1043
       8: data_block_manager_t::gc_writer_t::write_gcs(data_block_manager_t::gc_write_t*, unsigned long) at data_block_manager.cc:1032
       9: boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>::operator()(data_block_manager_t::gc_writer_t*, data_block_manager_t::gc_write_t*, unsigned long) const at mem_fn_template.hpp:281
       10: void boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> >::operator()<boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>&, boost::_bi::list0&, int) at bind.hpp:393
       11: boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > >::operator()() at bind_template.hpp:21
       12: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > > >::run_action() at callable_action.hpp:28
       13: callable_action_wrapper_t::run() at runtime_utils.cc:43
       14: coro_t::run() at coroutines.cc:197
       15: coro_t* coro_t::spawn_later_ordered<boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > > >(boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > > const&) at coroutines.hpp:68
       16: data_block_manager_t::gc_writer_t::gc_writer_t(data_block_manager_t::gc_write_t*, unsigned long, data_block_manager_t*) at data_block_manager.cc:918
       17: data_block_manager_t::run_gc() at data_block_manager.cc:1185
       18: data_block_manager_t::gc_read_callback_t::on_io_complete() at data_block_manager.hpp:138
       19: std::_Mem_fn<void (linux_iocallback_t::*)()>::operator()(linux_iocallback_t*) const at functional:551
       20: void std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)>::__call_c<void, , 0>(std::tuple<>&&, std::_Index_tuple<0>) const at functional:1156
       21: void std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)>::operator()<, void>() const at functional:1220
       22: thread_doer_t<std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)> >::do_perform_job() at do_on_thread.hpp:37
       23: thread_doer_t<std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)> >::on_thread_switch() at do_on_thread.hpp:54
       24: linux_message_hub_t::on_event(int) at message_hub.cc:154
       25: epoll_event_queue_t::run() at epoll.cc:115
       26: linux_thread_pool_t::start_thread(void*) at thread_pool.cc:159
       27: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fbda0623e9a] at 0x7fbda0623e9a (/lib/x86_64-linux-gnu/libpthread.so.0)
       28: clone+0x6d at 0x7fbda03503fd (/lib/x86_64-linux-gnu/libc.so.6)
error: Exiting.
Trace/breakpoint trap (core dumped)

I did something like this:

  • Started two nodes. Created a table and inserted some data (140K empty documents)
  • Set the number of replicas to 2
  • Shut down the secondary. Deleted its data directory.
  • Started up a new node with a fresh data directory. Marked the missing node as dead in the cluster. Backfilling would start immediately to make the new node a secondary (remember: the replica count of the table is still set to 2).
  • Waited a minute or so. The fresh node crashed.

I might actually have done something slightly different, because I can't reproduce it right now. I also had a compilation job running in the background, so there was some disk and CPU activity.

@danielmewes
Copy link
Member Author

I have a slight suspicion that this has to do with running out of memory and some allocation failing. I previously had crashes with std::bad_alloc() (/proc/sys/vm/overcommit_ratio is set to 2 on this system). Maybe we aren't handling allocation failures correctly somewhere, and end up with memory corruption?

@danielmewes
Copy link
Member Author

Just got it again, this time on the existing node instead of the fresh one. I joined a fresh node, marked the previous node dead, waited about 10 seconds and then the existing node crashed with pretty much the same error.
I had started a few extra compiler instances to fill up memory this time.

error: Error in src/serializer/log/data_block_manager.cc at line 1211:
error: Guarantee failed: [gc_state.current_entry == __null] 0x5f68160: 520192 garbage bytes left on the extent, 4096 index-referenced bytes, 0 token-referenced bytes, at offset 3145728.  block dump:
       [3145728..+4096)   
       [3149824..+4096)   
       [3153920..+4096)   
       [3158016..+4096)   
       [3162112..+4096)   
       [3166208..+4096)   
       [3170304..+4096)   
       [3174400..+4096)   
       [3178496..+4096)   
       [3182592..+4096)   
       [3186688..+4096)   
       [3190784..+4096)   
       [3194880..+4096)   
       [3198976..+4096)   
       [3203072..+4096)   
       [3207168..+4096)   
       [3211264..+4096)   
       [3215360..+4096)   
       [3219456..+4096)   
       [3223552..+4096)   
       [3227648..+4096)   
       [3231744..+4096)   
       [3235840..+4096)   
       [3239936..+4096)   
       [3244032..+4096)   
       [3248128..+4096)   
       [3252224..+4096)   
       [3256320..+4096)   
       [3260416..+4096)   
       [3264512..+4096)   
       [3268608..+4096)   
       [3272704..+4096)   
       [3276800..+4096)   
       [3280896..+4096)   
       [3284992..+4096)   
       [3289088..+4096)   
       [3293184..+4096)   
       [3297280..+4096)   
       [3301376..+4096)   
       [3305472..+4096)   
       [3309568..+4096)   
       [3313664..+4096)   
       [3317760..+4096)   
       [3321856..+4096)   
       [3325952..+4096)   
       [3330048..+4096)   
       [3334144..+4096)   
       [3338240..+4096)   
       [3342336..+4096)   
       [3346432..+4096)   
       [3350528..+4096)   
       [3354624..+4096)   
       [3358720..+4096)   
       [3362816..+4096)   
       [3366912..+4096)   
       [3371008..+4096)   
       [3375104..+4096)   
       [3379200..+4096)   
       [3383296..+4096)   
       [3387392..+4096)   
       [3391488..+4096)   
       [3395584..+4096)   
       [3399680..+4096)   
       [3403776..+4096)   
       [3407872..+4096)   
       [3411968..+4096)   
       [3416064..+4096)   
       [3420160..+4096)   
       [3424256..+4096)   
       [3428352..+4096)   
       [3432448..+4096)   
       [3436544..+4096)   
       [3440640..+4096)   
       [3444736..+4096)   
       [3448832..+4096)   
       [3452928..+4096)   
       [3457024..+4096)   
       [3461120..+4096)   
       [3465216..+4096)   
       [3469312..+4096)   
       [3473408..+4096)   
       [3477504..+4096)   
       [3481600..+4096)   
       [3485696..+4096)   
       [3489792..+4096)   
       [3493888..+4096)   
       [3497984..+4096)   
       [3502080..+4096)   
       [3506176..+4096)   
       [3510272..+4096)   
       [3514368..+4096)   
       [3518464..+4096)   
       [3522560..+4096)   
       [3526656..+4096)   
       [3530752..+4096)   
       [3534848..+4096)   
       [3538944..+4096)   
       [3543040..+4096)   
       [3547136..+4096)   
       [3551232..+4096)   
       [3555328..+4096)   
       [3559424..+4096)   
       [3563520..+4096)   
       [3567616..+4096)   
       [3571712..+4096)   
       [3575808..+4096)   
       [3579904..+4096)   
       [3584000..+4096)   
       [3588096..+4096)   
       [3592192..+4096)   
       [3596288..+4096)   
       [3600384..+4096)   
       [3604480..+4096)   
       [3608576..+4096)   
       [3612672..+4096)   
       [3616768..+4096)   
       [3620864..+4096)   
       [3624960..+4096)   
       [3629056..+4096)   
       [3633152..+4096)   
       [3637248..+4096)   
       [3641344..+4096)   
       [3645440..+4096)   
       [3649536..+4096)   
       [3653632..+4096)   
       [3657728..+4096)   
       [3661824..+4096)   
       [3665920..+4096)  I
error: Backtrace:
error: Sun Feb 23 22:24:08 2014

       1: rethinkdb_backtrace(void**, int) at rethinkdb_backtrace.cc:99
       2: backtrace_t::backtrace_t() at backtrace.cc:202
       3: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at backtrace.cc:282
       4: format_backtrace(bool) at backtrace.cc:197
       5: report_fatal_error(char const*, int, char const*, ...) at errors.cc:76
       6: data_block_manager_t::run_gc() at data_block_manager.cc:1201
       7: data_block_manager_t::on_gc_write_done() at data_block_manager.cc:1043
       8: data_block_manager_t::gc_writer_t::write_gcs(data_block_manager_t::gc_write_t*, unsigned long) at data_block_manager.cc:1032
       9: boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>::operator()(data_block_manager_t::gc_writer_t*, data_block_manager_t::gc_write_t*, unsigned long) const at mem_fn_template.hpp:281
       10: void boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> >::operator()<boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list0>(boost::_bi::type<void>, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>&, boost::_bi::list0&, int) at bind.hpp:393
       11: boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > >::operator()() at bind_template.hpp:21
       12: callable_action_instance_t<boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > > >::run_action() at callable_action.hpp:28
       13: callable_action_wrapper_t::run() at runtime_utils.cc:43
       14: coro_t::run() at coroutines.cc:197
       15: coro_t* coro_t::spawn_later_ordered<boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > > >(boost::_bi::bind_t<void, boost::_mfi::mf2<void, data_block_manager_t::gc_writer_t, data_block_manager_t::gc_write_t*, unsigned long>, boost::_bi::list3<boost::_bi::value<data_block_manager_t::gc_writer_t*>, boost::_bi::value<data_block_manager_t::gc_write_t*>, boost::_bi::value<unsigned long> > > const&) at coroutines.hpp:68
       16: data_block_manager_t::gc_writer_t::gc_writer_t(data_block_manager_t::gc_write_t*, unsigned long, data_block_manager_t*) at data_block_manager.cc:918
       17: data_block_manager_t::run_gc() at data_block_manager.cc:1185
       18: data_block_manager_t::gc_read_callback_t::on_io_complete() at data_block_manager.hpp:138
       19: std::_Mem_fn<void (linux_iocallback_t::*)()>::operator()(linux_iocallback_t*) const at functional:551
       20: void std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)>::__call<void, , 0>(std::tuple<>&&, std::_Index_tuple<0>) at functional:1147
       21: void std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)>::operator()<, void>() at functional:1206
       22: void do_on_thread<std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)> >(threadnum_t, std::_Bind<std::_Mem_fn<void (linux_iocallback_t::*)()> (linux_iocallback_t*)>&&) at do_on_thread.hpp:81
       23: linux_disk_manager_t::done(stats_diskmgr_t::action_t*) at disk.cc:146
       24: std::_Mem_fn<void (linux_disk_manager_t::*)(stats_diskmgr_t::action_t*)>::operator()(linux_disk_manager_t*, stats_diskmgr_t::action_t*) const at functional:551
       25: void std::_Bind<std::_Mem_fn<void (linux_disk_manager_t::*)(stats_diskmgr_t::action_t*)> (linux_disk_manager_t*, std::_Placeholder<1>)>::__call<void, stats_diskmgr_t::action_t*&, 0, 1>(std::tuple<stats_diskmgr_t::action_t*&>&&, std::_Index_tuple<0, 1>) at functional:1147
       26: void std::_Bind<std::_Mem_fn<void (linux_disk_manager_t::*)(stats_diskmgr_t::action_t*)> (linux_disk_manager_t*, std::_Placeholder<1>)>::operator()<stats_diskmgr_t::action_t*&, void>(stats_diskmgr_t::action_t*&&&) at functional:1206
       27: boost::detail::function::void_function_obj_invoker1<std::_Bind<std::_Mem_fn<void (linux_disk_manager_t::*)(stats_diskmgr_t::action_t*)> (linux_disk_manager_t*, std::_Placeholder<1>)>, void, stats_diskmgr_t::action_t*>::invoke(boost::detail::function::function_buffer&, stats_diskmgr_t::action_t*) at function_template.hpp:154
       28: boost::function1<void, stats_diskmgr_t::action_t*>::operator()(stats_diskmgr_t::action_t*) const at function_template.hpp:1014
       29: stats_diskmgr_t::done(conflict_resolving_diskmgr_action_t*) at stats.cc:28
       30: std::_Mem_fn<void (stats_diskmgr_t::*)(conflict_resolving_diskmgr_action_t*)>::operator()(stats_diskmgr_t*, conflict_resolving_diskmgr_action_t*) const at functional:551
       31: void std::_Bind<std::_Mem_fn<void (stats_diskmgr_t::*)(conflict_resolving_diskmgr_action_t*)> (stats_diskmgr_t*, std::_Placeholder<1>)>::__call<void, conflict_resolving_diskmgr_action_t*&, 0, 1>(std::tuple<conflict_resolving_diskmgr_action_t*&>&&, std::_Index_tuple<0, 1>) at functional:1147
       32: void std::_Bind<std::_Mem_fn<void (stats_diskmgr_t::*)(conflict_resolving_diskmgr_action_t*)> (stats_diskmgr_t*, std::_Placeholder<1>)>::operator()<conflict_resolving_diskmgr_action_t*&, void>(conflict_resolving_diskmgr_action_t*&&&) at functional:1206
       33: boost::detail::function::void_function_obj_invoker1<std::_Bind<std::_Mem_fn<void (stats_diskmgr_t::*)(conflict_resolving_diskmgr_action_t*)> (stats_diskmgr_t*, std::_Placeholder<1>)>, void, conflict_resolving_diskmgr_action_t*>::invoke(boost::detail::function::function_buffer&, conflict_resolving_diskmgr_action_t*) at function_template.hpp:154
       34: boost::function1<void, conflict_resolving_diskmgr_action_t*>::operator()(conflict_resolving_diskmgr_action_t*) const at function_template.hpp:1014
       35: conflict_resolving_diskmgr_t::done(accounting_diskmgr_action_t*) at conflict_resolving.cc:239
       36: std::_Mem_fn<void (conflict_resolving_diskmgr_t::*)(accounting_diskmgr_action_t*)>::operator()(conflict_resolving_diskmgr_t*, accounting_diskmgr_action_t*) const at functional:551
       37: void std::_Bind<std::_Mem_fn<void (conflict_resolving_diskmgr_t::*)(accounting_diskmgr_action_t*)> (conflict_resolving_diskmgr_t*, std::_Placeholder<1>)>::__call<void, accounting_diskmgr_action_t*&, 0, 1>(std::tuple<accounting_diskmgr_action_t*&>&&, std::_Index_tuple<0, 1>) at functional:1147
       38: void std::_Bind<std::_Mem_fn<void (conflict_resolving_diskmgr_t::*)(accounting_diskmgr_action_t*)> (conflict_resolving_diskmgr_t*, std::_Placeholder<1>)>::operator()<accounting_diskmgr_action_t*&, void>(accounting_diskmgr_action_t*&&&) at functional:1206
       39: boost::detail::function::void_function_obj_invoker1<std::_Bind<std::_Mem_fn<void (conflict_resolving_diskmgr_t::*)(accounting_diskmgr_action_t*)> (conflict_resolving_diskmgr_t*, std::_Placeholder<1>)>, void, accounting_diskmgr_action_t*>::invoke(boost::detail::function::function_buffer&, accounting_diskmgr_action_t*) at function_template.hpp:154
       40: boost::function1<void, accounting_diskmgr_action_t*>::operator()(accounting_diskmgr_action_t*) const at function_template.hpp:1014
       41: accounting_diskmgr_t::done(stats_diskmgr_2_action_t*) at accounting.cc:107
       42: std::_Mem_fn<void (accounting_diskmgr_t::*)(stats_diskmgr_2_action_t*)>::operator()(accounting_diskmgr_t*, stats_diskmgr_2_action_t*) const at functional:551
       43: void std::_Bind<std::_Mem_fn<void (accounting_diskmgr_t::*)(stats_diskmgr_2_action_t*)> (accounting_diskmgr_t*, std::_Placeholder<1>)>::__call<void, stats_diskmgr_2_action_t*&, 0, 1>(std::tuple<stats_diskmgr_2_action_t*&>&&, std::_Index_tuple<0, 1>) at functional:1147
       44: void std::_Bind<std::_Mem_fn<void (accounting_diskmgr_t::*)(stats_diskmgr_2_action_t*)> (accounting_diskmgr_t*, std::_Placeholder<1>)>::operator()<stats_diskmgr_2_action_t*&, void>(stats_diskmgr_2_action_t*&&&) at functional:1206
       45: boost::detail::function::void_function_obj_invoker1<std::_Bind<std::_Mem_fn<void (accounting_diskmgr_t::*)(stats_diskmgr_2_action_t*)> (accounting_diskmgr_t*, std::_Placeholder<1>)>, void, stats_diskmgr_2_action_t*>::invoke(boost::detail::function::function_buffer&, stats_diskmgr_2_action_t*) at function_template.hpp:154
       46: boost::function1<void, stats_diskmgr_2_action_t*>::operator()(stats_diskmgr_2_action_t*) const at function_template.hpp:1014
       47: stats_diskmgr_2_t::done(pool_diskmgr_action_t*) at stats_2.cc:39
       48: std::_Mem_fn<void (stats_diskmgr_2_t::*)(pool_diskmgr_action_t*)>::operator()(stats_diskmgr_2_t*, pool_diskmgr_action_t*) const at functional:551
       49: void std::_Bind<std::_Mem_fn<void (stats_diskmgr_2_t::*)(pool_diskmgr_action_t*)> (stats_diskmgr_2_t*, std::_Placeholder<1>)>::__call<void, pool_diskmgr_action_t*&, 0, 1>(std::tuple<pool_diskmgr_action_t*&>&&, std::_Index_tuple<0, 1>) at functional:1147
       50: void std::_Bind<std::_Mem_fn<void (stats_diskmgr_2_t::*)(pool_diskmgr_action_t*)> (stats_diskmgr_2_t*, std::_Placeholder<1>)>::operator()<pool_diskmgr_action_t*&, void>(pool_diskmgr_action_t*&&&) at functional:1206
       51: boost::detail::function::void_function_obj_invoker1<std::_Bind<std::_Mem_fn<void (stats_diskmgr_2_t::*)(pool_diskmgr_action_t*)> (stats_diskmgr_2_t*, std::_Placeholder<1>)>, void, pool_diskmgr_action_t*>::invoke(boost::detail::function::function_buffer&, pool_diskmgr_action_t*) at function_template.hpp:154
       52: boost::function1<void, pool_diskmgr_action_t*>::operator()(pool_diskmgr_action_t*) const at function_template.hpp:1014
       53: pool_diskmgr_action_t::done() at pool.cc:129
       54: blocker_pool_t::on_event(int) at blocker_pool.cc:156
       55: epoll_event_queue_t::run() at epoll.cc:115
       56: linux_thread_pool_t::start_thread(void*) at thread_pool.cc:159
       57: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7ff681da2e9a] at 0x7ff681da2e9a (/lib/x86_64-linux-gnu/libpthread.so.0)
       58: clone+0x6d at 0x7ff681acf3fd (/lib/x86_64-linux-gnu/libc.so.6)

@danielmewes
Copy link
Member Author

Now when I want to restart the node, I get:

error: Error in src/clustering/administration/persist.cc at line 75:
error: Guarantee failed: [res == archive_result_t::SUCCESS] Deserialization of T (template code) failed with error archive_result_t::SOCK_EOF.
error: Backtrace:
error: Sun Feb 23 22:59:24 2014

       1: rethinkdb_backtrace(void**, int) at rethinkdb_backtrace.cc:99
       2: backtrace_t::backtrace_t() at backtrace.cc:202
       3: lazy_backtrace_formatter_t::lazy_backtrace_formatter_t() at backtrace.cc:282
       4: format_backtrace(bool) at backtrace.cc:197
       5: report_fatal_error(char const*, int, char const*, ...) at errors.cc:76
       6: ../debug/rethinkdb() [0x26770fa] at 0x26770fa ()
       7: metadata_persistence::cluster_persistent_file_t::persistent_branch_history_manager_t<mock::dummy_protocol_t>::persistent_branch_history_manager_t(metadata_persistence::cluster_persistent_file_t*, char (metadata_persistence::cluster_metadata_superblock_t::*) [500], bool) at persist.cc:322
       8: metadata_persistence::cluster_persistent_file_t::construct_branch_history_managers(bool) at persist.cc:425
       9: metadata_persistence::cluster_persistent_file_t::cluster_persistent_file_t(io_backender_t*, serializer_filepath_t const&, perfmon_collection_t*) at persist.cc:222
       10: run_rethinkdb_serve(base_path_t const&, serve_info_t const&, file_direct_io_mode_t, int, uuid_u const*, cluster_semilattice_metadata_t const*, directory_lock_t*, bool*) at command_line.cc:749
       11: run_rethinkdb_porcelain(base_path_t const&, name_string_t const&, file_direct_io_mode_t, int, bool, serve_info_t const&, directory_lock_t*, bool*) at command_line.cc:826
       12: void std::_Bind<void (*(base_path_t, name_string_t, file_direct_io_mode_t, int, bool, serve_info_t, directory_lock_t*, bool*))(base_path_t const&, name_string_t const&, file_direct_io_mode_t, int, bool, serve_info_t const&, directory_lock_t*, bool*)>::__call<void, , 0, 1, 2, 3, 4, 5, 6, 7>(std::tuple<>&&, std::_Index_tuple<0, 1, 2, 3, 4, 5, 6, 7>) at functional:1147
       13: void std::_Bind<void (*(base_path_t, name_string_t, file_direct_io_mode_t, int, bool, serve_info_t, directory_lock_t*, bool*))(base_path_t const&, name_string_t const&, file_direct_io_mode_t, int, bool, serve_info_t const&, directory_lock_t*, bool*)>::operator()<, void>() at functional:1206
       14: std::_Function_handler<void (), std::_Bind<void (*(base_path_t, name_string_t, file_direct_io_mode_t, int, bool, serve_info_t, directory_lock_t*, bool*))(base_path_t const&, name_string_t const&, file_direct_io_mode_t, int, bool, serve_info_t const&, directory_lock_t*, bool*)> >::_M_invoke(std::_Any_data const&) at functional:1780
       15: std::function<void ()>::operator()() const at functional:2162
       16: starter_t::run_wrapper(std::function<void ()> const&) at runtime.cc:59
       17: std::_Mem_fn<void (starter_t::*)(std::function<void ()> const&)>::operator()(starter_t*, std::function<void ()> const&) const at functional:551
       18: void std::_Bind<std::_Mem_fn<void (starter_t::*)(std::function<void ()> const&)> (starter_t*, std::function<void ()>)>::__call<void, , 0, 1>(std::tuple<>&&, std::_Index_tuple<0, 1>) at functional:1147
       19: void std::_Bind<std::_Mem_fn<void (starter_t::*)(std::function<void ()> const&)> (starter_t*, std::function<void ()>)>::operator()<, void>() at functional:1206
       20: std::_Function_handler<void (), std::_Bind<std::_Mem_fn<void (starter_t::*)(std::function<void ()> const&)> (starter_t*, std::function<void ()>)> >::_M_invoke(std::_Any_data const&) at functional:1780
       21: std::function<void ()>::operator()() const at functional:2162
       22: callable_action_instance_t<std::function<void ()> >::run_action() at callable_action.hpp:28
       23: callable_action_wrapper_t::run() at runtime_utils.cc:43
       24: coro_t::run() at coroutines.cc:197
       25: coro_t* coro_t::spawn_sometime<std::function<void ()> >(std::function<void ()> const&) at coroutines.hpp:58
       26: starter_t::on_thread_switch() at runtime.cc:55
       27: linux_message_hub_t::on_event(int) at message_hub.cc:154
       28: epoll_event_queue_t::run() at epoll.cc:115
       29: linux_thread_pool_t::start_thread(void*) at thread_pool.cc:159
       30: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f01e1e23e9a] at 0x7f01e1e23e9a (/lib/x86_64-linux-gnu/libpthread.so.0)
       31: clone+0x6d at 0x7f01e1b503fd (/lib/x86_64-linux-gnu/libc.so.6)
error: Exiting.

@danielmewes
Copy link
Member Author

Crashing a RethinkDB server a couple of times while it's being backfilled to by means of limiting its memory corrupted the table:

Version: rethinkdb foo (debug) (CLANG 3.2 (tags/RELEASE_32/final))
error: Error in src/btree/erase_range.cc at line 96:
error: Assertion failed: [key_in_range(k, left_excl, right_incl)] 
error: Backtrace:
Version: rethinkdb foo (debug) (CLANG 3.2 (tags/RELEASE_32/final))
error: Error in src/btree/internal_node.cc at line 267:
error: Assertion failed: [is_sorted(node->pair_offsets, node->pair_offsets+node->npairs-1, internal_key_comp(node))] Invalid key given to update_key: offsets no longer in sorted order
error: Backtrace:

I can't restart it even after removing the memory constraint.
Smells like memory corruption to me.

@danielmewes
Copy link
Member Author

Also I noticed that when a memory allocation fails, we sometimes don't seem to crash immediately (at least in debug mode). For example I got cases like this:

error: Error in src/buffer_cache/alt/evicter.cc at line 63:
error: Assertion failed: [current_bag->has_page(page)] 
error: Backtrace:
Version: rethinkdb foo (debug) (CLANG 3.2 (tags/RELEASE_32/final))
error: Error in src/errors.cc at line 116:
error: Uncaught exception of type "std::bad_alloc"
         what(): std::bad_alloc

So it ran out of memory, but still had time to run into another assertion. Somehow an allocation failing had side-effects on a different thread, which kept running before the process was eventually terminated.

@danielmewes
Copy link
Member Author

... or well I guess std::bad_alloc is sort of expected to cause side-effects because it unwinds the stack. That is bad or course.
We should probably follow @srh's suggestion and overwrite malloc() and related by something that terminates the process immediately...

@coffeemug
Copy link
Contributor

We should probably follow @srh's suggestion and overwrite malloc() and related by something that terminates the process immediately...

👍

@danielmewes
Copy link
Member Author

... or much easier: Use this function that is part of the C++ standard, to make failed calls to new() and new[]() always crash immediately: http://www.cplusplus.com/reference/new/set_new_handler/

@danielmewes
Copy link
Member Author

A fix for out-of-memory conditions handled incorrectly is in code review 1263 by @AtnNn.

However I'm still getting table corruptions. Going to open a new issue for that.

@danielmewes
Copy link
Member Author

The improved out-of-memory handling is in next as of e5b6321

Closing this since the remaining part is tracked in #2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants