Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] SEGV during yb::log::Log::WaitForSafeOpIdToApply in DNS based cluster #5679

Closed
bmatican opened this issue Sep 14, 2020 · 0 comments
Closed
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority

Comments

@bmatican
Copy link
Contributor

Seems to trace back to this diff: https://phabricator.dev.yugabyte.com/D8972

Program terminated with signal 11, Segmentation fault.
#0  0x00007f4e5a46bbb9 in Get<google::protobuf::RepeatedPtrField<yb::log::LogEntryPB>::TypeHandler> (this=0x6b504d20, index=<optimized out>)
    at /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20200829090443-f431681041-centos/installed/uninstrumented/include/google/protobuf/repeated_field.h:1523
1523	/opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20200829090443-f431681041-centos/installed/uninstrumented/include/google/protobuf/repeated_field.h: No such file or directory.
(gdb) #0  0x00007f4e5a46bbb9 in Get<google::protobuf::RepeatedPtrField<yb::log::LogEntryPB>::TypeHandler> (this=0x6b504d20, index=<optimized out>)
    at /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20200829090443-f431681041-centos/installed/uninstrumented/include/google/protobuf/repeated_field.h:1523
#1  Get (index=<optimized out>, this=<optimized out>) at /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20200829090443-f431681041-centos/installed/uninstrumented/include/google/protobuf/repeated_field.h:1989
#2  entry (index=<optimized out>, this=<optimized out>) at src/yb/consensus/log.pb.h:773
#3  MaxReplicateOpId (this=0x6b504d00) at ../../src/yb/consensus/log.cc:243
#4  ToString (this=0x6b504d00) at ../../src/yb/consensus/log.cc:183
#5  ToString (this=<optimized out>, this=<optimized out>) at ../../src/yb/util/blocking_queue.h:232
#6  ToString<yb::BlockingQueue<yb::log::LogEntryBatch*, yb::DefaultLogicalSize> > (value=...) at ../../src/yb/util/tostring.h:73
#7  AsString<yb::BlockingQueue<yb::log::LogEntryBatch*, yb::DefaultLogicalSize> > (t=...) at ../../src/yb/util/tostring.h:279
#8  yb::TaskStream<yb::log::LogEntryBatch>::ToString (this=<optimized out>) at ../../src/yb/util/taskstream.h:80
#9  0x00007f4e5a45e1b3 in ToString (this=<optimized out>) at ../../src/yb/consensus/log.cc:322
#10 yb::log::Log::WaitForSafeOpIdToApply (this=0x17f3a300, min_allowed=..., duration=...) at ../../src/yb/consensus/log.cc:1089
#11 0x00007f4e5a7203bb in yb::consensus::RaftConsensus::WaitForSafeOpIdToApply (this=0xe6a94b0, op_id=...) at ../../src/yb/consensus/raft_consensus.cc:3105
#12 0x00007f4e5a7495df in yb::consensus::ReplicaState::ApplyPendingOperationsUnlocked (this=this@entry=0x176498c0, committed_op_id=..., could_stop=..., could_stop@entry=...) at ../../src/yb/consensus/replica_state.cc:903
#13 0x00007f4e5a74a23f in yb::consensus::ReplicaState::AdvanceCommittedOpIdUnlocked (this=0x176498c0, committed_op_id=..., could_stop=...) at ../../src/yb/consensus/replica_state.cc:852
#14 0x00007f4e5a730551 in yb::consensus::RaftConsensus::MarkOperationsAsCommittedUnlocked (this=this@entry=0xe6a94b0, request=..., deduped_req=..., last_from_leader=...) at ../../src/yb/consensus/raft_consensus.cc:1975
#15 0x00007f4e5a736a27 in yb::consensus::RaftConsensus::UpdateReplica (this=this@entry=0xe6a94b0, request=request@entry=0x6aed5ba0, response=response@entry=0x2b9ad3f0) at ../../src/yb/consensus/raft_consensus.cc:1748
#16 0x00007f4e5a737973 in yb::consensus::RaftConsensus::Update (this=this@entry=0xe6a94b0, request=request@entry=0x6aed5ba0, response=response@entry=0x2b9ad3f0, deadline=...) at ../../src/yb/consensus/raft_consensus.cc:1304
#17 0x00007f4e5b3c2c01 in yb::tserver::ConsensusServiceImpl::UpdateConsensus (this=this@entry=0x1feef00, req=req@entry=0x6aed5ba0, resp=resp@entry=0x2b9ad3f0, context=...) at ../../src/yb/tserver/tablet_service.cc:2111
#18 0x00007f4e57c4608a in yb::consensus::ConsensusServiceIf::Handle (this=0x1feef00, call=...) at src/yb/consensus/consensus.service.cc:100
#19 0x00007f4e53a13d09 in yb::rpc::ServicePoolImpl::Handle (this=0x2672240, incoming=...) at ../../src/yb/rpc/service_pool.cc:262
#20 0x00007f4e539bab34 in yb::rpc::InboundCall::InboundCallTask::Run (this=<optimized out>) at ../../src/yb/rpc/inbound_call.cc:212
#21 0x00007f4e53a262f8 in yb::rpc::(anonymous namespace)::Worker::Execute (this=<optimized out>) at ../../src/yb/rpc/thread_pool.cc:105
#22 0x00007f4e5225110f in operator() (this=0x4e0d018) at /home/yugabyte/yb-software/yugabyte-2.3.0.0-b167-centos-x86_64/linuxbrew-xxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#23 yb::Thread::SuperviseThread (arg=0x4e0cfc0) at ../../src/yb/util/thread.cc:760
#24 0x00007f4e4ca6c694 in start_thread (arg=0x7f4cc4a67700) at pthread_create.c:333
#25 0x00007f4e4c1a941d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Since these long waits are something we DFATAL, this codepath is likely not exercised much. I suspect the slowdown from DNS queries was contributing to this. I was seeing these in the log

E0910 21:41:22.224635 20822 log.cc:1083] T 92ad09106cfa4c8cbd9e89d4a07410e8 P 3846416ec2d24a8a901c1247599bb15f: Appender stack:     @     0x7facb467f11e  (unknown)
    @     0x7facb4ffcb39  __lll_lock_wait
    @     0x7facb4ff76e1  __GI___pthread_mutex_lock
    @     0x7facba7e16b6  yb::ThreadPool::DoSubmit()
    @     0x7facba7e1faa  yb::ThreadPoolToken::Submit()
    @     0x7facba7e22e8  yb::ThreadPoolToken::SubmitClosure()
    @     0x7facc2c8ca10  yb::consensus::PeerMessageQueue::NotifyObserversOfMajorityReplOpChange()
    @     0x7facc2c8d801  yb::consensus::PeerMessageQueue::ResponseFromPeer()
    @     0x7facc2c8e6ab  yb::consensus::PeerMessageQueue::LocalPeerAppendFinished()
    @     0x7facc2c9e044  yb::consensus::LogCache::LogCallback()
    @     0x7facc29e8a2b  yb::log::Log::Appender::GroupWork()
    @     0x7facc29ee1cb  yb::log::Log::Appender::ProcessBatch()
    @     0x7facc29f2cda  yb::TaskStream<>::Run()
    @     0x7facba7dd932  yb::ThreadPool::DispatchThread()
    @     0x7facba7da10d  yb::Thread::SuperviseThread()
    @     0x7facb4ff5692  start_thread
E0910 21:41:22.224735 20822 log.cc:1084] T 92ad09106cfa4c8cbd9e89d4a07410e8 P 3846416ec2d24a8a901c1247599bb15f: Long wait for safe op id: { term: 78 index: 3145487 }, current: { term: 77 index: 3145486 }, last appended: { term: 77 index: 3145486 }, las
t submitted: { term: 78 index: 3145487 }, appender: { queue: { type: 1 state: 2 max_op_id: term: 78 index: 3145487 }
 run_state: kComplete stopped: 0 stop_requested: 0 }, passed: 38.361s
@bmatican bmatican added kind/bug This issue is a bug area/docdb YugabyteDB core features priority/high High Priority labels Sep 14, 2020
spolitov added a commit that referenced this issue Sep 15, 2020
Summary:
When WaitForSafeOpIdToApply takes too long time we dump appender state.
If it contains empty batch it would crash the process.

This diff updates MaxReplicateOpId to handle empty batch, so process would not crash in above scenario.

Test Plan: Jenkins

Reviewers: bogdan

Reviewed By: bogdan

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D9374
@bmatican bmatican added this to Done in YBase features Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/high High Priority
Projects
Development

No branches or pull requests

2 participants