Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in OutboundCall::SetFinished() during stress test involving TServer restarts & reads from followers #451

Closed
kmuthukk opened this issue Aug 24, 2018 · 1 comment
Assignees
Labels
kind/bug This issue is a bug

Comments

@kmuthukk
Copy link
Collaborator

Stack Trace:

(gdb) where
#0  yb::rpc::OutboundCall::SetFinished (this=0x0) at ../../src/yb/rpc/outbound_call.cc:402
#1  0x00007fe82fb0250d in yb::rpc::LocalYBInboundCall::Respond (this=0x75cf9cd0, response=..., is_success=<optimized out>) at ../../src/yb/rpc/local_call.cc:81
#2  0x00007fe82fb3ce79 in yb::rpc::YBInboundCall::RespondSuccess (this=0x75cf9cd0, response=...) at ../../src/yb/rpc/yb_rpc.cc:329
#3  0x00007fe82fb1f39a in yb::rpc::RpcContext::RespondSuccess (this=0x7fe7578c8440) at ../../src/yb/rpc/rpc_context.cc:138
#4  0x00007fe834bfee21 in yb::tserver::TabletServiceImpl::Read (this=this@entry=0x1cbac00, req=req@entry=0x1f5a11f8, resp=resp@entry=0x1f5a1288, context=...) at ../../src/yb/tserver/tablet_service.cc:932
#5  0x00007fe8323269f0 in yb::tserver::TabletServerServiceIf::Handle (this=0x1cbac00, call=...) at src/yb/tserver/tserver_service.service.cc:141
#6  0x00007fe82fb33aba in yb::rpc::ServicePoolImpl::Handle (this=0x1cd0380, incoming=...) at ../../src/yb/rpc/service_pool.cc:214
#7  0x00007fe82fb3204a in Run (this=<optimized out>) at ../../src/yb/rpc/service_pool.cc:252
#8  yb::rpc::TasksPool<yb::rpc::(anonymous namespace)::InboundCallTask>::WrappedTask::Run (this=<optimized out>) at ../../src/yb/rpc/tasks_pool.h:70
#9  0x00007fe82fb39f99 in yb::rpc::(anonymous namespace)::Worker::Execute (this=0x675547e0) at ../../src/yb/rpc/thread_pool.cc:98
#10 0x00007fe82db3e4c6 in operator() (this=0xc4c92d8) at /n/jenkins/linuxbrew/linuxbrew_2018-03-16T16_38_10/Cellar/gcc/5.5.0/include/c++/5.5.0/functional:2267
#11 yb::Thread::SuperviseThread (arg=<optimized out>) at ../../src/yb/util/thread.cc:606
#12 0x00007fe82938a694 in start_thread (arg=0x7fe7578c9700) at pthread_create.c:333
#13 0x00007fe828ac83cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

So, perhaps 2bf6be4 has not fully fixed the issue we saw in the area.

@mbautin wrote:

It looks like the segmentation fault happens as a result of reading the outbound_call_metric_ field of OutboundCall, meaning the OutboundCall instance has probably already been deleted.

outbound_call.cc
void OutboundCall::SetFinished() {
  // Track time taken to be responded.
  if (outbound_call_metrics_) {
    outbound_call_metrics_->time_to_response->Increment(
        MonoTime::Now().GetDeltaSince(start_).ToMicroseconds());
  }
  set_state(FINISHED_SUCCESS);
  CallCallback();
  TRACE_TO(trace_, "Callback called.");
}

And @spolitov added:

yb::rpc::OutboundCall::SetFinished (this=0x0) at ../../src/yb/rpc/outbound_call.cc:402

this is null, it is invoked here:

outbound_call()->SetFinished();

where

std::shared_ptr<LocalOutboundCall> outbound_call() const { return outbound_call_.lock(); } 
....
// Weak pointer back to the outbound call owning this inbound call to avoid circular reference.
std::weak_ptr<LocalOutboundCall> outbound_call_;

It means that this call was already deleted. Quick fix will be checking result of outbound_call().

@kmuthukk kmuthukk added this to To Do in YBase features via automation Aug 24, 2018
@kmuthukk kmuthukk added the kind/bug This issue is a bug label Aug 24, 2018
yugabyte-ci pushed a commit that referenced this issue Aug 24, 2018
…ll in LocalYBInboundCall

Summary:
Because of another issue still under investigation, there could be a case when we invoke LocalYBInboundCall::Response when
original outbound all is already destroyed.
To handle this case correctly we should check that result returned by weak_ptr lock is not null.

Test Plan: Jenknis

Reviewers: robert, mikhail

Reviewed By: mikhail

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5388
yugabyte-ci pushed a commit that referenced this issue Aug 24, 2018
… startup

Summary:
When we receive read and tablet is not running, we reply from TabletServiceImpl::DoGetTabletOrRespond.
But if it was read with non strong consistency, then we will also reply from TabletServiceImpl::Read.

Changed handling of error cases during read from follower to address this issue.

Test Plan: Jenkins

Reviewers: mikhail, robert, hector

Reviewed By: hector

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D5390
@kmuthukk
Copy link
Collaborator Author

Fixed via 3bd4f37 & e4a55b9

YBase features automation moved this from To Do to Done Aug 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This issue is a bug
Projects
Development

No branches or pull requests

2 participants