New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent low likelihood Database job queue crash #2052
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #2052 +/- ##
===========================================
+ Coverage 67.7% 67.71% +<.01%
===========================================
Files 680 680
Lines 49219 49234 +15
===========================================
+ Hits 33323 33338 +15
Misses 15896 15896
Continue to review full report at Codecov.
|
LGTM 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/ripple/core/Stoppable.h
Outdated
@@ -339,14 +339,15 @@ class RootStoppable : public Stoppable | |||
/* Notify a root stoppable and children to stop, without waiting. | |||
Has no effect if the stoppable was already notified. | |||
|
|||
Returns true on the first call to stopAsync(), false otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@returns true if this is the first call to this function, false otherwise.
src/ripple/nodestore/Database.h
Outdated
@param parent The parent Stoppable. | ||
*/ | ||
Database (std::string name, Stoppable& parent) | ||
: Stoppable (std::move (name), parent) { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation
// crash during shutdown when its members are accessed by one of | ||
// these threads after the derived class is destroyed but before | ||
// this base class is destroyed. | ||
stopThreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered if there's a way we can force this, but I couldn't come up with anything lean and clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I thought about this too and didn't come up with anything that didn't require a bunch of restructuring. I didn't think it was worth a lot of effort since it feels like we're unlikely to derive yet another kind of Database from DatabaseImp
.
RootStoppable was using two separate flags to identify that it was stopping. LoadManager was being notified when one flag was set, but checking the other flag (not yet set) to see if we were stopping. There is no strong motivation for two flags. The timing window is closed by removing one flag and moving around a chunk of code.
The DatabaseImp holds threads that access DatabaseRotateImp. But the DatabaseRotateImp's destructor runs before the DatabaseImp destructor. The DatabaseRotateImp now assures that the DatabaseImp threads are stopped before the DatabaseRotateImp destructor completes.
Calling OverlayImpl::list_[].second->stop() may cause list_ to be modified (OverlayImpl::remove() may be called on this same thread). So iterating directly over OverlayImpl::list_ to call OverlayImpl::list_[].second->stop() could give undefined behavior. On MacOS that undefined behavior exhibited as a hang. Therefore we copy all of the weak/shared ptrs out of OverlayImpl::list_ before we start calling stop() on them. That guarantees OverlayImpl::remove() won't be called until OverlayImpl::stop() completes.
The DatabaseImp has threads that asynchronously call JobQueue to perform database reads. Formerly these threads had the same lifespan as Database, which was until the end-of-life of ApplicationImp. During shutdown these threads could call JobQueue after JobQueue had already stopped. Or, even worse, occasionally call JobQueue after JobQueue's destructor had run. To avoid these shutdown conditions, Database is made a Stoppable, with JobQueue as its parent. When Database stops, it shuts down its asynchronous read threads. This prevents Database from accessing JobQueue after JobQueue has stopped, but allows Database to perform stores for the remainder of shutdown. During development it was noted that the Database::close() method was never called. So that method is removed from Database and all derived classes. Stoppable is also adjusted so it can be constructed using either a char const* or a std::string. For those files touched for other reasons, unneeded #includes are removed.
ef52798
to
19857b7
Compare
Addressed the last few comments, squashed, and rebased. |
The first three commits have already passed review, so they do not need to be audited. However they are small, so there's not a lot of savings there. The interesting commit is the top-most one.
Regarding the top-most commit, the
DatabaseImp
has threads that asynchronously callJobQueue
to perform database reads. Formerly these threads had the same lifespan asDatabaseImp
, which was until the end-of-life ofApplicationImp
. During shutdown these threads could callJobQueue
afterJobQueue
had already stopped. Or, even worse, occasionally callJobQueue
afterJobQueue
's destructor had run.To avoid these shutdown conditions,
Database
is made aStoppable
, withJobQueue
as its parent. WhenDatabase
stops, it shuts downDatabaseImp
's asynchronous read threads. This preventsDatabase
from accessingJobQueue
afterJobQueue
has stopped, but allowsDatabase
to perform stores for the remainder of shutdown.During development it was noted that the
Database::close()
method was never called. So that method is removed fromDatabase
and all derived classes.Stoppable
is also adjusted so it can be constructed using either achar const*
or astd::string
.For those files touched for other reasons, unneeded
#includes
are removed.I ran over 10,000 15 second start / stop cycles of rippled over the weekend with theses changes and had no crashes or hangs. I also verified (using the debugger) that
DatabaseImp::storeInternal()
can still succeed whenDatabaseImp::isStopped
istrue
.Reviewers: @JoelKatz, @mellery451