Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent low likelihood Database job queue crash #2052

Closed

Conversation

scottschurr
Copy link
Collaborator

The first three commits have already passed review, so they do not need to be audited. However they are small, so there's not a lot of savings there. The interesting commit is the top-most one.

Regarding the top-most commit, the DatabaseImp has threads that asynchronously call JobQueue to perform database reads. Formerly these threads had the same lifespan as DatabaseImp, which was until the end-of-life of ApplicationImp. During shutdown these threads could call JobQueue after JobQueue had already stopped. Or, even worse, occasionally call JobQueue after JobQueue's destructor had run.

To avoid these shutdown conditions, Database is made a Stoppable, with JobQueue as its parent. When Database stops, it shuts down DatabaseImp's asynchronous read threads. This prevents Database from accessing JobQueue after JobQueue has stopped, but allows Database to perform stores for the remainder of shutdown.

During development it was noted that the Database::close() method was never called. So that method is removed from Database and all derived classes.

Stoppable is also adjusted so it can be constructed using either a char const* or a std::string.

For those files touched for other reasons, unneeded #includes are removed.

I ran over 10,000 15 second start / stop cycles of rippled over the weekend with theses changes and had no crashes or hangs. I also verified (using the debugger) that DatabaseImp::storeInternal() can still succeed when DatabaseImp::isStopped is true.

Reviewers: @JoelKatz, @mellery451

@codecov-io
Copy link

codecov-io commented Mar 14, 2017

Codecov Report

Merging #2052 into develop will increase coverage by <.01%.
The diff coverage is 92.3%.

@@             Coverage Diff             @@
##           develop    #2052      +/-   ##
===========================================
+ Coverage     67.7%   67.71%   +<.01%     
===========================================
  Files          680      680              
  Lines        49219    49234      +15     
===========================================
+ Hits         33323    33338      +15     
  Misses       15896    15896
Impacted Files Coverage Δ
src/ripple/nodestore/Manager.h 100% <ø> (ø)
src/ripple/app/misc/SHAMapStoreImp.h 97.05% <ø> (ø)
src/ripple/app/misc/SHAMapStore.h 100% <ø> (ø)
src/ripple/core/JobQueue.h 100% <ø> (ø)
src/ripple/nodestore/impl/ManagerImp.cpp 84.61% <ø> (ø)
src/ripple/nodestore/impl/DatabaseImp.h 93.02% <100%> (+3.08%)
src/ripple/nodestore/Database.h 100% <100%> (ø)
src/ripple/app/misc/SHAMapStoreImp.cpp 79.89% <100%> (+0.05%)
src/ripple/overlay/impl/OverlayImpl.cpp 30.9% <100%> (+0.37%)
src/ripple/nodestore/impl/DatabaseRotatingImp.h 60.52% <100%> (+6.47%)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0df1b09...19857b7. Read the comment docs.

@mellery451
Copy link
Contributor

LGTM 👍

Copy link
Contributor

@mellery451 mellery451 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -339,14 +339,15 @@ class RootStoppable : public Stoppable
/* Notify a root stoppable and children to stop, without waiting.
Has no effect if the stoppable was already notified.

Returns true on the first call to stopAsync(), false otherwise.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@returns true if this is the first call to this function, false otherwise.

@param parent The parent Stoppable.
*/
Database (std::string name, Stoppable& parent)
: Stoppable (std::move (name), parent) { }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

// crash during shutdown when its members are accessed by one of
// these threads after the derived class is destroyed but before
// this base class is destroyed.
stopThreads();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered if there's a way we can force this, but I couldn't come up with anything lean and clean.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought about this too and didn't come up with anything that didn't require a bunch of restructuring. I didn't think it was worth a lot of effort since it feels like we're unlikely to derive yet another kind of Database from DatabaseImp.

@nbougalis nbougalis added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Mar 17, 2017
RootStoppable was using two separate flags to identify that it
was stopping.  LoadManager was being notified when one flag was
set, but checking the other flag (not yet set) to see if we were
stopping.  There is no strong motivation for two flags.  The
timing window is closed by removing one flag and moving around
a chunk of code.
The DatabaseImp holds threads that access DatabaseRotateImp.  But
the DatabaseRotateImp's destructor runs before the DatabaseImp
destructor.  The DatabaseRotateImp now assures that the
DatabaseImp threads are stopped before the DatabaseRotateImp
destructor completes.
Calling OverlayImpl::list_[].second->stop() may cause list_ to be
modified (OverlayImpl::remove() may be called on this same thread).
So iterating directly over OverlayImpl::list_ to call
OverlayImpl::list_[].second->stop() could give undefined behavior.
On MacOS that undefined behavior exhibited as a hang.

Therefore we copy all of the weak/shared ptrs out of
OverlayImpl::list_ before we start calling stop() on them.  That
guarantees OverlayImpl::remove() won't be called until
OverlayImpl::stop() completes.
The DatabaseImp has threads that asynchronously call JobQueue to
perform database reads.  Formerly these threads had the same
lifespan as Database, which was until the end-of-life of
ApplicationImp.  During shutdown these threads could call JobQueue
after JobQueue had already stopped.  Or, even worse, occasionally
call JobQueue after JobQueue's destructor had run.

To avoid these shutdown conditions, Database is made a Stoppable,
with JobQueue as its parent.  When Database stops, it shuts down
its asynchronous read threads.  This prevents Database from
accessing JobQueue after JobQueue has stopped, but allows
Database to perform stores for the remainder of shutdown.

During development it was noted that the Database::close()
method was never called.  So that method is removed from Database
and all derived classes.

Stoppable is also adjusted so it can be constructed using either
a char const* or a std::string.

For those files touched for other reasons, unneeded #includes
are removed.
@scottschurr
Copy link
Collaborator Author

Addressed the last few comments, squashed, and rebased.

@scottschurr scottschurr mentioned this pull request Mar 21, 2017
@scottschurr
Copy link
Collaborator Author

Merged to develop as b4e7653, 1d482ee, 9ff9fa0, and 9d4500c.

@scottschurr scottschurr deleted the database-job-queue-crash branch March 21, 2017 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants