New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent low likelihood Database job queue crash #2052

Closed
wants to merge 4 commits into
base: develop
from

Conversation

Projects
None yet
5 participants
@scottschurr
Contributor

scottschurr commented Mar 14, 2017

The first three commits have already passed review, so they do not need to be audited. However they are small, so there's not a lot of savings there. The interesting commit is the top-most one.

Regarding the top-most commit, the DatabaseImp has threads that asynchronously call JobQueue to perform database reads. Formerly these threads had the same lifespan as DatabaseImp, which was until the end-of-life of ApplicationImp. During shutdown these threads could call JobQueue after JobQueue had already stopped. Or, even worse, occasionally call JobQueue after JobQueue's destructor had run.

To avoid these shutdown conditions, Database is made a Stoppable, with JobQueue as its parent. When Database stops, it shuts down DatabaseImp's asynchronous read threads. This prevents Database from accessing JobQueue after JobQueue has stopped, but allows Database to perform stores for the remainder of shutdown.

During development it was noted that the Database::close() method was never called. So that method is removed from Database and all derived classes.

Stoppable is also adjusted so it can be constructed using either a char const* or a std::string.

For those files touched for other reasons, unneeded #includes are removed.

I ran over 10,000 15 second start / stop cycles of rippled over the weekend with theses changes and had no crashes or hangs. I also verified (using the debugger) that DatabaseImp::storeInternal() can still succeed when DatabaseImp::isStopped is true.

Reviewers: @JoelKatz, @mellery451

@codecov-io

This comment has been minimized.

Show comment
Hide comment
@codecov-io

codecov-io Mar 14, 2017

Codecov Report

Merging #2052 into develop will increase coverage by <.01%.
The diff coverage is 92.3%.

@@             Coverage Diff             @@
##           develop    #2052      +/-   ##
===========================================
+ Coverage     67.7%   67.71%   +<.01%     
===========================================
  Files          680      680              
  Lines        49219    49234      +15     
===========================================
+ Hits         33323    33338      +15     
  Misses       15896    15896
Impacted Files Coverage Δ
src/ripple/nodestore/Manager.h 100% <ø> (ø)
src/ripple/app/misc/SHAMapStoreImp.h 97.05% <ø> (ø)
src/ripple/app/misc/SHAMapStore.h 100% <ø> (ø)
src/ripple/core/JobQueue.h 100% <ø> (ø)
src/ripple/nodestore/impl/ManagerImp.cpp 84.61% <ø> (ø)
src/ripple/nodestore/impl/DatabaseImp.h 93.02% <100%> (+3.08%)
src/ripple/nodestore/Database.h 100% <100%> (ø)
src/ripple/app/misc/SHAMapStoreImp.cpp 79.89% <100%> (+0.05%)
src/ripple/overlay/impl/OverlayImpl.cpp 30.9% <100%> (+0.37%)
src/ripple/nodestore/impl/DatabaseRotatingImp.h 60.52% <100%> (+6.47%)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0df1b09...19857b7. Read the comment docs.

codecov-io commented Mar 14, 2017

Codecov Report

Merging #2052 into develop will increase coverage by <.01%.
The diff coverage is 92.3%.

@@             Coverage Diff             @@
##           develop    #2052      +/-   ##
===========================================
+ Coverage     67.7%   67.71%   +<.01%     
===========================================
  Files          680      680              
  Lines        49219    49234      +15     
===========================================
+ Hits         33323    33338      +15     
  Misses       15896    15896
Impacted Files Coverage Δ
src/ripple/nodestore/Manager.h 100% <ø> (ø)
src/ripple/app/misc/SHAMapStoreImp.h 97.05% <ø> (ø)
src/ripple/app/misc/SHAMapStore.h 100% <ø> (ø)
src/ripple/core/JobQueue.h 100% <ø> (ø)
src/ripple/nodestore/impl/ManagerImp.cpp 84.61% <ø> (ø)
src/ripple/nodestore/impl/DatabaseImp.h 93.02% <100%> (+3.08%)
src/ripple/nodestore/Database.h 100% <100%> (ø)
src/ripple/app/misc/SHAMapStoreImp.cpp 79.89% <100%> (+0.05%)
src/ripple/overlay/impl/OverlayImpl.cpp 30.9% <100%> (+0.37%)
src/ripple/nodestore/impl/DatabaseRotatingImp.h 60.52% <100%> (+6.47%)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0df1b09...19857b7. Read the comment docs.

@mellery451

This comment has been minimized.

Show comment
Hide comment
@mellery451

mellery451 Mar 14, 2017

Contributor

LGTM 👍

Contributor

mellery451 commented Mar 14, 2017

LGTM 👍

@scottschurr scottschurr requested review from mellery451 and JoelKatz Mar 16, 2017

@mellery451

👍

Show outdated Hide outdated src/ripple/core/Stoppable.h
Show outdated Hide outdated src/ripple/nodestore/Database.h
// crash during shutdown when its members are accessed by one of
// these threads after the derived class is destroyed but before
// this base class is destroyed.
stopThreads();

This comment has been minimized.

@nbougalis

nbougalis Mar 17, 2017

Member

I wondered if there's a way we can force this, but I couldn't come up with anything lean and clean.

@nbougalis

nbougalis Mar 17, 2017

Member

I wondered if there's a way we can force this, but I couldn't come up with anything lean and clean.

This comment has been minimized.

@scottschurr

scottschurr Mar 17, 2017

Contributor

Yeah, I thought about this too and didn't come up with anything that didn't require a bunch of restructuring. I didn't think it was worth a lot of effort since it feels like we're unlikely to derive yet another kind of Database from DatabaseImp.

@scottschurr

scottschurr Mar 17, 2017

Contributor

Yeah, I thought about this too and didn't come up with anything that didn't require a bunch of restructuring. I didn't think it was worth a lot of effort since it feels like we're unlikely to derive yet another kind of Database from DatabaseImp.

@nbougalis nbougalis added the Passed label Mar 17, 2017

scottschurr added some commits Feb 13, 2017

Remove timing window from RootStoppable (RIPD-1392):
RootStoppable was using two separate flags to identify that it
was stopping.  LoadManager was being notified when one flag was
set, but checking the other flag (not yet set) to see if we were
stopping.  There is no strong motivation for two flags.  The
timing window is closed by removing one flag and moving around
a chunk of code.
Prevent DatabaseRotateImp crash on shutdown (RIPD-1392):
The DatabaseImp holds threads that access DatabaseRotateImp.  But
the DatabaseRotateImp's destructor runs before the DatabaseImp
destructor.  The DatabaseRotateImp now assures that the
DatabaseImp threads are stopped before the DatabaseRotateImp
destructor completes.
Prevent low-likelihood hang on shutdown (RIPD-1392):
Calling OverlayImpl::list_[].second->stop() may cause list_ to be
modified (OverlayImpl::remove() may be called on this same thread).
So iterating directly over OverlayImpl::list_ to call
OverlayImpl::list_[].second->stop() could give undefined behavior.
On MacOS that undefined behavior exhibited as a hang.

Therefore we copy all of the weak/shared ptrs out of
OverlayImpl::list_ before we start calling stop() on them.  That
guarantees OverlayImpl::remove() won't be called until
OverlayImpl::stop() completes.
Prevent low-likelihood crash on shutdown (RIPD-1392):
The DatabaseImp has threads that asynchronously call JobQueue to
perform database reads.  Formerly these threads had the same
lifespan as Database, which was until the end-of-life of
ApplicationImp.  During shutdown these threads could call JobQueue
after JobQueue had already stopped.  Or, even worse, occasionally
call JobQueue after JobQueue's destructor had run.

To avoid these shutdown conditions, Database is made a Stoppable,
with JobQueue as its parent.  When Database stops, it shuts down
its asynchronous read threads.  This prevents Database from
accessing JobQueue after JobQueue has stopped, but allows
Database to perform stores for the remainder of shutdown.

During development it was noted that the Database::close()
method was never called.  So that method is removed from Database
and all derived classes.

Stoppable is also adjusted so it can be constructed using either
a char const* or a std::string.

For those files touched for other reasons, unneeded #includes
are removed.
@scottschurr

This comment has been minimized.

Show comment
Hide comment
@scottschurr

scottschurr Mar 17, 2017

Contributor

Addressed the last few comments, squashed, and rebased.

Contributor

scottschurr commented Mar 17, 2017

Addressed the last few comments, squashed, and rebased.

@scottschurr scottschurr referenced this pull request Mar 21, 2017

Merged

Develop next #2059

@scottschurr

This comment has been minimized.

Show comment
Hide comment
@scottschurr

scottschurr Mar 21, 2017

Contributor

Merged to develop as b4e7653, 1d482ee, 9ff9fa0, and 9d4500c.

Contributor

scottschurr commented Mar 21, 2017

Merged to develop as b4e7653, 1d482ee, 9ff9fa0, and 9d4500c.

@scottschurr scottschurr deleted the scottschurr:database-job-queue-crash branch Mar 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment