Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate thread safety to RCLConsensus (RIPD-1389) #2106

Closed
wants to merge 2 commits into from

Conversation

bachase
Copy link
Collaborator

@bachase bachase commented May 8, 2017

This PR includes changes that move consensus thread safety logic from the generic implementation in Consensus into the RCL adapted version RCLConsensus. This has the advantage of placing the concurrency responsibilities in one spot in the code and additionally improves the performance of the consensus simulation framework by eliminating the uneeded concurrency overhead. The two primary changes are:

  1. Switch from a CRTP design to an Adaptor design for generic Consensus. Previously, CRTP was used to allow the specializing class to call into the generic Consensus code while servicing some of its required callback members (e.g. onClose could call some other member of Consensus). The Adaptor design makes the communication from Consensus to the specializations in Adaptor more explicit by passing any dependencies as arguments to the callback functions.

  2. Move the thread safety logic from Consensus to RCLConsensus. I now use atomics to hold status related information at the RCLConsensus level to avoid acquiring a mutex just to read a value. There is still room for improving the design, particularly when dispatching the onAccept call to the job queue, but I think this is still better. Although not necessary, I still use a recursive_mutex instead of a regular mutex in RCLConsensus to protect any future refactoring from breaking things.

This is the last planned substantial refactor of the consensus code before focusing fully on the simulation framework.

Doc preview: http://bachase.github.io/ripd1389/

@bachase
Copy link
Collaborator Author

bachase commented May 8, 2017

Some points I may consider addressing if reviewers think worthwhile:

  • Join the RCLConsensus::Consensus<Adaptor> member with its mutex via a scoped_value style helper to ensure we always take the lock before accessing the member.
  • Consider switching to a fixed size ring buffer instead of the deque currently used to store the 10 most recent peer proposals.
  • Use pimpl or virtual interface for RCLConsensus
  • Consider std::memory_order_relaxed for the atomic members of RCLConsensus. These are primarily status values that are ok to be a bit stale.

@HowardHinnant
Copy link
Contributor

On the ring buffer, saw this this morning: https://github.com/martinmoene/ring-span-lite

I haven't checked it out, I don't know how good it is. Just thought I would mention it in case it is helpful.

@codecov-io
Copy link

codecov-io commented May 9, 2017

Codecov Report

Merging #2106 into develop will increase coverage by 0.07%.
The diff coverage is 71.02%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2106      +/-   ##
===========================================
+ Coverage    69.48%   69.55%   +0.07%     
===========================================
  Files          685      689       +4     
  Lines        50520    50565      +45     
===========================================
+ Hits         35105    35173      +68     
+ Misses       15415    15392      -23
Impacted Files Coverage Δ
src/ripple/app/misc/NetworkOPs.h 100% <ø> (ø) ⬆️
src/ripple/app/main/Application.h 100% <ø> (ø) ⬆️
src/ripple/overlay/impl/PeerImp.h 0% <ø> (ø) ⬆️
src/ripple/app/consensus/RCLCxPeerPos.h 0% <0%> (ø) ⬆️
src/ripple/overlay/impl/PeerImp.cpp 0% <0%> (ø) ⬆️
src/ripple/app/consensus/RCLCxPeerPos.cpp 0% <0%> (ø) ⬆️
src/ripple/protocol/PublicKey.h 100% <100%> (ø) ⬆️
src/ripple/app/misc/ValidatorKeys.h 100% <100%> (ø)
src/ripple/consensus/Consensus.cpp 82.69% <100%> (ø) ⬆️
src/ripple/app/misc/NetworkOPs.cpp 61.39% <38.88%> (+0.09%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87742a5...95af8f8. Read the comment docs.

@bachase
Copy link
Collaborator Author

bachase commented May 18, 2017

Rebased on 0.70.0-b6

@bachase
Copy link
Collaborator Author

bachase commented Jun 26, 2017

Tagging @wilsonianb for 16589d4

//==============================================================================

#ifndef RIPPLE_APP_MISC_VALIDATOR_IDENTITY_H_INCLUDED
#define RIPPLE_APP_MISC_VALIDATOR_IDENTITY_H_INCLUDED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why IDENTITY instead of KEYS?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I had renamed the file a few times.

class Config;

/** Validator keys and manifest as set in configuration file. Values will be
emtpy if not configured as a validator or not configured with a manifest.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emtpy -> empty

{
configInvalid_ = true;
JLOG(j.fatal())
<< "Invalid entry in validator token configuration.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could modify this to resemble the message for an invalid seed below
"Invalid seed specified in [" SECTION_VALIDATOR_TOKEN "]"

// No config -> no key but valid
Config c;
ValidatorKeys k{c, j};
BEAST_EXPECT(k.publicKey.size() == 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also check secretKey to be extra safe. Same with a few other tests below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I missed this when adding the [FOLD] commit, but secretKey has a static size. How would you recommend I test it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot about that. I'm not sure how you could test that.

}

{
// Token takes precedence over seed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually don't allow the config to have both a token and a seed, but the check happens here:
https://github.com/bachase/rippled/blob/consensus-lock3/src/ripple/core/impl/Config.cpp#L361-L364
Do you think that should be moved into ValidatorKeys?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I clearly missed that since I added this test. From a code standpoint, it would be nice to move that check over, but I like the idea of validating as early as possible, which would be where the config currently does the work. What would you prefer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like keeping that check with the other config validations in Config::loadFromString, so I'd vote for either leaving as is or checking again in ValidatorKeys (and marking configInvalid_).

@@ -40,6 +40,7 @@ namespace ripple {
class InboundTransactions;
class LocalTxs;
class LedgerMaster;
class ValidatorKeys;

/** Manges the generic consensus algorithm for use by the RCL.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manges --> Manages

@bachase
Copy link
Collaborator Author

bachase commented Jul 5, 2017

Rebased on 0.70.0, squashed a few commits and address @wilsonianb's comments.

Copy link
Contributor

@nbougalis nbougalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor things


// Called when consensus operating mode changes
void onModeChange(ConsensuMode before, ConsensusMode after);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConsensuMode is missing an s


// Encapsulates the result of consensus.
template <class Adaptor>
struct ConsensusResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... ConsensusResult, ConsensusMode, ConsensusCloseTimes... it's starting to be a bit much. Should we consider a separate consensus namespace for all this?

Copy link
Collaborator Author

@bachase bachase Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for it! There were some downvotes last time the topic came up. I could put them inside the Consensus class, but not all depend on the Adaptor template type, so its convenient to specify independently.

@@ -449,7 +466,7 @@ RCLConsensus::doAccept(
JLOG(j_.info()) << "CNF buildLCL " << newLCLHash;

// See if we can accept a ledger as fully-validated
ledgerMaster_.consensusBuilt(sharedLCL.ledger_, getJson(true));
ledgerMaster_.consensusBuilt(sharedLCL.ledger_, consensusJson);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do std::move(consensusJson) here for a potential micro-optimization

// Since Consensus does not provide intrinsic thread-safety, this mutex
// guards all calls to consensus_. adaptor_ uses atomics internally
// to allow concurrent access of its data members that have getters.
mutable std::recursive_mutex mutex_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we easily avoid std::recursive_mutex in favor of a non-recursive one? Recursive locking = code smell.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly confident the current version does not require the recursive_mutex. In fact, that was the primary motivation for this refactor. But @scottschurr has done a good job convincing me that unless the code structurally prevents recursive calls while under lock, future refactors can break this assumption in non-obvious ways.

In particular, we may call into support code in NetworkOps or InboundLedgers while the mutex is under lock. Currently those won't call back into RCLConsensus but its tough to prevent that from changing in a clear way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since my name came up, I'll mention that it should be possible to add instrumentation that would fire an assert or LogicError if a mutex were about to be used recursively when it is not intended for such use. I haven't thought hard about the particular scenario in question, but here's an example where we know that recursive calls are possible, but we didn't want them to be recursive at a particular place: https://github.com/ripple/rippled/blob/develop/src/ripple/core/impl/DeadlineTimer.cpp#L36-L41

Another possibility would be that one or more of our (non-recursive) mutex library implementations may shriek in an appropriate fashion (:scream:) if called recursively. @HowardHinnant would be a good resource to ask about that. I know Howard has strong opinions about recursive_mutex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is generally best to avoid recursive_mutex if possible. However I don't loose sleep over the use of recursive_mutex (I put it in the standard ;-) ). I can whip up a shrieking mutex if we decide that's what we want. :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to using a shrieking mutex 😱 if that is the preference.

@@ -707,7 +700,7 @@ void NetworkOPsImp::processHeartbeatTimer ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above code-path we can call setMode(omCONNECTED) back to back: if we're in omDISCONNECTED (line 687) we set omCONNECTED, then in 698, if we're omCONNECTED, we again set omCONNECTED.

This may be harmless (except for the extra work), but it could cause some "flapping" of the network mode (disconnected -> connected -> syncing -> connected)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is confusing but would prefer modifying as a separate change, as I'm less sure the consequences.

On first, pass, I'm not seeing how flapping would happen; I would think both calls for setMode(omCONNECTED) would generally end up setting the same mode, unless we are close to the 1 minute ledger age boundary.


static std::string
to_string(Mode m);
//Revoke our outstanding proposal, if any, and cease proposing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micronit: leading space missing.

We enter the round proposing or observing. If we detect we are working
on the wrong prior ledger, we go to wrongLedger and attempt to acquire
the right one. Once we acquire the right one, we go to the switchedLedger
mode. If we again detect the wrong ledger before this round ends, we go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads weird:

Once we acquire the right one [...] If we again detect the wrong ledger [...] we go back [...] until we acquire the right one.

If we acquired the right one, how can we have the wrong one?

Copy link
Collaborator Author

@bachase bachase Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are really really slow, its possible we receive new validations that change our choice of the right ledger. So

  1. Detect wrong ledger Id:12 and attempt to acquire it.
  2. Time passes.
  3. We acquire ledger Id:12,
  4. Receive new validations for some other ledger Id:13 and attempt to acquire it.

I admit this is an outlandish case, but it is a possible sequence of events currently.

@@ -1094,7 +1094,7 @@ PeerImp::onMessage (std::shared_ptr <protocol::TMTransaction> const& m)
flags |= SF_TRUSTED;
}

if (! app_.getOPs().getValidationPublicKey().size())
if (! app_.getValidationPublicKey().size())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if having bool Application::validating() const makes sense.

Checking the size of the public key is a bit like checking whether your car radio is working by connecting a multimeter to the antenna input, instead of turning the radio on and seeing if sound comes out of the speakers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'll try to make that explicit.

Copy link
Contributor

@wilsonianb wilsonianb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 ValidatorKeys LGTM

@bachase
Copy link
Collaborator Author

bachase commented Jul 10, 2017

Rebased on 0.70.1

@bachase bachase force-pushed the consensus-lock3 branch 2 times, most recently from 4f3961e to 556d6f3 Compare July 12, 2017 16:14
@bachase
Copy link
Collaborator Author

bachase commented Jul 12, 2017

Rebased on 0.80.0-b1

Copy link
Collaborator

@JoelKatz JoelKatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:neckbeard:👍

@bachase
Copy link
Collaborator Author

bachase commented Jul 13, 2017

@JoelKatz @wilsonianb can I get a second look at the latest commit 1f5917b? Thanks!

Copy link
Contributor

@wilsonianb wilsonianb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't one already, could you add an explicit test that consensus is reached with 4 of 5 in agreement?

// additional timerEntry call for the updated peer 0 and
// peer 1 positions to arrive. Once they do, now peers 2-5
// see complete agreement and declare consensus
if (p.id > 1)
BEAST_EXPECT(
p.prevRoundTime() > sim.peers[0].prevRoundTime());
}
else // peer 0 is not participating
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this comment and the one on line 217 to include peer 1.

p.prevRoundTime() == sim.peers[0].prevRoundTime());
}
BEAST_EXPECT(p.prevProposers() == sim.peers.size() - 1);
BEAST_EXPECT(p.prevRoundTime() == sim.peers[0].prevRoundTime());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I run this test after reverting the agreement comparison from >= to >, the only check that fails is the prevRoundTime. Is that expected (is that the only thing here indicating that consensus was reached)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is the only change for now. The nodes still reach consensus, it just takes peers 1-4 longer.

As part of building out the framework, there is a now a better model for writing an explicit checker that would verify the number of proposals each node sent. I'll target this spot for an upgrade when it gets released.

Copy link
Contributor

@wilsonianb wilsonianb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Change to consensus check LGTM

@bachase bachase added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Jul 17, 2017
Moves thread safety from generic Consensus to RCLConsensus and switch generic
Consensus to adaptor design.
@seelabs
Copy link
Collaborator

seelabs commented Jul 20, 2017

In 0.80.0-b2

@seelabs seelabs closed this Jul 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants