Migrate thread safety to RCLConsensus (RIPD-1389) #2106

bachase · 2017-05-08T18:17:13Z

This PR includes changes that move consensus thread safety logic from the generic implementation in Consensus into the RCL adapted version RCLConsensus. This has the advantage of placing the concurrency responsibilities in one spot in the code and additionally improves the performance of the consensus simulation framework by eliminating the uneeded concurrency overhead. The two primary changes are:

Switch from a CRTP design to an Adaptor design for generic Consensus. Previously, CRTP was used to allow the specializing class to call into the generic Consensus code while servicing some of its required callback members (e.g. onClose could call some other member of Consensus). The Adaptor design makes the communication from Consensus to the specializations in Adaptor more explicit by passing any dependencies as arguments to the callback functions.
Move the thread safety logic from Consensus to RCLConsensus. I now use atomics to hold status related information at the RCLConsensus level to avoid acquiring a mutex just to read a value. There is still room for improving the design, particularly when dispatching the onAccept call to the job queue, but I think this is still better. Although not necessary, I still use a recursive_mutex instead of a regular mutex in RCLConsensus to protect any future refactoring from breaking things.

This is the last planned substantial refactor of the consensus code before focusing fully on the simulation framework.

Doc preview: http://bachase.github.io/ripd1389/

bachase · 2017-05-08T18:20:04Z

Some points I may consider addressing if reviewers think worthwhile:

Join the RCLConsensus::Consensus<Adaptor> member with its mutex via a scoped_value style helper to ensure we always take the lock before accessing the member.
Consider switching to a fixed size ring buffer instead of the deque currently used to store the 10 most recent peer proposals.
Use pimpl or virtual interface for RCLConsensus
Consider std::memory_order_relaxed for the atomic members of RCLConsensus. These are primarily status values that are ok to be a bit stale.

HowardHinnant · 2017-05-08T18:23:37Z

On the ring buffer, saw this this morning: https://github.com/martinmoene/ring-span-lite

I haven't checked it out, I don't know how good it is. Just thought I would mention it in case it is helpful.

codecov-io · 2017-05-09T13:57:01Z

Codecov Report

Merging #2106 into develop will increase coverage by 0.07%.
The diff coverage is 71.02%.

@@             Coverage Diff             @@
##           develop    #2106      +/-   ##
===========================================
+ Coverage    69.48%   69.55%   +0.07%     
===========================================
  Files          685      689       +4     
  Lines        50520    50565      +45     
===========================================
+ Hits         35105    35173      +68     
+ Misses       15415    15392      -23

Impacted Files	Coverage Δ
src/ripple/app/misc/NetworkOPs.h	`100% <ø> (ø)`	⬆️
src/ripple/app/main/Application.h	`100% <ø> (ø)`	⬆️
src/ripple/overlay/impl/PeerImp.h	`0% <ø> (ø)`	⬆️
src/ripple/app/consensus/RCLCxPeerPos.h	`0% <0%> (ø)`	⬆️
src/ripple/overlay/impl/PeerImp.cpp	`0% <0%> (ø)`	⬆️
src/ripple/app/consensus/RCLCxPeerPos.cpp	`0% <0%> (ø)`	⬆️
src/ripple/protocol/PublicKey.h	`100% <100%> (ø)`	⬆️
src/ripple/app/misc/ValidatorKeys.h	`100% <100%> (ø)`
src/ripple/consensus/Consensus.cpp	`82.69% <100%> (ø)`	⬆️
src/ripple/app/misc/NetworkOPs.cpp	`61.39% <38.88%> (+0.09%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 87742a5...95af8f8. Read the comment docs.

bachase · 2017-05-18T13:12:30Z

Rebased on 0.70.0-b6

bachase · 2017-06-26T14:13:16Z

Tagging @wilsonianb for 16589d4

wilsonianb · 2017-07-05T16:13:58Z

src/ripple/app/misc/ValidatorKeys.h

+//==============================================================================
+
+#ifndef RIPPLE_APP_MISC_VALIDATOR_IDENTITY_H_INCLUDED
+#define RIPPLE_APP_MISC_VALIDATOR_IDENTITY_H_INCLUDED


Why IDENTITY instead of KEYS?

Good catch, I had renamed the file a few times.

wilsonianb · 2017-07-05T16:14:31Z

src/ripple/app/misc/ValidatorKeys.h

+class Config;
+
+/** Validator keys and manifest as set in configuration file.  Values will be
+    emtpy if not configured as a validator or not configured with a manifest.


emtpy -> empty

wilsonianb · 2017-07-05T16:21:00Z

src/ripple/app/misc/impl/ValidatorKeys.cpp

+        {
+            configInvalid_ = true;
+            JLOG(j.fatal())
+                << "Invalid entry in validator token configuration.";


We could modify this to resemble the message for an invalid seed below
"Invalid seed specified in [" SECTION_VALIDATOR_TOKEN "]"

wilsonianb · 2017-07-05T16:25:48Z

src/test/app/ValidatorKeys_test.cpp

+            // No config -> no key but valid
+            Config c;
+            ValidatorKeys k{c, j};
+            BEAST_EXPECT(k.publicKey.size() == 0);


You could also check secretKey to be extra safe. Same with a few other tests below.

Actually, I missed this when adding the [FOLD] commit, but secretKey has a static size. How would you recommend I test it?

Ah, I forgot about that. I'm not sure how you could test that.

wilsonianb · 2017-07-05T16:30:59Z

src/test/app/ValidatorKeys_test.cpp

+        }
+
+        {
+            // Token takes precedence over seed


We actually don't allow the config to have both a token and a seed, but the check happens here:
https://github.com/bachase/rippled/blob/consensus-lock3/src/ripple/core/impl/Config.cpp#L361-L364
Do you think that should be moved into ValidatorKeys?

Good question, I clearly missed that since I added this test. From a code standpoint, it would be nice to move that check over, but I like the idea of validating as early as possible, which would be where the config currently does the work. What would you prefer?

I also like keeping that check with the other config validations in Config::loadFromString, so I'd vote for either leaving as is or checking again in ValidatorKeys (and marking configInvalid_).

wilsonianb · 2017-07-05T16:42:48Z

src/ripple/app/consensus/RCLConsensus.h

@@ -40,6 +40,7 @@ namespace ripple {
 class InboundTransactions;
 class LocalTxs;
 class LedgerMaster;
+class ValidatorKeys;

 /** Manges the generic consensus algorithm for use by the RCL.


Manges --> Manages

bachase · 2017-07-05T18:19:51Z

Rebased on 0.70.0, squashed a few commits and address @wilsonianb's comments.

nbougalis

Mostly minor things

nbougalis · 2017-07-05T19:33:52Z

docs/consensus.qbk


+    // Called when consensus operating mode changes
+    void onModeChange(ConsensuMode before, ConsensusMode after);


ConsensuMode is missing an s

nbougalis · 2017-07-05T19:36:16Z

docs/consensus.qbk

+
+// Encapsulates the result of consensus.
+template <class Adaptor>
+struct ConsensusResult


So... ConsensusResult, ConsensusMode, ConsensusCloseTimes... it's starting to be a bit much. Should we consider a separate consensus namespace for all this?

I'm all for it! There were some downvotes last time the topic came up. I could put them inside the Consensus class, but not all depend on the Adaptor template type, so its convenient to specify independently.

nbougalis · 2017-07-05T19:45:31Z

src/ripple/app/consensus/RCLConsensus.cpp

@@ -449,7 +466,7 @@ RCLConsensus::doAccept(
        JLOG(j_.info()) << "CNF buildLCL " << newLCLHash;

    // See if we can accept a ledger as fully-validated
-    ledgerMaster_.consensusBuilt(sharedLCL.ledger_, getJson(true));
+    ledgerMaster_.consensusBuilt(sharedLCL.ledger_, consensusJson);


I think we can do std::move(consensusJson) here for a potential micro-optimization

nbougalis · 2017-07-05T19:55:51Z

src/ripple/app/consensus/RCLConsensus.h

+    // Since Consensus does not provide intrinsic thread-safety, this mutex
+    // guards all calls to consensus_. adaptor_ uses atomics internally
+    // to allow concurrent access of its data members that have getters.
+    mutable std::recursive_mutex mutex_;


Can we easily avoid std::recursive_mutex in favor of a non-recursive one? Recursive locking = code smell.

I'm fairly confident the current version does not require the recursive_mutex. In fact, that was the primary motivation for this refactor. But @scottschurr has done a good job convincing me that unless the code structurally prevents recursive calls while under lock, future refactors can break this assumption in non-obvious ways.

In particular, we may call into support code in NetworkOps or InboundLedgers while the mutex is under lock. Currently those won't call back into RCLConsensus but its tough to prevent that from changing in a clear way.

Since my name came up, I'll mention that it should be possible to add instrumentation that would fire an assert or LogicError if a mutex were about to be used recursively when it is not intended for such use. I haven't thought hard about the particular scenario in question, but here's an example where we know that recursive calls are possible, but we didn't want them to be recursive at a particular place: https://github.com/ripple/rippled/blob/develop/src/ripple/core/impl/DeadlineTimer.cpp#L36-L41

Another possibility would be that one or more of our (non-recursive) mutex library implementations may shriek in an appropriate fashion (:scream:) if called recursively. @HowardHinnant would be a good resource to ask about that. I know Howard has strong opinions about recursive_mutex.

It is generally best to avoid recursive_mutex if possible. However I don't loose sleep over the use of recursive_mutex (I put it in the standard ;-) ). I can whip up a shrieking mutex if we decide that's what we want. :-)

I'm open to using a shrieking mutex 😱 if that is the preference.

nbougalis · 2017-07-05T20:06:49Z

src/ripple/app/misc/NetworkOPs.cpp

@@ -707,7 +700,7 @@ void NetworkOPsImp::processHeartbeatTimer ()



In the above code-path we can call setMode(omCONNECTED) back to back: if we're in omDISCONNECTED (line 687) we set omCONNECTED, then in 698, if we're omCONNECTED, we again set omCONNECTED.

This may be harmless (except for the extra work), but it could cause some "flapping" of the network mode (disconnected -> connected -> syncing -> connected)

I agree this is confusing but would prefer modifying as a separate change, as I'm less sure the consequences.

On first, pass, I'm not seeing how flapping would happen; I would think both calls for setMode(omCONNECTED) would generally end up setting the same mode, unless we are close to the 1 minute ledger age boundary.

nbougalis · 2017-07-05T20:18:07Z

src/ripple/consensus/Consensus.h

-
-    static std::string
-    to_string(Mode m);
+    //Revoke our outstanding proposal, if any, and cease proposing


Micronit: leading space missing.

nbougalis · 2017-07-05T20:25:43Z

src/ripple/consensus/ConsensusTypes.h

+   We enter the round proposing or observing. If we detect we are working
+   on the wrong prior ledger, we go to wrongLedger and attempt to acquire
+   the right one. Once we acquire the right one, we go to the switchedLedger
+   mode.  If we again detect the wrong ledger before this round ends, we go


This reads weird:

Once we acquire the right one [...] If we again detect the wrong ledger [...] we go back [...] until we acquire the right one.

If we acquired the right one, how can we have the wrong one?

If we are really really slow, its possible we receive new validations that change our choice of the right ledger. So

Detect wrong ledger Id:12 and attempt to acquire it.

Time passes.

We acquire ledger Id:12,

Receive new validations for some other ledger Id:13 and attempt to acquire it.

I admit this is an outlandish case, but it is a possible sequence of events currently.

nbougalis · 2017-07-05T20:39:21Z

src/ripple/overlay/impl/PeerImp.cpp

@@ -1094,7 +1094,7 @@ PeerImp::onMessage (std::shared_ptr <protocol::TMTransaction> const& m)
                flags |= SF_TRUSTED;
            }

-            if (! app_.getOPs().getValidationPublicKey().size())
+            if (! app_.getValidationPublicKey().size())


I'm wondering if having bool Application::validating() const makes sense.

Checking the size of the public key is a bit like checking whether your car radio is working by connecting a multimeter to the antenna input, instead of turning the radio on and seeing if sound comes out of the speakers.

Agreed, I'll try to make that explicit.

wilsonianb

👍 ValidatorKeys LGTM

bachase · 2017-07-10T13:55:58Z

Rebased on 0.70.1

bachase · 2017-07-12T16:17:22Z

Rebased on 0.80.0-b1

JoelKatz

👍

bachase · 2017-07-13T19:23:22Z

@JoelKatz @wilsonianb can I get a second look at the latest commit 1f5917b? Thanks!

wilsonianb

If there isn't one already, could you add an explicit test that consensus is reached with 4 of 5 in agreement?

wilsonianb · 2017-07-13T20:02:15Z

src/test/consensus/Consensus_test.cpp

+                    // additional timerEntry call for the updated peer 0 and
+                    // peer 1 positions to arrive.  Once they do, now peers 2-5
+                    // see complete agreement and declare consensus
+                    if (p.id > 1)
                        BEAST_EXPECT(
                            p.prevRoundTime() > sim.peers[0].prevRoundTime());
                }
                else  // peer 0 is not participating


Update this comment and the one on line 217 to include peer 1.

wilsonianb · 2017-07-14T16:25:02Z

src/test/consensus/Consensus_test.cpp

-                        p.prevRoundTime() == sim.peers[0].prevRoundTime());
-                }
+                BEAST_EXPECT(p.prevProposers() == sim.peers.size() - 1);
+                BEAST_EXPECT(p.prevRoundTime() == sim.peers[0].prevRoundTime());


When I run this test after reverting the agreement comparison from >= to >, the only check that fails is the prevRoundTime. Is that expected (is that the only thing here indicating that consensus was reached)?

Yes, that is the only change for now. The nodes still reach consensus, it just takes peers 1-4 longer.

As part of building out the framework, there is a now a better model for writing an explicit checker that would verify the number of proposals each node sent. I'll target this spot for an upgrade when it gets released.

wilsonianb

👍 Change to consensus check LGTM

Moves thread safety from generic Consensus to RCLConsensus and switch generic Consensus to adaptor design.

seelabs · 2017-07-20T21:55:26Z

In 0.80.0-b2

bachase assigned JoelKatz May 8, 2017

bachase requested a review from JoelKatz May 8, 2017 18:17

bachase requested a review from nbougalis May 10, 2017 15:13

bachase assigned nbougalis May 10, 2017

bachase force-pushed the consensus-lock3 branch from 688f711 to 84ccb97 Compare May 18, 2017 13:12

bachase requested a review from wilsonianb June 26, 2017 14:12

bachase assigned wilsonianb Jun 26, 2017

wilsonianb reviewed Jul 5, 2017

View reviewed changes

bachase force-pushed the consensus-lock3 branch from 84ccb97 to 58bf7a5 Compare July 5, 2017 18:19

nbougalis approved these changes Jul 5, 2017

View reviewed changes

wilsonianb approved these changes Jul 6, 2017

View reviewed changes

bachase force-pushed the consensus-lock3 branch from 2f33f2e to 5a98901 Compare July 10, 2017 13:55

bachase force-pushed the consensus-lock3 branch 2 times, most recently from 4f3961e to 556d6f3 Compare July 12, 2017 16:14

bachase force-pushed the consensus-lock3 branch from 556d6f3 to 4229ed6 Compare July 12, 2017 16:56

JoelKatz approved these changes Jul 12, 2017

View reviewed changes

wilsonianb reviewed Jul 13, 2017

View reviewed changes

wilsonianb reviewed Jul 14, 2017

View reviewed changes

wilsonianb approved these changes Jul 14, 2017

View reviewed changes

bachase added the Passed Passed code review & PR owner thinks it's ready to merge. Perf sign-off may still be required. label Jul 17, 2017

Migrate thread safety to RCLConsensus (RIPD-1389):

02b1728

Moves thread safety from generic Consensus to RCLConsensus and switch generic Consensus to adaptor design.

Fix consensus quorum comparison

a288fe3

bachase force-pushed the consensus-lock3 branch from 95af8f8 to a288fe3 Compare July 20, 2017 15:12

seelabs closed this Jul 20, 2017


		// Called when consensus operating mode changes
		void onModeChange(ConsensuMode before, ConsensusMode after);

		@@ -707,7 +700,7 @@ void NetworkOPsImp::processHeartbeatTimer ()

Migrate thread safety to RCLConsensus (RIPD-1389) #2106

Migrate thread safety to RCLConsensus (RIPD-1389) #2106

Conversation

bachase commented May 8, 2017 • edited Loading

bachase commented May 8, 2017

HowardHinnant commented May 8, 2017

codecov-io commented May 9, 2017 • edited Loading

Codecov Report

bachase commented May 18, 2017

bachase commented Jun 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bachase commented Jul 5, 2017

nbougalis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bachase Jul 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bachase Jul 5, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wilsonianb left a comment

Choose a reason for hiding this comment

bachase commented Jul 10, 2017

bachase commented Jul 12, 2017

JoelKatz left a comment

Choose a reason for hiding this comment

bachase commented Jul 13, 2017

wilsonianb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wilsonianb left a comment

Choose a reason for hiding this comment

seelabs commented Jul 20, 2017

bachase commented May 8, 2017 •

edited

Loading

codecov-io commented May 9, 2017 •

edited

Loading

bachase Jul 5, 2017 •

edited

Loading

bachase Jul 5, 2017 •

edited

Loading