Evaluate performance of PeX in a test setting #8

lthibault · 2021-08-26T13:55:52Z

As the PeX implementation nears completion, I think our priority should be to evaluate its performance along the following dimensions:

Correctness
Resilience to partitions
Resilience to partial failure
Time complexity

I've used matrix to run simulations inside of unit tests (see: TestPeerExchange_Simulation) and am so far very pleased with the results. It seems like a good starting point for evaluating and optimizing PeX.

In contrast to Testground, Matrix is simpler to configure, can be run directly in unit tests/benchmarks, doesn't require any configuration on the dev machine and doesn't require setting up a separate repository (or fiddling with go.mod). The trade-off is that we use an in-process transport, so we cannot use it to test layer-3 networking behaviors. I don't think we care, though.

Below are a list of questions I have about PeX. These should dictate the tests/simulations that need doing. Please feel free to append to this list.

How fast does PeX converge on a uniform distribution of records?
How resilient is the overlay to partial failure without partitions?
How long does a partition "remember" records from another partition?
How does increasing the fanout during each gossip round affect the convergence rate?
How does a pure-rand strategy affect the convergence rate relative to the current (hybrid) strategy?
How does Improve PeX stabilization time and partition-resistance. #7 compare to rand/hybrid strategies?

At this point, I think we should be going for low-hanging fruit. How can we apply the 80/20 rule here? I would ideally like to be confident in PeX by mid-September.

@aratz-lasa Thoughts? Can you look into this?

The text was updated successfully, but these errors were encountered:

aratz-lasa · 2021-09-01T11:18:09Z

Testing plan

How fast does PeX converge on a uniform distribution of records?

Check whether follows uniform distribution using Kolmogorov-Smirnov Test

How resilient is the overlay to partial failure without partitions?

Disconnect nodes until there is a partition

How long does a partition "remember" records from another partition?

Force partition by being unreachable at Network level + check how long nodes from other partition stay in the local views

How does increasing the fanout during each gossip round affect the convergence rate?

Increase fanout and check convergence speed (Kolmogorov-Smirnov Test)

How does a pure-rand strategy affect the convergence rate relative to the current (hybrid) strategy?

Use pure rand strategy and check convergence speed (Kolmogorov-Smirnov Test)

Questions

What is the easiest way to check for partitions or distribution characteristics? There are two main approaches:

Run offline tests, where topological changes are stored in a (influxDB-like) database, and results are later extracted from it.
Run online tests, where results are constantly being analyzed on every X event/time-unit.

Do Matrix or Testground facilitate for online tests?

lthibault · 2021-09-01T14:35:42Z

@aratz-lasa Looks good and sensible. Carry on! 👍

Regarding your questions, I think offline analysis is probably a better idea. It's going to help separate test logic from analysis, and it will also help us share datasets, which is going to be helpful when debugging protocol issues.

Matrix works well for quick-and-dirty tests, but (1) I have yet to implement any traffic-shaping facilities and (2) data-collection is out-of-scope, I think. I think Testground might be a better fit here, despite its annoyances. It stores data in an output directory in JSON format, so we can just load that up into a python script or something. You'll need to use the docker backend in order to do traffic shaping.

However, let's keep the Testground code and the analyses in a separate repo from CASM/Wetware.

aratz-lasa · 2021-09-02T10:07:54Z

Okay @lthibault , I will start right away!

lthibault · 2021-11-04T15:47:39Z

Re:

How fast does PeX converge on a uniform distribution of records?

We want to quantify the convergence time for a PeX cluster as a function of the number of nodes in a cluster.

Output of this should be:

should be a graph of number of nodes x convergence time
a rough Big-O formula for time complexity (e.g. O(n))

Operationalization of convergence:

let t = threshold [0 1] (n.b.: assumes normalized y axis)
let ymin = min(y...)
Convergence C_t = ymin > t

TODO:

Convergence time with t=.95
Convergence time with t=.99 (n.b. may never converge for small n)

aratz-lasa · 2021-11-09T08:10:40Z

Convergence test parameters:

Nodes: 3-64 (every 2 steps)
Ticks: 40
Convergence threshold: 0.99, 0.95, 0.8
Repetitions: 2-4

aratz-lasa · 2021-11-15T16:36:36Z

Convergence results:

lthibault · 2021-12-16T16:26:26Z

Re:

How does increasing the fanout during each gossip round affect the convergence rate?

It occurs to me that this is exactly the same as increasing the frequency of gossip rounds, so I'm considering this question to be answered.

lthibault assigned aratz-lasa Aug 26, 2021

lthibault added the tests Tests, benchmarks and simulations label Aug 26, 2021

lthibault mentioned this issue Nov 4, 2021

Investigate traffic shaping in Testground #13

Closed

lthibault closed this as completed Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate performance of PeX in a test setting #8

Evaluate performance of PeX in a test setting #8

lthibault commented Aug 26, 2021 •

edited

aratz-lasa commented Sep 1, 2021

lthibault commented Sep 1, 2021

aratz-lasa commented Sep 2, 2021

lthibault commented Nov 4, 2021 •

edited

aratz-lasa commented Nov 9, 2021

aratz-lasa commented Nov 15, 2021

lthibault commented Dec 16, 2021 •

edited

Evaluate performance of PeX in a test setting #8

Evaluate performance of PeX in a test setting #8

Comments

lthibault commented Aug 26, 2021 • edited

aratz-lasa commented Sep 1, 2021

Testing plan

Questions

lthibault commented Sep 1, 2021

aratz-lasa commented Sep 2, 2021

lthibault commented Nov 4, 2021 • edited

aratz-lasa commented Nov 9, 2021

aratz-lasa commented Nov 15, 2021

lthibault commented Dec 16, 2021 • edited

lthibault commented Aug 26, 2021 •

edited

lthibault commented Nov 4, 2021 •

edited

lthibault commented Dec 16, 2021 •

edited