MTP Audit and Implementation Bounty Submissions

This is a legacy article. It may no longer be accurate or up to date.

This Wiki is to publish the submissions to the MTP Audit and Implementation Bounty.

Further details on this bounty can be found on our blog post

The bounty was concluded which led to the publication of MTPv1.2 on January 2018.

Submissions under Verification/Classification

Verifying Fabien Coelho and Doubloon Skunworks joint submission: Evaluation of the MTP-Argon2 PoW Scheme Submitted on 28 September 2017.

MTP Audit Bounty (10,000 USD)

Submission 1 by mbevand (accepted)

Time-memory trade-off with 1/16th the memory, 2.88× the time

I found a time-memory trade-off attack against MTP-Argon2 that requires approximately 1/16th the memory, increases the time complexity by only ~2.88×, while requiring negligible precomputation work. The overall cost for a cheating prover, measured in ASIC area-time complexity, is hence reduced by a factor ~5.5×.

The prover computes the first 128 MB Argon2 segment of lane 0 honestly. This segment can be reused by lanes 1-3 because verifiers don’t verify the first 2 blocks of the first segments are initialized as per the Argon2 spec, see attack 1.

For the 2nd segment of lane 0, the prover chooses an arbitrary (aka non-consistent) 1st block that references any of the other 3 lanes (2nd 32-bit word, J_2 ≠ current_lane). He then computes the 2nd block honestly. If J_2 in the 2nd block references the current lane (probability = 0.25), the prover tries again by changing some bytes in the 1st block. If J_2 references one of the other 3 lanes (probability = 0.75), the prover progresses further and honestly computes the 3rd block. If J_2 in the 3rd block references the current lane, the provers goes back to the beginning, changes some bytes in the 1st block, so that both the 2nd and 3rd blocks reference any of the other 3 lanes. The prover continues in the same fashion for N blocks: he brute forces the 1st block until he honestly computes N subsequent blocks that all reference any of the other 3 lanes. The probability that an arbitrary 1st block fulfills this condition is 0.75^N. I suggest N = 49 for an efficient attack. On average the prover will have to try 1/(0.75^N) = ~1,324,336 candidate values for the 1st block.

So far the 2nd segment consists of:

block 1: brute forced
blocks 2-50: honestly computed

The prover stores these 50 blocks (memory cost: 50 kB.) The prover makes block 51 non-consistent by setting it to the same content as block 1. Then blocks 52-100, if computed, would generate the same data as blocks 2-50. There is therefore no need to generate or store the remainder of the brute forced segment: it consists of blocks 1-50 repeated over and over.

Repeat these steps to generate the 2nd segments of lanes 1-3, then the 3rd, then the 4th segments of all lanes (12 brute forced segments in total.)

The crux of the attack that is important to understand is that the annoying part of Argon2, from an attacker’s viewpoint, is that if J_2 references the current lane, the reference set R of blocks is constantly changing from block to block. However by forcing J_2 to reference other lanes, the sets R of the other 3 lanes remain constant for the whole segment, so a given block value X[i] will generate the same X[i+1] regardless of the index i.

In total, on average, the prover will have to try 12 × 1,324,336 = ~15.9 million candidate values for the brute forced blocks. This precomputation must be done once for every new challenge (eg. newly mined Zcoin block), and should take just a few seconds of GPU time as it is an embarrassingly parallel task.

The total memory used by the prover is 128 MB (first segments) + 12 * 50 kB (brute forced segments) = ~128.6 MB, or approximately 1/16th the amount used by honest provers.

In the 1st segment, all blocks are consistent. In the other 3 segments, 49 out of 50 blocks are consistent. So on average 197 out of 200 blocks are consistent, which means the proof-of-work’s random selection of L = 70 blocks has probability (197/200)^70 = ~0.347 of being valid. Therefore this attack only increases the time complexity by a factor ~2.88.

Overall, this time-memory trade-off attack is quite attractive: it requires little precomputation, reduces memory use to ~1/16th, and increases runtime by only ~2.88×, which translates to an ASIC area-time advantage of ~5.5× for a cheating prover over an honest prover. The most efficient N value for this attack remains to be determined but is likely around 30-50.

There is no easy and cheap way to detect the attack. The only viable fix that I see is to use Argon2 with 1 lane instead of 4. This forces the reference set R of blocks to change from block to block, so a given block X[i] will generate a block X[i+1] that varies depending on the exact value of the index i.

Submission 2 by mbevand (accepted)

Argon2 Segment Sharing

MTP-Argon2 suggests parameters designed to force provers to use 2048 MiB of RAM. However a flaw allows a malicious prover to reduce his RAM usage by 18.75% down to 1664 MiB by sharing Argon2 segments amongst lanes.

Argon2 is parametrized with 4 lanes of 512 MiB. Each lane is divided in 4 segments of 128 MiB. The first two X[i] blocks of a lane are normally initialized with respectively H’(H_0 ‖ 0 ‖ i) and H’(H_0 ‖ 1 ‖ i).

However an oddity of MTP is that verifiers never check that these first two blocks are initialized this way. So a prover can share the same first segment across all lanes, allowing him to store 3 fewer segments in RAM, hence saving 3 × 128 = 384 MiB of RAM. This brings his RAM usage from 2048 MiB down to 1664 MiB, and makes MTP less memory-hard than it is supposed to.

(Note that even though the first segment of each lane are identical, the second and subsequent segments will be different from other lanes because the reference set R of blocks will be different for different lanes.)

One pragmatic fix for MTP is to simply allow, even encourage, the use of this memory-saving technique, giving the same advantage to malicious and honest provers. Argon2 could be parametrized to use 2581120 KiB (2520.625 MiB) of RAM, so that the actual RAM usage with the memory-saving technique would be close to the 2048 MiB originally intended.

However a cleaner fix is to use Argon2 with 1 lane instead of 4. This would also fix the more serious attack 4.

One might think that an alternative fix would be to require provers to include the first 2 blocks of each lane (and their openings) in the proof, and to require verifiers to ensure that these blocks are unique. However fundamentally MTP allows some inconsistent blocks, so the prover could supply 8 unique blocks, and break the consistency of Argon2 by sharing the same remaining blocks amongst lanes.

Submission 3 by Fabien Coelho and Hidetoshi (accepted)

Parallel Searches

In the context of a probabilistic PoW the bottleneck of which is memory bandwidth, the prover is interested in performing more computation per loaded data, or with data already available in cache. In order to do that, the search paradigm can be inverted so that instead of fetching Array X elements needed for updating the Y value for nonces, it rather fetches the Y values for which the array elements are available, and scans the array in a round-robin fashion so as to make all parallel searches progress. This algorithmic change is significantly profitable if the search state for a nonce is small.

The proposals below add a step to enlarge this search state beyond the (possibly) 28 bytes in [7]. This does not impact Array X memory requirement, but reduces the efficiency of a memory bandwidth limited implementation, and makes it harder to amortize the 2 GiB memory cost over several solvers.

A possible hardware implementation of such a solver is outlined below:

Transposed Search Hardware

We now investigate the transposed search algorithm outlined above. Its main benefit is to reduce the bandwidth requirement which is the bottleneck of both the standard and compressed implementations. As noted before, the search state size is 28 bytes, to which we add 4 bytes to manage a data structure which would allow to order searches per next array element to process, using some simple array and linked-list scheme. We will dedicate area for 576 MiB of search states on die, accomodating on average 9 search states for each 2^21 elements, leaving 8.5% of die area available for up to 4,429 BLAKE2 cores.

When sending one array element, 9 search states on average can be incremented using 9 BLAKE2 operations each, thus with available cores we can handle a throughput of about 4429/9.9 = 54.7 elements per tick, which translate to 1.53 TB/s bandwith to fetch them. This is yet again above the bandwidth limit, but for a much larger number of cores, which are thus active at about 65.5%. The overall throughput is about 137.8 MΩ/s, as BLAKE2 cores are quite actives and the memory bandwidth, although still the bottleneck, is much less so. Throughput could be improved a little by allocating remainder transistors to caching memory. The critical bandwidth efficiency is improved as 9 search states share loading one array element. Moreover, as the memory access pattern is known in advance, data can be prefetched and deep threading is not needed.

This design allows to feed thousands of hash cores with the available bandwidth. Although it is much more efficient than previous implementations, most of the area is still dedicated to cache, and under 10% is really used for computing.

Solution

Size S = We suggest S = 64 which is the maximum for one invocation of BLAKE2. Back sweep Step 6 makes one search state size at least LS bytes and adds a signicant memory requirement (a few KB per nonce) for parallel searches. It somehow ensures that the memory requirement is proportional to the computation, although some trade-off is always possible: this memory is not strictly necessary, as the Y values could still be (re)computed from available data, but as keeping these values takes less space than their dependencies, recomputations are not worth it. If some parallel pipelining takes place, the average number of values needed is L/2 + 1 per search and should be considered to dimension the memory requirements.

MTP Implementation Bounty (2,500 USD)

2500 USD in BTC Bounty paid to Marc Bevand https://blockchain.info/tx/1fa33b6e2cf801e45dc37f1a0ac092229b0f8d37f43051d45b9dfd12a578e5df

Submission 1 by mbevand with fix (Accepted)

https://github.com/zcoinofficial/zcoin/blob/dfb9a0b57ebcc706ae0278b5698792a02d1b66f6/src/mtp.cpp#L609 "i < L * 2" should be replaced with "i < L * 3" or else the verifier will only verify the opening of 2/3rds of the blocks in the merkle tree. Failing to verify the remaining 1/3rd means a malicious prover would have to spend the expected computing resources (2GB) only once to compute 1 potential PoW solution. Then he would simply "grind" one block in the remaining 1/3rd by spending minimal resources, completely defeating the memory-hardness of MTP. For example he would calculate Y_j = H(Y_(j−1), X[i_j]) for L-1 blocks, then would grind the last block, potentially on many machines in parallel, by sending the Y_(L-1) value to the machines, each grinding unique random last blocks, and each needing to allocate only 1kB to hold this block.

Submission 2 by mbevand with fix (Accepted)

Judge's comments: Although this was submitted under the MTP Audit bounty classification, this particular submission arose from Alex Biryukov's (one of MTP's co-authors) explanation to mbevand in an e-mail on the 10 July 2017 on how Merkle tree openings should be validated when mbevand was discussing MTP with them. MTP's authors had assumed that it wasn't necessary to describe this in the paper and do not see it as a flaw in the paper itself, rather a point that was left silent. We had also sought the opinion of our cryptography advisor Tim Ruffing who came to the same conclusion who did not see the issue as a fundamental problem of MTP but more an issue of documentation.

Zcoin's implementation of merkle tree validation however did not address this and therefore we are accepting this submission as an implementation bug.

MTP as described in its paper https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_biryukov.pdf is flawed. Algorithm 2 (verifier's algorithm), step 2, verifies the opening but not the location of the Argon2 blocks in the tree. Their location is not part of the proof. Their location can only be computed at the next step, step 3, when Y_j is calculated. As a result, the prover can pretend the blocks selected through each of the L=70 steps are always X[3] (MTP does not allow to select X[1] or X[2]) and he can effectively run algorithm 1 (prover's algorithm) by storing only 3 kB. I'll explain later how to further reduce memory usage down to 1 kB.

More specifically an attack against MTP would work as follow: in the prover's algorithm, step 1, the prover computes and stores X[1] through X[3] in memory, and stops here. The prover assumes the rest of the blocks X[4] through X[T] are all zeros. In step 5, the prover doesn't bother computing i_j but assumes i_j is always 3. The rest of the algorithm is run normally. Whenever a nonce N results in a satisfying Y_L, the prover's proof will be the merkle root, the nonce N, and the set Z of L=70 elements where each element is the same:

(opening of X[3], opening of its two input blocks X[2] and X[1], input block X[2], and input block X[1])

On a side note, the Zcoin implementation of MTP would also store X[3] in the proof, but this is unnecessary.

The verifier's algorithm blindly accepts this proof because all the openings are valid and the algorithm doesn't verify whether the location (Y_j mod T) equals 3 or not. Since the location isn't verified or even known by the verifier's algorithm, it seems the paper implies a "canonical" merkle tree is computed: a tree where the hash of the "left" and "right" hashes are computed by first ordering the 2 hashes.

On a side note, the Zcoin implementation of MTP gets around the problem of not knowing the location not by computing a "canonical" merkle tree, but by storing both the left and right hashes in the proof. But it too fails to verify the location, so it doesn't prevent the attack.

In order to fix the verifier's algorithm, step 2 should be removed, and step 3 should be changed to:

Compute from Z for 1 ≤ j ≤ L: X[i_j] = F(X[i_j - 1], X[φ(i_j)]) Verify opening of X[i_j] and its location in tree: (Y_(j-1) mod T) Verify opening of X[i_j - 1] and its location in tree: (Y_(j-1) mod T) - 1 Verify opening of X[φ(i_j)] and its location in tree (*) Y_j = G(Y_j - 1, X[i_j])

(*) The location of X[φ(i_j)] is specified by the first two 32-bit words of X[i_j - 1] as documented in the Argon2 specification (J_1 and J_2 words.)

Since an opening consists of 21 hashes and no additional information, some of these hashes will be "left" hashes, others "right" hashes. So verifying both the opening and location of a block is done by hashing the hash in the proof and the hash computed on the fly while verifying the opening in the order determined by the block's location in the tree.

In the Zcoin implementation, it stores 21*3 hashes per opening (parent, left, right) so verifying the location of the block would consist of verifying the hash is at the expected place (either left or right) instead of freely allowing it to be at either the left or right: https://github.com/zcoinofficial/zcoin/blob/dfb9a0b57ebcc706ae0278b5698792a02d1b66f6/src/libmerkletree/merkletree.cpp#L79

The memory usage of this attack against MTP can be reduced from 3 kB to 1 kB. Because the verifier's algorithm allows any value for X[1] and X[2], the prover can actually assume they are all zeros and not even store them in memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly