Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible ASIC design #11

Closed
tevador opened this issue Dec 24, 2018 · 169 comments
Labels

Comments

@tevador
Copy link
Owner

@tevador tevador commented Dec 24, 2018

EDIT: This design is outdated and no longer applies to the current RandomX version.

Similar idea was originally proposed by @cjdelisle for a GPU miner, but I think it's more applicable for an ASIC design.

RandomX ASIC miner:

  • ~66 MiB of SRAM on-chip
  • 4 GiB of HBM memory
  • 1 dedicated core for dataset expansion
  • 256 small decoder/scheduler cores
  • 28 worker cores

During dataset expansion, the SRAM is used to store the 64 MiB cache, while dataset blocks are loaded into HBM memory.

When mining, the ASIC runs 256 programs in parallel.
Memory allocation:

  • 256 * 256 KiB scratchpad = 64 MiB of SRAM for scratchpads
  • 256 * 8 KiB = 2 MiB of SRAM for program buffers
  • 32 KiB for register files
  • ~256 KiB for program stack (1 KiB per program) = maximum recursion depth of 64

Instructions are loaded into the decoder/scheduler core (each core has its own program buffer, program counter and register file). The scheduler cores handle only CALL and RET instructions and pass the rest to one of the 28 worker cores.

Each of the 28 worker cores implements exactly one RandomX instruction pipelined for throughput which matches the instruction weight (for example MUL_64 worker can handle 21 times more instructions per clock than DIV_64 worker). Each worker has an instruction queue fed by the scheduler cores.

The speed of the decoder/scheduler cores would be designed to keep every worker core 100% utilized.

Some complications of the design:

  • CALL/RET will stall the program if there is dependency on an instruction which is still in worker queue
  • FPROUND will stall all subsequent floating point instructions
  • instructions which use the same address register will create a dependency chain

There could be also some dedicated load/store ports for loading instruction operands.

If limited only by HBM bandwidth, this ASIC could do around 120 000 programs per second (480 GiB/s memory read rate), or roughly the same as 30 Ryzen 1700 CPUs. This assumes that sufficient compute power can fit on a single chip. If we estimate power consumption at 300 W (= Vega 64), this ASIC would be around 9 times more power efficient than a CPU.

Improved estimate: #11 (comment)

Dislaimer: I'm not an ASIC designer.

@tevador tevador added the discussion label Dec 24, 2018
@cjdelisle

This comment has been minimized.

Copy link
Contributor

@cjdelisle cjdelisle commented Dec 24, 2018

The 66MiB of SRAM is probably a killer, but I suspect it can be avoided by doing async instructions and thousands or hundreds of thousands of threads.

BTW credit for this idea should go to: https://github.com/aggregate/MOG/

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 24, 2018

@cjdelisle 100 000 parallel threads would require ~24 GiB of memory just for scratchpads.
Also if you store the scratchpad in HBM, your memory bandwidth usage will increase by 50% 3 times.

You could also compromise by storing only the first 16 KiB of scratchpad in SRAM, which would decrease the required amount of on-chip memory to ~6 MiB, which is probably more realistic.

@cjdelisle

This comment has been minimized.

Copy link
Contributor

@cjdelisle cjdelisle commented Dec 24, 2018

The idea is to use threads to replace the latency bottleneck with a memory bandwidth bottleneck.. This is basically the ring processor idea but I think the von neumann bottleneck eventually degrades to a sorting problem where you need to sort the opcodes and operands to be next to each other and then gather the completed operand/opcode bundles into a queue which is then fed to the processor, creating more opcodes with need to sort more operands...

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Dec 24, 2018

66 MiB of SRAM is nothing unusual for ASIC, the speedup from SRAM will outweigh bigger chip area by large margin. Scrypt ASICs had 144 MiB SRAM per core, IIRC.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Dec 24, 2018

Anyway, HBM memory + memory controller + 256 CPU-like cores with 66 MiB SRAM on chip sound very similar to a GPU. It'll be still limited by computing power, not memory bandwidth. Power efficiency (H/s/W) will be maybe 2-3 times better than a CPU.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 24, 2018

@SChernykh A GPU would be compute-bound because it cannot run RandomX efficiently. Most 64-bit instructions have to be emulated on GPUs and double precision runs 16 times slower than single precision. ASIC, on the other hand, can execute 1 instruction per cycle in ideal case.

@cjdelisle RandomX was not designed to be latency-bound. That's why the dataset is accessed sequentially. Only scratchpad access is latency-bound, but it can fit into SRAM.

Also the instructions executed in RandomX are not independent. There will be random dependency chains, typically due to instructions using the same register (very rarely using the same scratchpad word). This would slightly complicate the design.

I have contacted @timolson to help us assess the viability and possible performance
of a RandomX ASIC.

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Dec 24, 2018

On first glance it looks much improved from RandomJS, being a much closer model of the underlying hardware.

However, if I understand correctly, you can run many programs in parallel on chip, assuming enough scratchpad space. This is similar to CryptoNight and favors ASIC development. The limiting factor will probably be this scratchpad area, not the logic, and as you pointed out, 66MiB for a bunch of cores is no problem, being only about 41 mm2 in a 16nm process. If we conservatively assume logic doubles the area then an 80 mm2 chip might cost around $10-15 packaged. You're gonna crush CPUs with this.

One way to prevent this parallelism is to make the large DRAM table read/write. Then you need a memcontroller and DRAM set for every core, which is closer to the setup of consumer hardware. Being able to isolate the running program to just the chip die makes for a nice, efficient ASIC. Where ASIC's fall down is when they have to wait for external IO. Once you hit the logic board, it's like a Ferrari in rush hour traffic.

Another option is to somehow force nonces to be tried in serial instead of parallel. Then an ASIC can't beat CPU's by merely adding cores. An ASIC design could still be cost-efficient by eliminating the excess cores on a CPU and all the gates for cache control. Or maybe there's a way to limit parallelism to exactly 8 or some chosen number of cores. This would make CPU's closer to the optimal configuration for the problem.

I didn't look closely enough, but wanted to point out potential "serialization" attacks. If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM. Also, parallelism may be discovered and exploited in the programs if there's not enough register overlap. You might consider using fewer registers that are updated more frequently to address this. Again, I'm not sure if it's an actual problem with your design, because I didn't look closely enough, but it should be mentioned.

A couple other miscellaneous comments:

  • I wouldn't waste die area or ASIC development time on the large table expansion. It can be done offline with a CPU... A miner will have an SoC controller anyway which would handle this.
  • HBM is a specialized process that's rather expensive. GDDR5 is probably a more efficient choice. GPU manufacturers live by "better performance" not "hashes-per-dollar," so they have moved to HBM in search of peak performance.
  • GPUs will be limited by their SRAM, which is 2MB per Nvidia SM, with only 64kB accessible to any one thread block. AMD's are less limiting but still problematic. Both CPUs and ASICs can outperform GPU's in this area. The scratchpad sizes you're using effectively prevent GPUs from being efficient.
  • Double precision float math on GPUs is 2-32x slower than single precision, but CPU's handle 64-bit math very well. Most GPUs also suck at all integer multiplication (addition is fine), although the GPGPU movement is changing this somewhat. The new 2000 series somewhat better, but they still emphasize float performance (deep learning training is primarily floating point multiplication.) So integer multiplication is another way CPU's and ASIC's can outperform GPU's. Here you can see instruction throughput numbers for Nvidia Notice how integer multiplication is so slow they don't even give a number! They just say "multiple instructions." Bit operations are also slow on GPU's. GPU's are really focused on float performance, and many games use half-precision 16-bit floats. It looks "good enough."

I'm currently writing a GPU miner for Grin, and since the genesis block is January 15th, I don't have much time to look deeper until later in January or February, sorry. I can quickly address specific concerns if you want to point something out, or if I overlooked something critical in my very brief review.

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Dec 24, 2018

One more comment: An ASIC can have the "correct ratio" of logic handlers based on the frequency and latency of various instructions used in RandomX, which may be different from CPU's. As a simple example, let's assume only two instructions, int multiply and int add, randomly selected 50/50. If multiply takes 4 cycles and add takes 1, then an ASIC will have 4 mul units for every 1 add unit, whereas a CPU gets one of each. That may not be strictly true, but you should tune your probabilities such that the probability_of_instruction / latency_in_cpu is the same for all instructions. In the above case, you want adds to be 4x more frequent than multiplies (assuming the CPU has one of each)

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Dec 24, 2018

Aaaaand one more thing... Although the 66 MiB chip you proposed would be $10-15, it's only gonna clock around 1GHz. Intel and AMD definitely do have lots of IP and scale to do efficient layouts and squeeze the maximum speeds out of their designs. If you can fix the multi-core problem running lots of programs in parallel, then a startup ASIC maker, even Bitmain, will not get close to the CPU performance of the incumbents. But you probably need to get within a factor of 3 or so.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 24, 2018

@timolson Thanks for the review.

However, if I understand correctly, you can run many programs in parallel on chip, assuming enough scratchpad space.

Yes and assuming you can read from the dataset quickly enough. The dataset is too big to be stored on-chip, so external memory is unavoidable.

One way to prevent this parallelism is to make the large DRAM table read/write.

That is difficult to do while also allowing hash verification for light clients who might not have enough RAM. But perhaps a 1 GB read/write buffer would be possible. Even phones have at least 2 GB nowadays.

Another option is to somehow force nonces to be tried in serial instead of parallel.

I'm not aware of any way how to achieve this. I don't think it's possible without some central authority handing out nonces.

Currently, parallelism is limited only by DRAM bandwidth.

If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM.

I don't think this is possible. Scratchpad read/write addresses are calculated from register values, which depend on previous results. Dataset is already being read sequentially and the reads cannot be reordered.

There are only 8 registers for address generation and register values change every instruction, so the sequence of independent instructions will be very short, perhaps 2-3 instructions.

Regarding GPU performance, I already suspected most of what you wrote.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Dec 24, 2018

But perhaps a 1 GB read/write buffer would be possible.

Per thread? Or one buffer for all threads? How are you going to synchronize them?

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Dec 24, 2018

Currently, parallelism is limited only by DRAM bandwidth.

Make sure it's tuned such that typical DDR4 SODIMM speeds align with 8-core parallelism and I'd say you're getting close. However, GDDR5 and HBM crush DDR4 for bandwidth-per-dollar, so if you pin the PoW to memory bandwidth, CPUs will lose out. One way to address that dilemma may be to use random DRAM access instead of sequential. Some of GDDR's improved speed comes from using wider rows and some comes from a wider bus, but DDR4 is competitive for random access patterns. If you're only reading a single word at a time, it doesn't matter that GDDR grabs 32 words while DDR only gets 8, or whatever. They both have random access latencies of 40-45 ns.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Dec 24, 2018

The best way would be to read random 64 bytes at a time. DDR4/CPU cache is optimized for this burst read size.

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Dec 24, 2018

The best way would be to read random 64 bytes at a time. DDR4/CPU cache is optimized for this burst read size.

Longer reads will favor GDDR & HBM because they have wider busses and also can push more bits-per-pin-per-cycle. I would suggest something smaller than 64 bytes, which would need 512 pin-cycles in DDR4 and only 128 pin-cycles in GDDR5. This is wider than consumer SODIMMs. 16 bytes is probably safe.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Dec 24, 2018

A 64byte burst is optimal for a typical 64bit memory channel, DDR4 is designed for 8-bit bursts per pin. While a GPU can grab that in a single cycle with a 512bit bus, it won't be able to burst any of its accesses.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Dec 24, 2018

I think the ideal case would be 64-byte random accesses that also saturate dual-channel DDR4 bandwidth on 8-core CPU.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 24, 2018

@timolson If we make random memory accesses (latency-bound), the CPU cores will be basically idle and you can make an ASIC with 1% of power consumption of a CPU.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 24, 2018

@SChernykh

Per thread? Or one buffer for all threads? How are you going to synchronize them?

One buffer per thread. So 8 GiB of memory for 8 parallel threads.

@timolson

This comment has been minimized.

Copy link

@timolson timolson commented Dec 24, 2018

It depends on the frequency of reads also, right? If you're reading often, then yes. But what about semi-infrequent random DRAM reads? You can tune the number of computations-per-DRAM read to match what a CPU core can do. In this way, it is not DRAM-bound by either latency or bandwidth. The idea is similar to ProgPoW in this regard, where they tuned the number of random math ops per memory access to match GPU capabilities.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 24, 2018

Fair enough.

Currently, it takes ~90 W of compute power (14 nm Ryzen CPU with 16 threads) to achieve ~16 GiB/s of DRAM read speed, which is about half of what dual channel DDR4 can do.

If you use GDDR5/HBM, you can easily read 20x faster, but how are you going to match the compute speed? Even if you improve efficiency by a factor of 2 over a CPU, that's 900 W of compute power.

RandomX uses primitive operations (add, sub, mul, div,, floating point), so I don't think you can make an ASIC much more power efficient than that. At most you can cut out some of the CPU parts like TLB, L3 cache, memory controller and IO.

@cjdelisle

This comment has been minimized.

Copy link
Contributor

@cjdelisle cjdelisle commented Dec 24, 2018

I didn't look closely enough, but wanted to point out potential "serialization" attacks. If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM.

This could be a significant risk if it is feasible to run hundreds or thousands of threads with DRAM because one need not do any dependency analysis, just schedule each thread for one instruction only.

Creating large datasets is a good solution but it is not really possible to require the dataset to be used by one thread only because in order for a large datasets to be worth creating and storing in the first place, it needs to be reusable. Serial re-usability is going to require that your verifier performs the whole series of operations which is probably a non-starter so you end up having to allow at least a few hundred parallel executions to use the same buffer...

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 24, 2018

BTW, random reads already happen in RandomX. There is (on average) one random read per 213 (8192) sequential reads.

@cjdelisle

This comment has been minimized.

Copy link
Contributor

@cjdelisle cjdelisle commented Dec 26, 2018

I made this little drawing of what I think a high latency high parallelism processor could look like: https://pixelfed.social/p/cjd/24845
Perhaps I'm wrong and smarter people will point out the error, but my intuition is that there's no way to win against this type of design. Even if your mining algorithm was "compile linux" (my informal benchmark for general purpose compute performance), I see no reason that one couldn't run 16k or more GCC processes with this type of architecture, given enough DRAM (especially with hardware scatter/gather support)

AFAICT there is no way to de-parallelize the mining beyond requiring memory for the verification process. If you require a different 50MB dataset per-nonce then the verifier needs 50MB and the solver using this architecture can run (AVAILABLE_DRAM / 50MB) parallel threads.

The method of requiring a precomputed dataset which is reusable for more than one nonce falls down because either the solver parallelizes all allowable permutations for one precomputed dataset or (if the number of allowed permutations is too low) he doesn't bother to precompute it at all and simply tests the same way as the verifier.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 26, 2018

I improved the ASIC design estimate based on comments from @timolson.

Let's start with memory. We need 4 GiB of GDDR5 for maximum bandwidth. At least 4 memory chips are required since the capacity is 8 Gb per chip. Each chip has a 32-bit interface, so our maximum memory bandwidth will be 4 * 32 * 8 Gb/s = 128 GiB/s, assuming 2000 MHz memory.

The memory can support up to 128 GiB / 4 MiB = 32 768 programs per second. Now let's try to make a chip that has enough compute capability to actually push out 32 thousand programs per second.

I started with the AMD Zen die, which has an area of 7 mm2. If we remove all cache, we have around 4 mm2. Let's say we can optimize the design down to 2 mm2 per core.

We know that the Ryzen core can do ~500 programs per second at 3350 MHz. Since our ASIC will run only at 1 GHz, our optimized core can do only ~150 programs per second.

We need ~218 such cores to saturate the memory bus. This amounts to about ~436 mm2.

Additionally, we will need ~40 mm2 of SRAM and a GDDR5 memory controller. The DDR4 memory controller in Ryzen is ~15 mm2, so let's say we can make a minimal controller with just 5 mm2.

In total, we have a huge ~480 mm2 die, which is about the same size as a Vega 64 GPU.

Price estimate:
~$150 for the die
~$75 for memory (based on prices from digikey.com)
~$30 PCB + power delivery
~$25 cooling

Total per mining board: ~$280. This doesn't include any R&D or IP licensing costs.

We can safely assume a power consumption of around 300 W per board, same as a Vega 64 at full load.

Hashes per Joule:

  • Ryzen 1700: ~40
  • ASIC board: ~100

So about 2.5 times more efficient. And this is the best case scenario.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Dec 27, 2018

Zen+ should be about 10% more efficient than Zen. Has anyone tested a Ryzen 2700 yet?

Your math assumes 32768 cores saturating GDDR5 all performing sequential accesses. The throughput will be much less with random accesses. I think your area estimates for the ASIC are overly optimistic, as there'll be quite a complicated interconnect network to attach all those cores to the memory etc.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 27, 2018

Your math assumes 32768 cores saturating GDDR5 all performing sequential accesses

Actually, there are only 218 cores. The design above assumes scratchpads are stored in SRAM, so a maximum of 256 programs can be run in parallel. The GDDR5 memory is just for the dataset, which is read mostly sequentially.

If you wanted to run thousands of programs in parallel, you'd have to store scratchpads in GDDR5 and use the design by @cjdelisle to hide random access latencies. However, in this case you would need 12 GDDR5 chips per board to get enough capacity (12 GiB) and bandwidth (384 GiB/s). The cost of the memory chips alone would be over $200 per board. Power consumption would probably also increase because GDDR5 is power hungry.

I think your area estimates for the ASIC are overly optimistic,

Yes, it's maybe too optimistic. The 2.5x efficiency figure is an upper estimate for an ASIC.

I still think a bandwidth-limited design is the way to go. If the design was purely latency-bound, an ASIC could use much cheaper DDR3 memory. This can be seen in Antminer E3.

@cjdelisle

This comment has been minimized.

Copy link
Contributor

@cjdelisle cjdelisle commented Dec 27, 2018

This seems like a reasonable design for the tech, consider that you can eliminate caches, registers and even split the components of the ALU into circuits (direct add insns to the adders, mul insns to the multipliers, etc). You need SRAM mostly for router-like buffers because the chip is basically a network.

Generally speaking, I think your approach of focusing on power consumption is a good heuristic to fit the problem to the hardware you have (though it might be worth also watching int ops and float ops to make sure there are no shortcuts). I'm hoping to fit the problem to the hardware I want to have so my design will be slightly different, focusing more on branching / prediction and use of lots of instructions with somewhat less power consumption.

That said, my whole design falls down if it turns out that the high bandwidth wiring is prohibitively expensive.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Dec 28, 2018

I'm experimenting with doubling the memory bandwidth requirements by increasing dataset read size from 8 to 16 bytes.

The performance drop for CPUs depends on the available memory bandwidth. On Ryzen 1700, it seemed to hit a bandwidth bottleneck with dual channel DDR4-2400, so I upgraded to DDR4-2933.

Here are the performance numbers:

dataset read threads memory programs/s read rate
8 bytes 8 DDR4-2400 3200 13 GB/s
8 bytes 16 DDR4-2400 4450 18 GB/s
16 bytes 16 DDR4-2400 3400 28 GB/s
16 bytes 8 DDR4-2933 3200 27 GB/s
16 bytes 16 DDR4-2933 3900 33 GB/s

With 16 threads, it's still slightly bandwidth-limited even with 2933 MHz memory. It seems that 3200 or 3466 MHz memory might be needed.

For the ASIC design, this would mean either halving the performance to ~16K programs per second per board (with corresponding halving of die area) or a forced upgrade to 256-bit GDDR5 interface with 8 memory chips, which would double the memory cost to ~$150 per board and put more strain on inter-core bandwidth.

One drawback of this change is that the execution units of CPUs would be slightly underutilized, which would make it easier for an ASIC to match the compute requirements. This could be solved by adding more compute per VM instruction.

What do you think about this change?

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Dec 30, 2018

3200-3466 MHz memory is becoming more common now. We should aim for maxing out available DDR4 dual channel bandwidth and compensate with more computing to load CPU if needed.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Jun 5, 2019

Are you sure there’s actually that much entropy in the instruction set permutations?

Have you actually read the specification? Programs are generated from a 512-bit seed and the total entropy of all instruction words in the program is considerably more than 512 bits. The size of the RandomX program buffer is 2048 bytes.

Why would hardwired circuits not be orders-of-magnitude more efficient?

Because your hardwired circuit would only execute a tiny tiny fraction of programs. The rest would have to go to a more general-purpose chip with efficiency close to a CPU.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

@hyc

The only person assassinating your character is you, spouting all your nonsense instead of doing or saying anything intelligent.

You wouldn’t know intelligence if it bit you in the ass.

@tevador

Are you sure there’s actually that much entropy in the instruction set permutations?

Have you actually read the specification? Programs are generated from a 512-bit seed and the total entropy of all instruction words in the program is considerably more than 512 bits. The size of the RandomX program buffer is 2048 bytes.

I did not write ‘seed’. I wrote in the “instruction set permutations”. There’s a distinction.

I understand you’re generating that many permutations. But permutations don’t necessarily equal entropy. I am confident you understand that.

Why would hardwired circuits not be orders-of-magnitude more efficient?

Because your hardwired circuit would only execute a tiny tiny fraction of programs. The rest would have to go to a more general-purpose chip with efficiency close to a CPU.

No. Every ASIC variant would receive its portion so they would all always be busy, presuming the actual entropy is much lower than you assume it is.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Jun 5, 2019

@shelby3

blah blah blah

Still handwaving and blowing hot air. Not a single real-world consideration - chip area, number of bits/second of communication overhead, circuit complexity, etc. etc. etc. Just garbage.

"People who are dead don't know they're dead; only the people around them suffer. Same for people who are stupid."

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

Still handwaving and blowing hot air. Not a single real-world consideration - chip area, number of bits/second of communication overhead, circuit complexity, etc. etc. etc. Just garbage.

Lol you’re not mathematical enough to know the distinction between entropy and permutations.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Jun 5, 2019

I did not write ‘seed’. I wrote in the “instruction set permutations”. There’s a distinction.

The total program entropy is much higher than the entropy of the seed. That's why I'm talking about the seed because not all possible programs can be generated by the generator. So the total number of different RandomX programs is exactly 2512, although the instruction set allows for a much higher number.

But permutations don’t necessarily equal entropy. I am confident you understand that.

In this case it is. If the entropy was lower, then you could change some seed bits without changing the semantics of the program. Which is not the case (see the specification).

presuming the actual entropy is much lower than you assume it is

The mathematical description of RandomX is very detailed, so it shouldn't be so hard to prove your point without assumptions.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Jun 5, 2019

...crickets...

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

If the entropy was lower, then you could change some seed bits without changing the semantics of the program.

Disagree. Semantics doesn’t correlate to entropy.

Think out-of-the-box more.

Also the entropy is the entropy of that when the instruction set is implemented on the CPU.

EDIT: surely the auditors are going to find this flaw. I will be shocked if they miss it. I see many cryptographers listed on the set of auditors.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Jun 5, 2019

Disagree.

The nice thing about mathematics is that you don't have to agree with something. Either you can rigorously prove it - then it becomes the irrevocable truth - or you cannot, in which case your opinion is irrelevant.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

The nice thing about mathematics is that you don't have to agree with something. Either you can rigorously prove it - then it becomes the irrevocable truth - or you cannot, in which case your opinion is irrelevant.

Right. So let’s see how smart your auditors are. Then I will come back if necessary.

I hope you understand/envision that the proof is a merging of knowledge of math and hardware. The entropy involves (is relative to) what the CPU is capable of randomizing w.r.t. what the ASIC capable of hardwiring with on-chip multiplexing. It’s more complex than a simple mathematical conceptualization.

I suspect you’re going to figure it out on your own now, perhaps after you sleep on it.

P.S. I still think it is hilarious that people think they can defeat ASICs for computation-based proof-of-work. It took me a long time to convince @tromp that Cuckoo would not be ASIC resistant. But I am willing to give you props for doing extensive work and getting it audited.

@shelby3 shelby3 mentioned this issue Jun 5, 2019
@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Jun 5, 2019

The entropy involves (is relative to)

No, entropy is related to information content (I suggest you to read the relevant pages on wikipedia at least). Since you need 512 bits to encode a RandomX program, the entropy is 512 bits and the number of different RandomX programs is 2512. I'm assuming you agree with this since you were unable to provide proof to the contrary.

I think you are (wrongly) assuming that the 2512 different programs can be split into a relatively few groups of 'similar' programs.

Go ahead and clone this repository:

git clone https://github.com/tevador/RandomX.git

Then build the project:

cd RandomX && make

Then use the code-generator executable and try to generate a few programs, for example:

bin/code-generator --genNative --nonce 1

Nonce can go from 1 to 2147483647.

Come back when you find 2 nonces that give 'similar' programs. Then we can continue this conversation, because I will be able to see what you consider to be similar programs.

If you find it hard to find 2 similar programs, then we can both finally agree that the number of distinct programs is too large for any specific on-chip optimizations.

PS: Don't try to wriggle out of this. I want to see some concrete examples. Otherwise don't expect any further responses in this discussion.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

Speaking of entropy, it might be less than 512 bits because the initial block template is 76 bytes (608 bits), but it's not totally random. So initial Blake2 hash is enough to use all entropy of the block template.

@tevador

This comment has been minimized.

Copy link
Owner Author

@tevador tevador commented Jun 5, 2019

@SChernykh

RandomX is a general PoW algorithm. The use of a Monero block template is a specific use case.

However, the hash of the previous block and the transaction root hash are together exactly 512 bits and their sources have a higher entropy than that. But it's a good observation and proves that there is no need for more than 512 bits for program entropy.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

I've read that "ah-ha" post again, and it seems it's shelby3 who's suffering from Dunning-Kruger effect, not me.

Ah now I remember one of my other concerns would be that the ASIC could do program analysis and then send the programs to specialized ASICs for each program variant. Maybe not a variant for each permutation, but variants which are more optimized than others for sets of permutations.

  1. All programs are the same (from optimization perspective). 256 instructions are enough to have all RandomX features in every program almost all the time. The one notable exception is rounding modes (36.7% probability that rounding mode will be fixed in a random program), but removing them won't make ASIC faster. I won't explain why, I assume you know how IEEE rounding works and how it's implemented in hardware.

  2. What your optimized circuits would do when there are no optimized programs for them to run (see point 1)? Right, they'll stay idle and waste chip space and power. The overall performance (throughput) will be worse compared to an ASIC of the same chip size that can execute all the programs.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

Firstly I will just interject a tangential point. If you had to put a $10 million performance bond on your claim of at best 2.5X ASIC resistance ratio, would you still stick so adamantly to claims that you’re so expert on this subject? Funny how people become more circumspect then they need to back up their words with their money.

@tevador

The entropy involves (is relative to)

No, entropy is related to information content (I suggest you to read the relevant pages on wikipedia at least).

Sorry but you’re not smart enough to understand information theory holistically if you believe what you wrote is comprehensive. Please do not talk down to me and pretend to yourself that I don’t understand Shannon’s work. Heck I even helped write the Wikipedia article you are referring to as well as the one on Nyquist–Shannon sampling theorem.

If you find it hard to find 2 similar programs, then we can both finally agree that the number of distinct programs is too large for any specific on-chip optimizations.

It’s as if you think you’ve written anything relevant at all to the issue at hand.

I told you already that I never disputed that you can achieve the permutations that you claim to achieve. I told you that you need to think more out-of-the-box. Come on.

Otherwise don't expect any further responses in this discussion.

It doesn’t matter whether you reply or not. I can post the proof of the flaw at any time if your auditors don’t find it. I can post it at the time that is most opportune for me. As I said, I really hope RandomX is widely adopted. 😁

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

I can post the proof of the flaw at any time if your auditors don’t find it. I can post it at the time that is most opportune for me. As I said, I really hope RandomX is widely adopted. 😁

Oh, how convenient. "A hev da proof, but I wont show ya fools".

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

Oh, how convenient. "A hev da proof, but I wont show ya fools".

Don’t you think you deserved that? Just look at the way you behaved in this thread. I always get the final word. Because I do my homework.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

Speaking of entropy, it might be less than 512 bits because the initial block template is 76 bytes (608 bits), but it's not totally random. So initial Blake2 hash is enough to use all entropy of the block template.

That’s not where the entropy that I am referring to lies. Analogous to side-channel or timing analysis attacks, information leaks into other dimensions of the problem. You are thinking too linearly.

Your complexity budget was too high and it came back to bite you where you were not looking.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

Your "proof" is worth nothing I can assure you. If you know names like"Nyquist" and "Shannon" and even (!!!) helped edit the wiki article (omg what an achievement 😜) then you should know how to post a rigorous mathematical proof. I'm looking forward to it. But I know I won't see it because you don't have it.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

Your "proof" is worth nothing I can assure you.

Post the the performance bond providing the assurance.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

Post the proof and then we can talk.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

All programs are the same (from optimization perspective). 256 instructions are enough to have all RandomX features in every program almost all the time […]

That’s a tautology or circular argument, because it’s only true if the entropy is as high as you presume it is.

What your optimized circuits would do when there are no optimized programs for them to run (see point 1)? Right, they'll stay idle and waste chip space and power. The overall performance (throughput) will be worse compared to an ASIC of the same chip size that can execute all the programs.

The segmentation of the specialization will be designed mathematically (i.e. statistically) so they’re always busy.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

The segmentation of the specialization will be designed mathematically (i.e. statistically) so they’re always busy.

Can you even math? "Average of X is Y" doesn't mean that X is always equal to Y. Specialized circuits will NOT always be busy unless they're heavily outnumbered by the programs to run. They'll either be a bottleneck in this case or they'll be idle some % of time. The ratio of different types of programs is not constant over time.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

Specialized circuits will NOT always be busy unless they're heavily outnumbered by the programs to run.

Did you already forget where upthread I corrected @hyc’s incorrect analogy to pixels and framebuffers.

There can be as many hashes in flight as we want, once we segment the programs across numerous ASIC dies. I already pointed out that there’s no significant bandwidth limitation to prevent that cross-chip handoff between program chains.

Can you even math?

Do you enjoy talking to other people this way?

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

There can be as many hashes in flight as we want, once we segment the programs across numerous ASIC dies.

No, there won't be. You're limited by the memory needed by scratchpads. All the assumptions that work with infinite in-flight programs stop working when you only have < 512 programs in flight. And you start encountering cases when some of execution units are idle and others are clogged.

@SChernykh

This comment has been minimized.

Copy link
Contributor

@SChernykh SChernykh commented Jun 5, 2019

Having said all that, it's still possible to have specialized units busy 80% (maybe more) of the time with proper distribution across the chip. Another question is what exactly they will specialize in? What subsets of programs can get a significant boost with specialized hardware?

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

No, there won't be. You're limited by the memory needed by scratchpads.

This does not have to be moved for every program. Only when the program goes off chip. I posit that statistically it can be made a small fraction, but this will depend on the precise calculation of the entropy issue I allege. If I am correct about the lower entropy, there’s potentially orders-of-magnitude power saving to offset that posited fractional cost.

EDIT: additionally even if the entropy + scratchbuffer can be made prohibitive, if I am correct about being at least near the margins, you’ll end up with the same flaw ProgPow has which is that you need to change the parameters as ASIC die density increases.

Another question is what exactly they will specialize in? What subsets of programs can get a significant boost with specialized hardware?

Answer that and also you answer why I think the entropy isn’t as high as you presume it is.

You’re presumably looking at the entropy from the standpoint of permutations of bits, which is relevant to the security of the hash. But I posit that does not speak entirely to the performance issue. Instead think about it from a hardware level. What does that permutation of bits actually mean at the hardware level in terms of the impact that has on the supposition that all bit permutations are statistically equivalent, especially when you factor what can the ASIC do about those permutations. And must the ASIC specialize on the program’s entire string of instructions or can it chop it up and multiplex? Play with the probability math for that.

@hyc

This comment has been minimized.

Copy link
Contributor

@hyc hyc commented Jun 5, 2019

@shelby3
For the record

Did you already forget where upthread I corrected @hyc’s incorrect analogy to pixels and framebuffers.

That was not my analogy.

I always get the final word. Because I do my homework.

LOL. Multiple instances disproving that, just within the past 24 hours.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 5, 2019

That was not my analogy.

Okay but you employed the analogy to argue that massive parallelisation isn’t possible for RandomX. Whether that ends up being true or not, I made the point that some algorithms are not serially bounded or at least not as tightly bounded as someone might claim.

And I presume we all agree now that the CPU is only optimal for algorithms that can be sufficiently bounded in terms of parallelism.

@shelby3

This comment has been minimized.

Copy link

@shelby3 shelby3 commented Jun 6, 2019

Another question is what exactly they will specialize in? What subsets of programs can get a significant boost with specialized hardware?

Answer that and also you answer why I think the entropy isn’t as high as you presume it is.

You’re presumably looking at the entropy from the standpoint of permutations of bits, which is relevant to the security of the hash. But I posit that does not speak entirely to the performance issue. Instead think about it from a hardware level. What does that permutation of bits actually mean at the hardware level in terms of the impact that has on the supposition that all bit permutations are statistically equivalent, especially when you factor what can the ASIC do about those permutations. And must the ASIC specialize on the program’s entire string of instructions or can it chop it up and multiplex? Play with the probability math for that.

Just in case that was not entirely clear, readers should note that RandomX randomizes the bits for the instructions an idealized virtual machine (VM). This randomized idealized instruction set must be JIT compiled to the actual CPU hardware. By ‘idealized’, I mean the supposition that every bit has a statistical egalitarianism with every other bit. But that’s a fallacy. In fact, the hardware has to prioritize resources and thus not every bit is statistically equivalent from a resources standpoint.

The generative essence of the myopia here is that humans refuse to accept that resources can’t be equably distributed because otherwise our existence wouldn’t proceed inexorably to maximum entropy (c.f. near the end of the linked post). Hopefully this will help you understand why computationally bounded proof-of-work will never be ASIC resistant.

Having said that from a high-level perspective, the actual math and hardware data needs to be crunched. I will leave that to the auditors. If I feel they have been negligent, then I will apply some more work to it and come back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.