Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible ASIC design #11

Closed
tevador opened this issue Dec 24, 2018 · 211 comments
Closed

Possible ASIC design #11

tevador opened this issue Dec 24, 2018 · 211 comments

Comments

@tevador
Copy link
Owner

tevador commented Dec 24, 2018

EDIT: This design is outdated and no longer applies to the current RandomX version.

Similar idea was originally proposed by @cjdelisle for a GPU miner, but I think it's more applicable for an ASIC design.

RandomX ASIC miner:

  • ~66 MiB of SRAM on-chip
  • 4 GiB of HBM memory
  • 1 dedicated core for dataset expansion
  • 256 small decoder/scheduler cores
  • 28 worker cores

During dataset expansion, the SRAM is used to store the 64 MiB cache, while dataset blocks are loaded into HBM memory.

When mining, the ASIC runs 256 programs in parallel.
Memory allocation:

  • 256 * 256 KiB scratchpad = 64 MiB of SRAM for scratchpads
  • 256 * 8 KiB = 2 MiB of SRAM for program buffers
  • 32 KiB for register files
  • ~256 KiB for program stack (1 KiB per program) = maximum recursion depth of 64

Instructions are loaded into the decoder/scheduler core (each core has its own program buffer, program counter and register file). The scheduler cores handle only CALL and RET instructions and pass the rest to one of the 28 worker cores.

Each of the 28 worker cores implements exactly one RandomX instruction pipelined for throughput which matches the instruction weight (for example MUL_64 worker can handle 21 times more instructions per clock than DIV_64 worker). Each worker has an instruction queue fed by the scheduler cores.

The speed of the decoder/scheduler cores would be designed to keep every worker core 100% utilized.

Some complications of the design:

  • CALL/RET will stall the program if there is dependency on an instruction which is still in worker queue
  • FPROUND will stall all subsequent floating point instructions
  • instructions which use the same address register will create a dependency chain

There could be also some dedicated load/store ports for loading instruction operands.

If limited only by HBM bandwidth, this ASIC could do around 120 000 programs per second (480 GiB/s memory read rate), or roughly the same as 30 Ryzen 1700 CPUs. This assumes that sufficient compute power can fit on a single chip. If we estimate power consumption at 300 W (= Vega 64), this ASIC would be around 9 times more power efficient than a CPU.

Improved estimate: #11 (comment)

Dislaimer: I'm not an ASIC designer.

@cjdelisle
Copy link
Contributor

The 66MiB of SRAM is probably a killer, but I suspect it can be avoided by doing async instructions and thousands or hundreds of thousands of threads.

BTW credit for this idea should go to: https://github.com/aggregate/MOG/

@tevador
Copy link
Owner Author

tevador commented Dec 24, 2018

@cjdelisle 100 000 parallel threads would require ~24 GiB of memory just for scratchpads.
Also if you store the scratchpad in HBM, your memory bandwidth usage will increase by 50% 3 times.

You could also compromise by storing only the first 16 KiB of scratchpad in SRAM, which would decrease the required amount of on-chip memory to ~6 MiB, which is probably more realistic.

@cjdelisle
Copy link
Contributor

The idea is to use threads to replace the latency bottleneck with a memory bandwidth bottleneck.. This is basically the ring processor idea but I think the von neumann bottleneck eventually degrades to a sorting problem where you need to sort the opcodes and operands to be next to each other and then gather the completed operand/opcode bundles into a queue which is then fed to the processor, creating more opcodes with need to sort more operands...

@SChernykh
Copy link
Collaborator

66 MiB of SRAM is nothing unusual for ASIC, the speedup from SRAM will outweigh bigger chip area by large margin. Scrypt ASICs had 144 MiB SRAM per core, IIRC.

@SChernykh
Copy link
Collaborator

Anyway, HBM memory + memory controller + 256 CPU-like cores with 66 MiB SRAM on chip sound very similar to a GPU. It'll be still limited by computing power, not memory bandwidth. Power efficiency (H/s/W) will be maybe 2-3 times better than a CPU.

@tevador
Copy link
Owner Author

tevador commented Dec 24, 2018

@SChernykh A GPU would be compute-bound because it cannot run RandomX efficiently. Most 64-bit instructions have to be emulated on GPUs and double precision runs 16 times slower than single precision. ASIC, on the other hand, can execute 1 instruction per cycle in ideal case.

@cjdelisle RandomX was not designed to be latency-bound. That's why the dataset is accessed sequentially. Only scratchpad access is latency-bound, but it can fit into SRAM.

Also the instructions executed in RandomX are not independent. There will be random dependency chains, typically due to instructions using the same register (very rarely using the same scratchpad word). This would slightly complicate the design.

I have contacted @timolson to help us assess the viability and possible performance
of a RandomX ASIC.

@timolson
Copy link

On first glance it looks much improved from RandomJS, being a much closer model of the underlying hardware.

However, if I understand correctly, you can run many programs in parallel on chip, assuming enough scratchpad space. This is similar to CryptoNight and favors ASIC development. The limiting factor will probably be this scratchpad area, not the logic, and as you pointed out, 66MiB for a bunch of cores is no problem, being only about 41 mm2 in a 16nm process. If we conservatively assume logic doubles the area then an 80 mm2 chip might cost around $10-15 packaged. You're gonna crush CPUs with this.

One way to prevent this parallelism is to make the large DRAM table read/write. Then you need a memcontroller and DRAM set for every core, which is closer to the setup of consumer hardware. Being able to isolate the running program to just the chip die makes for a nice, efficient ASIC. Where ASIC's fall down is when they have to wait for external IO. Once you hit the logic board, it's like a Ferrari in rush hour traffic.

Another option is to somehow force nonces to be tried in serial instead of parallel. Then an ASIC can't beat CPU's by merely adding cores. An ASIC design could still be cost-efficient by eliminating the excess cores on a CPU and all the gates for cache control. Or maybe there's a way to limit parallelism to exactly 8 or some chosen number of cores. This would make CPU's closer to the optimal configuration for the problem.

I didn't look closely enough, but wanted to point out potential "serialization" attacks. If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM. Also, parallelism may be discovered and exploited in the programs if there's not enough register overlap. You might consider using fewer registers that are updated more frequently to address this. Again, I'm not sure if it's an actual problem with your design, because I didn't look closely enough, but it should be mentioned.

A couple other miscellaneous comments:

  • I wouldn't waste die area or ASIC development time on the large table expansion. It can be done offline with a CPU... A miner will have an SoC controller anyway which would handle this.
  • HBM is a specialized process that's rather expensive. GDDR5 is probably a more efficient choice. GPU manufacturers live by "better performance" not "hashes-per-dollar," so they have moved to HBM in search of peak performance.
  • GPUs will be limited by their SRAM, which is 2MB per Nvidia SM, with only 64kB accessible to any one thread block. AMD's are less limiting but still problematic. Both CPUs and ASICs can outperform GPU's in this area. The scratchpad sizes you're using effectively prevent GPUs from being efficient.
  • Double precision float math on GPUs is 2-32x slower than single precision, but CPU's handle 64-bit math very well. Most GPUs also suck at all integer multiplication (addition is fine), although the GPGPU movement is changing this somewhat. The new 2000 series somewhat better, but they still emphasize float performance (deep learning training is primarily floating point multiplication.) So integer multiplication is another way CPU's and ASIC's can outperform GPU's. Here you can see instruction throughput numbers for Nvidia Notice how integer multiplication is so slow they don't even give a number! They just say "multiple instructions." Bit operations are also slow on GPU's. GPU's are really focused on float performance, and many games use half-precision 16-bit floats. It looks "good enough."

I'm currently writing a GPU miner for Grin, and since the genesis block is January 15th, I don't have much time to look deeper until later in January or February, sorry. I can quickly address specific concerns if you want to point something out, or if I overlooked something critical in my very brief review.

@timolson
Copy link

One more comment: An ASIC can have the "correct ratio" of logic handlers based on the frequency and latency of various instructions used in RandomX, which may be different from CPU's. As a simple example, let's assume only two instructions, int multiply and int add, randomly selected 50/50. If multiply takes 4 cycles and add takes 1, then an ASIC will have 4 mul units for every 1 add unit, whereas a CPU gets one of each. That may not be strictly true, but you should tune your probabilities such that the probability_of_instruction / latency_in_cpu is the same for all instructions. In the above case, you want adds to be 4x more frequent than multiplies (assuming the CPU has one of each)

@timolson
Copy link

Aaaaand one more thing... Although the 66 MiB chip you proposed would be $10-15, it's only gonna clock around 1GHz. Intel and AMD definitely do have lots of IP and scale to do efficient layouts and squeeze the maximum speeds out of their designs. If you can fix the multi-core problem running lots of programs in parallel, then a startup ASIC maker, even Bitmain, will not get close to the CPU performance of the incumbents. But you probably need to get within a factor of 3 or so.

@tevador
Copy link
Owner Author

tevador commented Dec 24, 2018

@timolson Thanks for the review.

However, if I understand correctly, you can run many programs in parallel on chip, assuming enough scratchpad space.

Yes and assuming you can read from the dataset quickly enough. The dataset is too big to be stored on-chip, so external memory is unavoidable.

One way to prevent this parallelism is to make the large DRAM table read/write.

That is difficult to do while also allowing hash verification for light clients who might not have enough RAM. But perhaps a 1 GB read/write buffer would be possible. Even phones have at least 2 GB nowadays.

Another option is to somehow force nonces to be tried in serial instead of parallel.

I'm not aware of any way how to achieve this. I don't think it's possible without some central authority handing out nonces.

Currently, parallelism is limited only by DRAM bandwidth.

If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM.

I don't think this is possible. Scratchpad read/write addresses are calculated from register values, which depend on previous results. Dataset is already being read sequentially and the reads cannot be reordered.

There are only 8 registers for address generation and register values change every instruction, so the sequence of independent instructions will be very short, perhaps 2-3 instructions.

Regarding GPU performance, I already suspected most of what you wrote.

@SChernykh
Copy link
Collaborator

But perhaps a 1 GB read/write buffer would be possible.

Per thread? Or one buffer for all threads? How are you going to synchronize them?

@timolson
Copy link

Currently, parallelism is limited only by DRAM bandwidth.

Make sure it's tuned such that typical DDR4 SODIMM speeds align with 8-core parallelism and I'd say you're getting close. However, GDDR5 and HBM crush DDR4 for bandwidth-per-dollar, so if you pin the PoW to memory bandwidth, CPUs will lose out. One way to address that dilemma may be to use random DRAM access instead of sequential. Some of GDDR's improved speed comes from using wider rows and some comes from a wider bus, but DDR4 is competitive for random access patterns. If you're only reading a single word at a time, it doesn't matter that GDDR grabs 32 words while DDR only gets 8, or whatever. They both have random access latencies of 40-45 ns.

@SChernykh
Copy link
Collaborator

The best way would be to read random 64 bytes at a time. DDR4/CPU cache is optimized for this burst read size.

@timolson
Copy link

timolson commented Dec 24, 2018

The best way would be to read random 64 bytes at a time. DDR4/CPU cache is optimized for this burst read size.

Longer reads will favor GDDR & HBM because they have wider busses and also can push more bits-per-pin-per-cycle. I would suggest something smaller than 64 bytes, which would need 512 pin-cycles in DDR4 and only 128 pin-cycles in GDDR5. This is wider than consumer SODIMMs. 16 bytes is probably safe.

@hyc
Copy link
Collaborator

hyc commented Dec 24, 2018

A 64byte burst is optimal for a typical 64bit memory channel, DDR4 is designed for 8-bit bursts per pin. While a GPU can grab that in a single cycle with a 512bit bus, it won't be able to burst any of its accesses.

@SChernykh
Copy link
Collaborator

SChernykh commented Dec 24, 2018

I think the ideal case would be 64-byte random accesses that also saturate dual-channel DDR4 bandwidth on 8-core CPU.

@tevador
Copy link
Owner Author

tevador commented Dec 24, 2018

@timolson If we make random memory accesses (latency-bound), the CPU cores will be basically idle and you can make an ASIC with 1% of power consumption of a CPU.

@tevador
Copy link
Owner Author

tevador commented Dec 24, 2018

@SChernykh

Per thread? Or one buffer for all threads? How are you going to synchronize them?

One buffer per thread. So 8 GiB of memory for 8 parallel threads.

@timolson
Copy link

It depends on the frequency of reads also, right? If you're reading often, then yes. But what about semi-infrequent random DRAM reads? You can tune the number of computations-per-DRAM read to match what a CPU core can do. In this way, it is not DRAM-bound by either latency or bandwidth. The idea is similar to ProgPoW in this regard, where they tuned the number of random math ops per memory access to match GPU capabilities.

@tevador
Copy link
Owner Author

tevador commented Dec 24, 2018

Fair enough.

Currently, it takes ~90 W of compute power (14 nm Ryzen CPU with 16 threads) to achieve ~16 GiB/s of DRAM read speed, which is about half of what dual channel DDR4 can do.

If you use GDDR5/HBM, you can easily read 20x faster, but how are you going to match the compute speed? Even if you improve efficiency by a factor of 2 over a CPU, that's 900 W of compute power.

RandomX uses primitive operations (add, sub, mul, div,, floating point), so I don't think you can make an ASIC much more power efficient than that. At most you can cut out some of the CPU parts like TLB, L3 cache, memory controller and IO.

@cjdelisle
Copy link
Contributor

I didn't look closely enough, but wanted to point out potential "serialization" attacks. If the program generator can (quickly) do a dependency analysis and sort the instructions such that the scratchpad is r/w in sequential order, then you can replace SRAM with DRAM.

This could be a significant risk if it is feasible to run hundreds or thousands of threads with DRAM because one need not do any dependency analysis, just schedule each thread for one instruction only.

Creating large datasets is a good solution but it is not really possible to require the dataset to be used by one thread only because in order for a large datasets to be worth creating and storing in the first place, it needs to be reusable. Serial re-usability is going to require that your verifier performs the whole series of operations which is probably a non-starter so you end up having to allow at least a few hundred parallel executions to use the same buffer...

@tevador
Copy link
Owner Author

tevador commented Dec 24, 2018

BTW, random reads already happen in RandomX. There is (on average) one random read per 213 (8192) sequential reads.

@cjdelisle
Copy link
Contributor

cjdelisle commented Dec 26, 2018

I made this little drawing of what I think a high latency high parallelism processor could look like: https://pixelfed.social/p/cjd/24845
Perhaps I'm wrong and smarter people will point out the error, but my intuition is that there's no way to win against this type of design. Even if your mining algorithm was "compile linux" (my informal benchmark for general purpose compute performance), I see no reason that one couldn't run 16k or more GCC processes with this type of architecture, given enough DRAM (especially with hardware scatter/gather support)

AFAICT there is no way to de-parallelize the mining beyond requiring memory for the verification process. If you require a different 50MB dataset per-nonce then the verifier needs 50MB and the solver using this architecture can run (AVAILABLE_DRAM / 50MB) parallel threads.

The method of requiring a precomputed dataset which is reusable for more than one nonce falls down because either the solver parallelizes all allowable permutations for one precomputed dataset or (if the number of allowed permutations is too low) he doesn't bother to precompute it at all and simply tests the same way as the verifier.

@tevador
Copy link
Owner Author

tevador commented Dec 26, 2018

I improved the ASIC design estimate based on comments from @timolson.

Let's start with memory. We need 4 GiB of GDDR5 for maximum bandwidth. At least 4 memory chips are required since the capacity is 8 Gb per chip. Each chip has a 32-bit interface, so our maximum memory bandwidth will be 4 * 32 * 8 Gb/s = 128 GiB/s, assuming 2000 MHz memory.

The memory can support up to 128 GiB / 4 MiB = 32 768 programs per second. Now let's try to make a chip that has enough compute capability to actually push out 32 thousand programs per second.

I started with the AMD Zen die, which has an area of 7 mm2. If we remove all cache, we have around 4 mm2. Let's say we can optimize the design down to 2 mm2 per core.

We know that the Ryzen core can do ~500 programs per second at 3350 MHz. Since our ASIC will run only at 1 GHz, our optimized core can do only ~150 programs per second.

We need ~218 such cores to saturate the memory bus. This amounts to about ~436 mm2.

Additionally, we will need ~40 mm2 of SRAM and a GDDR5 memory controller. The DDR4 memory controller in Ryzen is ~15 mm2, so let's say we can make a minimal controller with just 5 mm2.

In total, we have a huge ~480 mm2 die, which is about the same size as a Vega 64 GPU.

Price estimate:
~$150 for the die
~$75 for memory (based on prices from digikey.com)
~$30 PCB + power delivery
~$25 cooling

Total per mining board: ~$280. This doesn't include any R&D or IP licensing costs.

We can safely assume a power consumption of around 300 W per board, same as a Vega 64 at full load.

Hashes per Joule:

  • Ryzen 1700: ~40
  • ASIC board: ~100

So about 2.5 times more efficient. And this is the best case scenario.

@hyc
Copy link
Collaborator

hyc commented Dec 27, 2018

Zen+ should be about 10% more efficient than Zen. Has anyone tested a Ryzen 2700 yet?

Your math assumes 32768 cores saturating GDDR5 all performing sequential accesses. The throughput will be much less with random accesses. I think your area estimates for the ASIC are overly optimistic, as there'll be quite a complicated interconnect network to attach all those cores to the memory etc.

@tevador
Copy link
Owner Author

tevador commented Dec 27, 2018

Your math assumes 32768 cores saturating GDDR5 all performing sequential accesses

Actually, there are only 218 cores. The design above assumes scratchpads are stored in SRAM, so a maximum of 256 programs can be run in parallel. The GDDR5 memory is just for the dataset, which is read mostly sequentially.

If you wanted to run thousands of programs in parallel, you'd have to store scratchpads in GDDR5 and use the design by @cjdelisle to hide random access latencies. However, in this case you would need 12 GDDR5 chips per board to get enough capacity (12 GiB) and bandwidth (384 GiB/s). The cost of the memory chips alone would be over $200 per board. Power consumption would probably also increase because GDDR5 is power hungry.

I think your area estimates for the ASIC are overly optimistic,

Yes, it's maybe too optimistic. The 2.5x efficiency figure is an upper estimate for an ASIC.

I still think a bandwidth-limited design is the way to go. If the design was purely latency-bound, an ASIC could use much cheaper DDR3 memory. This can be seen in Antminer E3.

@cjdelisle
Copy link
Contributor

This seems like a reasonable design for the tech, consider that you can eliminate caches, registers and even split the components of the ALU into circuits (direct add insns to the adders, mul insns to the multipliers, etc). You need SRAM mostly for router-like buffers because the chip is basically a network.

Generally speaking, I think your approach of focusing on power consumption is a good heuristic to fit the problem to the hardware you have (though it might be worth also watching int ops and float ops to make sure there are no shortcuts). I'm hoping to fit the problem to the hardware I want to have so my design will be slightly different, focusing more on branching / prediction and use of lots of instructions with somewhat less power consumption.

That said, my whole design falls down if it turns out that the high bandwidth wiring is prohibitively expensive.

@tevador
Copy link
Owner Author

tevador commented Dec 28, 2018

I'm experimenting with doubling the memory bandwidth requirements by increasing dataset read size from 8 to 16 bytes.

The performance drop for CPUs depends on the available memory bandwidth. On Ryzen 1700, it seemed to hit a bandwidth bottleneck with dual channel DDR4-2400, so I upgraded to DDR4-2933.

Here are the performance numbers:

dataset read threads memory programs/s read rate
8 bytes 8 DDR4-2400 3200 13 GB/s
8 bytes 16 DDR4-2400 4450 18 GB/s
16 bytes 16 DDR4-2400 3400 28 GB/s
16 bytes 8 DDR4-2933 3200 27 GB/s
16 bytes 16 DDR4-2933 3900 33 GB/s

With 16 threads, it's still slightly bandwidth-limited even with 2933 MHz memory. It seems that 3200 or 3466 MHz memory might be needed.

For the ASIC design, this would mean either halving the performance to ~16K programs per second per board (with corresponding halving of die area) or a forced upgrade to 256-bit GDDR5 interface with 8 memory chips, which would double the memory cost to ~$150 per board and put more strain on inter-core bandwidth.

One drawback of this change is that the execution units of CPUs would be slightly underutilized, which would make it easier for an ASIC to match the compute requirements. This could be solved by adding more compute per VM instruction.

What do you think about this change?

@SChernykh
Copy link
Collaborator

3200-3466 MHz memory is becoming more common now. We should aim for maxing out available DDR4 dual channel bandwidth and compensate with more computing to load CPU if needed.

@hyc
Copy link
Collaborator

hyc commented Jan 26, 2022

Nobody needs to be reminded how bad it would be for a PoW algorithm to be broken. Please constrain your comments to the actual topic and refrain from going off on wild tangents, if you wish to be taken seriously.

@shelby3
Copy link

shelby3 commented Jan 27, 2022

@SChernykh

It's actually a good point regarding 7 nm availability. Mainstream CPUs will always be one step ahead of ASICs.

[…]

If you think that crypto prices and profits would go so high that ASIC manufacturers get a priority on 7 nm, 5 nm nodes in the future, what do you think AMD/Intel/NVIDIA would do? Yes, their own ASICs. It's actually a good outcome when you can buy an ASIC from AMD/Intel just like a GPU in any store.

Unfortunately Intel is not selling their new ASICs retail.

@timolson

Although the 66 MiB chip you proposed would be $10-15, it's only gonna clock around 1GHz. Intel and AMD definitely do have lots of IP and scale to do efficient layouts and squeeze the maximum speeds out of their designs.

Intel has just entered the proof-of-work ASIC business. Intel claims significant performance-efficiency-product advantage compared to Bitmain.

EDIT: I read that Intel is ostensibly venturing into proof-of-work ASICS because — at least in the case of SHA256 — the smaller wafer area provides higher yields (e.g. than their CPUs and other customer’s large area designs) thus giving them more options when ramping up new process yields and possibly incorporating multi-project wafers in the context of their recent strategic shift to compete with TMSC, Samsung and GlobalFounderies in offering fab services.

@hyc

Ask yourself this: if a small design house can build an ASIC that can outperform AMD, Intel, and ARM, then why are they only a small design house?

Incorrect assumption written during a cryptowinter before the fledgling onboarding of institutions that forebodes Bitcoin becoming a world reserve currency with a $100T market cap perhaps within a decade.

@tevador

There will be random dependency chains, typically due to instructions using the same register (very rarely using the same scratchpad word).

There are only 8 registers for address generation and register values change every instruction, so the sequence of independent instructions will be very short, perhaps 2-3 instructions.

Granted the superscalar pipeline in modern non-embedded markets CPUs will exploit that instruction level parallelism (ILP). Yet the dynamic dependencies for which dynamic superscalar excels, apply to memory read/write dependencies which you stated ‘very rarely’ occur. Could the static register independence ILP be statically compiled as out-of-order in a VLIW architecture? Itanium failed but ostensibly this was because real-world programs have too dynamic ILP going on to gain sufficiently from static analysis — yet RandomX seems to fail to duplicate that facet of real-world programs as alluded to in this thread last year. 😉

Apparently ²⁵⁵⁄₂₅₆ths of the performance benefit of speculative execution from the §Superscalar execution in the Design document is obtained by simply not taking every jump instruction, rendering a CPU’s speculative execution an electricity wasting appendage (as mentioned both in §2.6.1 Branch prediction documentation and in the audits). I’m trying boggle what appears to be the sophistry of “1. Non-speculative - when a branch is encountered, the pipeline is stalled. This typically adds a 3-cycle penalty for each branch.” Why would an ASIC statically set to predict every branch as not taken stall the pipeline for any of the ²⁵⁵⁄₂₅₆ of the occurrences as has ostensibly been (disingenuously?) modeled? EDIT: I suppose the intent could be as compared to a non-speculative, general purpose CPU that isn’t designed specifically to optimize RandomX — in which case that could be clarified so the unwitting readers aren’t misled given the presumption that RandomX’s design intent focus is to be ASIC resistant.

Thus I conclude the dynamic out-of-order and speculative appendages of modern CPUs waste some electricity and wafer area as compared to what could be implemented on an ASIC. As quantified it may or may not be that significant of an advantage for the ASIC but it will be one of many advantages.

RandomX uses primitive operations (add, sub, mul, div,, floating point), so I don't think you can make an ASIC much more power efficient than that. At most you can cut out some of the CPU parts like TLB, L3 cache, memory controller and IO.

New RISC-V CPU claims recordbreaking performance per watt

How about 10 times more power efficient?

Such embedded applications targeted CPUs will not likely be employed in applications that require real-time interaction. Nobody wants to wait on their smartphone to finish a task, and we all know there’s an Amdahl’s limitation on the parallelization of real-world programs which contain an inherent serial, contention and synchronization component. A RandomX ASIC may be able to remove the energy and wafer area wasting facets of modern CPUs that exist to minimize latency.

Yet we were all told this already:

Developers of ProgPoW expressed their opinion on RandomX: https://imgur.com/a/WUe87rQ

They make the throughput-efficiency-product point, but didn’t mention throughput-cost-product, for which higher yields on smaller wafer area could possibly be a significant advantage compared to cutting edge modern CPUs.

@hyc

One problem more naive people seem to have in discussing these things is treating bandwidth and latency as independent variables, when they are in fact inseparable. I suppose you have to have studied network engineering to really appreciate that fact, but it plays a huge factor in parallel computing […] In the real world, you need individual cores to be fast, you can't just get away with thousands of slow cores. Aside from the actual computation, the control of sequencing is extremely demanding; the communication overhead starts to overshadow the computation cost.

In the real world they’re inseparable because of inherent serial, contention and synchronization overhead. But afaics RandomX lacks the extent of nondeterminism in the real world of I/O.

@shelby3
Copy link

shelby3 commented Jan 27, 2022

I realize the following contemplated design was considered deprecated in this issues thread, ostensibly because RandomX was changed during the thread discussion to have a random access latency bound on the Dataset. Yet correcting the following is relevant for forthcoming estimates of an ASIC advantage for a new contemplated design I will posit.

@tevador

In total, we have a huge ~480 mm2 die, which is about the same size as a Vega 64 GPU.

[…]

So about 2.5 times more efficient. And this is the best case scenario.

Keeping with the point that we can no longer assume that proof-of-work ASICs will not receive top design effort from the likes of Intel, the comparable Nvidia GPUs of that era consumed ~77% of the power with ~36% faster base clock rate when scaled proportionally by process size and area (c.f. also).

We know that the Ryzen core can do ~500 programs per second at 3350 MHz. Since our ASIC will run only at 1 GHz, our optimized core can do only ~150 programs per second.

The base clock speed of said Nvidia scaled to 14nm would be only half of the Ryzen’s, thus only ~250 programs per second required.

Thus the area and power required required is nearly halved, whilst the power efficiency is ~77% thus 0.77 × 295 × 0.6 = 137W.

Ryzen 7 1700X operating at those frequencies is generating ~6200 programs per second and is 95W TDP. Maybe it generates ~4000 programs per second at 65 W TPD which is doesn’t change the conclusion.

Thus for this roughed estimate and design the ASIC would have a ~3.7 times power efficiency advantage.

The contemplated design has significantly more cache than the GPU so that would be slightly less compute intensive per unit area, so maybe that bumps the power efficiency advantage closer to 4.

Also I see no reason why if volumes are significant enough that Intel or AMD couldn’t be motivated to produce cores that operate at the same frequencies as their CPUs, thus the custom RandomX chip advantage could approach 8 times for this contemplated design example.

Price estimate:
~$150 for the die

Shrinking the wafer area by half presumably lowers the cost non-linearly (due to increased yields and more wafer area utilization) so less than $75.

The Ryzen 7 1700 was retailing for 4 – 5 times that cost in 2017 when the 14nm process was prevalent .

Also the end user had (and still has for newer CPUs) no way to produce a Ryzen 7 1700 computer system for less than ~$600+ and no option for amortizing system components over multiple CPUs because there’s no dual or quadruple CPU motherboards for non-server CPUs. Server CPUs and motherboards cost several thousand dollars and would be underutilized as a personal computer for most users. If they’re buying a device specifically for mining then they should purchase an ASIC.

@hyc

I think your area estimates for the ASIC are overly optimistic, as there'll be quite a complicated interconnect network to attach all those cores to the memory etc.

I bet it will be less than 10% difference. Hub-and-spoke topology[1] is efficient, the spoke ends have low bandwidth in this example and it’s why we don’t run an independent water main from the pumping station directly to every home.

[1] I.e. a scale-free network power-law phenomenon that dominates resource management in nature, even for wealth. Also related to the Pareto principle.

@shelby3
Copy link

shelby3 commented Jan 30, 2022

@hyc

Partial answers: https://www.reddit.com/r/Monero/comments/8bshrx/what_we_need_to_know_about_proof_of_work_pow/

Some excerpts:

ASICs and GPUs outrun CPUs because they have hundreds or thousands of small/simple compute nodes running in parallel. The more complicated[non-deterministic] the computation, the larger[the less relative shrink for] a compute node you need to successfully execute it - which means, the more expensive the ASIC, and the fewer compute nodes can fit on a chip[the less advantageous the ASIC’s throughput-efficiency-product and throughput-cost-product].

Ftfy.

Going down the path of code complexity has already been explored, and not to burst your bubble

[…]

Not on a per-core basis. Maybe in aggregate, with multiple cores on a chip. But when an ASIC is forced to use its transistors to implement an interpreter that itself must process software, instead of just executing hardwired functions, it's going to have timing and memory constraints just like a CPU. It's going to have to do dynamic memory management, just like a CPU. All of these constraints will erode its advantage.

Ostensibly other than the rare dynamic (i.e. register runtime values) random memory contention and cache lines (which can perhaps be obviated by a holistic threaded masking of latency), the only illusion of non-determinism (c.f. also) added by RandomX as compared to all previous failed attempts at ASIC resistance is the static-per-program randomization of VM instructions. Being static thus deterministic, it can perhaps be somewhat obviated as limiting relative advantage for an ASIC with VLIW as I previously posited, presumably with an increasing trade-off of lower throughput-cost-product (i.e. the often idle specialized circuits) for higher throughput-efficiency-product (i.e. more efficient specialized circuits) as the design is pushed to the limits of efficacy.

The relatively miserly wafer area of non-multiplicative VM instructions sans the orders-of-magnitude more wafer area gobbling multiplicative would occur 44% of the time paired, 29% as triplets and 19% as quadruplets. For integer VM instructions that’s only 100 combinations paired, 1000 as triplets and 10,000 as quadruples. The combinations are even less for floating point. However the number of combinations will be significantly larger if want to hardwire all possible register combinations for even more efficiency of the n-tuples, although this still may be a worthwhile trade-off of wafer cost for greater efficiency.

We must assume that CPU designs prioritize throughput over electrical efficiency for the otherwise too slow multiplicative instructions so in that case throughput could be sacrificed for efficiency if latency is masked in another facet of the ASIC’s design.

But transistor speeds have flatlined.

Incorrect. Transistor speed scales with capacitance which scales proportional to the die shrink factor.

I recently replied to your 2017 claim that Moore’s law was dead:

@hyc you probably know by now that TSMC is pursuing GAAFET as the successor to FinFET die shrinks below 5nm. Intel’s mismanagement likely contributed to the stall or slowdown so TSMC displaced Intel and now Intel is reinventing itself to compete. Graphene nanotubes aka CNNFET may continue Moore’s Law beyond GAAFET.

smooth_xmr wrote in your thread:

From the recent experience with Monero it looks like even factors of 3-4 or possibly lower are enough to dominate the network (slightly higher factors might be needed to do to so and also make it worthwhile).

@shelby3
Copy link

shelby3 commented Jan 30, 2022

@SChernykh

All the assumptions that work with infinite in-flight programs stop working when you only have < 512 programs in flight.

@shelby3

This does not have to be moved for every program. Only when the program goes off chip.

Specialized circuits will idle less often if we share them between threads but this incurs at least the extra latency of more distant scratchpad cache. If that’s a positive tradeoff[1] then presumably the L1 cache is eliminated (leaving only L2 to serve its function) because the latency will be masked by the additional threads required to mask the slower L1. This holistic masking would also provide some leeway for other latency hiccups that the more complex CPU might handle in strike with its higher cache set associativity, OoOE, etc..

Yet if sharing ALU resources is compartmentalized for a group of threads and threads can be moved to a different group by moving only the register file for each new program (with cache distance irrelevant due to latency masking), the binomial distribution of (perhaps separably) integer and floating point multiplicative operations could come into play. At n=256, the standard deviation for the normal distribution approximation indicates for example a ~31.7% occurrence of greater or less than 38 ± ~8 integer multiplicative operations per program (i.e. greater or less than ±15% from the mean). So some ALU groups could have more or less of these multiplicative resources to optimize the matching of throughput-efficiency-product and/or throughput-cost-product to programs relatively better than the CPU.

@hyc

You cannot optimize for randomness. Just like you cannot design an optimal compression algorithm for random data streams.

Random bits do not represent the entropy of the system when those bits evoke non-fungible resources. Your analogy is vacuous and #NotEvenWrong.

In other words this and the prior post are examples that I was correct that the entropy is not the 2⁵¹² seed of the random generator. The entropy is reduced by the complex analysis of the interactions of the said non-fungible resources. There’s now some math in this and my prior post for you to attempt to refute. If the entropy was solely determined by the random bits that comprise the program then said math could find no optimizations involving anything other than the random bits. My posited optimizations leverage information which subsumes some of the information (Shannon entropy) in the random bits.

...crickets...

😛

[1] Low-power SRAM can have static power consumption 1% of dynamic (hot) and static power consumption per 256KB can be less than 10μW — thus 10,000+ static scratchpads for less than ⅒W (not that we’ll need anywhere near that many). I will propose moving L3 off-die in my next post. I do not know if that document is representative of the reality that applies to the intend context. SRAM seems to have many variants and much ongoing research.

@timolson
Copy link

@shelby3
I maintain that it’s folly to replace the proof-of-work in Monero with a very complex design if it has not been exhaustively analyzed for potential exploits, because it’s much more difficult to know whether the proof-of-work has been surreptiously compromised. Newness is badness for proof-of-work especially when the design and potential exploits are very complex.

Shelby I completely agree with you. Complexity in a PoW is bad bad bad, increasing the attack surface and providing plenty of opportunity for unobvious optimizations by private parties who hide their trade secrets.

My assessment three years ago was that an ASIC could indeed outperform CPU's running RandomX, but I no longer have the time or motivation to really get into it. IMO, there are a few reasons why there is not an obvious RandomX ASIC yet, but none of them have to do with the "ASIC resistance" of the algorithm:

  1. Chip shortage. It has been impossible to get a new project into a foundry for over a year now.
  2. Threat of PoW switch: if an ASIC appears, the dev team will just switch PoW's again. This is the main reason I didn't myself pursue building a RandomX ASIC
  3. Complexity: this makes it more annoying to design an ASIC, although it also provides much opportunity to optimize vs stock CPU's. But the up-front design cost for RandomX is far higher than any other PoW I know of. Not a good thing, IMO.

For some reason the Monero team decided to hand the keys of the kingdom to AMD and Intel, a duopoly whose headquarters are across the street from each other, both of whom produce chips with backdoors for the NSA. I don't get it. I thought Monero was against that kind of centralized control under the thumb of a single government. Now that Intel will be explicitly producing mining chips, it seems an even stranger decision.

If you are looking for a PoW for a new project, why not just use Keccak? It's perhaps the most extensively attacked and reviewed hash we have. It's simple, and it's hella fast in hardware. It also has uses outside PoW.

@hyc
Copy link
Collaborator

hyc commented Jan 30, 2022

...crickets...

Eh, I've been waiting for you to finish editing all your mistakes.

For some reason the Monero team decided to hand the keys of the kingdom to AMD and Intel,

Nonsense. You continue to ignore ARM, which wins on power efficiency and sheer volume in terms of installed base. And RISC-V is still evolving. There is nothing preferential or advantageous to x86 in RandomX.

@hyc
Copy link
Collaborator

hyc commented Jan 30, 2022

As for Intel entering the SHA256 ASIC market - as I wrote in August 2019: http://highlandsun.com/hyc/monero-pow-12.txt

(06:57:53 AM) hyc: IMO the decision to move to SHA3 is still stupid
(06:58:11 AM) hyc: claims that it will put all hardware on an even playing field are wrong
(06:58:36 AM) hyc: even now, if you google search SHA2 designs, there are still new optimizations bein published.
(06:59:22 AM) hyc: and the same will be true of SHA3 - there will always be someone finding an edge to exploit. 
(07:01:07 AM) hyc: so the dream of "making it ASIC friendly will promote commoditization" is pure fantasy

@timolson
Copy link

You continue to ignore ARM, which wins on power efficiency and sheer volume in terms of installed base. And RISC-V is still evolving. There is nothing preferential or advantageous to x86 in RandomX.

Tri-opoly now with the M1. Cupertino is just a few miles from Santa Clara. We can argue about ARM chips being competitive, but at least they're owned by the non-US entity SoftBank.

My point is that when you create a complex design and try to tie it to CPU's, you limit competition to only a few very large corporations, almost all of whom are under the thumb of the US government. SHA3 chips could be produced by just about anyone, including small teams, and I urge you to look at the algorithm. Your claims of new tricks do not seem well-founded to me. It is far simpler than SHA2.

And if you think SHA2 has too many tricks, why would you design a PoW as complex as RandomX? Do you think there are not a ton of tricks for optimizing RandomX? I don't understand how you can claim SHA2 has too many tricks and yet think that RandomX is solid, or that it can't be made into an ASIC.

But we are replaying an old fight. 🤷 Shelby can read the history, except for the IRC talk.

@hyc
Copy link
Collaborator

hyc commented Jan 30, 2022

But we are replaying an old fight.

Agreed, there's nothing new here.

@timolson
Copy link

timolson commented Jan 30, 2022

@shelby3

For viewpoints critical of RandomX's ASIC resistance, look at comments from myself and also "Linzhi" which is a fabless ASIC company from Shenzhen.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

@timolson

but at least they're owned by the non-US entity SoftBank.

My Telegram Chinese friend wrote, “the biggest shareholder of sin0vac is softbank from japan.”

EDIT: What do you think of RISC-V as a potential way for us to have CPUs which are free from malware? Do you think it will be possible to get these manufactured and what would be the volumes and capital one would need to be taken seriously by TSMC or other foundry? Can these foundry insert exploits that we’d be unable to find?

Tangentially I wrote on my Telegram group:

Well if ever we need to build our own ancient CPU out of discrete transistors:

https://monster6502.com/

That was the second CPU I learned to program. This discrete version costs $4000 to build and consumes 10W compared to the actual 6502 IC part which costs $10 runs 300 times faster consuming 8mW.

I don’t know if people comprehend the rate at which Moore’s law (no not my law, haha) has altered the human species.

Even an early 1990s era CPU such as the Motorola 68000 (which is the CPU I was programming for most of my early significant accomplishments in my 20s) would occupy 8.8 hectares of land if built with discrete components.

The human species has difficulty comprehending exponential growth and scale.

IMO, there are a few reasons why there is not an obvious RandomX ASIC yet, but none of them have to do with the "ASIC resistance" of the algorithm

I want to help emphasize that those are relevant points to why some observers do not conclude with utmost confidence that RandomX is highly ASIC resistant.

  1. I agree with the proponents of RandomX that this is a reason ASICs are bad if one is depending on the proof-of-work for a blockchain’s security and censorship resistance. And in the case of Monero arguably its anonymity. Seems only the well connected will get any supply or at least first dibs before the economic life of each ASIC generation has been sufficiently depleted.

  2. I believe (and some in their community including apparently @hyc and smooth also seem to agree) that they can’t continue this and expect to have any viability going forward. Community forks are an obfuscation of reality as is democracy. Forks are decided by leaders and that means centralization and probably makes it a regulated security under the Howey test of U.S. Securities law. I believe the powers-that-be have been biding their time waiting to pounce when it meets other objectives of this premeditated Great Reset. There is new law being proposed in the U.S. to allow U.S. Treasury to shut down exchanges without any notice. Also Biden will apparently issue a national security order this coming week to accelerate such plans. Gensler said he is going to focus first on exchanges. I doubt they ever crackdown severely with securities law, e.g. EOS only settled for a $24 million fine (but they were ostensibly buying the ICO from themselves so maybe they’re not as cash rich as we might have thought).

    Perhaps more saliently is that when they change the proof-of-work they are very vulnerable to botnets and the main protagonist of this Great Reset (Klaus Schwab) has been telling us that cyberattacks are next on the agenda after the plandemic. Also war of course.

    Agreed it’s probably not worth your time to attempt to design it, because we don’t know how much longer the cryptocosm will exist before the pwers-that-be destroy all the sh8tcoins and resurrect the currently soft forked legacy Bitcoin in a hail of ANYONECANSPEND fire and brimstone hard f**king. Or Monero could change it again out of desperation despite members of their community recognizing that’s futile. And also this bull cycle probably has less than a year to run before another multi-year cryptowinter. Probably numerous other reasons it’s not likely someone will attempt a RandomX ASIC unless they have some ulterior motives that pay it back in which case they may already done so and how would we know?

  3. Agree but it seems like an accepted tradeoff they made to try to get the most ASIC resistance they could. So it depends what the objective is, as elaborated below. If we accept that powers-that-be are always going to end up with the ASICs regardless of what algorithm we choose (c.f. # 1 above), then one ends up looking elsewhere to solve the problem as I did. I moved on from proof-of-work in ~2015ish or so. Well it was a process of failed ideas and further contemplation. But proof-of-stake is not the solution for reasons I won’t elaborate here.

    Vitalik noted tradeoffs of ASIC resistance:

    Ethash has proven remarkably successful at ASIC resistance; after three years and billions of dollars of block rewards, ASICs do exist but are at best 2-5 times more power and cost-efficient than GPUs. ProgPoW has been proposed as an alternative, but there is a growing consensus that ASIC-resistant algorithms will inevitably have a limited lifespan, and that ASIC resistance has downsides because it makes 51% attacks cheaper (eg. see the 51% attack on Ethereum Classic).

    I believe that PoW algorithms that provide a medium level of ASIC resistance can be created, but such resistance is limited-term and both ASIC and non-ASIC PoW have disadvantages;

For viewpoints critical of RandomX's ASIC resistance, look at comments from myself and also "Linzhi" which is a fabless ASIC company from Shenzhen.

I did read that other RandomX issues thread. I appreciate everyone who has shared their knowledge including the authors and contributors to RandomX, because integrated circuits at the design and manufacturing perspective is new for me. I had studied the boolean logic design of for example early microprocessors (at age 13 actually) and had built analog electronic and digital circuits when I interned at Rockwell Science Center in Thousand Oaks, but then I exited the field to launch a software company in the 1980s.

And if you think SHA2 has too many tricks, why would you design a PoW as complex as RandomX? Do you think there are not a ton of tricks for optimizing RandomX? I don't understand how you can claim SHA2 has too many tricks and yet think that RandomX is solid, or that it can't be made into an ASIC.

It does come across as disingenuous. That makes it difficult to trust that they have been objective[unbiased and thorough] in their analysis [of potential exploits].

If you are looking for a PoW for a new project, why not just use Keccak? It's perhaps the most extensively attacked and reviewed hash we have. It's simple, and it's hella fast in hardware. It also has uses outside PoW.

I noted Mircea was also raving about Keccak in the past. I think that might be a good one to consider transitioning to after onboarding, perhaps hardcoded transition schedule in the protocol. My interest in RandomX is so people can onboard to transaction fees for Dapps[EDIT: found a better solution] for my contemplated vaporware alt[sh8t]coin without needing to buy crypto. They only need to be able to mine a fraction of a penny or so in value within a reasonably short period of time, preferably within an hour or at most overnight. I want to make sure the network difficulty won’t run away from them making even that implausible as it is on Bitcoin.

Also posit that I won’t have to lie to users about the fact that if the proof-of-work will always be entirely centralized in the end game. I have contemplated a consensus design which (hopefully) obviates a 50+% attack in any form including transaction censorship. An ISTJ tells me I lack fecundity in my bit-twiddle and I retort his bit-twiddle pursuits indicated he doesn’t conceptualize why the power-law distribution of resources in inviolable. Banging ones head against a brick wall is not a very productive activity, but hey never interrupt the antagonist when they’re busying destroying themself — Sun Tzu.

Also the other reason to consider RandomX is that if there are users willing to mine at a loss (they do not care about the loss on a fraction of a penny if they are onboarding), ergo the following destruction of all the altcoins that is coming might not apply to mine (but I still think it would be vulnerable if not for the key change I made in the consensus design):

http://trilema.com/2014/the-woes-of-altcoin-or-why-there-is-no-such-thing-as-cryptocurrencies/

In short, I expect Monero to be destroyed whenever the powers-that-be are ready to do do. Ditto Litecoin, Dash, etc.

@hyc

For some reason the Monero team decided to hand the keys of the kingdom to AMD and Intel,

Nonsense. You continue to ignore ARM, which wins on power efficiency and sheer volume in terms of installed base. And RISC-V is still evolving. There is nothing preferential or advantageous to x86 in RandomX.

I interpreted his comment differently that to truly compete with what would posited to be a state-of-the-art RandomX ASIC, it might be necessary to have Intel’s or AMD’s intellectual property. Such an intentional asymmetrical design choice could possibly be a major blunder unless the actual ASIC resistance (if we even have a way to reliably estimate it) is sufficient to meet some aims. I posit that for Monero’s raison d'être (i.e. anonymity) such an asymmetrical unknown deletes the assurance of anonymity. Smooth argued to me that routine privacy (e.g. your neighbor can’t track your spending) doesn’t necessarily require anonymity against powerful adversaries such as three letter agencies.

Btw, so far I am pleased with the decision to make RandomX compatible with ARM because it might match my use case if I can nail down the level of ASIC resistance to within an order-of-magnitude.

I haven’t yet finished my study and exposition yet I am already leaning towards the likelihood that a state-of-the-art RandomX ASIC will be at least an order-of-magnitude more efficient and/or less costly. Anyone else want to share any additional thoughts about such an estimate? I suppose I am thinking right now that three orders-of-magnitude is unlikely.

I really wanted help on refining such estimates which I presume is beneficial to any project that wishes to employ RandomX. I am not sure if anyone but “antagonists”[1] to RandomX are going to do their best to be objective[dig for exploits] in making such estimates. I am still open to learning more.

I think @tevador and perhaps also @SChernykh were trying to be somewhat unbiased and that was appreciated. I probably dropped the ball by not applying more effort last year to explain myself, but I grew weary of the discussion at that time, had other pressing (health[2]) matters to attend to, necessitated more focused study/thought but didn’t have the free time to do it properly, and at the time I had no immediate interest in using RandomX.

https://www.reddit.com/r/Monero/comments/8bshrx/what_we_need_to_know_about_proof_of_work_pow/dx9uxns/

Autistic
A misunderstood developmental disability that interferes with a person's ability to properly and/or appropriately communicate with others.

Maybe we’re only worthy of your elbows because you’re so on top of your field as to be outside our imbecile communication range.

[1] My impression/intuition is they (especially @timolson) are genuinely trying to help by unselfishly offering their time and expertise, but seems perhaps some of the RandomX contributors in this thread are skeptical and think there’s some hidden subterfuge agenda at least w.r.t. myself and "Linzhi". I offer no firm opinion on "Linzhi" other than cite factual statements such as the relative number of logic gates between different ALU operations.

[2] The reason for not trying to explain what I might mean about entropy before was because it would have dragged me into another debate perhaps to be misunderstood if I hadn’t taken the time to carefully study and contemplate. I was in a rush to travel from that location where I was to escape the lockdowns, closed exercise facilities, masking (and climate) that I got stranded in by the sudden turn of events in 2020, which were so deleterious to my chronic health issues. Obviously these guys here will ridicule anyone who has not extensively contemplated one’s own ideas. So no one signs up for that unless they can dedicate the time to do it reasonably well and thoroughly. Also I am weary of the MoAnero trolls, as it is has always been the same throughout the years when interacting with some of them, so I just decided to find a way to shutdown the discussion last year and move on to more pleasant activities. Seems to be an ISTJ personality type issue — they’re incompatible with ENTPs. ENTPs are visionaries and work with ideas. I?TJs are detail freak experts. Technology is supposed to fun, at least that is why I got into it. I learned a lot from discussions with smooth, ArticMine and some others.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

@hyc

As for Intel entering the SHA256 ASIC market - as I wrote in August 2019: http://highlandsun.com/hyc/monero-pow-12.txt

The effort that was applied to attempt to achieve maximum ASIC resistance is admirable. Whether the tradeoffs of that goal are are a net positive is debatable. What is not admirable is the dearth of intensive ASIC designs attempted by the devs to validate their work.

When one comes to this project they want to read about all the attempts to break the ASIC resistance and the detailed expositions. One wants to learn from the experts by reading instead of having to become experts themselves to do the work that we weren’t really qualified to do.

Instead of being on the defensive in discussions of RandomX why not try to more aggressively attack your own design. The best programmers want to break their own programs. I always expended an order-of-magnitude or more effort breaking and fixing my programs than designing and coding them. As for IQ which you seem to be harping on, I remember 152–160 IQ Eric Raymond was being schooled by some Rust devs when he got into a debate about some of the design choices. I may not have the Mensa-level IQ I once did (after head trauma such being struck with hammer, liver disease and type 2 diabetes) but I did have instances in my youth where I wrote out verbatim from (photographic?) memory several hundred lines of code after a power outage. I doubt I could do that now, also blind in one eye.

Also you should be proud that so many people are interested in discussing your work.

Shouldn’t this be a labor of love to talk shop with others and educate them about your work? The circle of people who would even have the ability and interest to even converse here is presumably quite small.

Not everyone who comes around is going to have the time to be as expert as the original authors (at least not initially) and it may take them some time to come up to speed and even confirm their interest in investing the effort to do so.

Why do you expect that other people owe you their utmost perfection? Nobody forced you to continue replying on this project if it pisses you off so much to have to entertain the people you think are worthless clowns.

Note I did appreciate your help on my recent question.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

@timolson

I thought Monero was against that kind of centralized control under the thumb of a single government. Now that Intel will be explicitly producing mining chips, it seems an even stranger decision.

When they assume our motives are competing financial interests why should we assume anything less for their motivation?

Why should we not assume some guys mined the heck out of Monero in the early days when mining difficulty was low and now are defending their tokens. They need more and more greater fools to buy so they can cash out at higher profits.

If so, then the most important priority is to maintain the illusion of virtue. And thus ASIC resistance so as to not admit that proof-of-work is always going to become more and more centralized over time. Westerners are living in an illusion that they are not fully enslaved, so the objective could be to sell that hopium to them.

OTOH, I observe that some in the Monero community are very idealistic, maybe to such an extreme that they refuse to accept that they can’t defeat the centralization of proof-of-work mining, even if entails lying to themselves by making overly optimistic assumptions.

And then there appears to be another facet of the Monero community they want to be known as the highest IQ, most technologically advanced project in crypto. Yet they never discarded that Rube Goldberg ring signature anonymity which is conceptually flawed. They have intensive engineering but viewed holistically it’s incoherent. This is I suppose what happens you bring a lot of huge egos[IQs] together who have sub-specialties that they each want to show off.

Then there are several very level-headed members of that community also. So we can’t generalize.

Maybe I am entirely wrong but that is my attempt to try to understand the project.

Also I have enjoyed learning about technology from the Monero project. And delighted that RandomX has been field tested for flaws unrelated to provable ASIC resistance. So all-in-all I think everything happens for a reason. Not really complaining. Just trying to throw some shade on any expectation of pure virtue in the cryptocosm.

P.S. Anyway I discovered a new way to do anonymity which is far superior to anything out there now because it renders any action we do on the Internet untraceable, not just cryptocurrency transactions. Not onion-routing nor even random-latency mixnets as I was the one who was arguing back in 2014/5 that Tor was a honeypot which was when some in the Monero community were espousing I2P.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

Eh, I've been waiting for you to finish editing all your mistakes.

Done. Check that post again now. 😉

@fluffypony
Copy link

P.S. Anyway I discovered a new way to do anonymity which is far superior to anything out there now because it renders any action we do on the Internet untraceable, not just cryptocurrency transactions. Not onion-routing nor even random-latency mixnets as I was the one who was arguing back in 2014/5 that Tor was a honeypot which was when some in the Monero community were espousing I2P.

Oh boy this again. If you took just half the effort you put in arguing on Bitcointalk and GitHub, and put it into actually building something, you might release it before this decade is over.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

Oh boy this again. If you took just half the effort you put in arguing on Bitcointalk and GitHub, and put it into actually building something, you might release it before this decade is over.

Pay the very large sum of money back to my friend that you ostensibly stole from him. I know things.

EDIT: Ad hominem declined. Will not sway me into revealing my intellectual property secrets before I’m prepared to launch something if ever.

@fluffypony
Copy link

Pay the very large sum of money back to my friend that you ostensibly stole from him. I know things.

I literally have no idea what you're talking about. You'll have to be extremely clear if you're going to sling around public accusations.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

I literally have no idea what you're talking about. You'll have to be extremely clear if you're going to sling around public accusations.

You know exactly what I am talking about.

@fluffypony
Copy link

You know exactly what I am talking about.

I haven't stolen money from anyone, so I absolutely do not.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

I haven't stolen money from anyone, so I absolutely do not.

It is all going to catch up to your someday. Just continue the big lie. I see you managed to escape from the USA[criminal extradition request].

EDIT: I’m aware how corrupt S. Africa is from the Youtube channel of an expat South African who grew up and still has family there. So who knows what’s really going on with that. Extortion against you perhaps, would be my leaning if I didn’t have other information indicating you might be unscrupulous. Maybe I should assume that instead that someone in S. Africa was bought off. In any case I trust my friend because I know he is virtuous. It’s my prerogative not to trust you. You injected non-factual allusions into a technical discussion — it has been explained many times that I was in no way connected with BCX and was in 2014 merely pontificating about whether his threats could have any technological realism. I was curious and learning about the technology — no malice involved. I was also 8 years younger and probably more aggressive, energetic, excited, naive and bewildered. Yet this was blown out of proportion by those who want to attach some ad hominem to my reputation. I was responding to @timolson who seemed bewildered by motivations so I proffered an explanatory hypothesis. The main thrust was to temper his expectations of virtue in the cryptocosm — just like every other facet of life there’s vested interests, situations, etc that can run counter to what would otherwise be irrational.

@fluffypony
Copy link

It is all going to catch up to your someday. Just continue the big lie. I see you managed to escape from the USA.

Nope, still in the USA. Again - if you're going to make baseless accusations you should back them up, otherwise it's BCX's Monero attack all over again.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

There is no baseless accusation. Go sue me (EDIT: for defamation if the accusation is false), then I will reveal the name of the person I promised not to reveal.

@fluffypony
Copy link

There is no baseless accusation. Go sue me, then I will reveal the name of the person I promised not to reveal.

lol why would I sue you? You're welcome to make as many baseless accusations as you want.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

Promising to hold $millions in XMR for people and then pretending you were hacked. You must have learned that from Bruce Wanker?

And never filing a police report, lol.

@fluffypony
Copy link

Promising to hold $millions in XMR for people and then pretending you were hacked. You must have learned that from Bruce Wanker?

The only time I promised to hold XMR for people was when I held MintPal's post-exit scam Monero to refund depositors, but ok.

@hyc
Copy link
Collaborator

hyc commented Jan 31, 2022

@shelby3 need I remind you of this from a mere 5 days ago
#11 (comment)

Please constrain your comments to the actual topic and refrain from going off on wild tangents, if you wish to be taken seriously.

Since it appears you're unable to maintain a coherent discussion I'm inclined to block you.

@shelby3
Copy link

shelby3 commented Jan 31, 2022

Before I go I will dump the other technological information I had dug up on latency bound.

For one it appears that 40ns was the assumption about memory latency being referenced upthread but this excellent document educated me about the meaning of key timing parameters for DDR4.

I remember there was a reference upthread to 40cycles + 90 ns for a L3 cache miss and this is presumably including all the latency of the various facets of the memory system including the memory controller. The memory controllers for modern CPUs are ostensibly complex and they are interacting with complex caching as well. An ASIC may not need all those features.

The optimal case of tFAW for some DDR4 appears to be ~20 ns for four reads thus 5ns per read. The Dataset latency bound limits absolute throughput and can’t be masked with threads, although in the optimal case it is close enough to L3 latency of ~40 cycles (~10ns) that we can probably move L3 off-die with enough aforementioned threads for holistic masking of latency if our memory controller is performant and we have enough threads in flight.

Of course then we have to consider latency due to the memory controller reordering for optimizing DDR latency, which is not a throughput limitation if we can mask it with threads.

We increase the number of memory banks to increase throughput, but if we have to use redundant Datasets then power consumption increases. Yet power consumption for DDR4 is reasonably low at ~1.5W per 4GB.

This is a hardware feature I had never studied in depth.

EDIT: the number of threads per computational core group is not contemplated to be massive as in @cjdelisle proposal. Rather just enough to mask various latencies I’ve mentioned in this and prior recent posts.

Since it appears you're unable to maintain a coherent discussion I'm inclined to block you.

Is baseless ad hominem the acute specialization[affliction] you share with @fluffypony.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants