Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUMA support #22

Open
lilyanatia opened this issue Feb 12, 2019 · 13 comments
Open

NUMA support #22

lilyanatia opened this issue Feb 12, 2019 · 13 comments
Labels
enhancement New feature or request

Comments

@lilyanatia
Copy link

on a quad-socket server (4x Xeon E5-4640), a single process with 64 threads maxes out at about 3700 H/s.

running 4 processes at once (one per physical CPU) with 8 threads each, I get about 2525 H/s per process for a total of 10100 H/s.

since real mining software will almost certainly have NUMA support, it would probably be good to implement it here so people get a more accurate idea of actual mining hashrates.

@tevador
Copy link
Owner

tevador commented Feb 12, 2019

Interesting. Thanks for the test.

Actually, NUMA is only part of the story here, DDR3 is limited to about 1500 H/s per channel, so even if this was a 64-core machine with uniform access to 4 channels, it would still be limited to ~6000 H/s. Machines like this definitely need multiple copies of the dataset for maximum performance. I'll keep it in mind.

BTW, DDR4 is noticeably less limiting due to its multiple internal banks (> 3000 H/s per channel).

@tevador
Copy link
Owner

tevador commented Feb 12, 2019

I noticed the CPU has 20 MB of L3 cache, so for the best performance, you should be using 10 threads per CPU or 40 threads total.

bin/randomx --mine --largePages --threads 10 --nonces 100000

@lilyanatia
Copy link
Author

it looks like the best performance should be with 10 threads per CPU, but it isn't.

with a single process, I tested all numbers from 32 to 80 and got the highest hashrate at 64 threads. 64 threads did about 3700 H/s, 32 threads did about 3100 H/s, and 40 threads did about 2900 H/s

with 4 processes, I tested from 8 to 16 and got the highest hashrate at 8 threads per process. on a single CPU, 8 threads did about 2525 H/s, 16 threads did about 2450 H/s, and 10 threads did about 1950 H/s.

with cryptonight, this machine does get the best performance at 10 threads per CPU.

@tevador
Copy link
Owner

tevador commented Feb 12, 2019

So it seems the L2 cache (256 KiB per core) is the limiting factor. RandomX needs 16 KiB of L1D, 256 KiB of L2 and 2 MiB of L3 per thread.

@MoneroChan
Copy link

interesting... if hotaru2k3 isn't using a DDR4 server board,
that means non uniform memory access can overcome the DDR3 ~6000H/s max limitation by ~40% in some configurations, but sacrifices ~40% from max possible h/s...

(This reminds me of data interleaving / XMR-Stak's Interleve function)

@lilyanatia
Copy link
Author

I'm using a DDR3 server board with 4 sockets and 4 memory channels per socket. the DDR3 ~6000H/s max is for 4 channels, not 16.

@MoneroChan
Copy link

ahh. it's per socket. so it's similar to 4 motherboards joined together with 4 channels each.
So 16 channels in total. That makes sense now.

So you've got 50% spare RAM bandwidth before your RAM becomes a bottleneck.
and theoretical Max on your system is 20000 to 24000H/s if you upgrade the CPUs?

@lilyanatia
Copy link
Author

the best CPUs for this board have 12 cores (a 50% increase over what I have), so it'd probably max out around 15000. L2 cache would still be the bottleneck.

@lilyanatia
Copy link
Author

lilyanatia commented Jun 22, 2019

How exactly is this done?

numactl -s | grep \^nodebind:\ | cut -c 11- | sed s/\ /\\n/g | xargs -P 0 -I node numactl -N node ./randomx-benchmark --mine --jit --largePages --init $(numactl -H | grep 'node 0 cpus: ' | cut -c 14- | wc -w) --threads $((echo $(lstopo --restrict 0xff --only L2 | grep -o '[0-9]\+[KM]B') | sed 's/ /+/g;s/KB//g;s/MB/*1024/g;s/^/(/;s/$/)\/256/' | bc; echo $(lstopo --restrict 0xff --only L3 | grep -o '[0-9]\+[KM]B') | sed 's/ /+/g;s/KB//g;s/MB/*1024/g;s/^/(/;s/$/)\/2048/' | bc) | sort -n | head -1)

@kio3i0j9024vkoenio
Copy link

kio3i0j9024vkoenio commented Jun 24, 2019

NUMA support really needs to be implemented. The slowdown by having to access memory through another processor causes a drop of about 52% in performance.

In the many forums I have read lots of users that will be mining with RandomX are buying now or utilizing servers that they already have that have two or more processors.

My Test System in a HP DL580 with four Xeon E7-8837 eight core processors with 8GB of memory on each processor or 32GB memory total.

Doing the benchmark tests shows that when RandomX allocates all the Dataset to only one of the processors memory that it runs 52% slower than when it spreads it out to other processors.

sudo sysctl -w vm.nr_hugepages=1200
./benchmark --mine --largePages --jit --threads 28 --nonces 100000
RandomX benchmark
Performance: 7193.61 hashes per second

Now allocate Dataset to only one processor:

sudo sysctl -w vm.nr_hugepages=4800
./benchmark --mine --largePages --jit --threads 28 --nonces 100000
RandomX benchmark
Performance: 3471.44 hashes per second or a slowdown of 51.7%

Sscreenshots shows that the 28 threads are spread over the four processors in both tests so when the Dataset is only in one of the processors local memory the other three need to go through another processor to access it. That caused the massive slowdown.

@kio3i0j9024vkoenio
Copy link

sudo sysctl -w vm.nr_hugepages=4800

Long story short since NUMA is still not in the benchmark RandomX miner you need to benchmark using this command:

seq 0 3 | xargs -P 0 -I node numactl -N node ./benchmark --mine --largePages --jit --nonces 100000 --init 8 --threads 8

That command runs four benchmarks each assigned to only one processor and that processor only uses its local memory.

This is the results I have obtained:

Running benchmark (100000 nonces) ...
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2319.77 hashes per second
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2296.53 hashes per second
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2289.64 hashes per second
Calculated result: 9b22794882187000d62c6d2b228fab5e585767aaaa5eb74905b0c7c00fcbdad8
Performance: 2280.22 hashes per second

That is a total of 9186 H/s for the four processors or an average of 2296 H/s for each Xeon E7-8837.

@tevador tevador added the enhancement New feature or request label Jun 30, 2019
@kio3i0j9024vkoenio
Copy link

kio3i0j9024vkoenio commented Aug 16, 2019

I just wanted to point out that XMrig V3.1.0 has implemented NUMA for RandomX with a Testnet and it works flawlessly.

https://github.com/xmrig/xmrig/releases/tag/v3.1.0

This is how I test on Ubuntu:

RandomX testing

wget https://github.com/xmrig/xmrig/releases/download/v3.1.0/xmrig-3.1.0-xenial-x64.tar.gz
tar xvzf xmrig-3.1.0-xenial-x64.tar.gz

cd xmrig-3.1.0

edit config.json:
   "asm": "bulldozer", - change to this if Opteron's 6200 or 6300 otherwise leave it alone
   "donate-level": 1,
   "algo": "rx/test",
   "pools": "randomx-benchmark.xmrig.com:7777",

./xmrig

@jtgrassie
Copy link
Contributor

I created a NUMA patch for the benchmark some time ago (just rebased now as well). Quite honestly though, numactl works fine as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants