Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional CPU threads occupy significant memory space #81

Closed
kachind opened this issue Jan 20, 2019 · 13 comments
Closed

Additional CPU threads occupy significant memory space #81

kachind opened this issue Jan 20, 2019 · 13 comments

Comments

@kachind
Copy link

kachind commented Jan 20, 2019

When running grin-miner with the c31 plugin, 1 thread will use ~11 GB of memory as expected. However, running 64 threads consumes 15.5 GB, and much beyond that will cause a memory allocation crash.

This system is a Xeon Phi 7210, which has 256 logical cores and 16 GB of high-bandwidth memory. Is a ~70 MB allocation per thread actually necessary? Are there any mitigating actions that could be taken here?

@tromp
Copy link
Owner

tromp commented Jan 20, 2019

Each thread needs an additional 42 MB, which is mostly sizeof(yzbucket).
I don't know why yours would have 70 MB per thread.

@kachind
Copy link
Author

kachind commented Jan 20, 2019

Thanks, that's helpful. How do CUDA / OpenCL implementations avoid multiple 42 MB allocations?

I've noticed some other bizarre behavior with high thread counts, particularly that around 32 threads the GPS stops increasing. If I start two 32-thread instances, the GPS is roughly double the GPS of one 64 thread instance. I can open a new issue for that if you like.

@tromp
Copy link
Owner

tromp commented Jan 20, 2019

You have limited memory bandwidth, so once exhausted, your additional threads just slow things down with increased contention for the limited bandwidth.

@kachind
Copy link
Author

kachind commented Jan 20, 2019

That doesn't explain why two separate instances break the bottleneck, though? It's the same number of total threads.

The Xeon Phi has approximately 500 GiB/sec of memory bandwidth to the MCDRAM so I'm skeptical that's the issue.

@tromp
Copy link
Owner

tromp commented Jan 20, 2019

That's sequential bandwidth. But your accesses are not all that sequential, causing a lot of memory latency as well. So I think of it as exhausting the latency bandwidth.

@kachind
Copy link
Author

kachind commented Jan 20, 2019

Does this AVX implementation produce much less sequential memory access than the OpenCL/CUDA implementations do? Otherwise I would expect GPUs to suffer from the same bandwidth issues, but they clearly are achieving significantly better performance.

Sorry for all the questions, I'm not familiar with the details of this algorithm.

@timolson
Copy link

timolson commented Jan 20, 2019

500 GiB/sec of memory bandwidth

Bandwidth numbers are completely misleading unless you adjust for CAS latency. You can only get 500 GiB/s writing straight down DRAM, with zero random access. If the implementation of Cuckoo Cycle is caching partition writes into blocks of 16 edges at a time, then typical DDR4-3000 will slow down more than 50%. Using MCDRAM instead of DDR DIMMs does not help with latency at all.

That being said, I find it dubious that running two solvers simultaneously would help this situation, unless they are writing to the same pages in RAM? If you allocate separate memory for each solver, this would only exacerbate the latency problem, and something else must be going on.

@timolson
Copy link

timolson commented Jan 20, 2019

Otherwise I would expect GPUs to suffer from the same bandwidth issues

They absolutely do suffer from this problem. Nvidia GPU's can only batch up to 32 words of 32-bits each. GDDR5 has a much higher transfer rate than DDR4, however, and GPU's compute siphash way faster than a CPU. 2^29 Siphashes on a 1080Ti takes about 30-35ms. I don't have a Xeon Phi but I'm guessing it takes seconds.

@tromp
Copy link
Owner

tromp commented Jan 20, 2019

siphashes on a Xeon Phi would benefit from avx512 instructions, but that's not supported in my solvers, and would still not make the Phi competitive with GPUs.

@kachind
Copy link
Author

kachind commented Jan 20, 2019

Yeah, here are the test details. I have to use the c29 algorithm, since otherwise I run out of memory.

(figures are plus or minus a couple of percent, since the solve time varies slightly between graphs)


[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 64

GPS: 0.24


[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 0

[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 1

Device 1 GPS: 0.18
Device 2 GPS: 0.18


I was hoping for a simple solution but it sounds like that's unlikely.

In your opinion, if the the CUDA implementation could be ported over to OpenMP or similar, the performance would still be poor? The Xeon Phi KNL architecture has L1 and L2 caches, which I had hoped might mitigate the HBM2 access latency, but if all of the accesses are totally unpredictable/random then it's obviously hopeless, and there's no practical motivation to fix this issue.

@kachind
Copy link
Author

kachind commented Jan 20, 2019

As an aside... KNL does support a family of unique 512-bit prefetch instructions as part of AVX512-PF, and it's deep OoO with 4 threads per physical core.

https://www.felixcloutier.com/x86/vgatherpf1dps:vgatherpf1qps:vgatherpf1dpd:vgatherpf1qpd

This would permit a thread to assemble a full cache line from non-sequential data, provided it's all located in the same 4 KiB page, and the processor would not stall since it could move to a different hyperthread. Is that relevant here? Is data generally dispersed within pages, between pages, or both?

@tromp
Copy link
Owner

tromp commented Jan 20, 2019

So the Phi performs worse than an i7 on the avx2 solver and if a Phi expert were to devote significant effort on optimizing the solver, they might be able to quadruple the performance and it would still fall far short of a GPU...

@kachind
Copy link
Author

kachind commented Jan 20, 2019

Fair enough, closing issue.

@kachind kachind closed this as completed Jan 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants