Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thunder run on GPU cluster: parameter, benchmark, scratch #14

Closed
sunny1226 opened this issue Jun 3, 2019 · 3 comments
Closed

Thunder run on GPU cluster: parameter, benchmark, scratch #14

sunny1226 opened this issue Jun 3, 2019 · 3 comments

Comments

@sunny1226
Copy link

sunny1226 commented Jun 3, 2019

Now we have a GPU cluster which contains 4X 4-2080Ti GPU nodes. The CPU is E5-2650(2x12 cores) and Physical Memory is 256G. We have used NVIDIA driver 410.93, CUDA-9.2, NCCL 2.4.2 for THUNDER. We are trying to run THUNDER on our GPU cluster. We have so many questions.

  1. How should we specify processes and threads? It said that on a cluster, we should specify one node with one process, and the thread number set as the number of CPU cores. However, when we only want to use 2 nodes to run THUNDER, (and other two nodes were used by others), we found that mpirun -np 2 cannnot work. Would more process speed up the job running? Or only more threads speed up it? And also, please give us suggestions on threads setting if we want to run THUNDER on 2 nodes.
  2. We want to run a benchmark to test if THUNDER installation have no problem. Now I used relion benchmark data (EMPAIR 10028 Ribosome, 51G, ~10K Particles). How long should such a dataset process by THUNDER on one of our nodes? I have also found there is a THUNDER-benchmark data on GitHub however I cannot download the data set. Should I use that dataset to run THUNDER benchmark?
  3. Our cluster have SSD scrach on each machine, not shared. On Relion and cryosparc v2, we can set scrach dir on local scrach directory. However, I didn't find the place to set local scratch or open it. Could I use local scrach? If I could use, how could I use it?
  4. I found THUNDER will copy my benchmark data to physical memory. However, when there are milions of particles, it will run out of physical memory and cause job failed (Happened on our old workstation with only 128G to run EMPAIR 10028 Particles). If I don't want to write particles into phsical memory, how could I do?
    Thanks!
@Zarrathustra
Copy link
Collaborator

  1. For two nodes, we recommend putting two processes on the first node and one process on the second one.
  2. Running speed depends on so much environment parameters, such as memory, GPUs, CPUs, network bandwidth, disk bandwidth. I am sorry that I can not give you an estimation. However, it is some benchmark tests in https://www.nature.com/articles/s41592-018-0223-8. Moreover, you can use https://github.com/thuem/THUNDER-demo-datasets as benchmarks. Download failure is due to lack of Github LFS service, https://git-lfs.github.com.
  3. Sorry, currently, there is no local scratch support in THUNDER.
  4. Currently, THUNDER will read all particles in physical memory. We are working the memory buffer system which will load particles into physical memory when needed. We hope to release this feature soon. A present solution to his "out of memory" issue is to use SWAP. By configuring SWAP to a larger partition, it will get this issue solved.

Best regards,

Mingxu

@sunny1226
Copy link
Author

Thanks. Besides, could thunder support LFS cluster?

@Zarrathustra
Copy link
Collaborator

Sure.

In Tsinghua, we use LFS and SLURM as job manager.

If you have any problem with running THUNDER using LFS job manager, please contact us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants