Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any difference about using 'OpenMPI'? #15

Closed
seongkyun opened this issue Feb 25, 2019 · 4 comments
Closed

Is there any difference about using 'OpenMPI'? #15

seongkyun opened this issue Feb 25, 2019 · 4 comments

Comments

@seongkyun
Copy link

seongkyun commented Feb 25, 2019

Hello.
I just tried to run this code with
Ubuntu 16.04 LTS, Geforce TITAN X GPU with Pytorch 0.4.1
While running the code with
mpirun -n 4 python plot_surface.py --mpi --cuda --model vgg9 --x=-0.5:1.5:401 --dir_type states --model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 --model_file2 cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=8192_wd=0.0_save_epoch=1/model_300.t7 --plot --show
, nothing has happened.

But just delete mpirun -n 4, then code starts running. I think it is in the training process.
And after the training process, I can see the plotted results.

Can I run this code without 'OpenMPI'??
I only know that openmpi is just for parallel computation.
So can I use the code with
python plot_surface.py --mpi --cuda --model vgg9 --x=-0.5:1.5:401 --dir_type states --model_file cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=128_wd=0.0_save_epoch=1/model_300.t7 --model_file2 cifar10/trained_nets/vgg9_sgd_lr=0.1_bs=8192_wd=0.0_save_epoch=1/model_300.t7 --plot --show?

@seongkyun seongkyun changed the title How to run this code? Is there any difference about using 'OpenMPI'? Feb 25, 2019
@ljk628
Copy link
Collaborator

ljk628 commented Feb 25, 2019

@seongkyun, Yes, you can run the code without MPI. Running the code with MPI will split the independent calculation jobs to different GPUs, and the results calculated by each worker are collected to write to disk.

I observed the hanging issue with a newly configured machine, while the same code works well in my machine. Are you able to test the following simple command mpirun -n 4 python test.py? The test.py contains following code

import mpi4py
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank, nproc = comm.Get_rank(), comm.Get_size()
print('rank: %d' % rank)

@seongkyun
Copy link
Author

seongkyun commented Feb 26, 2019

@ljk628 , Thank you for your replying.
I've just run your test.py code and the result is below:
rank: 0
Is it okay?

And installed requirements are below:
pytorch 0.4.1
torchvision 0.2.1
openmpi 3.1.2
mpi4py 2.0.0
numpy 1.12.1
h5py 2.8.0
matplotlib 2.2.3
scipy 1.1.0

@ljk628
Copy link
Collaborator

ljk628 commented Feb 26, 2019

It should print out four lines if you use mpirun -n 4, .e.g.,

rank: 0
rank: 1
rank: 3
rank: 2

h5py 2.8.0 does not work with this repo. Please check #4 and #12 for related issues.

@seongkyun
Copy link
Author

Thanks. I'll try that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants