Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous Integration of an existing MPI application with Pytorch MPI Backend #33943

Closed
vibhatha opened this issue Feb 28, 2020 · 7 comments
Closed
Labels
module: mpi Problems related to MPI support small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@vibhatha
Copy link

vibhatha commented Feb 28, 2020

馃悰 Bug

Using Pytorch MPI backend with an existing MPI application. I am trying to use a data pre-processing logic done with MPI (using mpi4py) and feed the processed data to Pytorch.
Here I found a problem in doing this using the distributed packaged with the MPI backend.

To Reproduce

In order to reproduce the error, please run the following code. In simply, by just re-running the following code, the error can be reproduced.

import mpi4py

mpi4py.rc(initialize=False, finalize=False)

from mpi4py import MPI

import torch.distributed as dist1

MPI.Init()

comm = MPI.COMM_WORLD

rank = comm.Get_rank()
size = comm.Get_size()

dist1.init_process_group("mpi", rank=rank, world_size=size)

print("Rank {} Size {}".format(rank, size))

MPI.Finalize()

Error is the following,

--------------------------------------------------------------------------
Open MPI has detected that this process has attempted to initialize
MPI (via MPI_INIT or MPI_INIT_THREAD) more than once.  This is
erroneous.
--------------------------------------------------------------------------
[vibhatha:19276] *** An error occurred in MPI_Init_thread
[vibhatha:19276] *** reported by process [3787915265,1]
[vibhatha:19276] *** on a NULL communicator
[vibhatha:19276] *** Unknown error
[vibhatha:19276] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[vibhatha:19276] ***    and potentially your MPI job)
[vibhatha:19268] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[vibhatha:19268] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2079
[vibhatha:19268] 3 more processes have sent help message help-mpi-runtime.txt / mpi_init: invoked multiple times
[vibhatha:19268] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[vibhatha:19268] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

Expected behaviour

So, what I can guess here what happens it without checking whether MPI is already initialized, the dist backend is initializing MPI. If this can be done with checking whether MPI is initialized, I think this error won't take place.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

You can get the script and run it with:

wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
  • PyTorch Version (e.g., 1.0): 1.14 (1.4.0a0+5375cea)
  • OS (e.g., Linux): Ubuntu 19.10
  • How you installed PyTorch (conda, pip, source): Installed from source
  • Build command you used (if compiling from source): python3 setup.py install
  • Python version: 3.7.5
  • CUDA/cuDNN version: None
  • GPU models and configuration: None
  • Any other relevant information: None

Additional context

@ezyang ezyang added module: mpi Problems related to MPI support small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 2, 2020
@ezyang
Copy link
Contributor

ezyang commented Mar 2, 2020

I would agree with your diagnosis. Do you think you'd be able to submit a PR fixing this?

@vibhatha
Copy link
Author

vibhatha commented Mar 2, 2020

Sure, I will do that. Thanks a lot for checking this.

@vibhatha
Copy link
Author

vibhatha commented Mar 4, 2020

@ezyang

I am following the contributor's guide. In my understanding, I may need to modify some of the places in the code where MPI has been initialized without checking whether it is initialized or not.

CPP Tests

Is there a developer guide for running cpp test cases?

For instance the following test,

https://github.com/pytorch/pytorch/blob/master/caffe2/mpi/mpi_test.cc

Python Tests

I want to set up a few test cases for the distributed module

python3 test/distributed/test_distributed.py

I guess this should be the best place to test the code base, Is this the correct place to work on the test cases?

@ezyang
Copy link
Contributor

ezyang commented Mar 4, 2020

@vibhatha The Python location looks reasonable. I wouldn't worry too much about the C++ tests; they'll run when you open a PR on our CI, we can talk about how to run them if they're specifically failing. (BTW, the caffe2 tests aren't applicable, don't worry about those).

@vibhatha
Copy link
Author

vibhatha commented Mar 5, 2020

Great :)

I will look into the code and evaluate the possible changes.
And then do some tests on python-test level.

Keep you posted :)

@pyrito
Copy link

pyrito commented Sep 23, 2021

Has there been an update on this issue by any chance? I may be running into a similar situation.

@vibhatha
Copy link
Author

I couldn't work on this when I reported it, but I think I can allocate some time for this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: mpi Problems related to MPI support small We think this is a small issue to fix. Consider knocking off high priority small issues triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants