-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
correct way to spawn a subprocess when running SCE-UA using multiprocess mpi #222
Comments
Update: cheers, iacopo |
Hi iacopo, obviously we are using
But we never saw this error before. The error must stem from mpi4py, because it is not programmed by us but knows about python - using mpi4py without spotpy will not solve that problem. After your update to openmpi 3, have you recompiled mpi4py? For debugging, I would suggest to put print (or other logging) statements into the simulation function around the os.fork call, so that you see what is happening or at least where it hangs. |
An alternative would be (not tested) to use the Python driver of VIC. It uses VIC as a CPython extension and is then executed in the same process. I am sure it would be possible with that approach to optimize I/O usage by loading driver data at model wrapper initialization instead of each single model run (see https://github.com/thouska/spotpy/blob/master/spotpy/examples/spot_setup_hymod_python.py). For HBVlight we could get > 10 times faster execution by this optimization. |
Philipp, thanks for your answers. Yes I have recompiled it, but still no improvement. I agree with you that the culprit here is either mpi4py or probably VIC. I might think to test the VIC python driver although it is still in development phase! I will update here, I think it could be useful for others that are using VIC 5.0.1 ! iacopo |
Hi iacopo, I think it would be great to find the source of the problem: a) in MPI4PY after a close look to our parallel hymode example i doubt if we ever tested it with MPI. @thouska: can you share us your code starting ldndc on the hpc? Do you use os.system also? Perhaps he can share that part of his model wrapper, since that works in parallel on mpi. (OpenMPI 3.0.0 in our case). |
Regarding point b), I have tried os.system() but without success. Therefore I am not sure here what is the issue but it seems VIC is causing the problem. I have also recompiled VIC 5.0.1 with OpenMPI 3.0.0 (even if I am not using VIC's in-built parallelisation) but no improvement. I was wondering, does the SCE-UA algorithm runs with other multi-process modules other that OpenMPI? thanks |
I am thinking that the problem could be multiple instances of VIC reading and writing from/to the same netCDFs! |
Hi @iacopoff, |
Hi, I have tried again following your example: Under the simulation method I am creating a folder for each CPU (if I interpreted correctly your example), and then I am copying all the input files in the new folder.
then I change the working directory, rename files (they get a new reference given by the "call" and "parall_dir" arguments) and updating the config file with the correct references to the new files.
within the self.vicmodel.run() (that should be now within the new folder owned by a CPU) I am:
does it sound correct?
apart from the same error about forking, I get only 1 new folder and within that the input correctly renamed and the config file with the correct references. However the process just hang there. if I run with 4 CPUs:
I get new folders within other folders. Do you have an idea of what is going on? I am about to give up :) thanks! iacopo |
Hi @iacopoff, |
thanks @thouska and @philippkraft for your help. iacopo |
Hi, OK VIC is running now. I have found that openMPI has issues when Python is calling a fork with os.system or subprocess.. and this is the reference to the workaround I have applied: |
I have another issue, unfortunately. It seems that the master is not always aware whether a process in a worker has finished or not, so that it starts doing its stuff before it receives a message from a worker that has completed its job. I have added a printout from a calibration test run, that has these specs
Basically it is printing when a worker or the master is doing something (initializing SPOTPY instance class, generating parameters, getting observation, simulating and calculating the obj functions) with a reference on the process rank (so 0 = master, > 0 workers). my comments are in bold.. `mpiexec -np 3 python vic_cal_spotpy_parallel.py ****** START CALIBRATION During handling of the above exception, another exception occurred: Traceback (most recent call last): I hope you can read the text and that it does make sense what I have reproduced here and what is the issue I am encountering. The master is looking for the output simulation from worker 2, before it has finished. In fact the master fails because it reads a NoneType. Do you have any idea of what could cause that? thanks! |
Hi @iacopoff,
before
You can check this by including e.g. somehting like this:
PS: The issue #226 might or might not be linked to this issue here |
Hi @thouska,
i guess it expects an array of simulation results from the simulation method of the worker 2. Which it does not find and therefore it crashes. If it was only a problem of printing delay on the screen then it would not crash. Or am i totally wrong? I have tried with MC but I got the same error. And as well the time.sleep() does not resolve the issue on the sequence of printing on screen. thanks a lot for your time! |
Have you tested our example script on your cluster computer? If it runs, the problem is located in the starting routine of the VIC model and if it is not running, it would be a general MPI problem. |
Hi, thanks. Which kind of executable is HYMODsilent.exe? I cannot execute it.. |
Hi @thouska any progress with this? i might compile it myself? thanks |
Hi @iacopoff, Would you like to test, wheter this works on your machine and if your reprted issue reamins? Than we would have a good basis to search for the reasons of this issue. |
Hi @thouska , thanks a lot! I have tested today and it did not work in both hymod_3.6 and hymod_3.7:
I have then compiled hymod myself (thanks for the *.sh file!) and this is the terminal output:
I have seen that in hymod.pyx there is a file call... maybe it is related to this: The important thing is that i did not get openmpi errors of any sort so far. thanks |
Hi @iacopoff, nice, the *.sh runs. 😄. However, did the ./hymod call produce a The other question is, which Linux do you use? Maybe I can try to reproduce the error. Otherwise I can try to work around the need of having the Regards |
Hi @bees4ever, thanks. it does not produce the Q.out file actually. Was it working for you with the file constant ? If you don't have time I am happy to try myself, maybe do you have already some hints?! some info on the Linux os i am using: NAME="CentOS Linux" CENTOS_MANTISBT_PROJECT="CentOS-7" uname -a Linux mgmt01 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux lscpu Architecture: x86_64 ` cheers! |
Hi @iacopoff, yes for me it worked with the Thanky |
@iacopoff I just got the idea, I provide a zip here, so you can test much easyier |
@bees4ever, thanks everything worked well! Also VIC now runs fine on many cores and results look good, even if i still get the error about forking. I guess that at this point I should just ignore it! There is one point it might worth noting: in the script https://github.com/thouska/spotpy/blob/master/spotpy/examples/spot_setup_hymod_unix.py thanks for your patience and your help! iacopo |
Hi @iacopoff, |
@thouska yes the hymod exe for unix is still writing a |
@bees4ever thanks for the quick answer! @iacopoff this means that you need this |
@bees4ever and @thouska, sorry maybe I was not clear: In the hymod exe for unix calibration the call line under the def simulation part works fine. In VIC calibration i had to move it into the init, but then using a self.call class variable. i may share with you the code i wrote sometime in the future if you like as example. thanks again |
I hope this issue is mainly solved. @iacopoff if you want to share your implementation, I would be very interessted to see it. If anyting is still not working, feel free to reopen this issue. |
Hi, I am calibrating an hydrological model (VIC model) using SCE-UA.
Below a short description of the way the model is run from the calibration.py script:
Within the spotpy_setup class, under the simulation method, I call a model class which then uses the subprocess.run() function to run the VIC executable.
This works fine when I am running the script on a single core.
However, when i try to run the sceua algorithm in parallel (argument parallel="mpi"), writing on the terminal mpiexec -np python calibration.py, I get the following error:
_--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.
The process that invoked fork was:
Local host: mgmt01 (PID 23174)
MPI_COMM_WORLD rank: 1
If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------_
I reckon this is because I am using subprocess.run().
Does anyone have encountered the same issue and would like to share a solution? maybe using mpi4py itself (i am not familiar with multiprocessing, but about to dive into it)?
thanks!
iacopo
The text was updated successfully, but these errors were encountered: