Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct way to spawn a subprocess when running SCE-UA using multiprocess mpi #222

Closed
iacopoff opened this issue Jun 17, 2019 · 32 comments
Closed

Comments

@iacopoff
Copy link
Contributor

Hi, I am calibrating an hydrological model (VIC model) using SCE-UA.

Below a short description of the way the model is run from the calibration.py script:
Within the spotpy_setup class, under the simulation method, I call a model class which then uses the subprocess.run() function to run the VIC executable.

This works fine when I am running the script on a single core.

However, when i try to run the sceua algorithm in parallel (argument parallel="mpi"), writing on the terminal mpiexec -np python calibration.py, I get the following error:

_--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process. Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption. The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

Local host: mgmt01 (PID 23174)
MPI_COMM_WORLD rank: 1

If you are absolutely sure that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------_

I reckon this is because I am using subprocess.run().

Does anyone have encountered the same issue and would like to share a solution? maybe using mpi4py itself (i am not familiar with multiprocessing, but about to dive into it)?

thanks!

iacopo

@iacopoff
Copy link
Contributor Author

Update:
Using openmpi 3 (I was using an older version) does avoid the error about forking, however it hangs indefinitely without doing much (CPUs barely used ~ 3-4 %) and not running the model.

cheers,

iacopo

@philippkraft
Copy link
Collaborator

Hi iacopo,

obviously we are using os.system which should, according to your error message, have the same problems:

os.system('HYMODsilent.exe')

But we never saw this error before. The error must stem from mpi4py, because it is not programmed by us but knows about python - using mpi4py without spotpy will not solve that problem. After your update to openmpi 3, have you recompiled mpi4py?

For debugging, I would suggest to put print (or other logging) statements into the simulation function around the os.fork call, so that you see what is happening or at least where it hangs.

@philippkraft
Copy link
Collaborator

An alternative would be (not tested) to use the Python driver of VIC. It uses VIC as a CPython extension and is then executed in the same process. I am sure it would be possible with that approach to optimize I/O usage by loading driver data at model wrapper initialization instead of each single model run (see https://github.com/thouska/spotpy/blob/master/spotpy/examples/spot_setup_hymod_python.py).

For HBVlight we could get > 10 times faster execution by this optimization.

@iacopoff
Copy link
Contributor Author

Philipp, thanks for your answers. Yes I have recompiled it, but still no improvement.

I agree with you that the culprit here is either mpi4py or probably VIC. I might think to test the VIC python driver although it is still in development phase!

I will update here, I think it could be useful for others that are using VIC 5.0.1 !

iacopo

@philippkraft
Copy link
Collaborator

philippkraft commented Jun 24, 2019

Hi iacopo,

I think it would be great to find the source of the problem:

a) in MPI4PY
b) in the way you are using subprocess.run (my bet)
c) in VIC

after a close look to our parallel hymode example i doubt if we ever tested it with MPI.

@thouska: can you share us your code starting ldndc on the hpc? Do you use os.system also?

Perhaps he can share that part of his model wrapper, since that works in parallel on mpi. (OpenMPI 3.0.0 in our case).

@iacopoff
Copy link
Contributor Author

Regarding point b), I have tried os.system() but without success.
Also, after the subprocess.run() that calls VIC, I am using another subprocess.run() to call the routing model (a separate executable from VIC) and that one does run without problems.

Therefore I am not sure here what is the issue but it seems VIC is causing the problem. I have also recompiled VIC 5.0.1 with OpenMPI 3.0.0 (even if I am not using VIC's in-built parallelisation) but no improvement.

I was wondering, does the SCE-UA algorithm runs with other multi-process modules other that OpenMPI?
I am not planning at the moment to run VIC on a cluster....

thanks

@iacopoff
Copy link
Contributor Author

I am thinking that the problem could be multiple instances of VIC reading and writing from/to the same netCDFs!

@thouska
Copy link
Owner

thouska commented Jun 27, 2019

Hi @iacopoff,
ok that sounds like a very likely source for this error. Of course, under MPI it needs to be guranteed that every cpu-core uses different files. Is it possible to tell VIC to use different netCDFs? This would be the most efficient way. Otherwise, you might want to use different copies of VIC and access those different folders from each cpu. I provide one example where this copy is made, for each run, which does the trick, but is a bit hard drive demanding. Better solution would be to do the n copies of your VIC model, where n is the number of CPU cores.

@iacopoff
Copy link
Contributor Author

iacopoff commented Jun 27, 2019

Hi, I have tried again following your example:

Under the simulation method I am creating a folder for each CPU (if I interpreted correctly your example), and then I am copying all the input files in the new folder.

def simulation(self,vector):
     if self.parallel == 'mpi':
            call = str(int(os.environ['OMPI_COMM_WORLD_RANK']) + 2)
            parall_dir = self.DirFolder + "/run_" + call
            copytree( self.DirFolder, parall_dir,ignore=ignore_patterns('*.csv', 'tests','*.py'))

then I change the working directory, rename files (they get a new reference given by the "call" and "parall_dir" arguments) and updating the config file with the correct references to the new files.
and run self.vic.model.run()

os.chdir(parall_dir)

...
  rename files
...

 simulations = self.vicmodel.run_vic(dir_path = parall_dir,global_file= globalFile_new ,par_file=paramFile_new,
                                        binfilt=vector[0],Ws=vector[1],Ds=vector[2],c=vector[3],soil_2d=vector[4],soil_3d=vector[5])

within the self.vicmodel.run() (that should be now within the new folder owned by a CPU) I am:

  • updating the netcdf parameter file
  • running VIC
  • running the routing exe

does it sound correct?
I have run the code above from the terminal using 2 CPUs:

mpiexec -n 2 python vic_calibration_multi.py

apart from the same error about forking, I get only 1 new folder and within that the input correctly renamed and the config file with the correct references. However the process just hang there.
so why only 1 folder?

if I run with 4 CPUs:

mpiexec -n 4 python vic_calibration_multi.py

I get new folders within other folders.

Do you have an idea of what is going on? I am about to give up :)

thanks!

iacopo

@thouska
Copy link
Owner

thouska commented Jun 27, 2019

Hi @iacopoff,
if you use parallel computing, one cpu will selcted to handle all the result collection, writing spotpy files, etc. (the so called master). All other available cpu will be used to start whatever is happening inside of "def simulation". These cpu's are the slaves. So in your case with n=2 cpu's you will get only one folder. So that makes sense so far.
That you get "new folders within other folders" indicates thats probably something is messed up with the use of os.chdir(...). You need to make sure that you alway return to your working directory. This is done in the example by:
os.chdir(self.hymod_path+call)
and
os.chdir(self.curdir)

@iacopoff
Copy link
Contributor Author

thanks @thouska and @philippkraft for your help.
I think I have almost resolved the issue, the IS team wrongly compiled the various library upon which VIC depends (netcdf, hdf5, openmpi...).
I have recompiled everything although now I have to switch to other projects for some time, when I will come back to this i will post here any solution I have found.
thanks

iacopo

@iacopoff
Copy link
Contributor Author

iacopoff commented Aug 7, 2019

Hi, OK VIC is running now.

I have found that openMPI has issues when Python is calling a fork with os.system or subprocess..
https://www.open-mpi.org/faq/?category=tuning#fork-warning
https://opus.nci.org.au/pages/viewpage.action?pageId=13141354

and this is the reference to the workaround I have applied:

https://github.com/open-mpi/ompi/issues/3158

@iacopoff
Copy link
Contributor Author

iacopoff commented Aug 7, 2019

I have another issue, unfortunately.

It seems that the master is not always aware whether a process in a worker has finished or not, so that it starts doing its stuff before it receives a message from a worker that has completed its job.

I have added a printout from a calibration test run, that has these specs

  • it uses SCE-UA
  • it runs with just three cores
  • SPOTPY version is 1.5.2

Basically it is printing when a worker or the master is doing something (initializing SPOTPY instance class, generating parameters, getting observation, simulating and calculating the obj functions) with a reference on the process rank (so 0 = master, > 0 workers). my comments are in bold..

`mpiexec -np 3 python vic_cal_spotpy_parallel.py

****** START CALIBRATION
****** START CALIBRATION
****** START CALIBRATION
***** i am INITIALISING spotpy setup in process rank: 2
***** i am GENERATING PARAMETERS in process rank: 2
***** i am INITIALISING spotpy setup in process rank: 0
***** i am GENERATING PARAMETERS in process rank: 0
***** i am reading OBSERVATION in process rank: 2
***** i am reading OBSERVATION in process rank: 0
***** i am INITIALISING spotpy setup in process rank: 1
***** i am GENERATING PARAMETERS in process rank: 1
***** i am reading OBSERVATION in process rank: 1
----- finished reading obervation in process rank: 0 observation file read?: True
----- finished reading obervation in process rank: 2 observation file read?: True
----- finished reading obervation in process rank: 1 observation file read?: True
Initializing the Shuffled Complex Evolution (SCE-UA) algorithm with 100 repetitions
The objective function will be minimized
Initializing the Shuffled Complex Evolution (SCE-UA) algorithm with 100 repetitions
The objective function will be minimized
Initializing the Shuffled Complex Evolution (SCE-UA) algorithm with 100 repetitions
The objective function will be minimized
***** i am GENERATING PARAMETERS in process rank: 0
Starting burn-in sampling...
***** i am RUNNING SIMULATION in process rank: 1 # here it starts simulation in worker 1
***** i am RUNNING SIMULATION in process rank: 2 # here it starts simulation in worker 2
executing VIC model...
executing VIC model...
executing routing...
executing routing...
----- finished simulation in process rank: 1, simulation file produced?: True # here it finishes simulation in worker 1
***** calc OBJECT FUNCTION in process rank: 0, simulation produced?: True observation produced?: True # and here the master is calculating the obj function from results of worker 1, everything is fine..
obj function result: -0.28935350204454646
1 of 100, minimal objective function=-0.289354, time remaining: 00:16:10
Initialize database...
['csv', 'hdf5', 'ram', 'sql', 'custom', 'noData']
Database file '/projects/mas1261/wp3/VIC/ba_bsn_025/run_test1/cal_kge_log_multiprocess.csv' created.
***** i am RUNNING SIMULATION in process rank: 1
***** calc OBJECT FUNCTION in process rank: 0, simulation produced?: False observation produced?: True # here the master is trying to calculate the obj function for worker 2 that has not finished yet.
Traceback (most recent call last):
File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/_algorithm.py", line 412, in getfitness
return self.setup.objectivefunction(evaluation=self.evaluation, simulation=simulation, params = (params,self.parnames))
TypeError: objectivefunction() got an unexpected keyword argument 'params'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/projects/home/iff/Dmoss/vic_calibration/research/vic_calibration/vic_cal_spotpy_parallel.py", line 265, in
startTime,endTime,timeStep,params,weights,cal_output_name,rep=n_of_runs)
File "/projects/home/iff/Dmoss/vic_calibration/research/vic_calibration/vic_cal_spotpy_parallel.py", line 232, in calibrate
sampler.sample(rep, ngs=10,kstop=100,pcento=0.0001,peps=0.0001)
File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/sceua.py", line 189, in sample
like = self.postprocessing(icall, randompar, simulations,chains=0)
File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/_algorithm.py", line 390, in postprocessing
like = self.getfitness(simulation=simulation, params=params)
File "/projects/home/iff/.local/src/python/spotpy/spotpy/algorithms/_algorithm.py", line 416, in getfitness
return self.setup.objectivefunction(evaluation=self.evaluation, simulation=simulation)
File "/projects/home/iff/Dmoss/vic_calibration/research/vic_calibration/vic_cal_spotpy_parallel.py", line 213, in objectivefunction
objectivefunction = -spotpy.objectivefunctions.kge(np.log(evaluation+1), np.log(simulation+1))
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
----- finished simulation in process rank: 2, simulation file produced?: True # here the worker 2 has finished the simulation, too late!
`

I hope you can read the text and that it does make sense what I have reproduced here and what is the issue I am encountering.

The master is looking for the output simulation from worker 2, before it has finished. In fact the master fails because it reads a NoneType.

Do you have any idea of what could cause that?

thanks!

@thouska
Copy link
Owner

thouska commented Aug 8, 2019

Hi @iacopoff,
thank you for your detailed error description. We will have a closer look on the mpi implementation of sce-ua in the upcoming days, to check for any new potential errors.
Meanwhile you might want to check two things:

  1. Start another sampler (e.g. Monte Carlo) to test, wheter you problem persist. In this sampler the mpi implementation is much more easier, i.e. less potential sources of errors and easier task to solve them.

  2. On the cluster computer I am using, printed messages come not always in the correct order in how they where produced. Therefore you might get the message:

***** calc OBJECT FUNCTION in process rank: 0, simulation produced?: False observation produced?: True # here the master is trying to calculate the obj function for worker 2 that has not finished yet.

before

----- finished simulation in process rank: 2, simulation file produced?: True # here the worker 2 has finished the simulation, too late!

You can check this by including e.g. somehting like this:

import time
 
# Wait for 5 seconds
time.sleep(5)

PS: The issue #226 might or might not be linked to this issue here

@iacopoff
Copy link
Contributor Author

iacopoff commented Aug 8, 2019

Hi @thouska,
I understand that printing on screen is not a great diagnostic tool to infer the temporal execution of the underling code when using multi processes. However, when the process from the master hits the objectivefunction method, which is what the text below represent:

***** calc OBJECT FUNCTION in process rank: 0, simulation produced?: False observation produced?: True # here the master is trying to calculate the obj function for worker 2 that has not finished yet

i guess it expects an array of simulation results from the simulation method of the worker 2. Which it does not find and therefore it crashes.

If it was only a problem of printing delay on the screen then it would not crash. Or am i totally wrong?

I have tried with MC but I got the same error. And as well the time.sleep() does not resolve the issue on the sequence of printing on screen.

thanks a lot for your time!

@thouska
Copy link
Owner

thouska commented Aug 12, 2019

Have you tested our example script on your cluster computer? If it runs, the problem is located in the starting routine of the VIC model and if it is not running, it would be a general MPI problem.

@iacopoff
Copy link
Contributor Author

Hi, thanks. Which kind of executable is HYMODsilent.exe? I cannot execute it..
I mean are you using windows?
thanks
iacopo

@thouska
Copy link
Owner

thouska commented Aug 30, 2019

Hi @iacopoff, sorry for the late response, indeed unfortunatelly our Hymod executable is only for windows. We are working now on a unix standalone version in #230. I will keep you updated, as soon as this is finished.

@iacopoff
Copy link
Contributor Author

Hi @thouska any progress with this? i might compile it myself?

thanks

@thouska
Copy link
Owner

thouska commented Sep 18, 2019

Hi @iacopoff,
yes @bees4ever was working on this to proivde a hymod version for unix. I just had no time to review it ( #231). I did this now and tested it on my local unix machine, there it worked perfectly by using:
mpirun -c 4 python3 tutorial_parallel_computing_hymod.py
The corresponsing spotpy_setup class can be found in spotpy/examples/spot_setup_hymod_unix.py
Which is using a standalone version of hymod under unix spotpy\examples\hymod_unix.
I merged the pull_request #231 and uploaded the new coresponding spotpy version (1.5.6) on pypi.

Would you like to test, wheter this works on your machine and if your reprted issue reamins? Than we would have a good basis to search for the reasons of this issue.

@iacopoff
Copy link
Contributor Author

Hi @thouska , thanks a lot!

I have tested today and it did not work in both hymod_3.6 and hymod_3.7:

/hymod_3.6: error while loading shared libraries: libpython3.6m.so.1.0: cannot open shared object file: No such file or directory

./hymod_3.7: /lib64/libc.so.6: version GLIBC_2.26' not found (required by ./hymod_3.7)`

I have then compiled hymod myself (thanks for the *.sh file!) and this is the terminal output:

Hymod reads in from file called 'Param.in'
NameError: name '__file__' is not defined

I have seen that in hymod.pyx there is a file call... maybe it is related to this:
https://stackoverflow.com/questions/51733318/cython-cannot-find-file

The important thing is that i did not get openmpi errors of any sort so far.

thanks

@bees4ever
Copy link
Contributor

Hi @iacopoff,

nice, the *.sh runs. 😄. However, did the ./hymod call produce a Q.out file?

The other question is, which Linux do you use? Maybe I can try to reproduce the error. Otherwise I can try to work around the need of having the __file__ constant.

Regards

@iacopoff
Copy link
Contributor Author

Hi @bees4ever, thanks.

it does not produce the Q.out file actually.

Was it working for you with the file constant ? If you don't have time I am happy to try myself, maybe do you have already some hints?!

some info on the Linux os i am using:
`
cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"


uname -a

Linux mgmt01 3.10.0-514.el7.x86_64 #1 SMP Tue Nov 22 16:42:41 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux


lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Model name: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
Stepping: 2
CPU MHz: 2394.005
BogoMIPS: 4788.01
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-11

`

cheers!

@bees4ever
Copy link
Contributor

Hi @iacopoff,

yes for me it worked with the __file__ constant.
I worked around now, to not use it anymore. if you have some time you may can apply the patch https://patch-diff.githubusercontent.com/raw/thouska/spotpy/pull/235.patch and test if you now can successfully build the hymod exe...

Thanky

@bees4ever
Copy link
Contributor

@iacopoff I just got the idea, I provide a zip here, so you can test much easyier
hymod_cython.zip

@iacopoff
Copy link
Contributor Author

@bees4ever, thanks everything worked well!

Also VIC now runs fine on many cores and results look good, even if i still get the error about forking. I guess that at this point I should just ignore it!

There is one point it might worth noting: in the script https://github.com/thouska/spotpy/blob/master/spotpy/examples/spot_setup_hymod_unix.py
I had to move this line
call = str(int(os.environ['OMPI_COMM_WORLD_RANK']) + 2)
under the spotpy init method. Otherwise it was not working.

thanks for your patience and your help!

iacopo

@bees4ever
Copy link
Contributor

@iacopoff thanks for your feedback. About this move of line, @thouska may can look into this...

@thouska
Copy link
Owner

thouska commented Oct 2, 2019

Hi @iacopoff,
do you need the cpu_id (named as call) set in the line:
call = str(int(os.environ['OMPI_COMM_WORLD_RANK']) + 2)
If so, you should not move it to init as this is called only once at the begining of the sampling.
The idea was to use this line in order to have a inidividual output filename, or generate individual model folders (one for each cpu core). If this is not done, different cpu cores might/will access/write on the same file on the hard drive, which will mess up the results.
If you solve this problem somehow differently, you can ignore the call line. However, for the Hymod_exe example it is definetly needed. @bees4ever: How did you solve this in the unix variant, is the model still writing a Q.out file? If so, we still need the call line.

@bees4ever
Copy link
Contributor

@thouska yes the hymod exe for unix is still writing a Q.out file

@thouska
Copy link
Owner

thouska commented Oct 3, 2019

@bees4ever thanks for the quick answer! @iacopoff this means that you need this call line in the def simulation part. What kind of error message do you get there? However, as mentioned above, you can also go for another way to get indiviual model folders, like a very exact time stamp as string or a random number out of a high range.

@iacopoff
Copy link
Contributor Author

iacopoff commented Oct 3, 2019

@bees4ever and @thouska, sorry maybe I was not clear:

In the hymod exe for unix calibration the call line under the def simulation part works fine.

In VIC calibration i had to move it into the init, but then using a self.call class variable. i may share with you the code i wrote sometime in the future if you like as example.

thanks again

@thouska
Copy link
Owner

thouska commented Apr 1, 2020

I hope this issue is mainly solved. @iacopoff if you want to share your implementation, I would be very interessted to see it. If anyting is still not working, feel free to reopen this issue.

@thouska thouska closed this as completed Apr 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants