Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to enable run fmask with Rios parallel process #6

Closed
gillins opened this issue Nov 3, 2016 · 11 comments
Closed

How to enable run fmask with Rios parallel process #6

gillins opened this issue Nov 3, 2016 · 11 comments
Labels
bug Something isn't working major

Comments

@gillins
Copy link
Member

gillins commented Nov 3, 2016

Original report by Xavier Corredor Llano (Bitbucket: XavierCLL, GitHub: XavierCLL).


HI,

I want to process fmask in parallel for improve the time of run. I tried implement the parallel process with jobmanager in Rios library described in here

I search the "controls = applier.ApplierControls()" in python-fmask code and change with:

#!python
controls = applier.ApplierControls()
controls.setNumThreads(multiprocessing.cpu_count())
controls.setJobManagerType("multiprocessing")

For some process work fine, but in others not:

In landsatTOA.py changed in line 167, produce the following error:

#!python
  File "/usr/bin/fmask_usgsLandsatTOA.py", line 48, in mainRoutine
    landsatTOA.makeTOAReflectance(cmdargs.infile, cmdargs.mtl, cmdargs.anglesfile, cmdargs.output)
  File "/usr/lib/python2.7/site-packages/fmask/landsatTOA.py", line 175, in makeTOAReflectance
    applier.apply(riosTOA, inputs, outputs, otherinputs, controls=controls)
  File "/usr/lib/python2.7/site-packages/rios/applier.py", line 674, in apply
    outBlocksList = jobmgr.runSubJobs(userFunction, jobInputs)
  File "/usr/lib/python2.7/site-packages/rios/parallel/jobmanager.py", line 209, in runSubJobs
    outputBlocksList = self.gatherAllOutputs(jobIDlist)
  File "/usr/lib/python2.7/site-packages/rios/parallel/jobmanager.py", line 808, in gatherAllOutputs
    output = job.get(timeout=None)            
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
KeyError: 139830344649824

Happens the same if tried enable parallel rios in fmask.py line 217 and line 1189.

Any idea of this issue?

Thanks

@gillins
Copy link
Member Author

gillins commented Nov 4, 2016

Original comment by Neil Flood (Bitbucket: neilflood, GitHub: neilflood).


Hi Xavier,

Oh dear. That seems bad.

It is quite possible that you have run into a problem with the way rios handles things when using the multiprocessing option. That option has never been very well tested, as we use some of the other parallel options instead.

However, before we get too far into that, I should say that I would be surprised if Fmask benefits much from running in parallel. Most of it is (I think) largely I/O limited, rather than compute limited. The one part which is strongly compute limited is the matching of clouds to shadows, and this is not written so be able to benefit from parallel processing, and runs across the whole image at once (because it does not know how far it will have to search to find the shadow). So, I would expect that running parallel in the way yo are suggesting will not actually help much.

Do you think that is correct?

It is very brave of you to try this, by the way. :-)

Neil

@gillins
Copy link
Member Author

gillins commented Nov 4, 2016

Original comment by Xavier Corredor Llano (Bitbucket: XavierCLL, GitHub: XavierCLL).


Hi Neil

Yeah you are right, maybe some process need calculate in whole image and this makes it impossible run in parallel. But, I would know which is the best option for run in parallel in Rios library (the most tested). The parallel process works for some parts of code and it increase the velocity of run (somewhat), I recomment it (maybe this can a new feature, that the user can choose if want run in parallel)

On other hand, thinking in improve the time of run, I saw some process that maybe are unnecessary like as "Computing Statistics" and "Computing Pyramid Layers" is really neccesay for fmask process? can I disabled this? how?

thanks a lot :)

@gillins
Copy link
Member Author

gillins commented Nov 4, 2016

Original comment by Sam Gillingham (Bitbucket: gillins, GitHub: gillins).


Hi Xavier,

It turns out that landsatTOA.py calls info.getNoDataValueFor() which does not work when running in parallel. I've added a paragraph to the RIOS docstring about this: https://bitbucket.org/chchrsc/rios/commits/2131b9e9381e6621978c0a7dc00d39a9f3b4f0c7.

There is no reason why RIOS could not cache the no data value in the ReaderInfo object rather than calling through to GDAL each time, and this would make it run in parallel. Feel free to open a PR against RIOS and we can discuss the finer details.

However, I would echo Neil's sentiments about speed. Try removing the call to getNoDataValueFor and see what happens. I tend to find that for simple calculations like this things tend to run slower, not faster, due to the extra communication overhead.

Regarding pyramid layers and statistics, you can disable them with this call on the controls object: http://rioshome.org/en/latest/rios_applier.html#rios.applier.ApplierControls.setCalcStats

Sam.

@gillins
Copy link
Member Author

gillins commented Nov 4, 2016

Original comment by Sam Gillingham (Bitbucket: gillins, GitHub: gillins).


I should have said that the reason for the weird error was that Python 2.x can't seem to report the exception in the sub process properly, if you run with Python 3.x you will see the exception raised when calling getNoDataValueFor() in the sub process. See http://stackoverflow.com/questions/29080650/debugging-errors-in-python-multiprocessing.

@gillins
Copy link
Member Author

gillins commented Nov 7, 2016

Original comment by Xavier Corredor Llano (Bitbucket: XavierCLL, GitHub: XavierCLL).


Hi Sam, thanks for your explanation,

I tried set info.getNoDataValueFor() to None in landsatTOA.py and works fine!, but in fmask.py in doPotentialCloudFirstPass for enable multiprocessing I set the info.getNoDataValueFor() to None in potentialCloudFirstPass, the result is different compared with not parallel version. The error persist too in finalizeAll in fmask.py changing the ApplierControls.

On the other hand, I set False to setCalcStats in all controls Rios instance for disable "Computing Statistics" and "Computing Pyramid Layers", improve the run of process, now required the 50%!, but the result compared with normal run is so different (mainly for the shadow) is weird, I thinked that this process was independent, is setCalcStats part of the fmask process?

Thanks a lot

@gillins
Copy link
Member Author

gillins commented Nov 7, 2016

Original comment by Sam Gillingham (Bitbucket: gillins, GitHub: gillins).


Hi Xavier

Neil may comment more when he is back from holidays, but getting and seeing the no data (which is what calculating the stats also does) is very important for fmask operation so no surprise you are getting different answers.

My reason for suggesting that you do this was to show that the speed is actually worse when running in parallel. I'm guessing this is what you found, correct?

Sam

@gillins
Copy link
Member Author

gillins commented Nov 7, 2016

Original comment by Xavier Corredor Llano (Bitbucket: XavierCLL, GitHub: XavierCLL).


Hi Sam,

Thanks, It good idea know more about that, maybe Neil help me about it.

With some test with some image, the partial parallel improve the time of run it's not much, but for me is better implement the parallel even if it is partial.

Thanks

@gillins
Copy link
Member Author

gillins commented Nov 13, 2016

Original comment by Sam Gillingham (Bitbucket: gillins, GitHub: gillins).


Hi Xavier,

I'm interested that you seem to get an improvement in speed. On my AMD machine running Linux Mint running fmask_usgsLandsatTOA.py with 4 CPUs is actually 6% slower than one. So multiprocessing seems a waste of time in this case.

Are you getting different results from this? And if so, what sort of hardware do you have?

Sam.

@gillins
Copy link
Member Author

gillins commented Nov 14, 2016

Original comment by Neil Flood (Bitbucket: neilflood, GitHub: neilflood).


Hi Xavier,

I have taken a moment to look into all this. There are several points to discuss.

I have modified the null value handling in fmask.py, so that it never uses getNoDataValueFor(), and this should allow it to be run with parallel, if you wish. However, parallel can only be used on the third pass and the final pass sections. Both the first and second passes are accumulating information between processing blocks, so these blocks cannot be run in parallel. I have tested this, and found no useful speed-up, with up to 30 threads. I believe that this is because the processing is I/O limited.

I modified landsatTOA.py to avoid calculating the stats and pyramids, and instead to set the null value explicitly on the output file. It is important to have the null value set on the TOA image file, as this is used to exclude the null area from consideration in the Fmask algorithm (very important to do this). Doing this made the TOA calculation neary 50% faster, as you said, but it is only about 10 seconds on my hardware, so I had not really felt it to be important. However, I have committed this change, too, if you think it is helpful. I am guessing that the lack of the null value would have been the reason for the difference in results which you found. I get identical results with these changes.

The stats and pyramids are not calculated in any of the intermediate images from fmask.py - I already had setCalcStats(False) in all cases. The stats and pyramids are calculated for the final output cloud image, but I believe that this is useful, because it makes the image easier to view. It takes only 4 seconds on my hardware, so I am going to leave that in place.

Thanks for being so keen on this project. It is really nice to have a strong interest from users. I hope that this discussion has been helpful to you.

Neil

@gillins
Copy link
Member Author

gillins commented Nov 27, 2016

Original comment by Xavier Corredor Llano (Bitbucket: XavierCLL, GitHub: XavierCLL).


Hi Neil and Sam,

Sorry for the delay, I was very busy the last week.

Neil you are right and thanks for your explanation. I made a little test, first I ran normally the fmask, after I enabled the parallel process and finally disable stats for some process (no for the final result), the result are (in my little laptop with 8 cores, for a image of ladsat 8, time for angle, saturation, toa and stacked function):

  • time run normal (without parallel): 4:34
  • time run (+ parallel): 4:18
  • time run (+ parallel and - stats): 4:09

And yeah, the parallel process is not useful for improve the speed-up, as you said.

I made a clone with this changes maybe is useful for you or for someone interested about this topic: https://bitbucket.org/XavierCLL/python-fmask

Thanks a lot Neil ans Sam,

Regards

@gillins
Copy link
Member Author

gillins commented Nov 27, 2016

Original comment by Neil Flood (Bitbucket: neilflood, GitHub: neilflood).


Hi Xavier,

thanks for all that. Good to know that my intuitions on this were correct. Thanks for investigating it thoroughly.

cheers
Neil

@gillins gillins closed this as completed Nov 27, 2016
@gillins gillins added major bug Something isn't working labels Nov 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant