Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revise parallelisation of area2catena #16

Open
TillF opened this issue Mar 23, 2015 · 6 comments
Open

revise parallelisation of area2catena #16

TillF opened this issue Mar 23, 2015 · 6 comments
Assignees

Comments

@TillF
Copy link
Collaborator

TillF commented Mar 23, 2015

In the current form, the parallelisation of area2catena seems to require the replication of the large grids. Instead, the parallelisation of calls just using the data required for single EHAs could improve performance significantly.

tpilz added a commit that referenced this issue Jul 28, 2016
Tried to avoid replication of complete grids during parallel processing within area2catena(). However, there are problems when trying to access one GRASS location in parallel. So far, I could not find a suitable solution but this commit can be used as a starting point for later development. See also https://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs
@TillF
Copy link
Collaborator Author

TillF commented Apr 24, 2017

Alternatively, we could only pass parts of the entire grids to function eha_calc(). Either rechtangular parts of the entire scene (e.g. quadrants) or even only the extents of single EHAs. I could look into that, but not currently.

@lbergi
Copy link

lbergi commented Apr 24, 2017

Dear Tobias and Till,
We found out something about study area size and RAM limitation working with lumpR, that might be interesting for you:
My study area is quite large: with 90*90 m resolution it has 75545028 cells. (details further down). Area2catena works with 6 layers, that means that
Total Cells: 75545028 * Layer (6)= 453 270 168 Cells are used, transfered and processed in the function. That process reaches the limits of my computer RAM (8 GB). Using the recommend parallelization (more cores) leads to some kind of an overflow of the RAM requirements. Observing a task manager you can see, that the CPU drops to 1-5% while RAM is full (95%). The function will never end (I interrupted it after 5 days of computing during the long easter weekend). You need to force break the whole machine.
The solution that worked for us is to only use 1 core. With that, the function is successfully computed. But: the computation duration is about 12 h and the RAM is at limit (nothing works anymore, R session gives you the dubious "cannot allocate memory" message with any action you try and you need to restart the computer).
A possible explanation is, that by needing more RAM than your computer offers, processes are transferred to swap (deferred for later). Transferring data to swap and back needs much time, what can be an explanation for long computing duration.

Summary: For study areas larger than ours you might use a server or a computer that has more RAM than 8 GB.

Details:
Ehas= ca 28.000 (with the parameter settings quite all of them are included)

Region:
projection: 1 (UTM)
zone: -23
datum: wgs84
ellipsoid: wgs84
north: 8394608.94608946
south: 7573260.73260733
west: 150000
east: 895015.88956751
nsres: 90.00090001
ewres: 89.99950345
rows: 9126
cols: 8278
cells: 75545028
Mask:
Type of Map: raster Number of Categories: 1
Data Type: CELL
Rows: 9126
Columns: 8278
Total Cells: 75545028
Projection: UTM (zone -23)
N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001
E: 895015.88956751 W: 150000 Res: 89.99950345
Range of data: min = 1 max = 1
Data Description:
generated by r.mapcalc
Comments:
if(isnull(elev_riv), null(), 1)
+----------------------------------------------------------------------------+

DATA:

Digital Elevation Model:

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: FCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 299 max = 2076 |
| |
| Data Description: |
| generated by r.mapcalc |
| |
| Comments: |
| if(mask_with_dam == 100, dem_shrink + 100, dem_shrink) |
| |
+----------------------------------------------------------------------------+

Flow Accum

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: DCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 1 max = 7048270 |
| |
| Data Description: |
| generated by r.mapcalc |
| |
| Comments: |
| abs(flow_accum_t) |
| |
+----------------------------------------------------------------------------+

Eha

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 0 |
| Data Type: CELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 21189 max = 53450 |
| |
| Data Description: |
| generated by r.grow |
| |
| Comments: |
| r.grow input="eha_t2" output="eha" radius=100 metric="euclidean" |
| |
+----------------------------------------------------------------------------+

dist_riv

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: DCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 0 max = 197.989052746315 |
| |
| Data Description: |
| generated by r.mapcalc |
| |
| Comments: |
| dist_riv_t / 90.000202 |
| |
+----------------------------------------------------------------------------+

elev_riv

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: FCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = -78 max = 669 |
| |
| Data Description: |
| generated by r.stream.distance |
| |
| |
+----------------------------------------------------------------------------+

svc

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 960 |
| Data Type: CELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 0 max = 960 |
| |
| Data Description: |
| generated by r.cross |
| |
| |
+----------------------------------------------------------------------------+

Kind regards,
Lisa and Josee

@tpilz
Copy link
Owner

tpilz commented Apr 24, 2017

Hi Lisa and José,

yes, the problem with very large sites is known to us. So far we have not succeeded in finding an appropriate solution to make the function more efficient in terms of storage handling (see comments above).

For the meantime, here some general suggestions in order to improve your calculations:

Use integer values for raster data (where is makes sense)
As I see, many of your raster data used for the calculations are of type FCELL and DCELL. Converting (at least some of) them to type CELL will make them consuming much less memory when imported into R (which is what function area2catena() does). To convert the rasters, consider r.mapcalc rast_cell="round(rast_dcell)" in GRASS.

Shrink your region as much as possible
Before the calculations, make sure you only use the region of actual interest and don't import raster cells which are not needed. To do so, before running area2catena() (and essentially any GRASS operation) consider g.region zoom=region_mask.

Adjust lumpR parameters
I suggest not to use too many spatial units, i.e. to generate less EHAs by setting parameter 'eha_thres' of function lump_grass_prep() to a larger value. There always has to be a compromise between degree of detail and feasibility.

Break down the problem
Maybe you should break down the problem into smaller watersheds that are processed individually. Currently you are processing an area of about 600,000 km^2 and I fear that lumpR will not be the only problem regarding computability. If you use WASA-SED, I fear that the computation time will be very high as well (but this also depends on how many LUs and TCs you are going to generate with prof_class()).

I hope to have helped you with these suggestions. Otherwise, feel free to further discuss this issue.

Tobias

@lbergi
Copy link

lbergi commented Apr 24, 2017

Hi Tobias!
Thank you very much for your fast reply. I will definitely try to convert my raster data into CELL.
We did some research. There is a package called bigmemory. With that, you can "store a matrix in memory, restart R, and gain access to the matrix without reloading data. Great for big data." + "can share the matrix among multiple R instances or sessions." (presentation from http://bit.ly/2oEpvrV, more details: http://bit.ly/2psGzF9)
As a raster map is basically a matrix, this might be helpful but I think that lots of code must be retyped.. we wanted to share the idea with you anyways.

@lbergi
Copy link

lbergi commented Apr 28, 2017

A comment on the region settings:

I scaled down the region in my R Template with
execGRASS("g.region"....)
to avoid unnecessary comuptation time. When I checked my data in GRASS after Tobias' recognition of the size of the data I found out that the region somehow disappeared/was wrong. All the data and masks were too large. As we already observed that masks are sometimes false I suppose that with big data sets lazy loadings prevent masks and regions to work properly.
This is what we did:
in GRASS 64 we set up the correct region again, but with the gui in the mapset @LUMP. Then I imported all my raw input data from mapset @PERMANENT to the mapset @LUMP with
r.mapcalc Data_mask = Data_raw@PERMANENT.
and used this in my lumpR template.
That way I can be sure that no unnecessary data is loaded and processed.
We thought it would be worth to share and be implemented in the lumpR code.
Kind regards, Lisa

@tpilz
Copy link
Owner

tpilz commented Apr 28, 2017

Hi Lisa,
I rather think, setting the GRASS location and region appropriately should be the user's task before applying lumpR. The software should not interfere to much here to not confuse the user and avoid unexpected behaviour.

By the way, I also suggest always to consider g.region -s before doing any processing (within GRASS or by applying lumpR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants