revise parallelisation of area2catena #16

TillF · 2015-03-23T11:39:35Z

In the current form, the parallelisation of area2catena seems to require the replication of the large grids. Instead, the parallelisation of calls just using the data required for single EHAs could improve performance significantly.

Tried to avoid replication of complete grids during parallel processing within area2catena(). However, there are problems when trying to access one GRASS location in parallel. So far, I could not find a suitable solution but this commit can be used as a starting point for later development. See also https://grasswiki.osgeo.org/wiki/Parallel_GRASS_jobs

TillF · 2017-04-24T09:36:19Z

Alternatively, we could only pass parts of the entire grids to function eha_calc(). Either rechtangular parts of the entire scene (e.g. quadrants) or even only the extents of single EHAs. I could look into that, but not currently.

lbergi · 2017-04-24T11:24:42Z

Dear Tobias and Till,
We found out something about study area size and RAM limitation working with lumpR, that might be interesting for you:
My study area is quite large: with 90*90 m resolution it has 75545028 cells. (details further down). Area2catena works with 6 layers, that means that
Total Cells: 75545028 * Layer (6)= 453 270 168 Cells are used, transfered and processed in the function. That process reaches the limits of my computer RAM (8 GB). Using the recommend parallelization (more cores) leads to some kind of an overflow of the RAM requirements. Observing a task manager you can see, that the CPU drops to 1-5% while RAM is full (95%). The function will never end (I interrupted it after 5 days of computing during the long easter weekend). You need to force break the whole machine.
The solution that worked for us is to only use 1 core. With that, the function is successfully computed. But: the computation duration is about 12 h and the RAM is at limit (nothing works anymore, R session gives you the dubious "cannot allocate memory" message with any action you try and you need to restart the computer).
A possible explanation is, that by needing more RAM than your computer offers, processes are transferred to swap (deferred for later). Transferring data to swap and back needs much time, what can be an explanation for long computing duration.

Summary: For study areas larger than ours you might use a server or a computer that has more RAM than 8 GB.

Details:
Ehas= ca 28.000 (with the parameter settings quite all of them are included)

Region:
projection: 1 (UTM)
zone: -23
datum: wgs84
ellipsoid: wgs84
north: 8394608.94608946
south: 7573260.73260733
west: 150000
east: 895015.88956751
nsres: 90.00090001
ewres: 89.99950345
rows: 9126
cols: 8278
cells: 75545028
Mask:
Type of Map: raster Number of Categories: 1
Data Type: CELL
Rows: 9126
Columns: 8278
Total Cells: 75545028
Projection: UTM (zone -23)
N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001
E: 895015.88956751 W: 150000 Res: 89.99950345
Range of data: min = 1 max = 1
Data Description:
generated by r.mapcalc
Comments:
if(isnull(elev_riv), null(), 1)
+----------------------------------------------------------------------------+

DATA:

Digital Elevation Model:

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: FCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 299 max = 2076 |
| |
| Data Description: |
| generated by r.mapcalc |
| |
| Comments: |
| if(mask_with_dam == 100, dem_shrink + 100, dem_shrink) |
| |
+----------------------------------------------------------------------------+

Flow Accum

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: DCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 1 max = 7048270 |
| |
| Data Description: |
| generated by r.mapcalc |
| |
| Comments: |
| abs(flow_accum_t) |
| |
+----------------------------------------------------------------------------+

Eha

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 0 |
| Data Type: CELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 21189 max = 53450 |
| |
| Data Description: |
| generated by r.grow |
| |
| Comments: |
| r.grow input="eha_t2" output="eha" radius=100 metric="euclidean" |
| |
+----------------------------------------------------------------------------+

dist_riv

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: DCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 0 max = 197.989052746315 |
| |
| Data Description: |
| generated by r.mapcalc |
| |
| Comments: |
| dist_riv_t / 90.000202 |
| |
+----------------------------------------------------------------------------+

elev_riv

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 255 |
| Data Type: FCELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = -78 max = 669 |
| |
| Data Description: |
| generated by r.stream.distance |
| |
| |
+----------------------------------------------------------------------------+

svc

|----------------------------------------------------------------------------|
| |
| Type of Map: raster Number of Categories: 960 |
| Data Type: CELL |
| Rows: 9126 |
| Columns: 8278 |
| Total Cells: 75545028 |
| Projection: UTM (zone -23) |
| N: 8394608.94608946 S: 7573260.73260733 Res: 90.00090001 |
| E: 895015.88956751 W: 150000 Res: 89.99950345 |
| Range of data: min = 0 max = 960 |
| |
| Data Description: |
| generated by r.cross |
| |
| |
+----------------------------------------------------------------------------+

Kind regards,
Lisa and Josee

tpilz · 2017-04-24T15:01:43Z

Hi Lisa and José,

yes, the problem with very large sites is known to us. So far we have not succeeded in finding an appropriate solution to make the function more efficient in terms of storage handling (see comments above).

For the meantime, here some general suggestions in order to improve your calculations:

Use integer values for raster data (where is makes sense)
As I see, many of your raster data used for the calculations are of type FCELL and DCELL. Converting (at least some of) them to type CELL will make them consuming much less memory when imported into R (which is what function area2catena() does). To convert the rasters, consider r.mapcalc rast_cell="round(rast_dcell)" in GRASS.

Shrink your region as much as possible
Before the calculations, make sure you only use the region of actual interest and don't import raster cells which are not needed. To do so, before running area2catena() (and essentially any GRASS operation) consider g.region zoom=region_mask.

Adjust lumpR parameters
I suggest not to use too many spatial units, i.e. to generate less EHAs by setting parameter 'eha_thres' of function lump_grass_prep() to a larger value. There always has to be a compromise between degree of detail and feasibility.

Break down the problem
Maybe you should break down the problem into smaller watersheds that are processed individually. Currently you are processing an area of about 600,000 km^2 and I fear that lumpR will not be the only problem regarding computability. If you use WASA-SED, I fear that the computation time will be very high as well (but this also depends on how many LUs and TCs you are going to generate with prof_class()).

I hope to have helped you with these suggestions. Otherwise, feel free to further discuss this issue.

Tobias

lbergi · 2017-04-24T15:47:30Z

Hi Tobias!
Thank you very much for your fast reply. I will definitely try to convert my raster data into CELL.
We did some research. There is a package called bigmemory. With that, you can "store a matrix in memory, restart R, and gain access to the matrix without reloading data. Great for big data." + "can share the matrix among multiple R instances or sessions." (presentation from http://bit.ly/2oEpvrV, more details: http://bit.ly/2psGzF9)
As a raster map is basically a matrix, this might be helpful but I think that lots of code must be retyped.. we wanted to share the idea with you anyways.

lbergi · 2017-04-28T14:03:05Z

A comment on the region settings:

I scaled down the region in my R Template with
execGRASS("g.region"....)
to avoid unnecessary comuptation time. When I checked my data in GRASS after Tobias' recognition of the size of the data I found out that the region somehow disappeared/was wrong. All the data and masks were too large. As we already observed that masks are sometimes false I suppose that with big data sets lazy loadings prevent masks and regions to work properly.
This is what we did:
in GRASS 64 we set up the correct region again, but with the gui in the mapset @LUMP. Then I imported all my raw input data from mapset @PERMANENT to the mapset @LUMP with
r.mapcalc Data_mask = Data_raw@PERMANENT.
and used this in my lumpR template.
That way I can be sure that no unnecessary data is loaded and processed.
We thought it would be worth to share and be implemented in the lumpR code.
Kind regards, Lisa

tpilz · 2017-04-28T16:47:09Z

Hi Lisa,
I rather think, setting the GRASS location and region appropriately should be the user's task before applying lumpR. The software should not interfere to much here to not confuse the user and avoid unexpected behaviour.

By the way, I also suggest always to consider g.region -s before doing any processing (within GRASS or by applying lumpR).

TillF added the enhancement label Mar 23, 2015

TillF mentioned this issue Apr 24, 2017

RAM limitation observation with area2catena() and large study areas #28

Closed

TillF self-assigned this Apr 24, 2017

TillF mentioned this issue Apr 27, 2017

memory issues with prof_class() #29

Open

tpilz mentioned this issue Jun 18, 2018

memory issues with calc_subbas #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revise parallelisation of area2catena #16

revise parallelisation of area2catena #16

TillF commented Mar 23, 2015

TillF commented Apr 24, 2017

lbergi commented Apr 24, 2017

tpilz commented Apr 24, 2017

lbergi commented Apr 24, 2017

lbergi commented Apr 28, 2017

tpilz commented Apr 28, 2017