Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rasterize_features uses too much memory in xcube 0.10.2 #666

Closed
pont-us opened this issue Apr 14, 2022 · 6 comments · Fixed by #672
Closed

rasterize_features uses too much memory in xcube 0.10.2 #666

pont-us opened this issue Apr 14, 2022 · 6 comments · Fixed by #672
Assignees
Labels
bug Something isn't working urgent High external pressure to address this ASAP

Comments

@pont-us
Copy link
Member

pont-us commented Apr 14, 2022

Describe the bug
In xcube 0.10.2, the memory usage of rasterize_features increases rapidly with the number of polygons. In the attached test case, 10 polygons require around 2.6 GB, 200 polygons require over 10 GB, and for over ≥300 features, my local machine (13.6 GiB RAM + 16.8 GiB swap) runs out of memory. By comparison, with version 0.9.2 of xcube, RAM usage is constant at around 6.8 GB for up to 500 polygons.

To Reproduce
Steps to reproduce the behavior:

Download the attached test script and run with xcube 0.10.2. It creates an xarray DataSet and a geopandas geodataframe and calls rasterize_features on them. The script takes a number of polygons for the geodataframe as a command-line parameter. Run it with e.g. 10, 200, and 500, measure maximum memory usage during execution, and compare with results for xcube 0.9.2.

Expected behavior
Memory usage should not increase without limit when more polygons are rasterized (as was the case for xcube 0.9.2).

Screenshots
Figure_1

Memory usage of the test script with different xcube versions and numbers of polygons; y-axis in tens of GB. Testing with 300 or more polygons under xcube 0.10.2 was not possible due to memory exhaustion.

Additional context
The problem appeared during testing of an AVL use case when rasterizing LPIS features into a cube of Sentinel-2 data. The test script was originally derived from this use case (and retains the original dimensions and resolution of the data cube) but has been simplified to a bare minimum.

The relevant changes in the xcube code were introduced in PR #594.

@pont-us
Copy link
Member Author

pont-us commented Apr 14, 2022

test_rasterize.py.zip

Test script (zipped because GitHub doesn’t support uploading Python scripts).

Also at https://gist.github.com/pont-us/533556e95363c25daf0c4b716b0660f0

@forman
Copy link
Member

forman commented Apr 16, 2022

The logic in rasterize_features() did not change since xcube 0.9, however we are using lazy dask arrays now instead of numpy arrays (dask.array.full() instead of numpy.full()). As these arrays are used only temporarily as an in-between results (the xcube does impl. not keep the instances), I believe the issue is very likely a memory leak in either dask or xarray.

@pont-us
Copy link
Member Author

pont-us commented Apr 19, 2022

I fear that you're right. The results above were obtained with

dask                      2022.3.0           pyhd8ed1ab_0    conda-forge
dask-core                 2022.3.0           pyhd8ed1ab_0    conda-forge
dask-image                2021.12.0          pyhd8ed1ab_0    conda-forge
xarray                    2022.3.0           pyhd8ed1ab_0    conda-forge

I'll experiment with older versions, and with substituting numpy.full() in the critical places in the xcube 0.10.2 code, and see if I can find the cause. I can't find anything relevant in the dask and xarray issue trackers, but of course the bug may be unreported so far.

@pont-us
Copy link
Member Author

pont-us commented Apr 20, 2022

Results of experiments so far, now using the current ( e7e9709 ) head of the master branch:

  • Replacing the two calls to dask.array.full with numpy.full in xcube.core.geom avoids the memory leak, reverting to the behaviour seen in 0.9.2: memory usage is considerable at around 7 GB, but remains constant as the number of polygons is increased.
  • Retaining the dask.full calls, and upgrading dask, dask-core, and distributed from 2022.3.0 to 2022.4.1 doesn't fix the memory leak. Neither does downgrading them to 2021.6.2.
  • Downgrading xarray from 2022.3.0 to 0.19.0 doesn't help either.

So apparently the bug, wherever it is, is not a new one.

@forman
Copy link
Member

forman commented Apr 22, 2022

Ok, let me try avoiding the use of xarray in the process. I have the feeling that the problem may be caused by xarray accidently keeping references to the temporary dask arrays. This may haben because xarray thinks, they form a graph or so.

forman added a commit that referenced this issue Apr 22, 2022
@forman
Copy link
Member

forman commented Apr 22, 2022

The current dask-based implementation generates a node of a dask graph for each polygon. The input of this node is the respective node of the previous polygon. The resulting graph "paints" all polygons sequentially into the feature variable. That's logically correct, but the implementation is conceivable unfavorable. This is because the intermediate arrays cannot release the intermediate arrays. This is not a memory leak, but instead simply a big, heavy graph that needs to remember all intermediate results.

We may reimplement the algorithm using da.map_blocks(), where we loop through the polygons for each chunk.

@forman forman self-assigned this Apr 22, 2022
@forman forman added bug Something isn't working urgent High external pressure to address this ASAP labels Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working urgent High external pressure to address this ASAP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants