Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting temporary files #608

Closed
plvoit opened this issue Dec 9, 2022 · 6 comments
Closed

Deleting temporary files #608

plvoit opened this issue Dec 9, 2022 · 6 comments

Comments

@plvoit
Copy link
Contributor

plvoit commented Dec 9, 2022

Functions like VectorSource and ZonalDataPoly create temporary files which don't get deleted after a script is completed. With large multiprocessing jobs this can cause the system to crash because it fills up the tmp-directory

MCVE Code Sample

If one follows this tutorial several temporary folders get created:

https://docs.wradlib.org/en/stable/notebooks/zonalstats/wradlib_zonalstats_quickstart.html

trg = wrl.io.VectorSource(shpfile, srs=proj_utm, name="trg")
src = wrl.io.VectorSource(grdverts, srs=proj_utm, name="src", projection_source=proj_stereo
zd = wrl.zonalstats.ZonalDataPoly(src, trg, srs=proj_utm))

Expected Output

Delete unnecessary tmp-files or at least return tmp-file names to the user

Problem Description

When running a large job which processes many polygons and rasters the storage of temporary files can cause the system to crash.
It would be nice if a user had the option to remove the files which get created when, e.g, reading shapefiles with VectorSource and similar functions. For this it would be necessary to know the the names of the temporary directories which get created.
The name gets created in gdal.py in the methdo _check_src and stored in the variable tmpfile.
If these tmpfile names would be stored somewhere and returned to the user, one could manually delete these files when not needed anymore.

Version

Output of wrl.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:45:29)
[GCC 10.4.0]
python-bits: 64
OS: Linux
OS-release: 5.15.0-56-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: ('en_GB', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.8.0
xarray: 0.20.2
pandas: 1.3.4
numpy: 1.21.4
scipy: 1.7.3
netCDF4: 1.5.7
pydap: None
h5netcdf: 0.11.0
h5py: 3.3.0
Nio: None
zarr: None
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.2.8
cfgrib: 0.9.8.5
iris: None
bottleneck: None
dask: 2022.04.0
distributed: 2022.4.0
matplotlib: 3.5.1
cartopy: 0.20.0
seaborn: 0.11.2
numbagg: None
fsspec: 2022.11.0
cupy: None
pint: None
sparse: None
setuptools: 65.5.1
pip: 22.3.1
conda: None
pytest: 6.2.5
IPython: 8.7.0
sphinx: 5.3.0
wradlib: 1.18.0

@kmuehlbauer
Copy link
Member

kmuehlbauer commented Dec 9, 2022

Thanks @plvoit for taking the time to create this issue and providing the information.

As you are referring to those files which are created like this:

tmpfile = tempfile.NamedTemporaryFile(mode="w+b").name
ogr_src = gdal_create_dataset(
            "ESRI Shapefile", os.path.join("/vsimem", tmpfile), gdal_type=gdal.OF_VECTOR
        )

Those files are created inside the /tmp-folder (at least on Linux). Normally older files get purged there depending on system settings. I can imagine if you are running on large batches of files that the available disk space might be filled up at some point in time.

I'm not sure when the user doesn't need that file any more. We could assume, we are safe to delete if the object is getting out of scope. If you are up to this I'll gladly accept a PullRequest adding that functionality.

But as you've also asked about getting back the filenames, there is already machinery for that. You can use the GDAL Dataset to retrieve the filename to delete manually:

tmp_filename = src.ds.GetDescription()
print(tmp_filename)
'/tmp/tmpjervo49i'

@plvoit
Copy link
Contributor Author

plvoit commented Dec 9, 2022

Hello kmuehlbauer,
first of all thanks for the quick reply and also thanks for the great package which you and your team are maintaining. I updated the code, so that the destructor of VectorSource removes these temporary directories and I will make a pull request now. I hope this is an acceptable solution.

@kmuehlbauer
Copy link
Member

@plvoit You're welcome. I've added some comments and suggestions in #609.

@kmuehlbauer
Copy link
Member

Good work @plvoit, hope to see you around more!

@plvoit
Copy link
Contributor Author

plvoit commented Dec 9, 2022

Thank you, glad I could contribute a little!
PS: Can I asked a question which is unrelated to this issue?
Is it possible to manipulate a VectorSource in a way that its shifted along x and y? How could I access the Polygon data and manipulate it to achieve this? I am computing a lot of zonal operations with shifting rasters. This way I could possibly speed up the process quite a bit.

@kmuehlbauer
Copy link
Member

@pvoit I'd recommend to ask this question over at https://openradar.discourse.group. You get a much wider audience there. I'll have to think about this a bit and let you know over there. In general it should be possible also with the power of GDAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants