Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hangs while saving netcdf file opened using xr.open_mfdataset with lock=None #3961

Open
jessicaaustin opened this issue Apr 10, 2020 · 13 comments

Comments

@jessicaaustin
Copy link

jessicaaustin commented Apr 10, 2020

I am testing out code that uses xarray to process netcdf files, in particular to join multiple netcdf files into one along shared dimensions. This was working well, except sometimes when saving the netcdf file the process would hang.

I was able to whittle it down to this simple example: https://github.com/jessicaaustin/xarray_netcdf_hanging_issue

This is the code snippet at the core of the example:

 # If you set lock=False then this runs fine every time.
 # Setting lock=None causes it to intermittently hang on mfd.to_netcdf
 with xr.open_mfdataset(['dataset.nc'], combine='by_coords', lock=None) as mfd:
     p = os.path.join('tmp', 'xarray_{}.nc'.format(uuid.uuid4().hex))
     print(f"Writing data to {p}")
     mfd.to_netcdf(p)
     print("complete")

If you run this once, it's typically fine. But run it over and over again in a loop, and it'll eventually hang on mfd.to_netcdf. However if I set lock=False then it runs fine every time.

I've seen this with the following combos:

  • xarray=0.14.1
  • dask=2.9.1
  • netcdf4=1.5.3

and

  • xarray=0.15.1
  • dask=2.14.0
  • netcdf4=1.5.3

And I've tried it with different netcdf files and different computers.

Versions

Output of `xr.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)
[GCC 7.3.0]
python-bits: 64
OS: Linux
OS-release: 4.15.0-20-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.5
libnetcdf: 4.7.4

xarray: 0.15.1
pandas: 1.0.3
numpy: 1.18.1
scipy: None
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: None
cftime: 1.1.1.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2.14.0
distributed: 2.14.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
setuptools: 46.1.3.post20200325
pip: 20.0.2
conda: None
pytest: None
IPython: None
sphinx: None

@jessicaaustin jessicaaustin changed the title Hangs while saving netcdf file opened using xr.open_mfdataset with lock=None Hangs while saving netcdf file opened using xr.open_mfdataset with lock=None Apr 10, 2020
@markelg
Copy link
Contributor

markelg commented May 4, 2020

Thanks @jessicaaustin. We have run into the same issue. Setting lock=False works, but as hdf5 is not thread safe, we are not sure if this could have unexpected consequences.

Edit: Actually, I have checked, and the hdf5 version we are using (from conda-forge) is build in thread safe mode. This means that concurrent reads are possible, and that the lock=False in open_mfdataset would be safe. In fact it is more efficient as it does not make sense to handle locks if hdf5 is already thread safe. Am I right?

@sjsmith757
Copy link

Using:

  • xarray=0.15.1
  • dask=2.14.0
  • netcdf4=1.5.3

I have experienced this issue as well when writing netcdf using xr.save_mfdataset on a dataset opened using xr.open_mfdataset. As described by OP it hangs when using lock=None (default behavior) on xr.open_mfdataset(), but works fine when using lock=False.

@bekatd
Copy link

bekatd commented Sep 10, 2020

Using:

  • xarray=0.16.0
  • dask=2.25.0
  • netcdf4=1.5.4

I am experiencing same when trying to write netcdf file using xr.to_netcdf() on a files opened via xr.open_mfdataset with lock=None.

Then I tried OP's suggestion and it worked like a charm

BUT

Now I am facing different issue. Seems that hdf5 IS NOT thread safe, since I encounter NetCDF: HDF error while applying different function on a netcdf file, previously were processed by another function with lock=False.
script just terminates not even reaching any calculation step in the code. seems like lock=False works opposite and file is in a corrupted mode?

This is the BIGGEST issue and needs resolve ASAP

@shadowleaves
Copy link

I have the same issue as well and it appears to me that Ubuntu system is more prone to this issue vs. CentOS. Wondering if anyone else has a similar experience

@jklymak
Copy link
Contributor

jklymak commented Nov 18, 2020

I have the same behaviour with MacOS (10.15). xarray=0.16.1, dask=2.30.0, netcdf4=1.5.4. Sometimes saves, sometimes doesn't. lock=False seems to work.

@bekatd
Copy link

bekatd commented Nov 19, 2020

I have the same behaviour with MacOS (10.15). xarray=0.16.1, dask=2.30.0, netcdf4=1.5.4. Sometimes saves, sometimes doesn't. lock=False seems to work.

Lock false sometimes throws hd5 error. No clear solution.

The only solution I have found, sleep method for 1 second

@jklymak
Copy link
Contributor

jklymak commented Nov 20, 2020

Lock false sometimes throws hd5 error. No clear solution.

I haven't seen that yet, but I'd still far prefer an occasional error to a hung process.

@fmaussion
Copy link
Member

Just adding my +1 here, and also mention that (if memory allows), ds.load() also helps. (related: #4710)

@heerad
Copy link

heerad commented Feb 14, 2021

Also seeing this as of version 0.16.1.

In some cases, I need lock=False otherwise I'll run into hung processes a certain percentage of the time. ds.load() prior to to_netcdf() does not solve the problem.

In other cases, I need lock=None otherwise I'll consistently get RuntimeError: NetCDF: Not a valid ID.

Is the current recommended solution to set lock=False and retry until success? Or, is it to keep lock=None and use zarr instead? @dcherian

@bekatd
Copy link

bekatd commented Feb 14, 2021

Is the current recommended solution to set lock=False and retry until success? Or, is it to keep lock=None and use zarr instead? @dcherian

Or alternatively you can try to set sleep between openings.

When you try to open same file from different functions with different operations, it is better to keep file opening function wrapped with a 1 second delay/sleep rather than direct open

@heerad
Copy link

heerad commented Feb 14, 2021

Or alternatively you can try to set sleep between openings.

To clarify, do you mean adding a sleep of e.g. 1 second prior to your preprocess function (and setting preprocess to just sleep then return ds if you're not doing any preprocessing)? Or, are you instead sleeping before the entire open_mfdataset call?

Is this solution only addressing the issue of opening the same ds multiple times within a python process, or would it also address multiple processes opening the same ds?

@bekatd
Copy link

bekatd commented Feb 15, 2021

Please make some dummy tests, I did time.sleep, prior every operation. This was the only workaround that really worked.

engeir added a commit to engeir/cesm-helper-scripts that referenced this issue Jul 25, 2023
engeir added a commit to engeir/cesm-helper-scripts that referenced this issue Jul 25, 2023
engeir added a commit to engeir/cesm-helper-scripts that referenced this issue Jul 25, 2023
* test(gen_agg): create dummy files and execute stuff

* test(gen_agg): stuff is working

Next: add feat to gen_agg so that files can be appended.

* test(gen_agg): split into many generated files

The time span attribute is moved to the global attribute of the dataset,
which makes it so it is transferred over to the history field of the new
concatenated file when using `ncrcat`.

* perf(close): close all opened datasets in gen_agg (#15)

* fix(nc_savefile): open in lock=False mode (#16)

Issue also described and tracked in
pydata/xarray#3961.

* test(gen_agg): test against different netCDF formats

All file formats supported by xarray now works in the tests. Good stuff.

* ci: set up workflows for labeler and release draft (#19)
@szwang1990
Copy link

Any progress in solving this problem?
I am using

  • xarray 0.20.1
  • netcdf4 1.6.2
    None of the above suggestions (lock=False, time.sleep(1)) works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants