New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open_mfdataset usage and limitations. #2501
Comments
In To help us help you debug, please provide more information about the files your are opening. Specifically, please call from glob import glob
import xarray as xr
all_files = glob('*1002*.nc')
display(xr.open_dataset(all_files[0]))
display(xr.open_dataset(all_files[1])) |
^ I'm assuming you're in a notebook. If not, call |
Thank you for looking into this. I just want to point out that I'm not that much concerned with the "slow performance" but much more with the memory consumption and the limitation it implies. from glob import glob
import xarray as xr
all_files = glob('...*TP110*.nc')
display(xr.open_dataset(all_files[0]))
display(xr.open_dataset(all_files[1]))
|
Try writing a preprocessor function that drops all coordinates def drop_coords(ds):
return ds.reset_coords(drop=True) |
I tried this, and either I didn't apply it right, or it didn't work. The memory use kept growing until the process died. My code to process the 8760 netcdf files with import xarray as xr
from dask.distributed import Client, progress, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
import pandas as pd
dates = pd.date_range(start='2009-01-01 00:00',end='2009-12-31 23:00', freq='1h')
files = ['./nc/{}/{}.CHRTOUT_DOMAIN1.comp'.format(date.strftime('%Y'),date.strftime('%Y%m%d%H%M')) for date in dates]
def drop_coords(ds):
return ds.reset_coords(drop=True)
ds = xr.open_mfdataset(files, preprocess=drop_coords, autoclose=True, parallel=True)
ds1 = ds.chunk(chunks={'time':168, 'feature_id':209929})
import numcodecs
numcodecs.blosc.use_threads = False
ds1.to_zarr('zarr/2009', mode='w', consolidated=True) I transfered the netcdf files from AWS S3 to my local disk to run this, using this command:
@TomAugspurger, if you could take a look, that would be great, and if you have any ideas of how to make this example simpler/more easily reproducible, please let me know. |
Thanks, will take a look this afternoon. Are there any datasets on https://pangeo-data.github.io/pangeo-datastore/ that would exhibit this poor behavior? I may not have access to the bucket (or I'm misusing
|
The datasets in our cloud datastore are designed explicitly to avoid this problem! |
Can you post the xarray repr of two sample files post pre-processing function? |
Good to know! FYI, #2501 (comment) was user error (I can access it, but need to specify the us-east-1 region). Taking a look now. |
@TomAugspurger, I'm back from vacation now and ready to attack this again. Any updates on your end? |
I'm looking into it today. Can you clarify
by "process" do you mean a dask worker process, or just the main python process executing the |
@TomAugspurger, okay, I just ran the above code again and here's what happens: The Then, despite the tasks showing on the dashboard being completed, the then after about 10 more minutes, I get these warnings: and then the errors: distributed.client - WARNING - Couldn't gather 17520 keys, rescheduling {'getattr-fd038834-befa-4a9b-b78f-51f9aa2b28e5': ('tcp://127.0.0.1:45640',), 'drop_coords-39be9e52-59de-4e1f-b6d8-27e7d931b5af': ('tcp://127.0.0.1:55881',), 'drop_coords-8bd07037-9ca4-4f97-83fb-8b02d7ad0333': ('tcp://127.0.0.1:56164',), 'drop_coords-ca3dd72b-e5af-4099-b593-89dc97717718': ('tcp://127.0.0.1:59961',), 'getattr-c0af8992-e928-4d42-9e64-340303143454': ('tcp://127.0.0.1:42989',), 'drop_coords-8cdfe5fb-7a29-4606-8692-efa747be5bc1': ('tcp://127.0.0.1:35445',), 'getattr-03669206-0d26-46a1-988d-690fe830e52f':
... Full error listing here: Does this help? I'd be happy to screenshare if that would be useful. |
@rabernat , to answer your question, if I open just two files:
the resulting dataset is:
|
@rsignell-usgs very helpful, thanks. I'd noticed that there was a pause after the open_dataset tasks finish, indicating that either the scheduler or (more likely) the client was doing work rather than the cluster. Most likely @rabernat's guess
is correct. Verifying all that now, and looking into if / how that can be done on the workers. |
@TomAugspurger , I thought @rabernat's suggestion of implementing def drop_coords(ds):
return ds.reset_coords(drop=True) would avoid this checking. Did I understand or implement this incorrectly? |
@TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the So if I use this code, the def drop_coords(ds):
ds = ds.drop(['reference_time','feature_id'])
return ds.reset_coords(drop=True) and I can then add back in the dropped coordinate values at the end: dsets = [xr.open_dataset(f) for f in files[:3]]
ds.coords['feature_id'] = dsets[0].coords['feature_id'] I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right? |
Great, thanks. I’ll look into the memory issue when writing. We may already have an issue for it.
… On Jul 10, 2019, at 10:59, Rich Signell ***@***.***> wrote:
@TomAugspurger , I sat down here at Scipy with @rabernat and he instantly realized that we needed to drop the feature_id coordinate to prevent open_mfdataset from trying to harmonize that coordinate from all the chunks.
So if I use this code, the open_mdfdataset command finishes:
def drop_coords(ds):
ds = ds.drop(['reference_time','feature_id'])
return ds.reset_coords(drop=True)
and I can then add back in the dropped coordinate values at the end:
dsets = [xr.open_dataset(f) for f in files[:3]]
ds.coords['feature_id'] = dsets[0].coords['feature_id']
I'm now running into memory issues when I write the zarr data -- but I should raise that as a new issue, right?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I believe that the memory issue is basically the same as dask/distributed#2602. The graphs look like: Reading and rechunking increase memory consumption. Writing relieves it. In Rich's case, the workers just load too much data before they write it. Eventually they run out of memory. |
Yep, that’s my suspicion as well. I’m still plugging away at it. Currently the pausing logic isn’t quite working well.
… On Jul 10, 2019, at 12:10, Ryan Abernathey ***@***.***> wrote:
I believe that the memory issue is basically the same as dask/distributed#2602.
The graphs look like: read --> rechunk --> write.
Reading and rechunking increase memory consumption. Writing relieves it. In Rich's case, the workers just load too much data before they write it. Eventually they run out of memory.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Hi guys, I'm having some issue that looks similar to @rsignell-usgs. Trying to open 413 netcdf files using
Trying to read it on a standard python session gives me core dumped:
Trying to read it on a dask cluster I get:
Is there anything obviously wrong I'm trying here please? |
I think this is stale now. See https://xarray.pydata.org/en/stable/io.html#reading-multi-file-datasets for latest guidance on reading such datasets. Please open a new issue if you are still having trouble with |
I'm trying to understand and use the open_mfdataset function to open a huge amount of files.
I thought this function would be quit similar to dask.dataframe.from_delayed and allow to "load" and work on an amount of data only limited by the number of Dask workers (or "unlimited" considering it could be "lazily loaded").
But my tests showed something quit different.
It seems xarray requires the index to be copied back to the Dask client in order to "auto_combine" data.
Doing some tests on a small portion of my data I have something like this.
Each file has these dimensions: time: ~2871, xx_ind: 40, yy_ind: 128.
The concatenation of these files is made on the time dimension and my understanding is that only the time is loaded and brought back to the client (other dimensions are constant).
Parallel tests are made with 200 dask workers.
As you can see, the amount of memory used for this operation is significant and I won't be able to do this on much more files.
When using the parallel option, the loading of files take a few seconds (judging from what the Dask dashboard is showing) and I'm guessing the rest of the time is for the "auto_combine".
So I'm wondering if I'm doing something wrong, if there other way to load data or if I cannot use xarray directly for this quantity of data and have to use Dask directly.
Thanks in advance.
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8
LOCALE: fr_FR.UTF-8
xarray: 0.10.9+32.g9f4474d.dirty
pandas: 0.23.4
numpy: 1.15.2
scipy: 1.1.0
netCDF4: 1.4.1
h5netcdf: 0.6.2
h5py: 2.8.0
Nio: None
zarr: 2.2.0
cftime: 1.0.1
PseudonetCDF: None
rasterio: None
iris: None
bottleneck: None
cyordereddict: None
dask: 0.19.4
distributed: 1.23.3
matplotlib: 3.0.0
cartopy: None
seaborn: None
setuptools: 40.4.3
pip: 18.1
conda: None
pytest: 3.9.1
IPython: 7.0.1
sphinx: None
The text was updated successfully, but these errors were encountered: