Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error #7018

lassiterdc · 2022-09-10T18:21:48Z

Problem Summary

I am attempting to convert a.grib2 file representing a single day's worth of gridded radar rainfall data spanning the continental US, into a netcdf. When a .grib2 is missing timesteps, I am attempting to fill them in with NA values using xarray.Dataset.reindex before running xarray.Dataset.to_netcdf. However, after I've reindexed the dataset, the script fails due to a memory allocation error. It succeeds if I don't reindex. One clue could be in the fact that the dataset chunks are set to (70, 3500, 7000), but when ds.to_netcdf is called, the script fails because it's attempting to load a chunk with dimensions (210, 3500, 7000).

Accessing Full Reproducible Example

The code and data to reproduce my results can be downloaded from this Dropbox link. The code is also shown below followed by the outputs. Potentially relevant OS and environment information are shown below as well.

Code

#%% Import libraries
import time
start_time = time.time()
import xarray as xr
import cfgrib
from glob import glob
import pandas as pd
import dask
dask.config.set(**{'array.slicing.split_large_chunks': False}) # to silence warnings of loading large slice into memory
dask.config.set(scheduler='synchronous') # this forces single threaded computations (netcdfs can only be written serially)
#%% parameters
chnk_sz = "7000MB"
fl_out_nc = "out_netcdfs/20010101.nc"
fldr_in_grib = "in_gribs/20010101.grib2"

#%% loading and exporting dataset
ds = xr.open_dataset(fldr_in_grib, engine="cfgrib", chunks={"time":chnk_sz},
                    backend_kwargs={'indexpath': ''})

# reindex
start_date = pd.to_datetime('2001-01-01')
tstep = pd.Timedelta('0 days 00:05:00')
new_index = pd.date_range(start=start_date, end=start_date + pd.Timedelta(1, "day"),\
                                    freq=tstep, inclusive='left')

ds = ds.reindex(indexers={"time":new_index})
ds = ds.unify_chunks()
ds = ds.chunk(chunks={'time':chnk_sz})

print("######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########")
print(ds)
print(' ')
print("######## ERROR MESSAGE ########")
ds.to_netcdf(fl_out_nc, encoding= {"unknown":{"zlib":True}})

Outputs

######## INSPECTING DATASET PRIOR TO WRITING TO NETCDF ########
<xarray.Dataset>
Dimensions:     (time: 288, latitude: 3500, longitude: 7000)
Coordinates:
  * time        (time) datetime64[ns] 2001-01-01 ... 2001-01-01T23:55:00
  * latitude    (latitude) float64 54.99 54.98 54.98 54.97 ... 20.03 20.02 20.01
  * longitude   (longitude) float64 230.0 230.0 230.0 ... 300.0 300.0 300.0
    step        timedelta64[ns] ...
    surface     float64 ...
    valid_time  (time) datetime64[ns] dask.array<chunksize=(288,), meta=np.ndarray>
Data variables:
    unknown     (time, latitude, longitude) float32 dask.array<chunksize=(70, 3500, 7000), meta=np.ndarray>
Attributes:
    GRIB_edition:            2
    GRIB_centre:             161
    GRIB_centreDescription:  161
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             161
    history:                 2022-09-10T14:50 GRIB to CDM+CF via cfgrib-0.9.1...
 
######## ERROR MESSAGE ########
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
d:\Dropbox\_Sharing\reprex\2022-9-9_writing_ncdf_fails\reprex\exporting_netcdfs_reduced.py in <cell line: 22>()
     160 print(' ')
     161 print("######## ERROR MESSAGE ########")
---> 162 ds.to_netcdf(fl_out_nc, encoding= {"unknown":{"zlib":True}})

File c:\Users\Daniel\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\core\dataset.py:1882, in Dataset.to_netcdf(self, path, mode, format, group, engine, encoding, unlimited_dims, compute, invalid_netcdf)
   1879     encoding = {}
   1880 from ..backends.api import to_netcdf
-> 1882 return to_netcdf(  # type: ignore  # mypy cannot resolve the overloads:(
   1883     self,
   1884     path,
   1885     mode=mode,
   1886     format=format,
   1887     group=group,
   1888     engine=engine,
   1889     encoding=encoding,
   1890     unlimited_dims=unlimited_dims,
   1891     compute=compute,
   1892     multifile=False,
   1893     invalid_netcdf=invalid_netcdf,
   1894 )

File c:\Users\xxxxx\anaconda3\envs\weather_gen_3\lib\site-packages\xarray\backends\api.py:1219, in to_netcdf(dataset, path_or_file, mode, format, group, engine, encoding, unlimited_dims, compute, multifile, invalid_netcdf)
...
    121     return arg

File <__array_function__ internals>:180, in where(*args, **kwargs)

MemoryError: Unable to allocate 19.2 GiB for an array with shape (210, 3500, 7000) and data type float32

Environment

windows 11 Home
xarray 2022.3.0
cfgrib 0.9.10.1
dask 2022.7.0

The text was updated successfully, but these errors were encountered:

lassiterdc · 2022-09-10T19:34:26Z

I found a functional workaround is to chunk by one of the spatial dimensions instead. I'd still like to know why the code above fails though. I'm assuming there's a scheduled task with dask the occurs before to_netcdf but I haven't been able to figure out what that is.

ds = xr.open_dataset(fldr_in_grib, engine="cfgrib", chunks={"latitude":875},
                    backend_kwargs={'indexpath': ''})

JamiePringle · 2022-09-13T13:36:40Z

I think #7028 might help you -- I was running into a similar problem. In short, try keeping your time variables as float64 instead of as date time (or converting before you try to save).

lassiterdc · 2022-09-15T19:59:39Z

I tried your suggestion and still ran into a memory allocation error, but it sounds like you're onto something. I also found this other thread about reindex causing memory allocation errors but it doesn't look like a solution was discovered there either. #2745

lassiterdc added the needs triage Issue that has not been reviewed by xarray team member label Sep 10, 2022

lassiterdc closed this as completed Sep 10, 2022

lassiterdc reopened this Sep 10, 2022

JamiePringle mentioned this issue Sep 13, 2022

.to_zarr() or .to_netcdf slow and uses excess memory when datetime64[ns] variable in output; a reproducible example #7028

Closed

4 tasks

durack1 mentioned this issue Feb 1, 2024

Remove copy statements in sea ice metrics PCMDI/pcmdi_metrics#1041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error #7018

Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error #7018

lassiterdc commented Sep 10, 2022 •

edited

lassiterdc commented Sep 10, 2022

JamiePringle commented Sep 13, 2022

lassiterdc commented Sep 15, 2022

Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error #7018

Writing netcdf after running xarray.dataset.reindex to fill gaps in a time series fails due to memory allocation error #7018

Comments

lassiterdc commented Sep 10, 2022 • edited

Problem Summary

Accessing Full Reproducible Example

Code

Outputs

Environment

lassiterdc commented Sep 10, 2022

JamiePringle commented Sep 13, 2022

lassiterdc commented Sep 15, 2022

lassiterdc commented Sep 10, 2022 •

edited