# Open NWM 1km dataset as DFReferenceFileSystem 

Open dataset as a fsspec `DFReferenceFileSystem` filesystem by reading references from a collection of Parquet files: one file containing global metadata and coordinate variable references, and one file for each of the data variables.  

The big wins here are lazy-loading of the references for each variable, and the more efficient construction of the virtual fsspec filesystem from the Parquet files (JSON is slow to decode).

In [2]:
import fsspec
from fsspec.implementations.reference import DFReferenceFileSystem
import xarray as xr

In [3]:
fs = fsspec.filesystem('s3', anon=True, 
                        client_kwargs={'endpoint_url':'https://ncsa.osn.xsede.org'})

In [4]:
s3_lazy_refs = 's3://esip/noaa/nwm/lazy_refs'

In [8]:
print(f'Number of reference files: {len(fs.ls(s3_lazy_refs))}')
print(f'Total size of references: {fs.du(s3_lazy_refs)/1e9} GB')

Number of Parquet files: 21
Total size of Parquet references: 0.492091486 GB


In [9]:
r_opts = {'anon': True}
t_opts = {'anon': True, 'client_kwargs':{'endpoint_url':'https://ncsa.osn.xsede.org'}}

In [10]:
%%time
fs2 = DFReferenceFileSystem(s3_lazy_refs, lazy=True, target_options=t_opts,
                        remote_protocol='s3', remote_options=r_opts)
m = fs2.get_mapper("")
ds = xr.open_dataset(m, engine="zarr", chunks={}, backend_kwargs=dict(consolidated=False))

CPU times: user 5.03 s, sys: 404 ms, total: 5.43 s
Wall time: 12 s


In [11]:
ds

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,30.03 TiB,8.44 MiB
Shape,"(116631, 3840, 2, 4608)","(1, 960, 1, 1152)"
Dask graph,3732192 chunks in 2 graph layers,3732192 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 30.03 TiB 8.44 MiB Shape (116631, 3840, 2, 4608) (1, 960, 1, 1152) Dask graph 3732192 chunks in 2 graph layers Data type float64 numpy.ndarray",116631  1  4608  2  3840,

Unnamed: 0,Array,Chunk
Bytes,30.03 TiB,8.44 MiB
Shape,"(116631, 3840, 2, 4608)","(1, 960, 1, 1152)"
Dask graph,3732192 chunks in 2 graph layers,3732192 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,30.03 TiB,8.44 MiB
Shape,"(116631, 3840, 2, 4608)","(1, 960, 1, 1152)"
Dask graph,3732192 chunks in 2 graph layers,3732192 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 30.03 TiB 8.44 MiB Shape (116631, 3840, 2, 4608) (1, 960, 1, 1152) Dask graph 3732192 chunks in 2 graph layers Data type float64 numpy.ndarray",116631  1  4608  2  3840,

Unnamed: 0,Array,Chunk
Bytes,30.03 TiB,8.44 MiB
Shape,"(116631, 3840, 2, 4608)","(1, 960, 1, 1152)"
Dask graph,3732192 chunks in 2 graph layers,3732192 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,60.06 TiB,5.40 MiB
Shape,"(116631, 3840, 4, 4608)","(1, 768, 1, 922)"
Dask graph,11663100 chunks in 2 graph layers,11663100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 60.06 TiB 5.40 MiB Shape (116631, 3840, 4, 4608) (1, 768, 1, 922) Dask graph 11663100 chunks in 2 graph layers Data type float64 numpy.ndarray",116631  1  4608  4  3840,

Unnamed: 0,Array,Chunk
Bytes,60.06 TiB,5.40 MiB
Shape,"(116631, 3840, 4, 4608)","(1, 768, 1, 922)"
Dask graph,11663100 chunks in 2 graph layers,11663100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,60.06 TiB,5.40 MiB
Shape,"(116631, 3840, 4, 4608)","(1, 768, 1, 922)"
Dask graph,11663100 chunks in 2 graph layers,11663100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 60.06 TiB 5.40 MiB Shape (116631, 3840, 4, 4608) (1, 768, 1, 922) Dask graph 11663100 chunks in 2 graph layers Data type float64 numpy.ndarray",116631  1  4608  4  3840,

Unnamed: 0,Array,Chunk
Bytes,60.06 TiB,5.40 MiB
Shape,"(116631, 3840, 4, 4608)","(1, 768, 1, 922)"
Dask graph,11663100 chunks in 2 graph layers,11663100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Examine a specific variable:

In [12]:
ds.TRAD

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 15.02 TiB 5.40 MiB Shape (116631, 3840, 4608) (1, 768, 922) Dask graph 2915775 chunks in 2 graph layers Data type float64 numpy.ndarray",4608  3840  116631,

Unnamed: 0,Array,Chunk
Bytes,15.02 TiB,5.40 MiB
Shape,"(116631, 3840, 4608)","(1, 768, 922)"
Dask graph,2915775 chunks in 2 graph layers,2915775 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


How big would the uncompressed size of the whole dataset be?

In [13]:
ds.nbytes/1e12  #TB

462.28064798432

Load some data at a specific time step.  The first time a variable is accessed it will take longer as the references need to be loaded.

In [14]:
%%time 
da = ds.TRAD.sel(time='1990-01-01 00:00').load()

CPU times: user 7.12 s, sys: 1.26 s, total: 8.39 s
Wall time: 13.6 s


Loading data for another time step is much faster as the references are already loaded:

In [15]:
%%time
da = ds.TRAD.sel(time='2015-01-01 00:00').load()

CPU times: user 4.04 s, sys: 531 ms, total: 4.57 s
Wall time: 6.31 s


Compute the mean over the domain:

In [16]:
da.mean().data

array(266.92635398)

In [None]:
da.plot()