# Tutorial

Intake-thredds provides an interface that combines functionality from [`siphon`](https://github.com/Unidata/siphon) and `intake` to retrieve data from THREDDS data servers. This tutorial provides an introduction to the API and features of intake-thredds. Let's begin by importing `intake`. 

In [1]:
import intake

## Loading a catalog

You can load data from a THREDDS catalog by providing the URL to a valid THREDDS catalog: 

In [2]:
cat_url = 'https://psl.noaa.gov/thredds/catalog/Datasets/noaa.ersst/catalog.xml'

In [3]:
catalog = intake.open_thredds_cat(cat_url, name='noaa-ersst-catalog')
print(catalog)
print(type(catalog))

<Intake catalog: noaa-ersst-catalog>
<class 'intake_thredds.cat.ThreddsCatalog'>


## Using the catalog

Once you've loaded a catalog, you can display its contents by iterating over its entries:

In [4]:
list(catalog)

['err.mnmean.v3.nc',
 'sst.mnmean.v3.nc',
 'sst.mnmean.v4.nc',
 'sst.mon.1971-2000.ltm.v4.nc',
 'sst.mon.19712000.ltm.v3.nc',
 'sst.mon.1981-2010.ltm.v3.nc',
 'sst.mon.1981-2010.ltm.v4.nc']

Once you've identified a dataset of interest, you can access it as follows:

In [5]:
source = catalog['err.mnmean.v3.nc']
print(source)

sources:
  err.mnmean.v3.nc:
    args:
      chunks: {}
      urlpath: https://psl.noaa.gov/thredds/dodsC/Datasets/noaa.ersst/err.mnmean.v3.nc
    description: THREDDS data
    driver: intake_xarray.opendap.OpenDapSource
    metadata:
      catalog_dir: null



In [6]:
print(type(source))

<class 'intake_xarray.opendap.OpenDapSource'>


## Loading a dataset

To load a dataset of interest, you can use the `to_dask()` method which is available on a **source** object:

In [7]:
%%time
ds = source().to_dask()
ds

CPU times: user 694 ms, sys: 223 ms, total: 917 ms
Wall time: 11.5 s


Unnamed: 0,Array,Chunk
Bytes,31.90 kB,31.90 kB
Shape,"(1994, 2)","(1994, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 31.90 kB 31.90 kB Shape (1994, 2) (1994, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  1994,

Unnamed: 0,Array,Chunk
Bytes,31.90 kB,31.90 kB
Shape,"(1994, 2)","(1994, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,127.78 MB,127.78 MB
Shape,"(1994, 89, 180)","(1994, 89, 180)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 127.78 MB 127.78 MB Shape (1994, 89, 180) (1994, 89, 180) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",180  89  1994,

Unnamed: 0,Array,Chunk
Bytes,127.78 MB,127.78 MB
Shape,"(1994, 89, 180)","(1994, 89, 180)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray


The `to_dask()` reads only metadata needed to construct an ``xarray.Dataset``. The actual data are streamed over the network when computation routines are invoked on the dataset. 
By default, `intake-thredds` uses ``chunks={}`` to load the dataset with dask using a single chunk for all arrays. You can use a different chunking scheme by prividing a custom value of chunks before calling `.to_dask()`:

In [8]:
%%time
# Use a custom chunking scheme
ds = source(chunks={'time': 100, 'lon': 90}).to_dask()
ds

CPU times: user 210 ms, sys: 18.6 ms, total: 229 ms
Wall time: 7.33 s


Unnamed: 0,Array,Chunk
Bytes,31.90 kB,1.60 kB
Shape,"(1994, 2)","(100, 2)"
Count,21 Tasks,20 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 31.90 kB 1.60 kB Shape (1994, 2) (100, 2) Count 21 Tasks 20 Chunks Type float64 numpy.ndarray",2  1994,

Unnamed: 0,Array,Chunk
Bytes,31.90 kB,1.60 kB
Shape,"(1994, 2)","(100, 2)"
Count,21 Tasks,20 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,127.78 MB,3.20 MB
Shape,"(1994, 89, 180)","(100, 89, 90)"
Count,41 Tasks,40 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 127.78 MB 3.20 MB Shape (1994, 89, 180) (100, 89, 90) Count 41 Tasks 40 Chunks Type float32 numpy.ndarray",180  89  1994,

Unnamed: 0,Array,Chunk
Bytes,127.78 MB,3.20 MB
Shape,"(1994, 89, 180)","(100, 89, 90)"
Count,41 Tasks,40 Chunks
Type,float32,numpy.ndarray


## Working with nested catalogs

In some scenarious, a THREDDS catalog can reference another THREDDS catalog. This results into a nested structure consisting of a parent catalog and children catalogs:

In [9]:
cat_url = 'https://psl.noaa.gov/thredds/catalog.xml'
catalog = intake.open_thredds_cat(cat_url)
list(catalog)

['Datasets', 'Aggregations']

In [10]:
print(list(catalog['Datasets']))

['20thC_ReanV2', '20thC_ReanV2c', '20thC_ReanV3', 'ATOMIC', 'COBE', 'COBE2', 'CarbonTracker', 'E3SM', 'E3SM_LE', 'LIM', 'NARR', 'S2S', 'SERDP_regimeshifts', 'Timeseries', 'cmap', 'coads', 'cpc_global_precip', 'cpc_global_temp', 'cpc_us_hour_precip', 'cpc_us_precip', 'cpcsoil', 'cru', 'dai_pdsi', 'ghcncams', 'ghcngridded', 'gistemp', 'godas', 'gpcc', 'gpcp', 'icoads', 'icoads2.5', 'interp_OLR', 'jmatemp', 'kaplan_sst', 'livneh', 'mlost', 'mlostv3b', 'ncep', 'ncep.marine', 'ncep.pac.ocean', 'ncep.reanalysis', 'ncep.reanalysis.dailyavgs', 'ncep.reanalysis.derived', 'ncep.reanalysis2', 'ncep.reanalysis2.dailyavgs', 'ncep.reanalysis2.derived', 'noaa.ersst', 'noaa.ersst.v3', 'noaa.ersst.v4', 'noaa.ersst.v5', 'noaa.oisst.v2', 'noaa.oisst.v2.derived', 'noaa.oisst.v2.highres', 'noaa_hrc', 'noaaglobaltemp', 'noaamergedtemp', 'nodc.woa94', 'nodc.woa98', 'olrcdr', 'prec', 'precl', 'snowcover', 'udel.airt.precip', 'uninterp_OLR']


In [11]:
print(list(catalog['Datasets']['ncep.reanalysis.dailyavgs']))

['other_gauss', 'pressure', 'surface', 'surface_gauss', 'tropopause']


In [12]:
print(list(catalog['Datasets']['ncep.reanalysis.dailyavgs']['surface'])[:10])

['air.sig995.1948.nc', 'air.sig995.1949.nc', 'air.sig995.1950.nc', 'air.sig995.1951.nc', 'air.sig995.1952.nc', 'air.sig995.1953.nc', 'air.sig995.1954.nc', 'air.sig995.1955.nc', 'air.sig995.1956.nc', 'air.sig995.1957.nc']


To load data from such a nested catalog, `intake-thredds` provides a special source object {py:class}`~intake_thredds.source.THREDDSMergedSource` accessible via the `.open_thredds_merged()` function. The inputs for this function consists of:

- `url`: top level URL of the THREDDS catalog
- `path`: a list of paths for child catalogs to descend down. The paths can include glob characters (*, ?). These glob characters are used for matching.

In [13]:
source = intake.open_thredds_merged(
    cat_url, path=['Datasets', 'ncep.reanalysis.dailyavgs', 'surface', 'air*sig995*194*.nc']
)
print(source)

sources:
  thredds_merged:
    args:
      path:
      - Datasets
      - ncep.reanalysis.dailyavgs
      - surface
      - air*sig995*194*.nc
      url: https://psl.noaa.gov/thredds/catalog.xml
    description: ''
    driver: intake_thredds.source.THREDDSMergedSource
    metadata: {}



To load the data into an xarray {py:class}`~xarray.Dataset`, you can invoke the `.to_dask()` method. 
Internally, {py:class}`~intake_thredds.source.THREDDSMergedSource` does the following:
- descend down the given paths and collect all available datasets.
- load each dataset in a dataset.
- combine all loaded datasets into a single dataset.

In [14]:
%%time
ds = source.to_dask()
ds

Dataset(s): 100%|████████████████████████████████| 2/2 [00:15<00:00,  7.77s/it]

CPU times: user 907 ms, sys: 70.5 ms, total: 977 ms
Wall time: 20.9 s





Unnamed: 0,Array,Chunk
Bytes,30.74 MB,15.39 MB
Shape,"(731, 73, 144)","(366, 73, 144)"
Count,6 Tasks,2 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 30.74 MB 15.39 MB Shape (731, 73, 144) (366, 73, 144) Count 6 Tasks 2 Chunks Type float32 numpy.ndarray",144  73  731,

Unnamed: 0,Array,Chunk
Bytes,30.74 MB,15.39 MB
Shape,"(731, 73, 144)","(366, 73, 144)"
Count,6 Tasks,2 Chunks
Type,float32,numpy.ndarray


## Caching

Under the hood `intake-thredds` uses the `driver='opendap'` from `intake-xarray` by default. You can also choose
`driver='netcdf'`, which in combination with `fsspec` caches files by appending `simplecache::` to the url,
see https://filesystem-spec.readthedocs.io/en/latest/features.html#remote-write-caching.

In [15]:
import os

import fsspec

# specify caching location, where to store files to with their original names
fsspec.config.conf['simplecache'] = {'cache_storage': 'my_caching_folder', 'same_names': True}

cat_url = 'https://psl.noaa.gov/thredds/catalog.xml'
source = intake.open_thredds_merged(
    f'simplecache::{cat_url}',
    path=['Datasets', 'ncep.reanalysis.dailyavgs', 'surface', 'air.sig995.194*.nc'],
    driver='netcdf',  # specify netcdf driver to open HTTPServer
)
print(source)

sources:
  thredds_merged:
    args:
      driver: netcdf
      path:
      - Datasets
      - ncep.reanalysis.dailyavgs
      - surface
      - air.sig995.194*.nc
      url: simplecache::https://psl.noaa.gov/thredds/catalog.xml
    description: ''
    driver: intake_thredds.source.THREDDSMergedSource
    metadata:
      fsspec_pre_url: 'simplecache::'



In [16]:
%time ds = source.to_dask()

Dataset(s): 100%|████████████████████████████████| 2/2 [00:10<00:00,  5.44s/it]

CPU times: user 875 ms, sys: 186 ms, total: 1.06 s
Wall time: 19.1 s





In [17]:
assert os.path.exists('my_caching_folder/air.sig995.1949.nc')

In [18]:
# after caching very fast
%time ds = source.to_dask()

CPU times: user 10 µs, sys: 1e+03 ns, total: 11 µs
Wall time: 12.9 µs
