# 5 Minute Tutorial

## OPeNDAP - the vision
The original vision of [OPeNDAP](https://www.opendap.org/) ([Cornillion, et al 1993](https://zenodo.org/records/10610992)) was democratize remote data access, by making the equivalency:

$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL} \approx \text{Remote Dataset} }$
and
$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL + Constraints} \approx \text{Subset of Remote Dataset}} $

That lead to the development of the `DAP2` protocol (formerly known as `DODS`). Currently, <span style='color:#ff6666'>**OPeNDAP**<span style='color:black'> and Unidata servers implement the <span style='color:#0066cc'>**DAP4**<span style='color:black'> protocol, which is more modern and broader in scope, to continue enabling the original vision of OPeNDAP.

## What pydap enables:

The internal logic of `PyDAP` enables the construction of constraint expressions for each url, realizing the original vision of <span style='color:#ff6666'>**OPeNDAP**<span style='color:black'> above, and given that `PyDAP` is a [backend engine](https://docs.xarray.dev/en/stable/user-guide/io.html#opendap) for `Xarray`, the original vision can scaled with parallelism. However, basic understanding of the use of Constraint Expression comes in handy when aggregating multiple files, for downloading only a handful of variables.


### Objectives:


- Demonstrate how to use the <span style='color:#0066cc'>**DAP4**<span style='color:black'> protocol.
- Use Xarray with pydap as the back `Pydap` to download data from two remote sources: `a)` an `NcML` aggregation, and `b)` two individual files,.
- Demonstrate the use of Constraint Expression and how these are passed down to the remote server so that <span style='color:#0066cc'>**subsetting is done by the server**<span style='color:black'> protocol


### Requirements

- Datasets behind a <span style='color:#0066cc'>**DAP4**<span style='color:black'> implementing server. For example, the test server: http://test.opendap.org/opendap/
- pydap>=3.5.8
- xarray>=2025.0
- numpy>=2.0

Here, we demonstrate this. The remote dataset that will be used in this tutorial can be inspected via the browser [HERE](http://test.opendap.org:8080/opendap/tutorials/20220531090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc.dmr.html)


In [None]:
from pydap.client import open_url, consolidate_metadata, create_session
import xarray as xr
import numpy as np

In [None]:
# create a session to inspect downloads. cache_name must have `debug`
session = create_session(use_cache=True, cache_kwargs={"cache_name":'debug_case1'})

## Case 1) Subsetting an NcML file

The file is an NcML file representing a virtually aggregated dataset, which can be found in the test server and it is named: [aggExisting.ncml](http://test.opendap.org/opendap/data/ncml/agg/aggExisting.ncml.dmr.html).

`NcML` represent virtually aggregated individual NetCDF files, and OPeNDAP servers can be configured to produce these. With an individual opendap url, a user has access to an entire collection of files, from which to subset.


In [None]:
ncml_url = "http://test.opendap.org/opendap/data/ncml/agg/aggExisting.ncml"
dap4_ncml_url = ncml_url.replace("http",  "dap4")
print("=============================================================\n Remote DAP4 URL: \n", dap4_ncml_url, "\n=============================================================")

In [None]:
ds = xr.open_dataset(
    dap4_ncml_url, 
    engine='pydap',
    session = session,
)
ds

### What happens if we download a single data point?

In [None]:
ds['T']

```{note}
The chunking of `T` implies the entire array is a single chunk! This is a stardard interpretation that `Xarray` makes of `OPeNDAP` urls. What happens if I download a simple subset? 
```


In [None]:
# clear the cache to inspect what is being downloaded
session.cache.clear() 

In [None]:
%%time
ds['T'].isel(time=1, lon=0).load()

In [None]:
print("====================================== \n Request sent to the Remote Server:\n ", session.cache.urls()[0].split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/"), "\n====================================== ")

The constraint expression is built from the `.isel` Xarray method and passed to the server, which does all the work.

## Case 2) Subsetting across two separate files.

The two files can be found in the test server, named: [coads_climatology](http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dmr.html) and [coads_climatology2](http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dmr.html). These two datasets share identical spatial dimensions, can be aggregated in time, and share almost all identical variables.

```{note}
It is important to always check of datasets can be aggregated. `PyDAP` and `Xarray` have internal logic to check if any two or more datasets can be concatenated. But all these safety checks only take into account dimensions and cooordinates.
```

An important step will be the use or Constraint Expressions to ensure that only the same variables of interest are concatenating.

```{warning}
One of these files has extra variables that we will discarded by the use of CEs.
```


In [None]:
urls = ["http://test.opendap.org/opendap/data/nc/coads_climatology.nc", "http://test.opendap.org/opendap/data/nc/coads_climatology2.nc"]
dap4_urls = [url.replace("http","dap4") for url in urls]

# constraint expression
dap4_CE = "?dap4.ce=" + ";".join(["/SST", "/COADSX", "/COADSY", "/TIME"])

# Final list of OPeNDAP URLs
dap4ce_urls =[url+dap4_CE for url in dap4_urls]
print("====================================================\nThe following are the DAP4 OPeNDAP URLs \n", dap4ce_urls)

In [None]:
consolidate_metadata(dap4ce_urls, session=session, concat_dim="TIME")

```{note}
`consolidate_metadata(dap4_urls, concat_dim='...', session=session)` downloads the dimensions of the remote file and stores them as a SQLite, to be reused. The session object becomes a get to authenticate, and a database manager! This practice can result in a performance gain of ~ 10-100 times faster workflows!
```

### User xarray logic to download data.


In [None]:
ds = xr.open_mfdataset(
    dap4ce_urls, 
    engine='pydap',
    concat_dim='TIME',
    session=session,
    combine="nested",
    parallel=True,
    decode_times=False,
)
ds

In [None]:
ds['SST']

```{note}
The chunking of `SST` implies the entire array is a single chunk! This is a stardard interpretation that `Xarray` makes of `OPeNDAP` urls. What if we download a single spatial point?
```


In [None]:
session.cache.clear()

In [None]:
%%time
ds['SST'].isel(TIME=0, COADSX=0, COADSY=0).load()

In [None]:
print("====================================== \n Request sent to the Remote Server:\n ", session.cache.urls()[0].split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/"), "\n====================================== ")

### The entire variable is unnecessarily downloaded !!

Ideally we would want the see the following Request (in the constraint expressssion) sent to the Remote Server:

```python
dap4.ce=/SST[0][0][0]
```
It seems that `xr.open_mfdataset` does not pass the slice argument to the server for each remote dataset. Instead it downloads all the data in a single request, subsets it, and then aggregated the data.


### How to send the slice to the Remote Server:


The answer is, to `rechunk` the dataset when creating it. The chunk **should match the expected size of your subset**. That way, for remote file, the subset will be processed within a single requests.

```{warning}
If you chunk the dataset with a size smaller that your expected download, you will trigger many downloads per remote file, forcing `Xarray` extra work to assemble the data together.
```




In [None]:
consolidate_metadata(dap4ce_urls, session=session, concat_dim="TIME")

In [None]:
# For a single element in all dimensions, the expected size is all unity
expected_sizes = {"TIME":1, "COADSX":1, "COADSY":1}

In [None]:
%%time
ds = xr.open_mfdataset(
    dap4ce_urls, 
    engine='pydap',
    concat_dim='TIME',
    session=session,
    combine="nested",
    parallel=True,
    decode_times=False,
    chunks=expected_sizes,
)
session.cache.clear()

In [None]:
ds['SST']

In [None]:
%%time
ds['SST'].isel(TIME=0, COADSX=0, COADSY=0).load()

In [None]:
print("====================================== \n Request sent to the Remote Server:\n ", session.cache.urls()[0].split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/"), "\n====================================== ")

### Warning: Be cautious about chunking

We now only downloaded exactly what we requested! However, the time for download was 10x slower, compared to the case when we requested more data!! The reason for the slowdown can be attributed to the number of chunks the dask graph generated.


* `No chunking. Download all the array in the file. 2 chunks in 5 dask graphs (one per file).`
* `Chunking. Download only the desired element of a file. 388800 chunks in 5 dask graphs`. 


In the scenario above, we went to the extremes. It is better to find a chunk compromise. We demonstrate that below.


In [None]:
consolidate_metadata(dap4ce_urls, session=session, concat_dim="TIME")

In [None]:
download_sizes = {"TIME":1, "COADSY":1}

In [None]:
%%time
ds = xr.open_mfdataset(
    dap4ce_urls, 
    engine='pydap',
    concat_dim='TIME',
    session=session,
    combine="nested",
    parallel=True,
    decode_times=False,
    chunks=download_sizes,
)
session.cache.clear()

In [None]:
ds['SST']

In [None]:
%%time
ds['SST'].isel(TIME=0, COADSX=0, COADSY=0).load()

In [None]:
print("====================================== \n Request sent to the Remote Server:\n ", session.cache.urls()[0].split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/"), "\n====================================== ")

### Success! Similar timings but much and smaller download!
