# 5 Minute Tutorial

## OPeNDAP - the vision
The original vision of [OPeNDAP](https://www.opendap.org/) ([Cornillion, et al 1993](https://zenodo.org/records/10610992)) was to democratize remote data access, by making the equivalencies

$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL} \approx \text{Remote Dataset} }$

$ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \boxed{\text{URL + Constraints} \approx \text{Subset of Remote Dataset}} $

That led to the development of the `DAP2` protocol (formerly known as `DODS`). Currently, <span style='color:#ff6666'>**OPeNDAP**</span> and Unidata servers implement the modern and broader  <span style='color:#0066cc'>**DAP4**</span> protocol (see [DAP4 specification](https://opendap.github.io/dap4-specification/DAP4.html#_how_dap4_differs_from_dap2)), to continue enabling the original vision of OPeNDAP.

## What pydap enables:

The internal logic of `PyDAP` enables the construction of constraint expressions for each url, interactively, hiding the abstraction away from the user. Furthermore, using `PyDAP` as a [backend engine](https://docs.xarray.dev/en/stable/user-guide/io.html#opendap) for `Xarray`, the original <span style='color:#ff6666'>**OPeNDAP**</span> vision can scaled with multi-core parallelism. Nonetheless, basic understanding about the use of Constraint Expression comes in handy when aggregating multiple files, and can lead to more efficient worklows.


### Objectives:


- Demonstrate how to specify the <span style='color:#0066cc'>**DAP4**</span> protocol to the remote server.
- Use `Xarray` with `PyDAP` as the backend engine to download a subset of remote data in two user case scenarios: `a)` an `NcML` aggregation file (virtual dataset), and `b)` across two Netcdf files.
- Demonstrate distinct ways to use Constraint Expression (`CE`s), and how these are passed down to the remote server so that <span style='color:#0066cc'>**subsetting is done by the server**</span>, in a `data-proximate` way,  without performace loss on the client side.


### Requirements

- Datasets behind a <span style='color:#0066cc'>**DAP4**</span> implementing server. For example, the test server: http://test.opendap.org/opendap/. 
- pydap>=3.5.8
- xarray>=2025.0
- numpy>=2.0

```{note}
The vast majority of NASA's OPeNDAP servers implement the DAP4 protocol.
```


In [None]:
from pydap.client import open_url, consolidate_metadata, create_session
import xarray as xr
import numpy as np

In [None]:
# create a session to inspect downloads. cache_name must have `debug`
session = create_session(use_cache=True, cache_kwargs={"cache_name":'debug_case1'})
session.cache.clear()

## Case 1) Subsetting an NcML file

The file is an NcML file representing a virtually aggregated dataset, which can be found in the test server and it is named: [aggExisting.ncml](http://test.opendap.org/opendap/data/ncml/agg/aggExisting.ncml.dmr.html).

<span style='color:#ff6666'>**OPeNDAP**</span> servers can be configured to produce NcML virtual datasets. Their advantage is that with an individual <span style='color:#ff6666'>**OPeNDAP**</span> url, a user has access to an entire collection of files from which to subset.


In [None]:
ncml_url = "http://test.opendap.org/opendap/data/ncml/agg/aggExisting.ncml"
dap4_ncml_url = ncml_url.replace("http",  "dap4")
print("=============================================================\n Remote DAP4 URL: \n", dap4_ncml_url, "\n=============================================================")

In [None]:
ds = xr.open_dataset(
    dap4_ncml_url, 
    engine='pydap',
    session = session,
    chunks={},
)
ds

### What happens if we download a single data point?

In [None]:
ds['T']

```{note}
The info about chunking in `T` implies the entire array is treated as a single chunk! This is a stardard interpretation that `Xarray` makes of `OPeNDAP` urls. What happens if I download a subset of the data? 
```


In [None]:
# clear the cache to inspect what is being downloaded
session.cache.clear() 

In [None]:
%%time
ds['T'].isel(time=1, lon=0).load()

In [None]:
print("====================================== \n Request sent to the Remote Server:\n ", session.cache.urls()[0].split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/"), "\n====================================== ")

<span style='color:#0066cc'>**The constraint expression is built from the**<span style='color:black'>
`.isel` `Xarray` <span style='color:#0066cc'>**method and correctly passed to the server, which does all the subsetting work!**<span style='color:black'>

## Case 2) Subsetting across two separate files.

The two files can be found in the test server, named: [coads_climatology](http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dmr.html) and [coads_climatology2](http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dmr.html). These two datasets share identical spatial dimensions, can be aggregated in time, and share almost all identical variables.

```{note}
It is important to always check that datasets can be aggregated. `PyDAP` and `Xarray` have internal logic to check if any two or more datasets can be concatenated. But all these safety checks only take into account dimensions and cooordinates.
```

<span style='color:#0066cc'>**An important step will be the use of Constraint Expressions (CEs) to ensure that only the  variables of interest are concatenating**<span style='color:black'>.

```{warning}
One of these files has extra variables not present in the other file, and that we will discarded by the use of CEs.
```


In [None]:
urls = ["http://test.opendap.org/opendap/data/nc/coads_climatology.nc", "http://test.opendap.org/opendap/data/nc/coads_climatology2.nc"]
dap4_urls = [url.replace("http","dap4") for url in urls]

# constraint expression
dap4_CE = "?dap4.ce=" + ";".join(["/SST", "/COADSX", "/COADSY", "/TIME"])

# Final list of OPeNDAP URLs
dap4ce_urls =[url+dap4_CE for url in dap4_urls]
print("====================================================\nThe following are the DAP4 OPeNDAP URLs \n", dap4ce_urls)

```{note}
**Q: Why use `CE`s when `Xarray` has a `.drop_variables` method?** Because `Xarray` needs to first parse the entirely of the remote metadata first, to subsequently drop the variables. In some files, there could be 1000 variables. `Xarray` would parse all these, and them drop them. With the `CE`, the server sends a Constrained Metadata associated with only the desired variables.
```


```{warning}
`Xarray` expects the presence of dimension in the metadata. When constructing `CE`s, the user needs to make sure to include all the dimensions associated with the variables of interest in the CE. In the example above, `COASX`, `COADSY`, and `TIME` are the dimensions of `SST`.
```



### <span style='color:#0066cc'>**Consolidate Metadata speeds up the Dataset generation**<span style='color:black'>.


In [None]:
consolidate_metadata(dap4ce_urls, session=session, concat_dim="TIME")

```{note}
`consolidate_metadata(dap4_urls, concat_dim='...', session=session)` downloads the dimensions of the remote file and stores them as a SQLite, to be reused. The session object becomes a way to authenticate, and act as a database manager! This practice can result in a performance gain of ~ 10-100 times faster workflows!
```

### Use Xarray logic to download data.


In [None]:
ds = xr.open_mfdataset(
    dap4ce_urls, 
    engine='pydap',
    concat_dim='TIME',
    session=session,
    combine="nested",
    parallel=True,
    decode_times=False,
)
ds

In [None]:
ds['SST']

```{note}
The chunking of `SST` implies the entire array within each file is a single chunk! This is a stardard interpretation that `Xarray` makes of `OPeNDAP` urls. What if we download a single spatial point from a single remote file? 
```


In [None]:
session.cache.clear()

In [None]:
%%time
ds['SST'].isel(TIME=0, COADSX=0, COADSY=0).load() # this should download a single point one of the files

In [None]:
print("====================================== \n Request sent to the Remote Server:\n ", session.cache.urls()[0].split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/"), "\n====================================== ")

### <span style='color:#0066cc'>**The entire variable is unnecessarily downloaded<span style='color:black'>** !!

Ideally we would want the see the following Request (in the constraint expressssion) sent to the Remote Server:

```python
dap4.ce=/SST[0][0][0]
```
It seems that `xr.open_mfdataset` does not pass the slice argument to the server for each remote dataset. Instead it downloads all the chunk (i.e. the data array) in a single request, subsets it, and then aggregates the data.


### <span style='color:#0066cc'>**How to pass the slice from Xarray to the Remote Server<span style='color:black'>**


**The answer is to `chunk` the dataset when creating it**. The chunk **should match the expected size of your subset**. That way the subset will be processed within a single request per remote file.

```{warning}
If you chunk the dataset with a size smaller that your expected download, you will trigger many downloads per remote file, forcing `Xarray` extra work to assemble the data together.
```




In [None]:
# consolidate metadata again, since the cached metadata was cleared before
consolidate_metadata(dap4ce_urls, session=session, concat_dim="TIME")


In [None]:
# For a single element in all dimensions, the expected size of the download is:
expected_sizes = {"TIME":1, "COADSX":1, "COADSY":1}

In [None]:
%%time
ds = xr.open_mfdataset(
    dap4ce_urls, 
    engine='pydap',
    concat_dim='TIME',
    session=session,
    combine="nested",
    parallel=True,
    decode_times=False,
    chunks=expected_sizes,
)
session.cache.clear()

In [None]:
ds['SST'] # inspect chunks before download

In [None]:
%%time
ds['SST'].isel(TIME=0, COADSX=0, COADSY=0).load() # triggers download of an individual chunk

In [None]:
print("====================================== \n Request sent to the Remote Server:\n ", session.cache.urls()[0].split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/"), "\n====================================== ")

### Warning: Be cautious about chunking

We now only downloaded exactly what we requested! However, in some scenarios the time for download can be 10x slower, compared to the case when we requested more data!! The reason for the slowdown can sometimes be attributed to the number of chunks the dask graph generated.


* `No chunking. Download all the array in the file. 2 chunks in 5 dask graphs (one per file).`
* `Chunking. Download only the desired element of a file. 388800 chunks in 5 dask graphs`. 

Ideally, the chunk manager should only trigger the download of a single chunk. However, `388800` were created to ensure passing the slice to the server. This, can sometimes lead to slowdowns on the client side.

In the scenario above, we went to the extremes. It is better to find a chunk compromise. We demonstrate that below, but <span style='color:#0066cc'>**now subsetting across all time (across both files)**</span>. 


In [None]:
consolidate_metadata(dap4ce_urls, session=session, concat_dim="TIME")

In [None]:
download_sizes = {"COADSY":1} # note that we will subset across all time

In [None]:
%%time
ds = xr.open_mfdataset(
    dap4ce_urls, 
    engine='pydap',
    concat_dim='TIME',
    session=session,
    combine="nested",
    parallel=True,
    decode_times=False,
    chunks=download_sizes,
)
session.cache.clear()

In [None]:
ds['SST']

In [None]:
%%time
ds['SST'].isel(COADSX=0, COADSY=0).load()

In [None]:
print("====================================== \n Parallel Requests sent to the Remote Server:\n ", [url.split("?")[-1].split("&dap4.checksum")[0].replace("%5B","[").replace("%5D","]").replace("%3A",":").replace("%2F","/") for url in session.cache.urls()], "\n====================================== ")

### Success! Similar timings but much and smaller download!
