# Download Data

### Purpose
This jupyter notebook highlights two different methods for accessing and downloading data from Ocean Observatories Initiative Carbon System instruments. The first method utilizes OOI's API to perform M2M (Machine-2-Machine) queries for data from the OOI THREDDS data server. The second method requests data from OOI's DataExplorer ERDDAP server.

#### THREDDs Data
The data served up via OpenDAP on OOI THREDDs servers are the same datasets which can be accessed via OOI's Data Portal at https://ooinet.oceanobservatories.org/. This is the source for accessing realtime or near-realtime data from OOI. 


#### Data Explorer
Data Explorer is the new tool for exploring, discovering, and downloading data from OOI. It can be accessed via the web at https://dataexplorer.oceanobservatories.org/. Data Explorer hosts "gold copy" versions of OOI datasets, with all the relevant data stream merged into a single unified file. These datasets are hosted on the Data Explorer ERDDAP server at  However, Data Explorer currently only from the Data Explorer website, they currently can't be downloaded from the ERDDAP server.

---
## OOINet/THREDDs
First, we are going to access and download data from OOI's Data Portal. Then we will do some dataset reprocessing to make the resulting data easier and more intuitive to work with. This portion of the notebook relies on some community tools which have been developed by OOI's Data Team members which simplify interacting with OOI's API. The two tools are the OOINet tool (https://github.com/reedan88/OOINet) and the Data Explorations Modules (https://github.com/oceanobservatories/ooi-data-explorations).

This notebook provides an example on how to use the OOINet download tool to perform the following functions:
* Search for datasets
* Identify desired reference designator
* Get the associated metadata for a given reference designator
* Request netCDF datasets for a reference designator
* Download the netCDF dataset to your local machine

The key parameters which the OOI API requires is the "reference designator." A reference designator may be thought of as a type of instrument located at a fixed location and depth. It is split up into the following three pieces:
1. Subsite - this is the part of the array that the instrument is located at (e.g. Coastal Pioneer Inshore Surface Mooring CP03ISSM)
2. Node - this is the part of the subsite that the instrument is attached to (e.g. the Surface Buoy on CP03ISSM as SDB12)
3. Sensor - this is the number-letter combination that designates a particular class and series of instrument (e.g. the Pro-Oceanus CO2-Pro Atmosphere as 04-PCO2AA000)

In [None]:
# Import some standard python data processing and analysis packages
import os, sys, datetime, pytz, re
import dateutil.parser as parser
import pandas as pd
import numpy as np
import xarray as xr
import warnings
import gc
import json
warnings.filterwarnings("ignore")

In [None]:
# Import dask to make use of parallel computing to significantly speed up processing speed
from dask.diagnostics import ProgressBar

#### Import the ```ooinet``` M2M toolbox
This toolbox is publicly available at https://github.com/reedan88/OOINet. It should be cloned onto your machine and the setup instructions followed before use.

In [None]:
sys.path.append("/home/areed/Documents/OOI/reedan88/ooinet/")
from ooinet import M2M
from ooinet.utils import convert_time, ntp_seconds_to_datetime, unix_epoch_time
from ooinet.Instrument.common import process_file, add_annotation_qc_flag

#### Import ```ooi_data_explorations``` toolbox
This toolbox is publicly available at https://github.com/oceanobservatories/ooi-data-explorations. Similarly to the ```ooinet``` toolbox above, it should be installed onto your machine following the setup instructions before use.

In [None]:
sys.path.append("/home/areed/Documents/OOI/oceanobservatories/ooi-data-explorations/python/")
from ooi_data_explorations.common import get_annotations, get_vocabulary, load_gc_thredds
from ooi_data_explorations.combine_data import combine_datasets
from ooi_data_explorations.uncabled.process_pco2a import pco2a_datalogger 
from ooi_data_explorations.qartod.qc_processing import identify_blocks, create_annotations, process_gross_range, \
    process_climatology, woa_standard_bins, inputs, ANNO_HEADER, CLM_HEADER, GR_HEADER

---
## Search Datasets
First, we can search the available OOI Reference Designators (i.e. "refdes" for short) on the following keys: **array**, **node**, **instrument**. Additionally, can request for "**English_names**", which will return the descriptive name for the associated array, node, and instrument. Below, we will search for the available CTD instruments on the Pioneer Array Central Surface Mooring.

The major caveat with the search is, similar to searching on ERDDAP datasets, the search terms must be partial or full match based on OOI nomenclature. For example, we have to search for "PCO2", "PCO2AA", or the full instrument name "04-PCO2AA" if we are searching for the sea-surface pCO2 sensor. We can't search "pco2", "carbon dioxide" or other instrument terms.

gold_copy = 'http://thredds.dataexplorer.oceanobservatories.org/thredds/catalog/ooigoldcopy/public/'

In [None]:
instruments = M2M.search_datasets(array="CP03ISSM", English_names=True)
instruments

From the returned list of available instruments above, we can select a particular instrument using its **reference designator** (refdes for short):

In [None]:
refdes = "CP03ISSM-SBD12-04-PCO2AA000"

---
## Metadata
Next, we can query OOINet for the metadata associated with the selected reference designator. The metadata contains such valuable information such as the available methods and streams (which are required to download the data), the particleKeys (the data variable names), and the associated units. 

In [None]:
metadata = M2M.get_metadata(refdes)
metadata

#### Sensor Parameters
Each instrument returns multiple parameters containing a variety of low-level instrument output and metadata. However, we are interested in science-relevant parameters. We can identify the science parameters based on the preload database, which designates the science parameters with a "data level" of L1 or L2. 

Consequently, we will want to filter and group the metadata for a given reference designator to identify the relevant parameters. First, we query the preload database with the relevant metadata for a reference designator. Then, we filter the metadata for the science-relevant data streams based on the preload information. Then, we reduce the results by grouping by the stream parameter to get the stream-by-stream data, which will be useful when requesting data from OOINet for download. 

In [None]:
data_levels = M2M.get_parameter_data_levels(metadata)
data_levels

Filter the metadata based on the data levels for **L1** & **L2** data

In [None]:
def filter_parameter_ids(pdId, pid_dict):
    data_level = pid_dict.get(pdId)
    if data_level is not None:
        if data_level > 0:
            return True
        else:
            return False
    else:
        return False

In [None]:
mask = metadata["pdId"].apply(lambda x: filter_parameter_ids(x, data_levels))
metadata = metadata[mask]

Groupby based on the reference designator - method - stream to get the unique values for each data stream

In [None]:
metadata = metadata.groupby(by=["refdes","method","stream"]).agg(lambda x: pd.unique(x.values.ravel()).tolist())
metadata = metadata.reset_index()
metadata = metadata.applymap(lambda x: x[0] if len(x) == 1 else x)
metadata.head()

This returns all of the methods and streams which have scientific data. For some datasets, such as the PCO2W or METBK datasets, we will need to do further cleaning to get rid of engineering and other metadata streams that do not contain relevant science data

In [None]:
mask = metadata["stream"].apply(lambda x: False if "blank" in x else True)
metadata = metadata[mask]
metadata

---
## Deployment Information
When we searched for datasets, it returned a table which listed the available deployment numbers for each of the datasets. We can get much more detailed information on the deployments for a particular reference designator by requesting the deployment information from OOINet.

In [None]:
deployments = M2M.get_deployments(refdes=refdes)
deployments

We'll go ahead and save the deployment data as a csv since it might be useful when working with the data.

In [None]:
deployments.to_csv(f"../data/{refdes}_deployments.csv", index=False)

---
## Vocab Information
Additionally, if we are interested in more detailed information on the location that the reference designator is assigned to, we can request the vocab information for the given reference designator. The vocab information includes some of the "**English_names**" info we requested when searching for datasets, as well as instrument model, manufacturer, and the descriptive names for the reference designator location.

In [None]:
vocab = M2M.get_vocab(refdes=refdes)
vocab

---
## Calibration Information
We can also request the calibration information for a given reference designator. Since individual instruments are swapped during each mooring deployment & recovery, the calibration coefficients for a reference designator are different for each deployment. The way OOI operates is that it loads all the available calibration coefficients for a given reference designator. Then, for each deployment, it finds the calibration coefficients with the most recent calibration date which most closely _precedes_ the start of the deployment. The result is a table, sorted by deployment number for a reference designator, with the uid of the specific instrument, its calibration coefficients, when the instrument was calibrated, and the source of the calibration coefficients.

Now, the ```PCO2A``` does not happen to require calibration information by OOI to process and deliver data, so there are no calibration data available from OOINet.

In [None]:
calibrations = M2M.get_calibrations_by_refdes(refdes, deployments)
calibrations

It is also possible to request the calibration history for a specific instrument by utilizing the **uid** of the instrument.

In [None]:
uid_calibrations = M2M.get_calibrations_by_uid(uid)
uid_calibrations

---
## Download Datasets
The ultimate goal of the queries above were to identify what data streams(s) we are interested in, along with supporting metadata/calibration information, in order to request the to download. Now we want to be able to request those data streams and get the associated netCDF files. This process involves the following steps:
1. Identify the methods and data streams for the selected reference designator
2. Request the THREDDS server url for the data sets
3. Get the catalog of datasets on the THREDDS server
4. Parse the catalog for the desired netCDF files
5. Download the identified netCDF files to a local directory

Below, we script the above steps in order to download all of the available datasets. In the following section we will combine the data delivered via different methods (e.g. telemetered, recovered_host, recovered_inst) to generate a single combined dataset with the most complete data record available.

In [None]:
def trim_overlaps(ds, deployments):
    """Trim overlapping deployment data (necessary to use xr.open_mfdataset)"""
    # --------------------------------
    # Second, get the deployment times
    deployments = deployments.sort_values(by="deploymentNumber")
    deployments = deployments.set_index(keys="deploymentNumber")
    # Shift the start times by (-1) 
    deployEnd = deployments["deployStart"].shift(-1)
    # Find where the deployEnd times are earlier than the deployStart times
    mask = deployments["deployEnd"] > deployEnd
    # Wherever the deployEnd times occur after the shifted deployStart times, replace those deployEnd times
    deployments["deployEnd"][mask] = deployEnd[mask]
    deployments["deployEnd"] = deployments["deployEnd"].apply(lambda x: pd.to_datetime(x))
    
    # ---------------------------------
    # With the deployments info, can write a preprocess function to filter 
    # the data based on the deployment number
    depNum = np.unique(ds["deployment"])
    deployInfo = deployments.loc[depNum]
    deployStart = deployInfo["deployStart"].values[0]
    deployEnd = deployInfo["deployEnd"].values[0]
    
    # Select the dataset data which falls within the specified time range
    ds = ds.sel(time=slice(deployStart, deployEnd))
    
    return ds

In [None]:
def preprocess_datalogger(ds):
    ds = process_file(ds)
    ds = trim_overlaps(ds, deployments)
    ds = pco2a_datalogger(ds)
    gc.collect()
    return ds

Filter out the "metadata" datastreams; use only the regular dataset and the water dataset

In [None]:
datastreams = M2M.get_datastreams(refdes)
datastreams

In [None]:
mask = datastreams["stream"].apply(lambda x: False if "metadata" in x or "blank" in x or "power" in x or "air" in x else True)
datastreams = datastreams[mask]
datastreams

---
## Download Data
To access data, there are two applicable methods. The first is to download the data and save the netCDF files locally. The second is to access and process the files remotely on the THREDDS server, without having to download the data.

In [None]:
# Get the available datasets
for index in datastreams[mask].index:
    # Get the method and stream
    method = datastreams.loc[index]["method"]
    stream = datastreams.loc[index]["stream"]

    # Get the URL - first try the goldCopy thredds server
    thredds_url = M2M.get_thredds_url(refdes, method, stream, goldCopy=True)

    # Get the catalog
    catalog = M2M.get_thredds_catalog(thredds_url)

    # Clean the catalog
    catalog = M2M.clean_catalog(catalog, stream, deployments)
    
    # Get the links to the THREDDs server and load the data
    dodsC = M2M.URLS["goldCopy_dodsC"]
    
    # Not all datasets have made it into the goldCopy THREDDS - in that case, need to request
    # from OOINet
    if len(catalog) == 0:
        # Get the URL - first try the goldCopy thredds server
        thredds_url = M2M.get_thredds_url(refdes, method, stream, goldCopy=False)

        # Get the catalog
        catalog = M2M.get_thredds_catalog(thredds_url)

        # Clean the catalog
        catalog = M2M.clean_catalog(catalog, stream, deployments)

        # Get the links to the THREDDs server and load the data
        dodsC = M2M.URLS["dodsC"]
    
    # Now load the data
    if method == "telemetered":
        tele_files = [re.sub("catalog.html\?dataset=", dodsC, file) for file in catalog]
        tele_files = [f for f in tele_files if "blank" not in f]
        print(f"----- Load {method}-{stream} data -----")
        with ProgressBar():
            tele_data = xr.open_mfdataset(tele_files, preprocess=preprocess_datalogger, parallel=True)
    elif method == "recovered_host":
        host_files = [re.sub("catalog.html\?dataset=", dodsC, file) for file in catalog]
        host_files = [f for f in host_files if "blank" not in f]
        print(f"----- Load {method}-{stream} data -----")
        with ProgressBar():
            host_data = xr.open_mfdataset(host_files, preprocess=preprocess_datalogger, parallel=True)
    else:
        pass

**Combine the datasets into a single dataset**

In [None]:
data = combine_datasets(tele_data, host_data, None, None)
data

**Clean up workspace variables and free up memory**

In [None]:
host_data.close()
tele_data.close()
del tele_data, host_data
gc.collect()

#### Save the results

In [None]:
data.to_netcdf(f"../data/{refdes}_combined.nc", engine="h5netcdf")
data.close()

---
## Annotations
Annotations contain important qualitative assessments of data quality from the instrument operators. They may range from explanations for why data is missing for a given time period to information about biofouling or other data quality issues. Annotations can be downloaded from OOINet for a particular reference designator.

In [None]:
# Download the annotations for each reference designator
annotations = M2M.get_annotations(refdes)
annotations

Save the annotations to local directory

In [None]:
annotations.to_csv(f"../data/{refdes}_annotations.csv")

---
## Data Explorer
---
The data from Data Explorer are hosted via ERDDAP. To interact with Data Explorer's ERDDAP, we'll utilize the python package ```erddapy```.

When using the Data Explorer ERDDAP server, the other metadata we accessed via M2M above, such as the sensor vocab, deployment info, calibration information, etc. is NOT available. That metadata may currently only be accessed via the OOINet M2M API.

In [None]:
from erddapy import ERDDAP

**Data Explorer ERDDAP url**

In [None]:
dataExplorer = "http://erddap.dataexplorer.oceanobservatories.org/erddap"

Connect to the Data Explorer ERDDAP

In [None]:
erd = ERDDAP(server=dataExplorer)

Search for ```PHSEN``` on the Irminger Array

In [None]:
search_url = erd.get_search_url(search_for="cp03issm pco2a", 
                                protocol="tabledap",
                                response="csv")

Get the dataset ids for the available PHSEN datasets on the Irminger Sea Flanking Mooring A

In [None]:
dataset_ids = pd.read_csv(search_url)["Dataset ID"]
dataset_ids

Download the dataset from ERDDAP

In [None]:
# Select the dataset id of the instrument you want to download
dataset_id = "ooi-cp03issm-sbd12-04-pco2aa000"

# Get the download url
download_url = erd.get_download_url(dataset_id=dataset_id, 
                                    protocol="tabledap",
                                    response="opendap")

# Set up the parameters for the dataset request from the ERDDAP server
erd.dataset_id = dataset_id
erd.response = "nc"
erd.protocol = "tabledap"

Open the requested dataset using ```xarray```

In [None]:
ds = erd.to_xarray()
ds = ds.swap_dims({"obs":"time"})
ds = ds.sortby("time")
ds