# Using M2M: Programmatic Interaction with OOI Data & Metadata
Author: Andrew Reed

### Purpose
This notebook is the companion to the presentation "**Using M2M: Programmatic Interaction with OOI Data & Metadata**" presented as part of the Ocean Observatories Facilities Board **NE Pacific Community Workshop** in Portland, Oregon, from June 7 - 9, 2022. The goal is to walk through how to use the Ocean Observatories Initiative's API, named Machine-2-Machine (M2M), system and examine what data can be queried and how to manipulate it to extract desired information. The table below outlines the different categories of data which can be queried, their access points in the API, and a short description.

| Category | Access Point | Description |
| -------- | ------------ | ----------- |
| Deployment | 12587/events/deployment/inv/ | Access deployment numbers as well as the asset & calibration info for specified instrument & deployment, and deployment times/cruises |
| Deployment | 12587/asset/deployments | Asset & calibration info for all deployment for the specified UID |
| Calibration | /12587/asset/cal?uid= OR ?assetid= | Return all calibration info for a given uid or assetId |
| Calibration | /12587/asset/cal?refdes= | Return list of deployments with calibrations for a given reference designator |
| Asset | 12587/asset?uid= OR ?serialnumber= | Asset information by unique id or instrument serial number |
| Preload | 12575/parameter/ | Retrieve information for a parameter (i.e. variable) given its ID number |
| Preload | 12575/stream/byname/ | Retrieve information for a stream given its name |
| Annotations | /12580/anno/find?= | Retrieve annotations for a specific time period and for a given reference designator (optional: stream and method) |
| Vocab | 12586/vocab/inv/ | Get the vocabulary (descriptions) for a sensor |
| Data | 12576/sensor/inv/ | Can access the data from OOI using either a synchronous (returns JSON; limited to 20000 data points) or asynchronous (returns netCDF, CSV, or JSON; not data limit) |

The tutorial and presentation are based off of similar work developed by Sage Lichtenwalder (github: @seagrinch) for the 2018 OOI Data Workshops. 

### Setup
First, please go to https://ooinet.oceanobservatories.org/ and make an account for yourself. Once you have registered and logged in, navigate to your account settings by clicking on "User Profile" under your email in the top right corner of your screen. Once at your Profile, record your API Username and API Token. These are necessary if you wish to access and download data from the Ocean Observatories API.

Additionally, this notebook makes use of the code contained in the partner package pyOOI. A stripped-down version of this package has been included as a module in the repository with this tutorial to allow for direct import. Further code scripts and functions may also be found on github.com/oceanobservatories/ooi-data-explorations. 

In [None]:
import os, sys
import yaml
import datetime
import requests
import pandas as pd
import xarray as xr
import warnings
warnings.filterwarnings("ignore")

In [None]:
from IPython.core.display import display, HTML
#display(HTML("<style>.container { width:100% !important; }</style>"))

Import the M2M module from the pyOOI package if it is downloaded locally:

In [None]:
sys.path.append("../src/")
from pyOOI.M2M import *

In [None]:
for key in URLS.keys():
    dtype = key
    url = URLS.get(dtype)
    print(dtype + " :: " + url)

---
## Navigating the API
Navigating the OOI M2M end-points can be confusing. We can reference the helpful OOI API cheat-sheet. There are also several other quirks with how OOI delivers data. First, some queries to OOI will return . These are some basic functions needed to interoperate with the OOI data

In [None]:
def ntp_seconds_to_datetime(ntp_seconds):
    """Convert OOINet timestamps to unix-convertable timestamps."""
    # Specify some constant needed for timestamp conversions
    ntp_epoch = datetime.datetime(1900, 1, 1)
    unix_epoch = datetime.datetime(1970, 1, 1)
    ntp_delta = (unix_epoch - ntp_epoch).total_seconds()

    return datetime.datetime.utcfromtimestamp(ntp_seconds - ntp_delta)

def convert_time(ms):
    if ms is None:
        return None
    else:
        return datetime.datetime.utcfromtimestamp(ms/1000)

def unix_epoch_time(date_time):
    """Convert a datetime to unix epoch microseconds."""
    # Convert the date time to a string
    date_time = int(pd.to_datetime(date_time).strftime("%s"))*1000
    return date_time

---
## Finding Data

The first step in downloading data from the OOI M2M is to find the datasets that you want to download. We can do this by querying the "data" API through its various endpoints until we have the sensor that we are interested in.

In [None]:
# Start with the basic - just requesting the very basic gives you a list of the available sites - OOI parlance
# for the different moorings (i.e. Global Station Papa Flanking Mooring A = GP03FLMA)
print("API Endpoint: " + URLS["data"] + "\n")

print("Returns the following list of sites: \n")
sites = get_api(URLS["data"])
print(sites)

In [None]:
# Select a site: Global Flanking Mooring A
site = "GP03FLMA"

# Can further narrow down the search - adding in the site will generate a list of the "nodes" on the mooring
print("API Endpoint: " + URLS["data"] + "/" + site + "\n")

print(f"Returns the following list of nodes on {site}: \n")
nodes = get_api(URLS["data"] + "/" + site)
print(nodes)

In [None]:
# Select a node - in this case the Mooring Riser
node = "RIS01"

# Can further narrow the search - adding in the node with generate a list of the sensors on the given platform and node
print("API Endpoint: " + URLS["data"] + "/" + site + "/" + node + "\n")

# Next, we can get all of the sensors on a given mooring node
print(f"Returns the following list of sensors on {site}-{node}: \n")
sensors = get_api(URLS["data"] + "/" + site + "/" + node)
print(sensors)

Here we can see there are four sensors on the Global Flanking Mooring A - Riser. One of the sensors (00-SIOENG000) returns engineering/operations data and does not have useful science data. The other three sensors are:
* 03-DOSTAD000 - oxygen sensor
* 04-PHSENF000 - pH sensor
* 05-FLORTD000 - chlorophyll/turbidity sensor

In [None]:
# Select a sensor - in this case the dissolved oxygen sensor
sensor = "03-DOSTAD000"

# With the site-node-sensors we can construct a "reference designator" or refdes for short
print(f"Site: {site}")
print(f"Node: {node}")
print(f"Sensor: {sensor}")

With the **site**, **node**, and **sensor** we can construct the **reference designator** or **refdes** for short. The **reference designator** identifies a particular instrument that has been deployed as part of a site.

In [None]:
refdes = "-".join((site, node, sensor))
print(f"Reference Designator: {refdes}")
print(f"Site: {site}")
print(f"Node: {node}")
print(f"Sensor: {sensor}")

In [None]:
# With a sensor selected, we can see what data delivery methods are available
print("API Endpoint: " + URLS["data"] + "/" + site + "/" + node + "/" + sensor + "\n")

# Next, we can get all of the sensors on a given mooring node
print(f"Returns the following list of data delivery methods on {site}-{node}-{sensor}: \n")
methods = get_api(URLS["data"] + "/" + site + "/" + node + "/" + sensor)
print(methods)

The **Data Delivery Method** specifies how the data was either transmitted or recorded. In this case, we have the options:
* **recovered_host**: Data downloaded directly from the computer on the mooring or asset which logs the data from the attached instruments
* **telemetered**: Data received through wireless transmission, e.g. surface buoy to satellite, glider to satellite, etc. Telemetered data is frequently truncated or decimated to reduce size for transmission.

In [None]:
# Select a data delivery method
method = "recovered_host"

# Can further narrow the search - adding in the node with generate a list of the sensors on the given platform and node
print("API Endpoint: " + URLS["data"] + "/" + site + "/" + node + "/" + sensor + "/" + method + "\n")

# Next, we can get all of the sensors on a given mooring node
print(f"Returns the following list of streams for {site}-{node}-{sensor} {method}: \n")
sensors = get_api(URLS["data"] + "/" + site + "/" + node + "/" + sensor + "/" + method)
print(sensors)

The **Data Streams** are generated from parsing the sensor raw data and separating it based on content (e.g. science, engineering, metadata, etc.). In this case, we have the options:
* **dosta_abcdjm_sio_metadata_recovered**: this is the stream which contains metadata and sensor engineering data
* **dosta_abcdjm_sio_instrument_recovered**: this stream contains the science-relevant data we are interested in getting

In [None]:
# Select the stream
stream = "dosta_abcdjm_sio_instrument_recovered"

#### Search Datasets

For your convience, the function ```search_datasets``` included in the tutorial package can search the available OOI Reference Designators (i.e. "refdes" for short) on the following keys: **array**, **node**, **instrument**. Additionally, can request for "**English_names**", which will return the descriptive name for the associated array, node, and instrument. The function uses the knowledge of the sensor endpoints outlined above to crawl through the endpoints looking for available datasets which fit the search keys. Below, we will search for the available CTD instruments on the Global Ocean Station Papa Flanking Mooring A. Adding the "**English_names**" make use of the **vocab** url which we'll explore in a section lower-down in this notebook.

The major caveat with the search is, similar to when searching on ERDDAP datasets, the search terms must be partial or full match based on OOI nomenclature. For example, if we were looking for CTDs, we would have to search for "CTD", "CTDMO", or the full instrument name "02-CTDMOH051". We can't search "conductivity", "temperature" or other CTD-related instrument terms.

We'll search the Global Ocean Station Papa Flanking Mooring A Datasets for any oxygen sensors, all of which will start with "DO".

In [None]:
papa_datasets = search_datasets(array="GP03FLMA", instrument="DO", English_names=True)
papa_datasets

You will still need to query the M2M API to get the available methods and data streams for the reference designator that you choose.

Now, we could go ahead and request data for the Global Ocean Station Papa Mooring Riser Dissolved Oxygen Sensor by using the API Endpoint https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv/GP03FLMA/RIS01/03-DOSTAD000/recovered_host/dosta_abcdjm_sio_instrument_recovered. However, this request will return _all_ of the available data and _all_ of the parameters for the sensor, including a lot of engineering or unprocessed data.

Instead, we can interrogate the OOI M2M system to get information on when the sensor has been deployed, what parameters are available on what data streams, and start to narrow our data request to only those parameters, time periods, and/or deployments that we may be interested in.

---
## Deployment Information
A deployment is defined as the span of time a mooring or instrument were deployed and then recovered. When we searched for the dissolved oxygen sensor on the Global Ocean Station Papa Flanking Mooring A, it returned a table which listed the available deployment numbers for each of the datasets. We can get much more detailed information on the deployments for a particular reference designator by requesting the deployment information from OOINet. 

We can start by using the deployment endpoint and reference designator to get a list of the available deployments for the reference designator:

In [None]:
refdes = "GP03FLMA-RIS01-03-DOSTAD000"
site, node, sensor = refdes.split("-", 2)

In [None]:
# First, we can request the deployment numbers for the given site-node-stream
print("API Endpoint: " + "/".join((URLS["deploy"], site, node, sensor)) + "\n")
print(get_api("/".join((URLS["deploy"], site, node, sensor))))

If we want more detailed information about a given deployment, we need to add in the deployment number:

In [None]:
# Returns asset and calibration information for deployment
deployment_number = "5"
data = get_api("/".join((URLS["deploy"], site, node, sensor, deployment_number)))
data

This returns a json object with a dictionary that contains a lot of very, very detailed information. Depending on what you want, this has to be parsed out. Since we are interested in deployment information, lets
parse that relevant info such as deployment start/end times, the unique ID of the instrument deployed, what cruise it was deployed on, etc:

In [None]:
deploymentInfo = pd.DataFrame()
for d in data: # If you requested more than one deployment
    deploymentInfo = deploymentInfo.append( {
        "deploymentNumber": d.get("deploymentNumber"),
        "referenceDesignator": d.get("referenceDesignator"),
        "mooring": d.get("mooring").get("description"),
        "sensor": d.get("sensor").get("description"),
        "UID": d.get("sensor").get("uid"),
        "deployDateTime": convert_time(d.get("eventStartTime")).strftime("%Y-%m-%dT%H:%M:%SZ"),
        "recoverDateTime": convert_time(d.get("eventStopTime")).strftime("%Y-%m-%dT%H:%M:%SZ"),
        "deployCruise": d.get("deployCruiseInfo").get("eventName"),
        "recoverCruise": d.get("recoverCruiseInfo").get("eventName")
    }, ignore_index=True)
    
deploymentInfo

The included function in the tutorial package ```get_deployments``` takes a reference designator, i.e. GP03FLMA-RIS01-02-DOSTAD000, parses it into the {site}/{node}/{sensor} information, and fetches all of the available info on the deployments for the given reference designator and returns it as a pandas dataframe.

In [None]:
deployments = get_deployments(refdes)
deployments

---
## Vocab Information
Additionally, if we are interested in more detailed information on the location that the reference designator is assigned to, we can request the vocab information for the given reference designator. The request returns a JSON object with details on the instrument such as the descriptive names for the reference designator location, the nominal depths, the manufacturer as well as the instrument model, etc. The vocab information includes some of the "**English_names**" info we requested when searching for datasets.

In [None]:
vocab_url = "/".join((URLS["vocab"], site, node, sensor))
print(vocab_url + "\n") 

vocab = get_api(vocab_url)
vocab

The included function ```get_vocab``` performs the request and reformats the JSON object into a pandas dataframe

In [None]:
vocab = get_vocab(refdes)
vocab

---
## Calibration Information
We can also request the calibration information for a given reference designator. Since individual instruments are swapped during each mooring deployment & recovery, the calibration coefficients for a reference designator are different for each deployment. The way OOI operates is that it loads all the available calibration coefficients for a given reference designator. Then, for each deployment, it finds the calibration coefficients with the most recent calibration date which most closely _precedes_ the start of the deployment. The result is a table, sorted by deployment number for a reference designator, with the uid of the specific instrument, its calibration coefficients, when the instrument was calibrated, and the source of the calibration coefficients.

In [None]:
cal_url = URLS["cal"] + "?refdes=" + refdes
print(cal_url + "\n")

cal_data = get_api(cal_url)
cal_data

In [None]:
len(cal_data)

We can see that the returned JSON object has multiple entries. Since individual instruments are swapped during each mooring deployment & recovery, the calibration coefficients for a reference designator are different for each deployment. The way OOI operates is that it loads all the available calibration coefficients for a given reference designator. Then, for each deployment, it finds the calibration coefficients with the most recent calibration date which most closely _precedes_ the start of the deployment, and applies those for the given deployment. 

However, when going through each entry, you'll notice that each deployment entry has all of the calibrations in the system entered for that given instrument UID. This makes parsing it very confusing. A better approach is to limit your request to a single deployment or even a single day by adding in **beginDT** and **endDT** to the request. 

In [None]:
# Limit the request to Deployment 6: summer of 2018 to summer of 2019
beginDT = "2018-08-29T22:54:00.000Z"
endDT = "2018-08-30T22:54:00.000Z"

# 
cal_url = URLS["cal"] + "?refdes=" + refdes + "&beginDT=" + beginDT + "&endDT=" + endDT
print("API Endpoint: " + cal_url + "\n")

cal_data = get_api(cal_url)
cal_data

Now we can parse the JSON file to get the relevant calibration information for the DOSTA instrument deployed at Global Ocean Station Papa for deployment number 6:

In [None]:
calibrations = pd.DataFrame()
for c in cal_data:
    deploymentNumber = c.get("deploymentNumber")
    uid = c.get("sensor").get("uid")
    for cc in c.get("sensor").get("calibration"):
        for ccc in cc.get("calData"):
            name = ccc.get("eventName")
            value = ccc.get("value")
            source = ccc.get("dataSource")
            # Update the calibration data frame
            calibrations = calibrations.append({
                "deploymentNumber": deploymentNumber,
                "uid": uid,
                "calCoef": name,
                "value": value,
                "calFile": source,
            }, ignore_index=True)
            
calibrations

The function ```get_calibrations``` included in this tutorial drastically simplifies the requests. However, it does need the **deployment** information fetched using the ```get_deployments``` function. What is nice about the function is that it will return the calibrations applicable for each deployment and just that deployment.

In [None]:
calibrations = get_calibrations_by_refdes(refdes, deployments)
calibrations

It is much easier to query for calibration data by the UID, or unique ID, for an instrument. For example, if we are looking at the oxygen optode deployed for Deployment 6, it has a UID of CGINS-DOSTAD-00228. We can request the calibration coefficients for that particular instrument.

In [None]:
# Set up the calibration url and arguments to pass to the request
uid = "CGINS-DOSTAD-00228" # This is unique to each instrument
cal_url = URLS["cal"] + "?uid=" + uid
print("API Endpoint: " + cal_url)

# Make the request
calInfo = get_api(cal_url)

# Put the data into a pandas dataframe, sorted by calibration date and coefficient name
columns = ["uid", "calCoef", "calDate", "value", "calFile"]
instrumentCals = pd.DataFrame(columns=columns)
for c in calInfo["calibration"]:
    for cc in c["calData"]:
        instrumentCals = instrumentCals.append({
            "uid": cc["assetUid"],
            "calCoef": cc["eventName"],
            "calDate": convert_time(cc["eventStartTime"]),
            "value": cc["value"],
            "calFile": cc["dataSource"]
        }, ignore_index=True)
instrumentCals.sort_values(by=["calDate", "calCoef"], inplace=True)
instrumentCals

Similiarly, there is an included function ```get_calibrations_by_uid```

In [None]:
get_calibrations_by_uid("CGINS-DOSTAD-00228")

---
## Metadata
Next, we can query OOINet for the metadata associated with the selected reference designator. The metadata contains such valuable information such as the available **methods** and **streams** (which are required to download the data), the **particleKeys** (the data variable names), and the associated **units**. 

In [None]:
metadata = get_metadata(refdes)
metadata

Now, there are a lot of different variables returned in the metadata for the Oxygen sensor. Unless we want to reprocess the raw data ourselves, we really just want the scientifically relevant parameters. But which ones are those?

Level 1, or L1 Data Products, are derived from L0 data, and provide data that has been calibrated using vendor-provided values or values derived from pre-deployment procedures, and that is in scientific units.

Example: Data from Aanderaa oxygen (DOSTA) sensors are converted from the L0 data  (DCONCS_L0) to the L1 dissolved oxygen concentrations data product (DCONCS_L1) internally using the manufacturerâ€™s conversion factors. While this is done onboard of the oxygen optode, the L0 products, such as the amplitude, phase, etc are available if desired.

Level 2, or L2 Data Products are derived quantities created via an algorithm that draws on multiple L1 Data Products. L2 data products may be based on data from the same or a combination of separate instruments. 
Example: Level 1 temperature (TEMPWAT_L1) and salinity (PRACSAL_L1) data products are used in conjunction with the Level 1 nitrate concentration data product to produce a temperature and salinity corrected Level 2 dissolved oxygen concentration data product (DOXYGEN_L2).  

**We recommend that end users work with Level 2 Data Products for analysis**, and use Level 0 and Level 1 products only in cases where the end user has a specific reason requiring these earlier-stage data products for their own data processing needs.

We can query for the relevant data product level from the "Preload Database" using the **pdId** field from the returned metadata. One note of caution: the data levels for some variables such as **time** are set as **None** even though time is a fundamental parameter. 

In [None]:
# Lets query "time" for its data level
preload_url = URLS["preload"] + "/" + "7"
print("API Endpoint: " + preload_url)

get_api(preload_url)

In [None]:
# Query the dissolved oxygen
preload_url = URLS["preload"] + "/" + "14"
print("API Endpoint: " + preload_url)

get_api(preload_url)

From above, we can see that **time** does not have a data level, whereas **dissolved_oxygen** is a **L2** level product. 

The function ```get_parameter_data_levels``` will take in the metadata you requested and return the relevant data levels for the parameter

In [None]:
data_levels = get_parameter_data_levels(metadata)
data_levels

With the returned parameter IDs, we can now filter for the L1 and L2 data levels:

In [None]:
def filter_science_parameters(metadata, data_levels):
    """This function returns the science parameters for each datastream"""
    
    def filter_parameter_ids(pdId, pid_dict):
        data_level = pid_dict.get(pdId)
        if data_level is not None:
            if data_level > 0:
                return True
            else:
                return False
        else:
            return False
    
    # Filter the parameters for processed science parameters
    mask = metadata["pdId"].apply(lambda x: filter_parameter_ids(x, data_levels))
    metadata = metadata[mask]

    return metadata

In [None]:
science_variables = filter_science_parameters(metadata, data_levels)
science_variables

Now, we can notice that the timestamps, etc did were filtered out. This is because they don't have a defined data product level. An additional wrinkle is that **```time```** is NOT the default dimension of delivered netCDF files - this means it needs to be specifically requested for data requests. This is something to be aware of when requesting only specific data variables.

---
## Requesting Data

### Synchronous Data
The fast(er) way to get data is via a **Synchronous** request. A synchronous request can accept the following specifications:
* limit (required): specifies number of data points with a maximum of 20000 
* beginDT (optional): start date as YYYY-mm-ddTHH:MM:SS.fffZ format
* endDT (optional): end date in same format as beginDT
* parameters (optional): numeric IDs of which parameters to get 

If you do not specify the limit, the request defaults to an **asynchronous** request which is covered below.

We'll go through the steps to narrow down to get just the oxygen data for the year 2015 from the Global Ocean Station Papa Flanking Mooring A oxygen sensor. The request we would build would have the following specifications:
* limit: 20000
* beginDT: 2015-01-01T00:00:00.000Z
* endDT: 2016-01-01T00:00:00.000Z
* parameters: 7 (time), 14 (dissolved oxygen)


In [None]:
method = "recovered_host"
stream = "dosta_abcdjm_sio_instrument_recovered"

# Request the oxygen data from the 
data_url = "/".join((URLS["data"], site, node, sensor, method, stream))

params = {
    "beginDT": "2015-01-01T00:00:01.000Z",
    "endDT": "2016-01-01T00:00:01.000Z",
    "limit": "20000"
}

data = get_api(data_url, params)
data

We'll put the data, which is in a JSON object, into a pandas dataframe for easier reading and parsing:

In [None]:
df = pd.DataFrame(data)
df.head(10)

Query all of the different column names which are the different parameters returned in the data request we just made:

In [None]:
df.columns

That is a lot of data that was returned that we aren't necessarily interested in. Let's narrow our request to just the **dissolved_oxygen**, which from above we know has a parameter ID of PD14:

In [None]:
# Select the method and stream
method = "recovered_host"
stream = "dosta_abcdjm_sio_instrument_recovered"

# Build the base url to request the data
data_url = "/".join((URLS["data"], site, node, sensor, method, stream))

# Add in the limits to the data request
params = {
    "beginDT": "2015-01-01T00:00:01.000Z",
    "endDT": "2016-01-01T00:00:01.000Z",
    "limit": "20000",
    "parameters": "14"
}

# Request the data
data = get_api(data_url, params)
data

Put the data, which is in a JSON object, into a pandas dataframe:

In [None]:
df = pd.DataFrame(data)
df.head()

We requested just the dissolved oxygen data, but we also got **ctdmo_ghqr_sio_mule_instrument-ctdmo_seawater_temperature** and **ctdmo_ghqr_sio_mule_instrument-practical_salinity** even though we didn't request them. So what are they? They are the seawater temperature and practical salinity needed to calculate the dissolved oxygen concentration from the DO measured by the instrument. The part before **-** in the names tell you the data stream the parameter comes from. 

However, this time we don't have any **time** parameter! That's because we forgot to request it. So let's try this once more, including the **time** parameter id of **7**:

In [None]:
# Select the method and stream
method = "recovered_host"
stream = "dosta_abcdjm_sio_instrument_recovered"

# Build the base url to request the data
data_url = "/".join((URLS["data"], site, node, sensor, method, stream))

# Add in the limits to the data request, this time remembering to include "time" parameter "7"
params = {
    "beginDT": "2015-01-01T00:00:01.000Z",
    "endDT": "2016-01-01T00:00:01.000Z",
    "limit": "20000",
    "parameters": "7,14"
}

# Request the data
data = get_api(data_url, params)
data

In [None]:
# Put the data into a dataframe
df = pd.DataFrame(data)

# Convert the time stamps from a string into a datetime object
df["time"] = df["time"].apply(lambda x: ntp_seconds_to_datetime(x))

# Get the deployment number from the "pk" dictionary
df["deployment"] = df["pk"].apply(lambda x: int(x["deployment"]))    
df.head(10)

Now we have everything we could want in order to begin visualizing and analyzing the oxygen data we requested.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(21,15))

ax[0].plot(df["time"], df["dissolved_oxygen"], marker=".", linestyle="", color="tab:blue")
ax[0].set_ylabel("Dissolved Oxygen \n[umol/kg]", fontsize=16, weight="bold")
ax[0].set_ylim((250, 400))
ax[0].grid()

ax[1].plot(df["time"], df["ctdmo_ghqr_sio_mule_instrument-ctdmo_seawater_temperature"], marker=".", linestyle="", color="tab:red")
ax[1].set_ylabel("Seawater Temperature \n[deg_C]", fontsize=16, weight="bold")
ax[1].grid()

ax[2].plot(df["time"], df["ctdmo_ghqr_sio_mule_instrument-practical_salinity"], marker=".", linestyle="", color="tab:green")
ax[2].set_ylabel("Practical Salinity", fontsize=16, weight="bold")
ax[2].set_ylim((31.5, 33.5))
ax[2].grid()

Looking at the data, there seems to be a lot of noise in the oxygen data from the first 5 months of 2015, while there is a lot of noise from June - October in the salinity and temperature data. Wonder what could be going on? The first place to check are the **annotations**.

---
## Annotations
Annotations are technical notes or qualitative data assessments of the instrument added by staff from the institutions operating the sensors. They represent the first human-in-the-loop (HITL) quality control review of the data coming from the sensor, and may contain important information about the state of the instrument, such as the presence of biofouling, power issues, communications disruptions, and other such issues. 

Annotations are ideal for removing known and identified bad data from a dataset before further processing. While it is not within the purview of OOI to comprehensively flag all such issues, such that additional end user QA/QC is required, existing annotations will provide valuable and time-saving information to support end user analysis.

An annotation downloaded from the OOI Data Portal is associated with a particular reference designator and not an individual instrument. It may also be further limited to a particular stream for a given reference designator, such as the pco2w_abc_dcl_instrument_recovered for recovered data from the SAMI-pCO2 instrument, as well as further limited to particular parameters. Annotations are either open-ended, with a start time (beginDT) and no end time (endDT), or may have both a start and end time. Times are returned in unix epoch microseconds. Lastly, a qcFlag may be assigned to a particular annotation following the QARTOD flagging conventions described above. 

In [None]:
print(f"The annotation endpoint is: {URLS['anno']}")

Now construct the annotation request. Once we get the annotations, we have to convert the beginDT and endDT from the unix epoch time milliseconds into a readable time stamp:

In [None]:
params = {
    "beginDT": "2015-01-01T00:00:01.000Z",
    "endDT": "2016-01-01T00:00:01.000Z",
}

annotations = get_annotations(refdes, beginDT="2015-01-01T00:00:01.000Z", endDT="2016-01-01T00:00:01.000Z")
annotations["beginDT"] = annotations["beginDT"].apply(lambda x: convert_time(x))
annotations["endDT"] = annotations["endDT"].apply(lambda x: convert_time(x))
annotations
annotations

Print out the annotation text for each row in the table above:

In [None]:
for index in annotations.index:
    start, stop, anno = annotations.loc[index, "beginDT"], annotations.loc[index, "endDT"], annotations.loc[index, "annotation"]
    print(f"{start} to {stop}: {anno}")

So now the noise in the oxygen data makes sense. There was biofouling! 

---
## Asynchronous Data Request

Asynchronous data request are not limited in the number of data points that you can request. Additionally, they allow you to request **netCDF** and **csv** data formats as well as **JSON**. However, they are slower than synchronous data requests and, depending on the dataset, can be very very large. The available request specificiations include:
* limit (required): if not specified, defaults netCDF 
* beginDT (optional): start date as YYYY-mm-ddTHH:MM:SS.fffZ format
* endDT (optional): end date in same format as beginDT
* parameters (optional): numeric IDs of which parameters to get 
* include_provenance (optional, default False): include a provenance file which specifies data processing paths
* include_annotations (optional, default False): include a file with data annotations 

For the example we walked through above with the synchronous request, we can similarly request the asynchronous version to get netCDF datasets. Our specifications will be:
* beginDT: 2015-01-01T00:00:00.000Z
* endDT: 2016-01-01T00:00:00.000Z
* parameters: 7 (time), 14 (dissolved oxygen)


In [None]:
# Asynchronous 
data_url = "/".join((URLS["data"], site, node, sensor, method, stream))

params = {
    "beginDT": "2015-01-01T00:00:01.000Z",
    "endDT": "2016-01-01T00:00:01.000Z",
    "parameters": "7,14"
}

In [None]:
from halo import HaloNotebook
import time

In [None]:
with HaloNotebook(text="Waiting for request to process", spinner="clock"):
    # Get the urls
    urls = get_api(data_url, params=params)
    # Check the status of the dataset preparation
    status_url = [url for url in urls["allURLs"] if re.match(r'.*async_results.*', url)][0]
    status_url = status_url + "/status.txt"
    status = SESSION.get(status_url)
    # Hold until the dataset construction is finished
    while status.status_code != requests.codes.ok:
        time.sleep(2)
        status = SESSION.get(status_url)
        
# Now fetch the thredds_url from the 
for d in urls['allURLs']:
    if 'thredds' in d:
        thredds_url = d

thredds_url

Now, the url we want is the "thredds catalog" in the 'allURLs' dictionary entry above. We have to parse out the catalog for the netCDF files

In [None]:
from bs4 import BeautifulSoup

In [None]:
page = requests.get(thredds_url).text
soup = BeautifulSoup(page, "html.parser")
pattern = re.compile('.*\\.nc$')
catalog = sorted([node.get('href') for node in soup.find_all('a', text=pattern)])
catalog

However, notice that there are some datasets which are NOT oxygen data. These include CTDMOG040, which supplies the practical salinity and temperature data necessary for calculating the oxygen concentration, and a FLORT datastream. We'll parse those out of the catalog and leave us with just the DOSTA datasets.

In [None]:
# Get rid of the unwanted datasets
catalog = [x for x in catalog if refdes in x.split("/")[-1]] 
catalog

In order to download the data, we want to get the catalog files from the **fileServer** url. Then we can download the netCDF files to whatever directory we want.

In [None]:
from urllib.request import urlretrieve

In [None]:
# To download, we need the fileServer
fileServer = URLS["fileServer"]
netCDF_files = [re.sub("catalog.html\?dataset=", fileServer, file) for file in catalog]

# Make a save directory
saveDir = f"../data/{refdes}/"
if not os.path.exists(saveDir):
    os.makedirs(saveDir)

for file in netCDF_files:
        filename = file.split("/")[-1]
        saveFile = "/".join((saveDir, filename))
        print(f"Saving {filename} to {saveFile} \n")
        urlretrieve(file, saveFile)

#### Packaged functions
The steps outlined above have been simplified into several easier-to-use functions as part of the package with this tutorial.

In [None]:
# First, get the thredds_url
thredds_url = get_thredds_url(refdes, method, stream, goldCopy=False, beginDT=params["beginDT"], endDT=params["endDT"], parameters=params["parameters"])

# Second, access the catalog
catalog = get_thredds_catalog(thredds_url)

# Next , get rid of the unwanted datasets
catalog = [x for x in catalog if refdes in x.split("/")[-1]] 

# Lastly, download the datasets
saveDir = f"../data/{refdes}/"
download_netCDF_files(catalog, goldCopy=False, saveDir=saveDir)

There are several ways to open the data. You might be tempted to utilize ```xarray.open_mfdataset``` feature to open all of the datasets at once into a single file. This will fail because moorings are deployed such that the new mooring goes into the water before the old mooring is recovered, leading to overlapping time periods. The ```open_mfdataset``` function requires increasing primary dimension. This can be avoided by using a ```preprocess``` routine to trim the overlapping portions of the datasets, but then you potentially lose some valuable data when two instruments are in the water at the same time! 

Instead, we can concatentate the datasets together. However, this is recommended only with smaller datasets; long timeseries or large datasets, especially profilers, will cause you to run out of working memory.

In [None]:
netCDF_files = ["/".join((saveDir, x)) for x in sorted(os.listdir(saveDir))]
for file in netCDF_files:
    ds = xr.open_dataset(file)
    ds = ds.swap_dims({"obs":"time"})
    try:
        new_ds = xr.concat([new_ds, ds], dim="time")
    except:
        new_ds = ds

In [None]:
new_ds = new_ds.sortby("time")
new_ds