# Downloading pH glider data from the IOOS Glider Data Assembly Center

_Written by Lori Garzio, June 13, 2023_

[Rutgers Center for Ocean Observing Leadership](https://rucool.marine.rutgers.edu/) (RUCOOL)

This notebook demonstrates how to download NetCDF files from the [IOOS Glider DAC](https://gliders.ioos.us/) to a folder on your local machine using the [erddapy package](https://pypi.org/project/erddapy/). We will specify which variables to include in the file, and will be downloading the delayed-mode quality-controlled files from the [Glider DAC ERDDAP server](https://gliders.ioos.us/erddap/index.html). We have to download the delayed-mode datasets because real-time datasets do not have calculated pH included in the files. pH is calculated in the post-processing and QA/QC process, and those delayed-mode files are submitted to the DAC for public access.

In this example, we will be downloading delayed-mode data from the [spring 2021 deployment of ru30](https://gliders.ioos.us/erddap/tabledap/ru30-20210503T1929-delayed.html). You can browse this ERDDAP data page to get information about the deployment, make some graphs, and download the dataset in a variety of formats. However, instead of using this form to manually download the file, we are going to use python to download the dataset.

In [1]:
# import the required packages for data download
from erddapy import ERDDAP
import re
import os
import numpy as np
import xarray as xr

The first step is to get a list of variables that are included in this dataset. The function below returns a list of variables for a user-specified ERDDAP server and dataset ID.

In [2]:
def get_dataset_variables(server, ds_id):
    e = ERDDAP(server=server,
               protocol='tabledap',
               response='nc')
    var_dict = e._get_variables(dataset_id=ds_id)

    return list(var_dict.keys())

In [3]:
ioos_server = 'https://gliders.ioos.us/erddap'
deployment_id = 'ru30-20210716T1804-delayed'

ds_vars = get_dataset_variables(ioos_server, deployment_id)
print(ds_vars)

['density_lag_shifted', 'profile_id', 'NC_GLOBAL', 'conductivity_lag_shifted', 'pH_corrected', 'v_qc', 'pressure', 'lon_uv', 'depth_qc', 'density_qc', 'salinity', 'instrument_flbbcd', 'beta_700nm', 'chlorophyll_a', 'oxygen_concentration_corrected', 'oxygen_saturation_corrected', 'depth', 'pH_reference_voltage_corrected', 'depth_interpolated', 'precise_lon', 'precise_time', 'salinity_interpolated', 'lat_uv_qc', 'temperature_interpolated', 'trajectory', 'time', 'oxygen_saturation_raw', 'conductivity_qc', 'pressure_qc', 'longitude', 'instrument_optode', 'lon_uv_qc', 'instrument_ctd', 'u_qc', 'pressure_interpolated', 'temperature_lag_shifted', 'latitude_qc', 'temperature', 'platform_meta', 'latitude', 'time_qc', 'pH_reference_voltage_raw', 'conductivity', 'aragonite_saturation_state', 'longitude_qc', 'precise_lat_qc', 'precise_lon_qc', 'pH_raw', 'precise_time_qc', 'cdom', 'salinity_lag_shifted', 'salinity_qc', 'v', 'total_alkalinity', 'density', 'lat_uv', 'oxygen_concentration_raw', 'wmo_i

Next, we'll filter down this list to generate a list of variables that we actually want to download, so we don't have a bunch of variables we don't need in our file.

In [4]:
search_vars = ['latitude', 'longitude', 'depth', 'aragonite_saturation_state', 'chlorophyll', 
               'density', 'instrument_', 'oxygen', 'pH', 'pressure', 'salinity', 'temperature',
               'total_alkalinity']

In [5]:
r = re.compile('|'.join(search_vars))
new_list = list(filter(r.match, ds_vars))
print(new_list)

['density_lag_shifted', 'pH_corrected', 'pressure', 'depth_qc', 'density_qc', 'salinity', 'instrument_flbbcd', 'chlorophyll_a', 'oxygen_concentration_corrected', 'oxygen_saturation_corrected', 'depth', 'pH_reference_voltage_corrected', 'depth_interpolated', 'salinity_interpolated', 'temperature_interpolated', 'oxygen_saturation_raw', 'pressure_qc', 'longitude', 'instrument_optode', 'instrument_ctd', 'pressure_interpolated', 'temperature_lag_shifted', 'latitude_qc', 'temperature', 'latitude', 'pH_reference_voltage_raw', 'aragonite_saturation_state', 'longitude_qc', 'pH_raw', 'salinity_lag_shifted', 'salinity_qc', 'total_alkalinity', 'density', 'oxygen_concentration_raw', 'temperature_qc', 'instrument_ph']


The new list looks better than the full list of dataset variables, but we don't need any of the QC variables since this dataset has already been QC'd (and we don't need to apply any more QC variables), so let's remove those QC variables. We also want to grab the 'time' variable.

In [6]:
var_list = [x for x in new_list if not x.endswith('_qc')]
print(var_list)

['density_lag_shifted', 'pH_corrected', 'pressure', 'salinity', 'instrument_flbbcd', 'chlorophyll_a', 'oxygen_concentration_corrected', 'oxygen_saturation_corrected', 'depth', 'pH_reference_voltage_corrected', 'depth_interpolated', 'salinity_interpolated', 'temperature_interpolated', 'oxygen_saturation_raw', 'longitude', 'instrument_optode', 'instrument_ctd', 'pressure_interpolated', 'temperature_lag_shifted', 'temperature', 'latitude', 'pH_reference_voltage_raw', 'aragonite_saturation_state', 'pH_raw', 'salinity_lag_shifted', 'total_alkalinity', 'density', 'oxygen_concentration_raw', 'instrument_ph']


In [7]:
var_list.append('time')
print(var_list)

['density_lag_shifted', 'pH_corrected', 'pressure', 'salinity', 'instrument_flbbcd', 'chlorophyll_a', 'oxygen_concentration_corrected', 'oxygen_saturation_corrected', 'depth', 'pH_reference_voltage_corrected', 'depth_interpolated', 'salinity_interpolated', 'temperature_interpolated', 'oxygen_saturation_raw', 'longitude', 'instrument_optode', 'instrument_ctd', 'pressure_interpolated', 'temperature_lag_shifted', 'temperature', 'latitude', 'pH_reference_voltage_raw', 'aragonite_saturation_state', 'pH_raw', 'salinity_lag_shifted', 'total_alkalinity', 'density', 'oxygen_concentration_raw', 'instrument_ph', 'time']


Now that we have the full list of variables that we want to include in our file, we can download our dataset. The function below will generate an xarray dataset for a user-specified ERDDAP server and dataset ID, with options to return selected variables and other constraints. In this example, we'll only be specifying the variables we want, not other constraints (which could be a specific time range, for example). You can find examples of specifying additional constraints in the [erddapy documentation](https://pypi.org/project/erddapy/).

In [8]:
def get_erddap_dataset(server, ds_id, variables=None, constraints=None):
    e = ERDDAP(server=server,
               protocol='tabledap',
               response='nc')
    
    e.dataset_id = ds_id
    
    if constraints:
        e.constraints = constraints
    if variables:
        e.variables = variables
        
    ds = e.to_xarray()
    ds = ds.sortby(ds.time)
    
    return ds

**Please Note:** The code in the next cell accesses the DAC ERDDAP server and tries to return all of the data you requested. Because we're requesting a high-resolution delayed-mode dataset, sometimes you will get a really long error message when you try to run this next block of code. If the top of the error reads: "TimeoutError" and the bottom reads: "ReadTimeout: The read operation timed out", try to run the code again. You might need to run it several times to finally get your dataset because of this read timeout error.

In [15]:
kwargs = dict()
kwargs['variables'] = var_list
ds = get_erddap_dataset(ioos_server, deployment_id, **kwargs)

In [16]:
ds

We're going to format the dataset a little differently than what's provided, to remove the trajectory and profile dimensions. Since time, latitude and longitude are unique to each profile, we're just going to expand these variables so they're on the obs dimension, rather than the profile dimension. Then we'll swap the obs dimension for time. This may take a few seconds to run if you have a large dataset, but will make it easier to work with for analyses.

In [17]:
# expand the time/lat/lon variables
ds = ds.drop_dims({'trajectory'})
profile_time = np.array([], dtype='datetime64[ns]')
profile_lat = np.array([])
profile_lon = np.array([])

for rs in ds.rowSize:
    new_time = np.repeat(rs.time.values, rs.values)
    new_lat = np.repeat(rs.latitude.values, rs.values)
    new_lon = np.repeat(rs.longitude.values, rs.values)
    profile_time = np.append(profile_time, new_time)
    profile_lat = np.append(profile_lat, new_lat)
    profile_lon = np.append(profile_lon, new_lon)
    
# add the variables to the dataset
attrs = {
    'comment': 'profile time',
}
da = xr.DataArray(profile_time, coords=ds.temperature.coords, dims=ds.temperature.dims,
                  name='time', attrs=attrs)
ds['time'] = da

attrs = {
    'comment': 'profile lat',
}
da = xr.DataArray(profile_lat, coords=ds.temperature.coords, dims=ds.temperature.dims,
                  name='latitude', attrs=attrs)
ds['latitude'] = da

attrs = {
    'comment': 'profile lon',
}
da = xr.DataArray(profile_lon, coords=ds.temperature.coords, dims=ds.temperature.dims,
                  name='longitude', attrs=attrs)
ds['longitude'] = da

# drop the profile dimension
ds = ds.drop_dims({'profile'})
ds = ds.swap_dims({'obs': 'time'})

ds

Now specify a location on your local machine to save the NetCDF file. The lines of code below will generate a new folder with the deployment_id in your specified save location, and will save the netcdf in that directory.

In [18]:
save_dir = '/Users/garzio/Documents/rucool/gliderdata'
save_dir = os.path.join(save_dir, deployment_id)
os.makedirs(save_dir, exist_ok=True)
fname = f'{deployment_id}.nc'
ds.to_netcdf(os.path.join(save_dir, fname))
print(f'File saved to: {os.path.join(save_dir, fname)}')

File saved to: /Users/garzio/Documents/rucool/gliderdata/ru30-20210716T1804-delayed/ru30-20210716T1804-delayed.nc
