## Objective

* Download OOI Array CTD Data 
* Transform related variables for machine learning QC.

## Method

* Take [Global Station Papa](http://ooi.visualocean.net/regions/view/GP) data as a testbed
* Download major CTD variables (e.g., temperature, salinity, depth) and save them as a csv file
* Output QC flags (normal or suspicious)



Here, I download CTD data from [OOI Station PAPA](https://ooinet.oceanobservatories.org/data_access/?search=Global%20Station%20Papa).

In [35]:
import warnings
warnings.filterwarnings('ignore')
import requests
import time
import numpy as np
from thredds_crawler.crawl import Crawl
import os
import xarray as xr
import matplotlib.pyplot as plt
import pandas as pd


# OOI data asscess setting
# OOI data team account 
username = 'OOIAPI-D8S960UXPK4K03'
token = 'IXL48EQ2XY'
base_url = 'https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv'

We only need to request the data once. The requested data will be download to a THREDDS server and be there for at least six months. When we need the same data next time, we can simply download data from the THREDDS server.

In [41]:
# Find device address. Follow: https://github.com/ooi-data-review/m2m_demo/blob/master/notebooks/netcdf_data_request.ipynb
# You need to change this if you're looking for other instruments.
# See http://ooi.visualocean.net/instruments/view/GP02HYPM-WFP02-04-CTDPFL000
array_name = 'Global-Station-Papa'
refdes = 'GP02HYPM-RIM01-02-CTDMOG039'  # CTD
method = 'telemetered'
stream = 'ctdmo_ghqr_sio_mule_instrument'
beginDT = '2015-06-04T16:30:01.000Z'
endDT = '2016-08-25T04:00:01.000Z'

# Make the whole data address for downloading.
data_request_url ='/'.join((base_url,refdes[:8],refdes[9:14],refdes[15:],method,stream))
params = {
    'beginDT':beginDT,
    'endDT':endDT,   
}
print(data_request_url)
r = requests.get(data_request_url, params=params, auth=(username, token))
data = r.json()

https://ooinet.oceanobservatories.org/api/m2m/12576/sensor/inv/GP02HYPM/RIM01/02-CTDMOG039/telemetered/ctdmo_ghqr_sio_mule_instrument


In [42]:
# Check whethe data download is complete.
%time

check_complete  = data['allURLs'][1] + '/status.txt'
for i in range(1000): 
    r = requests.get(check_complete)
    if r.status_code == requests.codes.ok:
        print('request completed!')
        break
    else:
        time.sleep(.5)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 13.6 µs
request completed!


Once we request the data sucessfully, we can extract data from the THREDDS server.

In [27]:
# Get the data URL for the NetCDF file dataset from THREDDS server.
url = data['allURLs'][0]  # This is the THREDDS server address.
print('THREDDS server: ' + url)
url = url.replace('.html', '.xml')
tds_url = 'https://opendap.oceanobservatories.org/thredds/dodsC'
c = Crawl(url, select=[".*\.nc$"], debug=False)
datasets = [os.path.join(tds_url, x.id) for x in c.datasets]
print(datasets)

THREDDS server: https://opendap.oceanobservatories.org/thredds/catalog/ooi/ooidatateam@gmail.com/20180823T175922-GP02HYPM-WFP02-04-CTDPFL000-telemetered-ctdpf_ckl_wfp_instrument/catalog.html
['https://opendap.oceanobservatories.org/thredds/dodsC/ooi/ooidatateam@gmail.com/20180823T175922-GP02HYPM-WFP02-04-CTDPFL000-telemetered-ctdpf_ckl_wfp_instrument/deployment0003_GP02HYPM-WFP02-04-CTDPFL000-telemetered-ctdpf_ckl_wfp_instrument_20150605T040003-20160104T190808.954545.nc', 'https://opendap.oceanobservatories.org/thredds/dodsC/ooi/ooidatateam@gmail.com/20180823T175922-GP02HYPM-WFP02-04-CTDPFL000-telemetered-ctdpf_ckl_wfp_instrument/deployment0002_GP02HYPM-WFP02-04-CTDPFL000-telemetered-ctdpf_ckl_wfp_instrument_20140622T050003-20141018T190930.571428.nc', 'https://opendap.oceanobservatories.org/thredds/dodsC/ooi/ooidatateam@gmail.com/20180823T175922-GP02HYPM-WFP02-04-CTDPFL000-telemetered-ctdpf_ckl_wfp_instrument/deployment0001_GP02HYPM-WFP02-04-CTDPFL000-telemetered-ctdpf_ckl_wfp_instrume

In [43]:
# Do some filtering if needed.
# sci_datasets = list(filter(lambda x: 'ENG000000' in x, datasets))
# Read the data using xarray.
ds = xr.open_mfdataset(datasets)
#ds = ds.swap_dims({'obs': 'time'})
ds

<xarray.Dataset>
Dimensions:                                      (obs: 28919)
Coordinates:
  * obs                                          (obs) int64 0 1 2 3 4 5 6 7 ...
    time                                         (obs) datetime64[ns] dask.array<shape=(28919,), chunksize=(5166,)>
    pressure                                     (obs) float64 dask.array<shape=(28919,), chunksize=(5166,)>
    lat                                          (obs) float64 dask.array<shape=(28919,), chunksize=(5166,)>
    lon                                          (obs) float64 dask.array<shape=(28919,), chunksize=(5166,)>
Data variables:
    deployment                                   (obs) int32 dask.array<shape=(28919,), chunksize=(5166,)>
    id                                           (obs) |S64 dask.array<shape=(28919,), chunksize=(5166,)>
    conductivity                                 (obs) float64 dask.array<shape=(28919,), chunksize=(5166,)>
    driver_timestamp                          

In [44]:
# Useful variables. Use L1 data and QC flags.
select_var = ['time', 'lon', 'lat', 
            'ctdpf_ckl_seawater_temperature', 'ctdpf_ckl_seawater_conductivity', 'ctdpf_ckl_seawater_pressure',
              'practical_salinity', 'density', 'density_qc_executed', 'density_qc_results',
              'practical_salinity_qc_executed', 'practical_salinity_qc_results',
            'ctdpf_ckl_seawater_pressure_qc_executed', 'ctdpf_ckl_seawater_pressure_qc_results',
            'ctdpf_ckl_seawater_temperature_qc_executed','ctdpf_ckl_seawater_temperature_qc_results',
            'ctdpf_ckl_seawater_conductivity_qc_executed','ctdpf_ckl_seawater_conductivity_qc_results']
df = ds[select_var].to_dataframe()
df.drop(columns=['pressure'], inplace=True)
#df.head()
df.columns = ['time', 'lon', 'lat', 'sea_water_temperature', 'sea_water_conductivity', 'sea_water_pressure',
             'sea_water_salinity', 'sea_water_density', 'density_qc_executed', 'density_qc_results', 
             'salinity_qc_executed', 'salinity_qc_results', 'pressure_qc_executed', 'pressure_qc_results',
             'temperature_qc_executed', 'temperature_qc_results', 'conductivity_qc_executed', 'conductivity_qc_results']

QC table
```
Test name              Bit position
                         15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
global_range_test         0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1
dataqc_localrangetest     0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0
dataqc_spiketest          0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0
dataqc_polytrendtest      0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
dataqc_stuckvaluetest     0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0
dataqc_gradienttest       0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0
dataqc_propagateflags     0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
```

If `qc_executed` is not equal to `qc_results`, we mark it as `suspicious`.

In [45]:
# Set normal or suspicious flags for each variable
df['pressure_flag'] = df['pressure_qc_executed']==df['pressure_qc_results']
df['temperature_flag'] = df['temperature_qc_executed']==df['temperature_qc_results']
df['conductivity_flag'] = df['conductivity_qc_executed']==df['conductivity_qc_results']
df['density_flag'] = df['density_qc_executed']==df['density_qc_results']
df['salinity_flag'] = df['salinity_qc_executed']==df['salinity_qc_results']

df.head()

Unnamed: 0_level_0,time,lon,lat,sea_water_temperature,sea_water_conductivity,sea_water_pressure,sea_water_salinity,sea_water_density,density_qc_executed,density_qc_results,...,pressure_qc_results,temperature_qc_executed,temperature_qc_results,conductivity_qc_executed,conductivity_qc_results,pressure_flag,temperature_flag,conductivity_flag,density_flag,salinity_flag
obs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2015-06-05 04:00:03.000000000,-144.803,50.07983,4.2321,3.18064,206.58,33.760779,1027.74527,29.0,29.0,...,29.0,29.0,29.0,29.0,29.0,True,True,True,True,True
1,2015-06-05 04:06:07.523809280,-144.803,50.07983,3.9922,3.17126,280.74,33.857878,1028.193595,29.0,29.0,...,29.0,29.0,29.0,29.0,29.0,True,True,True,True,True
2,2015-06-05 04:12:12.047619072,-144.803,50.07983,3.7849,3.16747,387.67,33.971107,1028.803508,29.0,29.0,...,29.0,29.0,29.0,29.0,29.0,True,True,True,True,True
3,2015-06-05 04:18:16.571428352,-144.803,50.07983,3.6757,3.17085,493.93,34.069008,1029.38644,29.0,29.0,...,29.0,29.0,29.0,29.0,29.0,True,True,True,True,True
4,2015-06-05 04:24:21.095238144,-144.803,50.07983,3.5903,3.17542,599.93,34.157043,1029.956567,29.0,29.0,...,29.0,29.0,29.0,29.0,29.0,True,True,True,True,True


In [33]:
df.to_csv(array_name+'_'+refdes+'_'+method+'_'+stream+'.csv')

Take a look at how many suspicious records we have in this datasets.

In [46]:

susp_record = {'suspicious_number': [df.shape[0]-sum(df['temperature_flag']), 
                                    df.shape[0]-sum(df['conductivity_flag']),
                                    df.shape[0]-sum(df['pressure_flag']),
                                    df.shape[0]-sum(df['salinity_flag']),
                                    df.shape[0]-sum(df['density_flag'])],
                'suspicious_rate': [(df.shape[0]-sum(df['temperature_flag']))/df.shape[0], 
                                    (df.shape[0]-sum(df['conductivity_flag']))/df.shape[0],
                                    (df.shape[0]-sum(df['pressure_flag']))/df.shape[0],
                                    (df.shape[0]-sum(df['salinity_flag']))/df.shape[0],
                                    (df.shape[0]-sum(df['density_flag']))/df.shape[0],
                                   ]}
sp_df = pd.DataFrame(susp_record, index=['temperature', 'conductivity', 'pressure', 'salinity', 'density'])
sp_df

Unnamed: 0,suspicious_number,suspicious_rate
temperature,22,0.000761
conductivity,14219,0.491684
pressure,22,0.000761
salinity,22,0.000761
density,22,0.000761
