# Introduction


This notebook tries to describe the technical tasks in the context of 'why are they necessary'. 


## Technical topics covered

- Working with shallow profiler ascent/descent/rest cycles
- bash, text editor, git, GitHub
- running a Jupyter notebook server (code and markdown)
- Ordering, retrieving and cleaning datasets from OOI
- Deconstructing datasets into usable chunks
- Basic Python as the baseline layer of the code
    - Start by putting all code in notebooks
    - Notebooks can become over-long or very code-dense so...
        - Be prepared to break notebooks apart into smaller notebooks
        - Be prepared to migrate blocks of working code to module files
- Install Python data science libraries for plotting, opening datasets, slicing
    - starting with `matplotlib`, `numpy`, `pandas`, and `xarray`
- As needed bring in Python extension libraries
    - interactive widgets, maps, animation, color maps
- Pulling other data (besides shallow profiler) from the OOI data system
- Pulling datasets from other programs: ARGO, MODIS, GLODAP, ROMS, MSLA, etcetera
- Using binder as an ephemeral executable sandbox
- Working from larger extra-repo datasets


### Working with shallow profiler ascent/descent/rest cycles


This topic is out of sequence intentionally. The topics that follow are in more of a logical order.




The issue at hand is that the shallow profiler ascends and descends and rests about nine times per
day; but the time of day when these events happen is not perfectly fixed. As a result we need 
a means to identify the start and end times of (say) an ascent so that we can be confident that
the data were in fact acquired as the profiler ascended through the water column. This is also 
useful for comparing ascent to descent data or comparing profiler-at-rest data to platform data
(since the profiler is at rest *on* the platform).



To restate the task: From a conceptual { time window } we would like very specific { metadata }
for time windows when the profiler ascended while collecting data. 
That is, we want accurate subsidiary time windows for successive profiles within our conceptual
time window; per site and year.
We can then use these specific { time window } boundaries to select data
subsets from corresponding profiling runs. 



The first step in this process is to get CTD data for the shallow profiler since it will have a
record of depth over time. This record is scanned in one-year chunks to identify the UTM start
times of each successive profile. Also determined: The start times of descents and the start times of rests. 



From these three sets of timestamps we can make the assumption that the end of an 
ascent corresponds to the start of a descent. Likewise the end of a descent is the start of 
a rest; and the start of an ascent is the end of the previous rest. Each ascent / descent / rest
interval is considered as one profile (in that order). The results are written to a CSV file
that has one row of timing metadata per profile. 



Now suppose the goal is to create a sequence of temperature plots for July 2019 for the Axial 
Base shallow profiler. First we would identify the pre-existing CSV file for Axial Base for the
year 2019 and read that file into a pandas Dataframe. Let's suppose it is read into a profile
Dataframe called `p` and that we have labled the six columns that correspond to
ascent start/end, descent start/end and rest start/ned. Here is example code from `BioOpticsModule.py`.


```
p = pd.read_csv(metadata_filename, usecols=["1", "3", "5", "7", "9", "11"])
p.columns          = ['ascent_start', 'ascent_end', 'descent_start', 'descent_end', 'rest_start', 'rest_end']
p['ascent_start']  = pd.to_datetime(pDf['ascent_start'])
p['ascent_end']    = pd.to_datetime(pDf['ascent_end'])
p['descent_start'] = pd.to_datetime(pDf['descent_start'])
p['descent_end']   = pd.to_datetime(pDf['descent_end'])
p['rest_start']    = pd.to_datetime(pDf['rest_start'])
p['rest_end']      = pd.to_datetime(pDf['rest_end'])
```



Let's examine two rows of this Dataframe:



```
print(p['ascent_start'][0])

2019-01-01 00:27:00

print(p['ascent_start'][1600])

2019-07-04 15:47:00
```


That is, row 0 corresponds to the start of 2019, January 1, and row 1600 occurs on July 4.


For a 365 day year with no
missed profiles (9 profiles per day) this file would contain 365 * 9 = 3285 profiles. In practice
there will be fewer owing to storms or other factors that interrupt data acquisition. 


Each row of this dataframe corresponds to a profile run (ascent, descent, rest) of the shallow
profiler. Consequently we could use the time boundaries of one such row to select data that was
acquired *during the ascent period of that profile*. Suppose a temperature dataset for the month of July 
is called `T`. `T` is constructed as an xarray Dataset with dimension `time`. 
We can use the xarray *select* method `.sel`, as in `T.sel(time=slice(time0, time1))`, to
produce a Dataset with only times 
that fall within a profile ascent window.  


```
time0    = p['ascent_start'][1600]
time1    = p['ascent_end'][1600]
T_ascent = T.sel(time=slice(time0, time1))
```


Now `T_ascent` will contain about 60 minutes worth of data. 



This demonstrates loading time boundaries from the metadata `p`. 
The metadata informs the small time box. Now we need the other direction 
as well: Suppose the interval of interest is the first four days of July 2019.
We have no idea which rows of the metadata `p` this corresponds to. We need
a list of row indices for `p` in that time window. For this we 
have a utility function.


```
def GenerateTimeWindowIndices(pDf, date0, date1, time0, time1):
    '''
    Given two day boundaries and a time window (UTC) within a day: Return a list
    of indices of profiles that start within both the day and time bounds. This 
    works from the passed dataframe of profile times.
    '''
    nprofiles = len(pDf)
    pIndices = []
    for i in range(nprofiles):
        a0 = pDf["ascent_start"][i]
        if a0 >= date0 and a0 <= date1 + td64(1, 'D'):
            delta_t = a0 - dt64(a0.date())
            if delta_t >= time0 and delta_t <= time1: pIndices.append(i)
    return pIndices
```

This function has both a date range and a time-of-day range. The resulting row index list corresponds
to profiles that satisfy both time window constraints: Date and time of day. 


The end-result is this: We can go from a conceptual { time window } to a list of { metadata rows }, i.e. a
list of integer row numbers, using the above utility function. Within the metadata structure `p` we can 
use these rows to look up ascent / descent / rest times for profiles.
At that point we have very specific { time window } boundaries for selecting data
from individual profiles. 






### bash, text editor, git, GitHub
### running a Jupyter notebook server (code and markdown)


- I learn the basic commands of the `bash` shell; including how to use a text editor like `nano` or `vim`
- I create an account at `github.com` and learn to use the basic `git` commands
    - `git pull`, `git add`, `git commit`, `git push`, `git clone`, `git stash`
    - I plan to spend a couple of hours learning `git`; I find good YouTube tutorials
- I create my own GitHub repository with a `README.md` file describing my research goals
- I set up a Jupyter notebook server on my local machine
    - As I am using a PC I install WSL-2 (Windows Subsystem for Linux v2)...
        - ...and install Miniconda plus some Python libraries
- I clone my "empty" repository from GitHub to my local Linux environment
- I start my Jupyter notebook server, navigate to my repo, and create a first notebook
- I save my notebook and use `git add, commit, push` to save it safely on GitHub
- On GitHub: Add and test a **`binder`** badge
    - Once that works, be sure to `git pull` the modified GitHub repo back into the local copy



### Ordering, retrieving and cleaning datasets from OOI


At this point we do not have any data; so let's do that next. There are two important considerations. 
First: If the data volume will exceed 100MB: That is too much to keep in a GitHub repository. The
data must be staged "nearby" in the local environment; outside the repository but accessible by
the repository code, as in:


```
               ------------- /repo directory
              /
/home --------
              \
               -------------- /data directory

```


Second: Suppose the repo *does* contain (smaller) datasets, to be read by the code. 
If the intent is to use `binder` to make a sandbox version of the repo
available, all significant changes to this code should be tested: First locally
and then (after a `push` to GitHub) ***in `binder`***. This ensures that not too 
many changes pile up, breaking binder in mysterious and hard-to-debug ways.




Now that we have a dataset let's open it up and examine it within a Notebook.
The data are presumed to be in NetCDF format; so we follow common practice of
reading the data into an `xarray Dataset` which is a composition of `xarray
DataArrays`. There is a certain amount of learning here, particularly as this
library shares some Python DNA with `pandas` and `numpy`. Deconstructing an
`xarray Dataset` can be very challenging; so a certain amount of ink is devoted
to that process in this repo.

#### Open and subset a NetCDF data file via the `xarray Dataset`  


Data provided by OOI tends to be "not ready for use". There are several steps needed; and
these are not automated. They require some interactive thought and refinement. 


- Convert the principal dimension from `obs` or `row` to `time` 
    - `obs/row` are generic terms with values running 1, 2, 3... (hinders combining files into longer time series)
- Re-name certain data variables for easier use; and delete anything that is not of interest
- Identify the time range of interest
- Write a specific subset file
    - For example: Subset files that are small can live within the repo


```
# This code runs 'one line at a time' (not as a block) to iteratively streamline the data

#   Suggestion: Pay particular attention to the construct ds = ds.some_operation(). This ensures 
#     that the results of some_operation() are retained in the new version of the Dataset. 

ds = xr.open_dataset(filename)
ds                                         # notice the output will show dimension as "row" and "time" as a data variable


ds = ds.swap_dims({'row': 'time'})         # moves 'time' into the dimension slot
ds = ds.rename({'some_ridiculously_long_data_variable_name':'temperature'})
ds = ds.drop('some_data_variable_that_has_no_interest_at_this_point')


ds = ds.dropna('time')                     # if any data variable value == 'NaN' this entry is deleted: Includes all
                                           #   corresponding data variable values, corresponding coordinates and 
                                           #   the corresponding dimension value. This enables plotting data such
                                           #   as pH that happens to be rife with NaNs. 

ds.z.plot()                                # this produces a simple chart showing gaps in the data record
ds.somedata.mean()                         # prints the mean of the given data variable

ta0 = dt64_from_doy(2021, 60)              # these time boundaries are set iteratively...
ta1 = dt64_from_doy(2021, 91)              #   ...to focus in on a particular time range with known data...
ds.sel(time=slice(ta0,  ta1)).z.plot()     #   ...where this plot is the proof


ds.sel(time=slice(ta0,  ta1)).to_netcdf(outputfile)           # writes a time-bounded data subset to a new NetCDF file
```

#### Depth and time


Datasets have a depth attribute `z` and a time dimension `time`. These are derived by the data 
system and permit showing sensor values (like temperature) either in terms of depth below the 
surface; or in time relative to some benchmark. 

#### Some complicating data features


- Some signals may have dropouts: Missing data is usually flagged as `NaN`
    - See the section above on using the xarray `.dropna(dimension)` feature to clean this up
- Nitrate data also features ***dark sample*** data
- Spectrophotometer instruments measure both ***optical absorption*** and ***beam attenuation***
    - For both of these about 82 individual channel values are recorded
        - Each channel is centered at a unique wavelength in the visible spectrum
        - The wavelength channels are separated by about 4 nm
        - The data are noisy
        - Some channels contain no data
    - Sampling frequency needed
- Spectral irradiance carries seven channels (wavelengths) of data
- Current measurements give three axis results: north, east, up
    - ADCP details needed





#### wget


`wget` can be used recursively to copy files from the web to local copies.
`wget` used in the **Global Ocean** notebook to get 500MB of data from the 
cloud that would otherwise make the repository too bulky for GitHub.  


Example usage, typically run from the command line, run from a Jupyter notebook
cell, or placed in a `bash` script:


```
wget -q https://kilroybackup.s3.us-west-2.amazonaws.com/glodap/GLODAPv2.2016b.NO3.nc -O glodap/NO3.nc
```

The `-q` flag suppresses output ('quiet') and `-O` specifies the local name of the data file.



### Basic Python as the baseline layer of the code



#### Start by putting all code in notebooks



#### Notebooks can become over-long or very code-dense so...



##### Be prepared to break notebooks apart into smaller notebooks



##### Be prepared to migrate blocks of working code to module files



### Install Python data science libraries for plotting, opening datasets, slicing



#### starting with `matplotlib`, `numpy`, `pandas`, and `xarray`



### As needed bring in Python extension libraries



#### interactive widgets, maps, animation, color maps



### Pulling other data (besides shallow profiler) from the OOI data system



### Pulling datasets from other programs: ARGO, MODIS, GLODAP, ROMS, MSLA, etcetera



### Using binder as an ephemeral executable sandbox



### Working from larger extra-repo datasets





    






Let's make a time subset of the dataset and plot the data.




Let's focus on a profiler.




Let's animate a time series.





### Data product levels


The 
[OOI Data Catalog Documentation](https://dataexplorer.oceanobservatories.org/help/overview.html#data-products) 
describes three levels of data product, summarized: 


* Level 1 ***Instrument deployment***: Unprocessed, parsed data parameter that is in instrument/sensor 
units and resolution. See note below defining a *deployment*. This is not data we are interested in using, as a rule.


* Level 1+ ***Full-instrument time series***: A join of recovered and telemetered 
streams for non-cabled instrument deployments. For high-resolution cabled and recovered data, this product is 
binned to 1-minute resolution to allow for efficient visualization and downloads for users that do not need 
the full-resolution, gold copy (Level 2) time series. We'd like to hold out for 'gold standard'.


* Level 2 ***Full-resolution, gold standard time series***: The calibrated full-resolution dataset 
(scientific units). L2 data have been processed, pre-built, and served 
from the OOI system to the 
[OOI Data Explorer](https://dataexplorer.oceanobservatories.org/)
and to Users. The mechanisms are THREDDS and ERDDAP; file format  
NetCDF-CF. There is one file for every instrument, stream, and deployment.  For more refer to this
[Data Download](https://dataexplorer.oceanobservatories.org/help/overview.html#download-data-map-overview) link.



## OOI terminology



- **instrument**: A physical device with one or more sensors.
- **stream**: Sensor data.
- **deployment**: The act of putting infrastructure in the water, or the length of 
time between a platform going in the water and being recovered and brought back to shore.There are 
multiple deployment files per instrument. 




# Retained from prior iterations



The sequence of events so far:


* order data
* clean the data to regular 1Min samples
* scan the data for profiles; write these to CSV files
* load in a profile list for a particular site and year


Now we start charting this data. We'll begin with six signals, three each from the CTD and the fluorometer. 
Always we have two possible axes: Depth and time. Most often we chart against depth using the y-axis and 
measuring from a depth of 200 meters at the bottom to the surface at the top of the chart. 


CTD signals


* Temperature
* Salinity
* Dissolved oxygen


Fluorometer signals


* CDOM: Color Dissolved Organic Matter)
* chlor-a: Chlorophyll pigment A
* scatt: Backscatter


The other sensor signals will be introduced subsequently. These include nitrate concentration,
pH, pCO2, PAR, spectral irradiance, local current and water density. 


Some profile pandas dataframe code; and filtering: 


```
# Create a pandas DataFrame: Six columns of datetimes for a particular year and site
#   The six columns are start/end for, in order: ascent, descent, rest: See column labels below.
def ReadProfiles(fnm):
    """
    Profiles are saved by site and year as 12-tuples. Here we read only
    the datetimes (not the indices) so there are only six values. These
    are converted to Timestamps. They correspond to ascend start/end, 
    descend start/end and rest start/end.
    """
    df = pd.read_csv(fnm, usecols=["1", "3", "5", "7", "9", "11"])
    df.columns=['ascent_start', 'ascent_end', 'descent_start', 'descent_end', 'rest_start', 'rest_end']
    df['ascent_start'] = pd.to_datetime(df['ascent_start'])
    df['ascent_end'] = pd.to_datetime(df['ascent_end'])
    df['descent_start'] = pd.to_datetime(df['descent_start'])
    df['descent_end'] = pd.to_datetime(df['descent_end'])
    df['rest_start'] = pd.to_datetime(df['rest_start'])
    df['rest_end'] = pd.to_datetime(df['rest_end'])
    return df


# FilterSignal() operates on a time series DataArray passed in as 'v'. It is set up to point to multiple possible
#   smoothing kernels but has just one at the moment, called 'hat'.
def FilterSignal(v, ftype='hat', control1=3):
    """Operate on an XArray data array (with some checks) to produce a filtered version"""
    # pre-checks
    if not v.dims[0] == 'time': return v

    if ftype == 'hat': 
        n_passes = control1        # should be a kwarg
        len_v = len(v)
        for n in range(n_passes):
            source_data = np.copy(v) if n == 0 else np.copy(smooth_data)
            smooth_data = [source_data[i] if i == 0 or i == len_v - 1 else \
                0.5 * source_data[i] + 0.25 * (source_data[i-1] + source_data[i + 1]) \
                for i in range(len_v)]
        return smooth_data
    return v

```

# Spectral Irradiance






> ***If we shadows have offended,***<BR>
> ***Think but this, and all is mended,***<BR>
> ***That you have but slumber'd here***<BR>
> ***While these visions did appear.***<BR>


# Introduction

    
This notebook should run in **`binder`**. It uses small datasets stored within this repo.

The notebook charts
CTD data, dissolved oxygen, nitrate, PAR, spectral irradiance, fluorescence and pH in relation
to pressure/depth. The focus is
shallow (photic zone) profilers from the Regional Cabled Array component of OOI.
Specifically the Oregon Slope Base site in 2019. Oregon Slope Base is an instrumentation
site off the continental shelf west of the state of Oregon.



# Visions of the Photic Zone: CTD and other low data rate sensors


The 'photic zone' is the upper layer of the ocean regularly illuminated by sunlight. This set of photic zone 
notebooks concerns sensor data from the surface to about 200 meters depth. Data are acquired from two to nine
times per day by shallow profilers. This notebook covers CTD (salinity 
and temperature), dissolved oxygen, nitrate, pH, spectral irradiance, fluorometry and photosynthetically 
available radiation (PAR).  


Data are first taken from the Regional Cabled Array shallow profilers and platforms. A word of explanation here: The
profilers rise and then fall over the course of about 80 minutes, nine times per day, from a depth of 200 meters
to within about 10 meters of the surface. As the ascend and descend they record data. The resting location in
between these excursions is a platform 200 meters below the surface that is anchored to the see floor. The platform
also carries sensors that measure basic ocean water properties.


<BR>
<img src="./images/vessels/revelle.jpg" style="float: left;" alt="ship and iceberg photo" width="900"/>
<div style="clear: left"><BR>


Research ship Revelle in the southern ocean: 100 meters in length. 
Note: Ninety percent of this iceberg is beneath the surface. 


More on the Regional Cabled Array oceanography program [here](https://interactiveoceans.washington.edu).
    
    
### Study site locations
    

We begin with three sites in the northeast Pacific: 
    

```
Site name               Lat               Lon
------------------      ---               ---
Oregon Offshore         44.37415          -124.95648
Oregon Slope Base       44.52897          -125.38966 
Axial Base              45.83049          -129.75326
```   


## Spectral Irradiance

* The data variable is `spkir_downwelling_vector` x 7 wavelengths per below
* 9 months continuous operation at about 4 samples per second gives 91 million samples
* DataSet includes `int_ctd_pressure` and `time` Coordinates; Dimensions are `spectra` (0--6) and `time`
* Oregon Slope Base `node : SF01A`, `id : RS01SBPS-SF01A-3D-SPKIRA101-streamed-spkir_data_record`
* Correct would be to plot these as a sequence of rainbow plots with depth, etc

See [Interactive Oceans](https://interactiveoceans.washington.edu/instruments/spectral-irradiance-sensor/): 


> The Spectral Irradiance sensor (Satlantic OCR-507 multispectral radiometer) measures the amount of 
> downwelling radiation (light energy) per unit area that reaches a surface. Radiation is measured 
> and reported separately for a series of seven wavelength bands (412, 443, 490, 510, 555, 620, 
> and 683 nm), each between 10-20 nm wide. These measurements depend on the natural illumination 
> conditions of sunlight and measure apparent optical properties. These measurements also are used 
> as proxy measurements of important biogeochemical variables in the ocean.
>
> Spectral Irradiance sensors are installed on the Science Pods on the Shallow Profiler Moorings 
> at Axial Base (SF01A), Slope Base (SF01A), and at the Endurance Array Offshore (SF01B) sites. 
> Instruments on the Cabled Array are provided by Satlantic – OCR-507. 


```
spectral_irradiance_source = './data/rca/irradiance/'
ds_irradiance = [xr.open_dataset(spectral_irradiance_source + 'osb_sp_irr_spec' + str(i) + '.nc') for i in range(7)]

# Early attempt at using log crashed the kernel

day_of_month_start = '25'
day_of_month_end = '27'
time0 = dt64('2019-01-' + day_of_month_start)
time1 = dt64('2019-01-' + day_of_month_end)

spectral_irradiance_upper_bound = 10.
spectral_irradiance_lower_bound = 0.
ds_irr_time_slice = [ds_irradiance[i].sel(time = slice(time0, time1)) for i in range(7)]

fig, axs = plt.subplots(figsize=(12,8), tight_layout=True)
colorwheel = ['k', 'r', 'y', 'g', 'c', 'b', 'm']
for i in range(7):
    axs.plot(ds_irr_time_slice[i].spkir_downwelling_vector, \
                ds_irr_time_slice[i].int_ctd_pressure, marker='.', markersize = 4., color=colorwheel[i])
    
axs.set(xlim = (spectral_irradiance_lower_bound, spectral_irradiance_upper_bound), \
        ylim = (60., 0.), title='multiple profiles: spectral irradiance')


plt.show()
```

In [None]:
"""
Stand-alone code to plot a user-specified mooring extraction.
"""
from pathlib import Path
moor_fn = Path('/Users/pm8/Documents/LO_output/extract/cas6_v3_lo8b/'
    +'moor/ooi/CE02_2018.01.01_2018.12.31.nc')

import xarray as xr
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# load everything using xarray
ds = xr.load_dataset(moor_fn)
ot = ds.ocean_time.values
ot_dt = pd.to_datetime(ot)
t = (ot_dt - ot_dt[0]).total_seconds().to_numpy()
T = t/86400 # time in days from start
print('time step of mooring'.center(60,'-'))
print(t[1])
print('time limits'.center(60,'-'))
print('start ' + str(ot_dt[0]))
print('end   ' + str(ot_dt[-1]))
print('info'.center(60,'-'))
VN_list = []
for vn in ds.data_vars:
    print('%s %s' % (vn, ds[vn].shape))
    VN_list.append(vn)
    
# populate lists of variables to plot
vn2_list = ['zeta']
if 'shflux' in VN_list:
    vn2_list += ['shflux', 'swrad']
vn3_list = []
if 'salt' in VN_list:
    vn3_list += ['salt', 'temp']
if 'oxygen' in VN_list:
    vn3_list += ['oxygen']

# plot time series using a pandas DataFrame
df = pd.DataFrame(index=ot)
for vn in vn2_list:
    df[vn] = ds[vn].values
for vn in vn3_list:
    # the -1 means surface values
    df[vn] = ds[vn][:, -1].values

plt.close('all')
df.plot(subplots=True, figsize=(16,10))
plt.show()