# Restructuring FS2009 bottle files  - (first attempt)

- Parse the relevant data from the FS bottle file `fs2009bottle+CDOM.mat`
    - To a NetCDF-compatible xarray *Dataset* by way of a Python dictionary.  
- Massage into the (presumably) desired structure.
    - Sorting by Niskin number to get unified dataset with dimensions `STATION, NISKIN_BOTTLE`
    - (This is *not* how the data in the mat file are organized.. 
- Option to wrap CDOM absorbtion coefficients into a single 3D variable.
- Export to (temporary) NetCDF file.

In [466]:
import xarray as xr
import numpy as np
import pymatreader
from matplotlib import pyplot as plt

## Read the matfile

- Using the `pymatreader` library which is pretty robust across matlab versions.
- Note that `pymatreader.read_mat()` will give some warnings - they can be ignored in this case.

In [467]:
fn = '/home/oyvindl/work/data/cruise_data/fram_strait/bottle_data_2025/data/bottle_data/fs2009bottle+CDOM.mat'

In [468]:
dm = pymatreader.read_mat(fn)
N_btl, N_station = dm['bot_nisk'].shape

### Toggle whether to wrap absorption variables
Whether or not to wrap the absorbtion coefficient variables into a single variable with a wavelength dimension as per CF (presumably)

In [444]:
use_wl_dimension = False

## Build the xr dataset

Create an xarray dataset `ds` that we will polulate with the data and massage into the shape we want.

Initial dimensions:
- `STATION, BTL_TEMPORARY` (if `use_wl_dimension=False`)
- `STATION, BTL_TEMPORARY, WAVELENGTH` (if `use_wl_dimension=True`)


**Note:** Each station has a different niskin ordering (in `bot_nisk`) -> **We will have to reindex, reordering by niskin number**.

Initializing `ds` with a dimension `BTL_TEMPORARY` dimension which we will eventually replace with `NISKIN_BOTTLE`.


In [471]:
if not use_wl_dimension:
    ds = xr.Dataset(coords = {
        'STATION' : np.int_(dm['bot_stn']),
        'BTL_TEMPORARY': np.arange(N_btl)},)
else:
    ds = xr.Dataset(coords = {
        'STATION' : np.int_(dm['bot_stn']),
        'BTL_TEMPORARY': np.arange(N_btl),
        'WAVELENGTH':[254.0, 350.0, 375.0]})

### Parse data into `ds`

#### Define in and out names

Read variables to new names. Input variable selection (roughly) based on [`Framstrait_CTD_parameter_overview.xlsx`](https://npolar.sharepoint.com/:x:/r/sites/FramStraitdatadescriptorpaper-DTUdatasharing/Shared%20Documents/DTU%20data%20sharing/Framstrait_CTD_parameter_overview.xlsx?d=wffc5157539f7454caba994817b7dc30c&csf=1&web=1&e=qTmw8l).

- *NOTE*: this is just a temporary dictionary; will conform with the names in the excel sheet eventually!

In [475]:
var_dict = {
    'bot_nisk': 'NISKIN_NUM_STATION',
    'bot_ctdsal':'PSAL_CTD',
    'bot_temp1':'TEMP_CTD',
    'bot_press':'PRES_CTD',
    'bot_labsal': 'PSAL_LAB',
    'bot_d18o': 'DO18',
    'bot_doc': 'DOC',
    'bot_no2': 'NO2',
    'bot_no3': 'NO3',
    'bot_po4': 'PO4',
    'bot_sio4': 'SIO4',
    'bot_lat':'LATITUDE', 
    'bot_lon':'LONGITUDE', 
    'ex350em450_lab': 'CDOM_EX350EM450_LAB',
    'ex370em460_lab': 'CDOM_EX370EM460_LAB',
    'ex370em460_ctd': 'CDOM_EX370EM460_CTD',
    'S': 'CDOM_SLOPE_300-650_LAB',
    'S275_295': 'CDOM_SLOPE_275-295_LAB',
    'S350_400': 'CDOM_SLOPE_350-400_LAB',
    'a254': 'CDOM_ACOEF254_LAB',
    'a350': 'CDOM_ACOEF350_LAB',
    'a375': 'CDOM_ACOEF375_LAB',
}

#### Parse data from the matfile to the xr Dataset

In [477]:
# Look through primary level variables:

for old_name in dm.keys():
    if old_name in var_dict:
        new_name = var_dict[old_name]

        # Look at 2D variables
        if dm[old_name].shape == (N_btl, N_station):
            ds[new_name] = (('BTL_TEMPORARY', 'STATION'), dm[old_name],) 
            
            # Grab flags associated with the variable
            flag_name_old = f'{old_name}_flag'
            if flag_name_old in dm.keys():
                ds[f'{new_name}_FLAG'] = (('BTL_TEMPORARY', 'STATION'), dm[flag_name_old],) 

            # TBD here: Do the appropriate cross-referencing in metadata e.g.

        # Look at 1D variables
        elif dm[old_name].shape == (N_station,):
            ds[new_name] = (('STATION'), dm[old_name],)             

# Look through variables nestes in `bot_cdom`:

for old_name in dm['bot_cdom'].keys():
    if old_name in var_dict:
        new_name = var_dict[old_name]
        if dm['bot_cdom'][old_name].shape == (N_btl, N_station):
            ds[new_name] = (('BTL_TEMPORARY', 'STATION'), dm['bot_cdom'][old_name],) 
        elif dm['bot_cdom'][old_name].shape == (N_station,):
            ds[new_name] = (('STATION'), dm['bot_cdom'][old_name],)   



#### Set lat/lon/pressure/niskin bottle number as non-dimensional coordinates


In [479]:
ds = ds.set_coords(['LATITUDE', 'LONGITUDE', 'NISKIN_NUM_STATION', 'PRES_CTD'])

### Combining the CDOM ACOEF variables

If `use_wl_dimension=True`: 

Combine the `ACOEFXXX` variables into one 3d variable

In [480]:
if use_wl_dimension:
    As = []
    WLs = [254, 350, 375]
    for WL in WLs:
        WL_varnm = f'CDOM_ACOEF{WL}_LAB'
        A_WL = ds[WL_varnm]
        A_WL['WAVELENGTH'] = ((), float(WL))
        As += [A_WL]
        ds = ds.drop_vars(WL_varnm)
    
    A_combined = xr.concat(As, 'WAVELENGTH')
    
    ds['CDOM_ACOEF_LAB'] = A_combined

### Reorder variables

This is mostly aesthetical: Reorder so that the variables appear in a nice order.

In [481]:
# Reorder so we get the flags at the end (a bit hacky but who cares)
varlist = []
      
for key in list(ds.keys()):
    if ('_FLAG' not in key 
        and '_CTD' not in key 
        and 'CDOM_' not in key):
        varlist += [key]
for key in list(ds.keys()):
    if '_CTD' in key and 'CDOM' not in key:
        varlist += [key]

for key in list(ds.keys()):
    if 'CDOM_ACOEF' in key:
        varlist += [key]
        
for key in list(ds.keys()):
    if 'CDOM_' in key and key not in varlist:
        varlist += [key]

for key in list(ds.keys()):
    if '_FLAG' in key:
        varlist += [key]
        
ds = ds[varlist]

### Reindex to a `NISKIN_BOTTLE` dimension

For each station: 
    - Remove entries with NaN niskin bottle number
    - Sort by Niskin bottle number

Then concatenate the resulting datasets to get a dataset `ds_reindexed` with coordinates (`STATION, NISKIN_BOTTLE`).

In [482]:
ds_stations = []
for station in ds.STATION.data:
    
    # Select one station 
    ds_station = ds.sel(STATION=station)

    # Swap dimension from BTL_TEMPORARY to NISKIN_NUM 
    ds_station = ds_station.assign_coords(NISKIN_BOTTLE=("BTL_TEMPORARY", ds_station["NISKIN_NUM_STATION"].data))
    ds_station = ds_station.swap_dims({"BTL_TEMPORARY": "NISKIN_BOTTLE"})

    # Remove rows where Niskin number is NaN
    valid_nisk = ~np.isnan(ds_station['NISKIN_BOTTLE'])
    ds_station_nna = ds_station.isel(NISKIN_BOTTLE=valid_nisk)

    # Sort by Niskin number
    ds_station_nna_sort = ds_station_nna.sortby('NISKIN_BOTTLE')

    # Collect
    ds_stations += [ds_station_nna_sort]

ds_reindexed = xr.concat(ds_stations, dim = 'STATION')
# Change from float to int
ds_reindexed['NISKIN_BOTTLE'] = ds_reindexed['NISKIN_BOTTLE'] .astype(int)

# Remove unused/redundant
ds_reindexed = ds_reindexed.drop_vars(['BTL_TEMPORARY', 'NISKIN_NUM_STATION'])