# Using EcoFOCIpy to process raw field data

## DY2209 (Fall Arctic Cruise - Dyson)

## CTD / BTL Data

Basic workflow for each instrument grouping is *(initial archive level)*:
- SBE workflow must happen first
- Parse data from btl files into pandas dataframe

Convert to xarray dataframe for all following work *(working or final data level):
- Add metadata from cruise yaml files and/or header info
- ingest metadata from cruise / cast logs
- process data beyond simple file translate
- apply any calibrations or corrections
    + field corrections
    + offsets
    + instrument compensations
    + some QC were available... this would be old-school simple bounds mostly
- adjust time bounds and sample frequency (xarray dataframe)
- save as CF netcdf via xarray: so many of the steps above are optional
    + **ERDDAP NRT** if no corrections, offsets or time bounds are applied but some meta data is
    + **Working and awaiting QC** has no ERDDAP representation and is a holding spot
    + **ERDDAP Final** fully calibrated, qc'd and populated with meta information

Plot for preview and QC
- preview images (indiv and/or collectively)
- manual qc process
- automated qc process ML/AI

Further refinenments for ERDDAP hosting:

## Differences from EPIC standard
- previously btl files had coordinates of lat/lon/time/depth...with a bottle position / fireing order being a variable, this means that if multiple bottles are fired at the same depth, the file was not uniquely indexed and the bottle variable had to be conflated for multiple fireings
- in new version, we will index based on lat/lon/time/bottle_num (bottle number is a sequential unique value... often representing the position on the rosette), merging with CTD downcast data will require maintining a pressure/depth variable in the bottle data that can be rounded to the nearest 1m bin.  This does not solve the problem with multiple discrete samples taken at a single depth and single niskin though.

## Example below is for SBE 9/11+ V2 but the workflow is similar for any SBE instruments.

Future processing of this instrument can be a simplified (no markdown) process which can be archived so that the procedure can be traced or updated

We process each cast as an individual file so this example will not loop over all profiles.  See `example/all_casts.py` example for processing an entire cruise at once.

Adding Discrete samples such as Oxygen, Chlorophyll, Salinity to BTL Data is in `example/discrete_castdata.py`.  Its purpose is to match niskin/bottle information to depth for the discrete data.

In [1]:
import yaml
import glob

import EcoFOCIpy.io.sbe_ctd_parser as sbe_ctd_parser #<- instrument specific
import EcoFOCIpy.io.ncCFsave as ncCFsave
import EcoFOCIpy.metaconfig.load_config as load_config

The sample_data_dir should be included in the github package but may not be included in the pip install of the package

## Simple Processing - first step

In [2]:
sample_data_dir = '/Users/bell/ecoraid/2022/CTDcasts/dy2209/' #root path to cruise directory
ecofocipy_dir = '/Users/bell/Programs/EcoFOCIpy/'

In [3]:
###############################################################
# edit to point to {cruise sepcific} raw datafiles 
datafile = sample_data_dir+'rawconverted/' #<- point to cruise and process all files within
cruise_name = 'DY2209' #no hyphens
cruise_meta_file = sample_data_dir+'logs/DY2209.yaml'
inst_meta_file = sample_data_dir+'logs/FOCI_standard_CTDplusrinko.yaml'
group_meta_file = ecofocipy_dir+'staticdata/institutional_meta_example.yaml'
###############################################################

#init and load data
cruise = sbe_ctd_parser.sbe_btl()
filename_list = sorted(glob.glob(datafile + '*.btl'))

cruise_data = cruise.manual_parse(filename_list)

Processing /Users/bell/ecoraid/2022/CTDcasts/dy2209/rawconverted/CTD001.btl
Processing /Users/bell/ecoraid/2022/CTDcasts/dy2209/rawconverted/CTD002.btl
Processing /Users/bell/ecoraid/2022/CTDcasts/dy2209/rawconverted/CTD003.btl
Processing /Users/bell/ecoraid/2022/CTDcasts/dy2209/rawconverted/CTD004.btl


  ctd_df = pd.concat([ctd_df,row])
  ctd_df = pd.concat([ctd_df,row])
  ctd_df = pd.concat([ctd_df,row])
  ctd_df = pd.concat([ctd_df,row])


In [4]:
#quick statistical look at the distribution of data for a cast
# #preview a dataframe
cruise_data['CTD001.btl'].describe()

Unnamed: 0,sal11,sal00,sbeox0ml/l,sbeox0ps,sbox0mm/kg,sbeox1ml/l,sbeox1ps,sbox1mm/kg,sigma-t00,sigma-t11,c0ms/cm,c1ms/cm,fleco-afl,t090c,t190c,turbwetntu0,par,scan,prdm,datetime
count,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6.0,6
mean,31.642167,31.6439,7.1362,101.810333,310.973167,6.741633,96.182833,293.778833,24.85935,24.85785,31.660529,31.659865,1.308267,6.328517,6.329617,1.954083,3.114163,6224.833333,12.1395,2022-09-20 21:13:01.000000256
min,31.5945,31.5954,7.0501,97.652,307.222,6.6627,92.006,290.253,24.7824,24.7817,30.63259,30.634116,0.8815,4.9063,4.9114,1.7434,0.28329,2919.0,3.012,2022-09-20 21:10:43
25%,31.5959,31.597125,7.095325,101.38825,309.129,6.690725,95.8235,291.5645,24.78355,24.782925,31.802512,31.804358,0.991875,6.49735,6.502075,1.80295,0.337663,5156.0,3.1075,2022-09-20 21:12:16.500000
50%,31.59745,31.59885,7.16965,103.0245,312.4525,6.777,97.383,295.3415,24.7854,24.78445,31.888205,31.886035,1.3274,6.65245,6.65215,1.9289,3.02378,6910.0,6.482,2022-09-20 21:13:29.500000
75%,31.640325,31.642575,7.174075,103.09175,312.647,6.78385,97.4875,295.64075,24.839525,24.837125,31.889793,31.886561,1.535275,6.656125,6.65425,2.1141,5.8781,7782.75,17.841,2022-09-20 21:14:05.750000128
max,31.8136,31.8166,7.1817,103.204,312.979,6.785,97.499,295.691,25.1624,25.1594,31.889905,31.889134,1.8287,6.6569,6.655,2.1865,6.0823,7965.0,33.251,2022-09-20 21:14:14
std,0.087119,0.087856,0.058988,2.224872,2.601972,0.060803,2.213479,2.683811,0.151303,0.150427,0.505669,0.504429,0.373846,0.701699,0.699412,0.190505,3.028188,2022.717421,12.387202,


## Create BTL report file

In [22]:
# btl report file
for cast in cruise_data.keys():
    try:
        df=cruise_data[cast]
        df['cast'] = cast.lower().split('.')[0]
        if cast.lower().split('.')[0] == 'ctd001':
            df.to_csv(f'{cruise_name}.report_btl')
        else:
            df.to_csv(f'{cruise_name}.report_btl',mode='a', header=False)
    except:
        print(f'some issue in {cast}')

## Time Properties

Not traditionally dealt with for CTD files as they are likely dynamically updated via GPS feed.  However, FOCI tends to label the date/time with the ***at depth*** time-stamp

## Depth Properties and other assumptions

- currently, all processing and binning (1m for FOCI) is done via seabird routines and the windows software.  This may change with the python ctd package for a few tasks

## Add Deployment meta information

In [6]:
#just a dictionary of dictionaries - simple
with open(cruise_meta_file) as file:
    cruise_config = yaml.full_load(file)
cruise_config[cruise_name]

{'CruiseID': 'DY2209',
 'CruiseID_Historic': '',
 'CruiseID_Alternates': '',
 'Project_Leg': '',
 'Vessel': 'NOAAS Oscar Dyson',
 'ShipID': 'DY',
 'StartDate': datetime.date(2022, 9, 19),
 'EndDate': datetime.date(2022, 9, 23),
 'Project': 'FOCI',
 'ChiefScientist': 'Ryan McCabe',
 'StartPort': 'Dutch Harbor, AK',
 'EndPort': 'Kodiak, AK',
 'CruiseLocation': 'Bering Sea',
 'Description': 'FOCI Fall Arctic/DBO Survey',
 'CruiseYear': 2022,
 'ctdlogs_pdf_name': 'DY2209_CTDCastLogs.pdf'}

In [7]:
#and if you want a cast from the cruise, just use the consective cast number
cruise_config['CTDCasts']['CTD001']

{'id': 53912,
 'Vessel': 'NOAAS Oscar Dyson',
 'CruiseID': 'DY2209',
 'Project_Leg': '',
 'UniqueCruiseID': 'DY2209',
 'Project': 'FOCI Fall Cruise',
 'StationNo_altname': '1',
 'ConsecutiveCastNo': 'CTD001',
 'LatitudeDeg': 64,
 'LatitudeMin': 0.23,
 'LongitudeDeg': 163,
 'LongitudeMin': 3.97,
 'GMTDay': 21,
 'GMTMonth': 'Sep',
 'GMTYear': 2022,
 'GMTTime': 18600,
 'DryBulb': 4.9,
 'RelativeHumidity': 92,
 'WetBulb': -99.9,
 'Pressure': 1012,
 'SeaState': '',
 'Visibility': '',
 'WindDir': 213,
 'WindSpd': 12.8,
 'CloudAmt': '',
 'CloudType': '',
 'Weather': '',
 'SurfaceTemp': 6.75,
 'BottomDepth': 38,
 'StationNameID': 'M14',
 'MaxDepth': 33,
 'InstrumentSerialNos': '',
 'Notes': '',
 'NutrientBtlNiskinNo': '1;2;3;;5;6',
 'NutrientBtlNumbers': '1;2;3;;4',
 'OxygenBtlNiskinNo': '1;1;2;3;;5',
 'OxygenBtlNumbers': '227;239;241;242;;243',
 'SalinityBtlNiskinNo': '1',
 'SalinityBtlNumbers': '632',
 'ChlorophyllBtlNiskinNo': '1;2;3;;5',
 'ChlorophyllBtlVolumes': '285;281;283;;283',
 'Inst

## Add Instrument meta information

Time, depth, lat, lon should be added regardless (always our coordinates) but for a mooring site its going to be a (1,1,1,t) dataset
The variables of interest should be read from the data file and matched to a key for naming.  That key is in the inst_config file seen below and should represent common conversion names in the raw data

In [8]:
with open(inst_meta_file) as file:
    inst_config = yaml.full_load(file)
inst_config

{'time': {'epic_key': 'TIM_601',
  'name': 'time',
  'generic_name': 'time',
  'standard_name': 'time',
  'long_name': 'date and time since reference time',
  'time_origin': '1900-01-01 00:00:00',
  'units': 'days since 1900-01-01T00:00:00Z'},
 'depth': {'epic_key': 'D_3',
  'generic_name': 'depth',
  'units': 'meter',
  'long_name': 'depth below surface (meters)',
  'standard_name': 'depth'},
 'latitude': {'epic_key': 'LON_501',
  'name': 'latitude',
  'generic_name': 'latitude',
  'units': 'degrees_north',
  'long_name': 'latitude',
  'standard_name': 'latitude'},
 'longitude': {'epic_key': 'LAT_500',
  'name': 'longitude',
  'generic_name': 'longitude',
  'units': 'degrees_east',
  'long_name': 'longitude',
  'standard_name': 'longitude'},
 'temperature_ch1': {'epic_key': 'T_28',
  'generic_name': 'temp channel 1',
  'long_name': 'Sea temperature in-situ ITS-90 scale',
  'standard_name': 'sea_water_temperature',
  'units': 'degree_C'},
 'temperature_ch2': {'epic_key': 'T2_35',
  'ge

In [9]:
#sbe data uses header info to name variables... but we want standard names from the dictionary I've created, so we need to rename column variables appropriately
#rename values to appropriate names, if a value isn't in the .yaml file, you can add it

#*** biggest *** difference between moored and profile data is there may be multiple instruments with the same dataype (e.g.) temperature
# on the same platform.  We _used_ to use the phrases primary and secondary, but will now only refer to them as ch1, ch2 etc
cruise_data['CTD001.btl'] = cruise_data['CTD001.btl'].rename(columns={
                        't090c':'temperature_ch1',
                        't190c':'temperature_ch2',
                        'sal00':'salinity_ch1',
                        'sal11':'salinity_ch2',
                        'sbox0mm/kg':'oxy_conc_ch1',
                        'sbeox0ml/l':'oxy_concM_ch1',
                        'sbox1mm/kg':'oxy_conc_ch2',
                        'sbeox1ml/l':'oxy_concM_ch2',
                        'sbeox0ps':'oxy_percentsat_ch1',
                        'sbeox1ps':'oxy_percentsat_ch2',
                        'sigma-t00':'sigma_t_ch1',
                        'sigma-t11':'sigma_t_ch2',
                        'cstarat0':'Attenuation',
                        'cstartr0':'Transmittance',
                        'fleco-afl':'chlor_fluorescence',
                        'turbwetntu0':'turbidity',
                        'empty':'empty', #this will be ignored
                        'prdm':'pressure',
                        'flag':'flag'})

cruise_data['CTD001.btl'].sample()

Unnamed: 0_level_0,salinity_ch2,salinity_ch1,oxy_concM_ch1,oxy_percentsat_ch1,oxy_conc_ch1,oxy_concM_ch2,oxy_percentsat_ch2,oxy_conc_ch2,sigma_t_ch1,sigma_t_ch2,...,c1ms/cm,chlor_fluorescence,temperature_ch1,temperature_ch2,turbidity,par,scan,pressure,datetime,cast
bottle,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4.0,31.5945,31.5954,7.1746,103.098,312.67,6.785,97.499,295.691,24.7824,24.7817,...,31.885879,1.8287,6.655,6.655,1.7677,5.9884,7335.0,3.089,2022-09-20 21:13:47,ctd001


In [10]:
cruise_data['CTD001.btl'].columns

Index(['salinity_ch2', 'salinity_ch1', 'oxy_concM_ch1', 'oxy_percentsat_ch1',
       'oxy_conc_ch1', 'oxy_concM_ch2', 'oxy_percentsat_ch2', 'oxy_conc_ch2',
       'sigma_t_ch1', 'sigma_t_ch2', 'c0ms/cm', 'c1ms/cm',
       'chlor_fluorescence', 'temperature_ch1', 'temperature_ch2', 'turbidity',
       'par', 'scan', 'pressure', 'datetime', 'cast'],
      dtype='object')

## Add institutional meta-information


In [11]:
with open(group_meta_file) as file:
    group_config = yaml.full_load(file)
group_config

{'source_documents': 'http://www.oceansites.org/docs/oceansites_data_format_reference_manual.pdf',
 'institution': 'Pacific Marine Environmental Lab (PMEL)',
 'project': 'EcoFOCI',
 'project_url': 'https://www.ecofoci.noaa.gov',
 'principal_investigator': 'Phyllis Stabeno',
 'principal_investigator_email': 'phyllis.stabeno (at) noaa.gov',
 'creator_name': 'Shaun Bell',
 'creator_email': 'shaun.bell (at) noaa.gov',
 'creator_institution': 'PMEL',
 'keywords': 'Mooring, Oceanographic',
 'comment': 'Provisional data',
 'sea_area': 'Bering Sea (BS)',
 'featureType': 'timeSeries',
 'conventions': '”CF-1.6, ~OceanSITES-1.5, ACDD-1.2”',
 'license': '',
 'references': '',
 'citation': '',
 'acknowledgement': ''}

In [12]:
cruise_data['CTD001.btl'].columns

Index(['salinity_ch2', 'salinity_ch1', 'oxy_concM_ch1', 'oxy_percentsat_ch1',
       'oxy_conc_ch1', 'oxy_concM_ch2', 'oxy_percentsat_ch2', 'oxy_conc_ch2',
       'sigma_t_ch1', 'sigma_t_ch2', 'c0ms/cm', 'c1ms/cm',
       'chlor_fluorescence', 'temperature_ch1', 'temperature_ch2', 'turbidity',
       'par', 'scan', 'pressure', 'datetime', 'cast'],
      dtype='object')

In [13]:
# Add meta data and prelim processing based on meta data
# Convert to xarray and add meta information - save as CF netcdf file
# pass -> data, instmeta, depmeta
cruise_data_nc = ncCFsave.EcoFOCI_CFnc(df=cruise_data['CTD001.btl'], 
                                instrument_yaml=inst_config, 
                                operation_yaml=cruise_config,
                                operation_type='ctd')
cruise_data_nc

<EcoFOCIpy.io.ncCFsave.EcoFOCI_CFnc at 0x154829810>

In [14]:
cruise_data_nc.get_xdf()

At this point, you could save your file with the `.xarray2netcdf_save()` method and have a functioning dataset.... but it would be very simple with no additional qc, meta-data, or tuned parameters for optimizing software like ferret or erddap.

In [15]:
# expand the dimensions and coordinate variables
# renames them appropriatley and prepares them for meta-filled values
cruise_data_nc.expand_dimensions(dim_names=['latitude','longitude','time'],geophys_sort=False)

In [16]:
#build list from columsn in data - if a variable isn't in the yaml file, it will be dropped from the final data fields
cruise_data_nc.variable_meta_data(variable_keys=list(cruise_data['CTD001.btl'].columns.values),drop_missing=False)
#adding dimension meta needs to come after updating the dimension values... BUG?
cruise_data_nc.dimension_meta_data(variable_keys=['time','latitude','longitude'])

The following steps can happen in just about any order and are all meta-data driven.  Therefore, they are not required to have a functioning dataset, but they are required to have a well described dataset

In [17]:
cruise_data_nc.get_xdf()

In [18]:
#add global attributes
cruise_data_nc.deployment_meta_add(conscastno='CTD001')

#add instituitonal global attributes
cruise_data_nc.institution_meta_add(group_config)

#add creation date/time - provenance data
cruise_data_nc.provinance_meta_add()

#provide intial qc status field
cruise_data_nc.qc_status(qc_status='unqcd') #<- options are unknown, excellent, probably good, mixed, unqcd

cruise_data_nc.get_xdf()

## Rare Bottle File Edits

<div class="warning" style='background-color:#ffcccb; color: #FF0000; border-left: solid #805AD5 4px; border-radius: 4px; padding:0.7em;'>
<span>
<p style='margin-top:1em; text-align:center'>
<b>WARNING</b></p>
<p style='margin-left:1em;'>bottle/niskin and rosette position should be the same but can be different (example, bottles are labeled sequentially but a rosette position is skipped due to balancing or other instruments.  On this cruise the following rosette positin was fired, while the bottles where labled differently</p>

cruise_data[cast]

<br>
The following happened on a Dyson Cruise in 2021/2022

<style type="text/css">
.tg  {border-collapse:collapse;border-spacing:0;}
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  overflow:hidden;padding:10px 5px;word-break:normal;}
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
  font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
.tg .tg-0lax{text-align:left;vertical-align:top}
</style>
<table class="tg">
<thead>
  <tr>
    <th class="tg-0lax"><span style="font-weight:bold">Rosette</span></th>
    <th class="tg-0lax">1</th>
    <th class="tg-0lax">12</th>
    <th class="tg-0lax">11</th>
    <th class="tg-0lax">10</th>
    <th class="tg-0lax">9</th>
    <th class="tg-0lax">8</th>
    <th class="tg-0lax">7<br></th>
    <th class="tg-0lax">6</th>
    <th class="tg-0lax">5</th>
    <th class="tg-0lax">4</th>
    <th class="tg-0lax">3</th>
    <th class="tg-0lax">2</th>
  </tr>
</thead>
<tbody>
  <tr>
    <td class="tg-0lax"><span style="font-weight:bold">Niskin</span><br></td>
    <td class="tg-0lax">1</td>
    <td class="tg-0lax">2</td>
    <td class="tg-0lax">3</td>
    <td class="tg-0lax">4</td>
    <td class="tg-0lax">5</td>
    <td class="tg-0lax">6</td>
    <td class="tg-0lax">7</td>
    <td class="tg-0lax">8</td>
    <td class="tg-0lax">9</td>
    <td class="tg-0lax">10<br></td>
    <td class="tg-0lax">11</td>
    <td class="tg-0lax">12</td>
  </tr>
</tbody>
</table>
</div>

## Save CF Netcdf files

Currently stick to netcdf3 classic... but migrating to netcdf4 (default) may be no problems for most modern purposes.  Its easy enough to pass the `format` kwargs through to the netcdf api of xarray.

In [19]:
#loop over all casts and perform tasks shown above

for cast in cruise_data.keys():
    try:
        cruise_data[cast] = cruise_data[cast].drop(columns=['c0ms/cm','c1ms/cm'])
        cruise_data[cast] = cruise_data[cast].rename(columns={
                            't090c':'temperature_ch1',
                            't190c':'temperature_ch2',
                            'sal00':'salinity_ch1',
                            'sal11':'salinity_ch2',
                            'sbox0mm/kg':'oxy_conc_ch1',
                            'sbeox0ml/l':'oxy_concM_ch1',
                            'sbox1mm/kg':'oxy_conc_ch2',
                            'sbeox1ml/l':'oxy_concM_ch2',
                            'sbeox0ps':'oxy_percentsat_ch1',
                            'sbeox1ps':'oxy_percentsat_ch2',
                            'sigma-t00':'sigma_t_ch1',
                            'sigma-t11':'sigma_t_ch2',
                            'cstarat0':'Attenuation',
                            'cstartr0':'Transmittance',
                            'fleco-afl':'chlor_fluorescence',
                            'turbwetntu0':'turbidity',
                            'empty':'empty', #this will be ignored
                            'prdm':'pressure',
                            'flag':'flag'})

        cruise_data_nc = ncCFsave.EcoFOCI_CFnc(df=cruise_data[cast], 
                                    instrument_yaml=inst_config, 
                                    operation_yaml=cruise_config,
                                    operation_type='ctd')

        cruise_data_nc.expand_dimensions(dim_names=['latitude','longitude','time'],geophys_sort=False)

        cruise_data_nc.variable_meta_data(variable_keys=list(cruise_data[cast].columns.values),drop_missing=False)
        #adding dimension meta needs to come after updating the dimension values... BUG?
        cruise_data_nc.dimension_meta_data(variable_keys=['time','latitude','longitude'])
        cruise_data_nc.temporal_geospatioal_meta_data_ctd(positiveE=False,conscastno=cast.split('.')[0])

        #add global attributes
        cruise_data_nc.deployment_meta_add(conscastno=cast.split('.')[0].upper())

        #add instituitonal global attributes
        cruise_data_nc.institution_meta_add(group_config)

        #add creation date/time - provenance data
        cruise_data_nc.provinance_meta_add()

        #provide intial qc status field
        cruise_data_nc.qc_status(qc_status='unqcd') #<- options are unknown, excellent, probably good, mixed, unqcd

        cast_label = cast.lower().split('d')[-1].split('.')[0]
        cruise_data_nc.xarray2netcdf_save(xdf = cruise_data_nc.get_xdf(),
                                   filename=cruise_name+'c'+cast_label.zfill(3)+'_btl.nc',format="NETCDF3_CLASSIC")

  
    except:
        print(f'Skipping {cast}')

  xdf.to_netcdf(filename,format=kwargs['format'],encoding={'time':{'units':'days since 1900-01-01'}})
  xdf.to_netcdf(filename,format=kwargs['format'],encoding={'time':{'units':'days since 1900-01-01'}})
  xdf.to_netcdf(filename,format=kwargs['format'],encoding={'time':{'units':'days since 1900-01-01'}})
  xdf.to_netcdf(filename,format=kwargs['format'],encoding={'time':{'units':'days since 1900-01-01'}})


## Next Steps

QC of data (plot parameters with other instruments)
- be sure to updated the qc_status and the history