# Reformating of haddock annotation datasets made by the passive acoustics group at NEFSC.


## Purpose of this notebook
This notebook describes the steps involved in cleaning up and reorganizing the haddock sound manual annotations that were used done in the passive acoustics group at NEFASC.

The specific objectives of this notebook are:

- Reorganize the audio data so they are sorted by deployment and locations
- Convert the XBat annotations to Raven annotation tables.
- Add metadata to all annotations (i.e., coordinates, depths, location, dates, etc)
- Make annotation labels consistent (i.e. 'HK' for haddock, 'NN' for noise)


## Description of the data and manual annotations

This dataset uses manual annotations made with Xbat on data from Stellwagen Bank. Two days of data were annotated manually: 30 Oct 2010 at NOPP8 and 4 Apr 2008 at NOPP8. The first step of this analysis consisted of converting the Xbat annotations (mat files) to Raven format. This was donne with Matlab, Excel, and Raven. This stage is not documented in the notebook.

Audio files originaly have 10 channels. The software Sox was used to extract the channel that was annotated as a separate file.


## Adding annotation fields in Raven

Since the times of the original annotations are relative to the first file of each subset, we need to add additional fields so we can recalculate annotation times relative to the start of each audio file. This will make the manipulation of annotations much easier and cleaner.

1. Open all files from a same subset in Raven
2. Load the annotation table
3. Click right on the header section of the annotation table in Raven, then select "Choose Measurements..."
4. Add the measurements "File Offset (s)", "Begin File", and "Begin Path".
5. Save the updated annotation table (File > Save Selection Table). In the rest of this notebook this new table file is appended with "_modified.txt"

Data and respective Raven annotations were places in the folders USA-NEFSC-SBNMS-200803-NOPP2_CH08 and USA-NEFSC-SBNMS-200910-NOPP8_CH07

## Import libraries and define functions used throughout

In [22]:
import os
from ecosound.core.annotation import Annotation
from ecosound.core.metadata import DeploymentInfo
from ecosound.core.audiotools import Sound
import pandas as pd
from datetime import datetime
import re
import uuid

def get_datetime_from_filename(filename):
    time_format = "%Y%m%d_%H%M%S"
    file_offset_s_regex = "_[0-9]+s"
    file_offset_ms_regex = "_[0-9]+s"
    file_orig_regex = "_[0-9]{8}_[0-9]{6}."    
    # first part - date/time of origninal audio file
    p1 = re.compile(file_orig_regex)
    datestr_1 = p1.search(filename)
    date = datetime.strptime(datestr_1[0][1:-1],time_format)    
    ## second part - nb of seconds
    #p1 = re.compile(file_orig_regex)
    #datestr_1 = p1.search(df['Begin File'].iloc[0])
    #date = datetime.strptime(datestr_1[0][1:-1],time_format)    
    return date   

def load_raven_table(root_dir,audio_dir,annotation_file,deployment_file):
    ## load Raven annotations
    df = pd.read_csv(os.path.join(root_dir, annotation_file), sep='\t')
    df = df[df['View']== 'Spectrogram 1'] # remove all "waveform" rows (redundant with the "Spectrogram" ones)
    df = df.reset_index(drop=True)    
    ## find out start date/time for each audio file
    files_date=df['Begin File'].apply(get_datetime_from_filename)
    # Definition of start and stop time offsets of annoatations (relative to start of each audio file)
    duration = df['End Time (s)']-df['Begin Time (s)']
    start_offset = df['File Offset (s)']
    end_offset = start_offset + duration
    ## Populate annotation object
    annot = Annotation()
    annot.data['audio_file_start_date'] = files_date
    annot.data['audio_channel'] = df['Channel']-1
    annot.data['audio_file_name'] = df['Begin File'].apply(lambda x: os.path.splitext(os.path.basename(x))[0])
    annot.data['audio_file_dir'] = audio_dir
    annot.data['audio_file_extension'] = df['Begin Path'].apply(lambda x: os.path.splitext(x)[1])
    annot.data['time_min_offset'] = start_offset
    annot.data['time_max_offset'] = end_offset
    annot.data['time_min_date'] = pd.to_datetime(annot.data['audio_file_start_date'] + pd.to_timedelta(annot.data['time_min_offset'], unit='s'))
    annot.data['time_max_date'] = pd.to_datetime(annot.data['audio_file_start_date'] + pd.to_timedelta(annot.data['time_max_offset'], unit='s'))
    annot.data['frequency_min'] = df['Low Freq (Hz)']
    annot.data['frequency_max'] = df['High Freq (Hz)']    
    annot.data['label_class'] = df['tags']
    annot.data['from_detector'] = False
    annot.data['software_name'] = 'raven'
    annot.data['uuid'] = annot.data.apply(lambda _: str(uuid.uuid4()), axis=1)
    annot.data['duration'] = annot.data['time_max_offset'] - annot.data['time_min_offset']
    annot.insert_metadata(os.path.join(root_dir, deployment_file)) # insert metadata
    annot.check_integrity(verbose=True, ignore_frequency_duplicates=True) # check integrity
    print(len(annot), 'annotations imported.')
    return annot


## Create deployment info files with metadata for each deployment

Instantiate a DeploymentInfo object to handle metadata for the deployment, and create an empty deployment info file.

In [23]:
# Instantiate
Deployment = DeploymentInfo()
# write empty file to fill in (do once only)
#Deployment.write_template(os.path.join(root_dir, deployment_file))

A csv file "deployment_info.csv" has now been created in the root_dir. It is empty and only has column headers, and includes teh following fiilds:

* audio_channel_number
* UTC_offset
* sampling_frequency (in Hz)
* bit_depth 
* mooring_platform_name
* recorder_type
* recorder_SN
* hydrophone_model
* hydrophone_SN
* hydrophone_depth
* location_name
* location_lat
* location_lon
* location_water_depth
* deployment_ID
* deployment_date
* recovery_date

This file needs to be filled in by the user with the appropriate deployment information. Once filled in, the file can be loaded using the Deployment object:

## Cleaning annotations

Now we go through the annotations for each deployment/dataset and add the associated metadata, correct inconsistencies in annotations labels, and save as a NetCDF and Raven file. 

### Dataset 1a: USA-NEFSC-SBNMS-200803-NOPP2_CH08 - single pulses

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [24]:
root_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200803-NOPP2_CH08'
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200803-NOPP2_CH08'
deployment_file = r'deployment_info.csv' 
annotation_file = r'NOPP2_20090404_haddock_singles_raven_table_modified.txt'

Now we can load and format the manual annotations for this dataset and add the metadata. 

In [25]:
annot = load_raven_table(root_dir,audio_dir,annotation_file,deployment_file)

Duplicate entries removed: 0
Integrity test succesfull
182 annotations imported.


Let's look at the different annotation labels that were used:

In [26]:
print(annot.get_labels_class())

[nan]


Let's insert label "HK"for haddock amd "P" (pulse) as second label

In [30]:
annot.insert_values(label_class= 'HK')
annot.insert_values(label_subclass= 'P')
print(annot.get_labels_class())

['HK']


Now, having a look a summary of all the annotations available in this dataset.

In [31]:
# print summary (pivot table)
print(annot.summary())

label_class                         HK  Total
deployment_ID                                
USA-NEFSC-SBNMS-200803-NOPP2_CH08  182    182
Total                              182    182


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [32]:
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +'_singlepulse_annotations.nc'))
#annot.to_raven(root_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

  return self - (self // other) * other


### Dataset 1b:  USA-NEFSC-SBNMS-200803-NOPP2_CH08 - pulse trains
Now we can repeat the step above for all the other datasets:

In [33]:
root_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200803-NOPP2_CH08'
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200803-NOPP2_CH08'
deployment_file = r'deployment_info.csv' 
annotation_file = r'NOPP2_20080404_haddock_pulsetrains_concatenated_raven_table_modified.txt'

In [34]:
annot = load_raven_table(root_dir,audio_dir,annotation_file,deployment_file)

Duplicate entries removed: 0
Integrity test succesfull
2753 annotations imported.


In [36]:
print(annot.get_labels_class())

[nan]


In [37]:
annot.insert_values(label_class= 'HK')
annot.insert_values(label_subclass= 'PT')
print(annot.summary())

label_class                          HK  Total
deployment_ID                                 
USA-NEFSC-SBNMS-200803-NOPP2_CH08  2753   2753
Total                              2753   2753


In [38]:
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +'_pulsetrain_annotations.nc'))
#annot.to_raven(root_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

  return self - (self // other) * other


### Dataset 2a: USA-NEFSC-SBNMS-200910-NOPP8_CH07 - single pulse

In [45]:
root_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200910-NOPP8_CH07'
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200910-NOPP8_CH07'
deployment_file = r'deployment_info.csv' 
annotation_file = r'NOPP8_20091030_haddock_singles_raven_table_modified.txt'

In [46]:
annot = load_raven_table(root_dir,audio_dir,annotation_file,deployment_file)

Duplicate entries removed: 0
Integrity test succesfull
282 annotations imported.


In [47]:
print(annot.get_labels_class())

[nan]


In [48]:
annot.insert_values(label_class= 'HK')
annot.insert_values(label_subclass= 'P')
print(annot.summary())

label_class                         HK  Total
deployment_ID                                
USA-NEFSC-SBNMS-200910-NOPP8_CH07  282    282
Total                              282    282


In [49]:
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +'_singlepulse_annotations.nc'))
#annot.to_raven(root_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

  return self - (self // other) * other


### Dataset 2b: USA-NEFSC-SBNMS-200910-NOPP8_CH07 - pulse train

In [50]:
root_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200910-NOPP8_CH07'
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-SBNMS-200910-NOPP8_CH07'
deployment_file = r'deployment_info.csv' 
annotation_file = r'NOPP8_20091030_haddock_pulsetrains_concatenated_raven_table_modified.txt'

In [51]:
annot = load_raven_table(root_dir,audio_dir,annotation_file,deployment_file)

Duplicate entries removed: 0
Integrity test succesfull
9446 annotations imported.


In [52]:
print(annot.get_labels_class())

[nan]


In [53]:
annot.insert_values(label_class= 'HK')
annot.insert_values(label_subclass= 'PT')
print(annot.summary())

label_class                          HK  Total
deployment_ID                                 
USA-NEFSC-SBNMS-200910-NOPP8_CH07  9446   9446
Total                              9446   9446


In [54]:
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +'_pulsetrain_annotations.nc'))
#annot.to_raven(root_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

  return self - (self // other) * other
