# Reformating of the Brunswick minke annotations


## Purpose of this notebook
This notebook describes the steps involved in gathering, cleaning up and reorganizing the minke whale pulse train manual annotations that were performed by an PA-group intern. Data were fully analysed from 17-Dec-2016 to 27-Dec-2016 (channel 6 of deployment USA-NEFSC-GA-201612). This fully annotated dataset can be used to assess the performance of the detector (i.e. quantify precision, recall, and number of false alarms per day)

The specific objectives of this notebook are:

- Convert the Raven annotation tables to have annotation times relative to the beginning of each audio file.
- Add metadata to all annotations (i.e., coordinates, depths, location, dates, etc)
- Make annotation labels consistent (i.e. 'MW' for minke)


## Adding annotation fields in Raven

Since the times of the original annotations are relative to the first file of each day, we need to add additional fields so we can recalculate annotation times relative to the start of each audio file. This will make the manipulation of annotations much easier and cleaner.

1. Open all files for an entire day in Raven
2. Load the annotation table "...masterlog.txt" corresponding to that day
3. Click right on the header section of the annotation table in Raven, then select "Choose Measurements..."
4. Add the measurements "File Offset (s)", "Begin File", and "Begin Path".
5. Save the updated annotation table (File > Save Selection Table). The modified  Raven table are all saved in .\USA-NEFSC-GA-201612-CH6\old_format\tables_with_added_fields\.

## Import libraries and define functions used throughout

In [40]:
import os
from ecosound.core.annotation import Annotation
from ecosound.core.metadata import DeploymentInfo
from ecosound.core.tools import list_files, filename_to_datetime
from ecosound.core.audiotools import Sound
import pandas as pd
from datetime import datetime
import re
import uuid

def get_datetime_from_filename(filename):
    time_format = "%Y%m%d_%H%M%S"
    file_offset_s_regex = "_[0-9]+s"
    file_offset_ms_regex = "_[0-9]+s"
    file_orig_regex = "_[0-9]{8}_[0-9]{6}"    
    # first part - date/time of origninal audio file
    p1 = re.compile(file_orig_regex)
    datestr_1 = p1.search(filename)
    date = datetime.strptime(datestr_1[0][1:],time_format)    
    ## second part - nb of seconds
    #p1 = re.compile(file_orig_regex)
    #datestr_1 = p1.search(df['Begin File'].iloc[0])
    #date = datetime.strptime(datestr_1[0][1:-1],time_format)    
    return date   

def load_raven_table(root_dir,audio_dir,annotation_file,deployment_file):
    ## load Raven annotations
    df = pd.read_csv(os.path.join(root_dir, annotation_file), sep='\t')
    df = df[df['View']== 'Spectrogram 1'] # remove all "waveform" rows (redundant with the "Spectrogram" ones)
    df = df.reset_index(drop=True)    
    ## find out start date/time for each audio file
    files_date=df['Begin File'].apply(get_datetime_from_filename)
    # Definition of start and stop time offsets of annoatations (relative to start of each audio file)
    duration = df['End Time (s)']-df['Begin Time (s)']
    start_offset = df['File Offset (s)']
    end_offset = start_offset + duration
    ## Populate annotation object
    annot = Annotation()
    annot.data['audio_file_start_date'] = files_date
    annot.data['audio_channel'] = df['Channel']-1
    annot.data['audio_file_name'] = df['Begin File'].apply(lambda x: os.path.splitext(os.path.basename(x))[0])
    annot.data['audio_file_dir'] = audio_dir
    annot.data['audio_file_extension'] = df['Begin Path'].apply(lambda x: os.path.splitext(x)[1])
    annot.data['time_min_offset'] = start_offset
    annot.data['time_max_offset'] = end_offset
    annot.data['time_min_date'] = pd.to_datetime(annot.data['audio_file_start_date'] + pd.to_timedelta(annot.data['time_min_offset'], unit='s'))
    annot.data['time_max_date'] = pd.to_datetime(annot.data['audio_file_start_date'] + pd.to_timedelta(annot.data['time_max_offset'], unit='s'))
    annot.data['frequency_min'] = df['Low Freq (Hz)']
    annot.data['frequency_max'] = df['High Freq (Hz)']    
    annot.data['label_class'] = 'MW'
    annot.data['label_subclass'] = df['calltype']
    annot.data['from_detector'] = False
    annot.data['software_name'] = 'raven'
    annot.data['uuid'] = annot.data.apply(lambda _: str(uuid.uuid4()), axis=1)
    annot.data['duration'] = annot.data['time_max_offset'] - annot.data['time_min_offset']
    annot.insert_metadata(os.path.join(root_dir, deployment_file)) # insert metadata
    annot.check_integrity(verbose=True, ignore_frequency_duplicates=True) # check integrity
    print(len(annot), 'annotations imported.')
    return annot



## Create deployment info files with metadata for each deployment

Instantiate a DeploymentInfo object to handle metadata for the deployment, and create an empty deployment info file.

In [2]:
# Instantiate
Deployment = DeploymentInfo()
# write empty file to fill in (do once only)
#Deployment.write_template(os.path.join(root_dir, deployment_file))

A csv file "deployment_info.csv" has now been created in the root_dir. It is empty and only has column headers, and includes teh following fiilds:

* audio_channel_number
* UTC_offset
* sampling_frequency (in Hz)
* bit_depth 
* mooring_platform_name
* recorder_type
* recorder_SN
* hydrophone_model
* hydrophone_SN
* hydrophone_depth
* location_name
* location_lat
* location_lon
* location_water_depth
* deployment_ID
* deployment_date
* recovery_date

This file needs to be filled in by the user with the appropriate deployment information. Once filled in, the file can be loaded using the Deployment object:

## Cleaning annotations

Now we go through the modified annotations and add the associated metadata, correct inconsistencies in annotations labels, and save as a NetCDF and Raven file.

Definition of all the paths of all folders with the raw annotation and audio files for this deployment:

In [42]:
root_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-GA-201612-CH6'
annotation_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-NEFSC-GA-201612-CH6\old_format\tables_with_added_fields'
audio_dir = r'Z:\ACOUSTIC_DATA\BOTTOM_MOUNTED\NEFSC_GA\NEFSC_GA_201611\Brunswick'
deployment_file = r'deployment_info.csv' 


Now we can load and format the manual annotations for this dataset and add the metadata. 

In [44]:
annot = Annotation()
annot_files = list_files(annotation_dir,'.txt',recursive=False,case_sensitive=True)
for annot_file in annot_files:
    annot_tmp = load_raven_table(root_dir,audio_dir,annot_file,deployment_file)
    annot = annot + annot_tmp
annot.check_integrity()
annot

Duplicate entries removed: 0
Integrity test succesfull
274 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
257 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
219 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
198 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
417 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
309 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
326 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
309 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
393 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
293 annotations imported.


Annotation object (2995)

Let's look at the different annotation labels that were used:

In [45]:
print(annot.get_labels_class())

['MW']


Now, having a look a summary of all the annotations available in this dataset.

In [46]:
# print summary (pivot table)
print(annot.summary())

label_class                MW  Total
deployment_ID                       
USA-NEFSC-GA-201611-CH6  2995   2995
Total                    2995   2995


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [47]:
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(root_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)