# Creation of the annotation dataset
This notebook describes the steps involved in gathering, cleaning, and merging all manual annotations made by various analysist from various groups and using different software in a single large dataset.

objective of this notebook:

- make annotation labels consistent (standard name)
- add metadata for proper repartition of train/test/evaluation datasets



## Description of the data and manual annotations

-- TO DO -- 

Describe dataset and annotations

- ref to Popescu's paper
- screenshot of folder

## Adding annotation fields in Raven

-- TO DO --

- screenshot of Raven with entire dataset loaded
- screeshot of added fields: file offset, Begin file, Begin path 

First, ecosound and the other linbraries need to be imported.

## Import libraries and define functions used throughout

In [1]:
import os
from ecosound.core.annotation import Annotation
from ecosound.core.metadata import DeploymentInfo
import pandas as pd
from datetime import datetime
import re
import uuid

def get_datetime_from_filename(filename):
    time_format = "%Y%m%d_%H%M%S"
    file_offset_s_regex = "_[0-9]+s"
    file_offset_ms_regex = "_[0-9]+s"
    file_orig_regex = "_[0-9]{8}_[0-9]{6}_"    
    # first part - date/time of origninal audio file
    p1 = re.compile(file_orig_regex)
    datestr_1 = p1.search(filename)
    date = datetime.strptime(datestr_1[0][1:-1],time_format)    
    ## second part - nb of seconds
    #p1 = re.compile(file_orig_regex)
    #datestr_1 = p1.search(df['Begin File'].iloc[0])
    #date = datetime.strptime(datestr_1[0][1:-1],time_format)    
    return date   

def load_raven_table(root_dir,audio_dir,annotation_file,deployment_file):
    ## load Raven annotations
    df = pd.read_csv(os.path.join(root_dir, annotation_file), sep='\t')
    df = df[df['View']== 'Spectrogram 1'] # remove all "waveform" rows (redundant with the "Spectrogram" ones)
    df = df.reset_index(drop=True)    
    ## find out start date/time for each audio file
    files_date=df['Begin File'].apply(get_datetime_from_filename)
    # Definition of start and stop time offsets of annoatations (relative to start of each audio file)
    duration = df['End Time (s)']-df['Begin Time (s)']
    start_offset = df['File Offset (s)']
    end_offset = start_offset + duration
    ## Populate annotation object
    annot = Annotation()
    annot.data['audio_file_start_date'] = files_date
    annot.data['audio_channel'] = df['Channel']-1
    annot.data['audio_file_name'] = df['Begin File'].apply(lambda x: os.path.splitext(os.path.basename(x))[0])
    annot.data['audio_file_dir'] = audio_dir
    annot.data['audio_file_extension'] = df['Begin Path'].apply(lambda x: os.path.splitext(x)[1])
    annot.data['time_min_offset'] = start_offset
    annot.data['time_max_offset'] = end_offset
    annot.data['time_min_date'] = pd.to_datetime(annot.data['audio_file_start_date'] + pd.to_timedelta(annot.data['time_min_offset'], unit='s'))
    annot.data['time_max_date'] = pd.to_datetime(annot.data['audio_file_start_date'] + pd.to_timedelta(annot.data['time_max_offset'], unit='s'))
    annot.data['frequency_min'] = df['Low Freq (Hz)']
    annot.data['frequency_max'] = df['High Freq (Hz)']    
    annot.data['label_class'] = df['tags']
    annot.data['from_detector'] = False
    annot.data['software_name'] = 'raven'
    annot.data['uuid'] = annot.data.apply(lambda _: str(uuid.uuid4()), axis=1)
    annot.data['duration'] = annot.data['time_max_offset'] - annot.data['time_min_offset']
    annot.insert_metadata(os.path.join(root_dir, deployment_file)) # insert metadata
    annot.check_integrity(verbose=True, ignore_frequency_duplicates=True) # check integrity
    print(len(annot), 'annotations imported.')
    return annot

## Create deployment info files with metadata for each deployment

Instantiate a DeploymentInfo object to handle metadata for the deployment, and create an empty deployment info file.

In [2]:
# Instantiate
Deployment = DeploymentInfo()
# write empty file to fill in (do once only)
#Deployment.write_template(os.path.join(root_dir, deployment_file))

A csv file "deployment_info.csv" has now been created in the root_dir. It is empty and only has column headers, and includes teh following fiilds:

* audio_channel_number
* UTC_offset
* sampling_frequency (in Hz)
* bit_depth 
* mooring_platform_name
* recorder_type
* recorder_SN
* hydrophone_model
* hydrophone_SN
* hydrophone_depth
* location_name
* location_lat
* location_lon
* location_water_depth
* deployment_ID
* deployment_date
* recovery_date

This file needs to be filled in by the user with the appropriate deployment information. Once filled in, the file can be loaded using the Deployment object:

## Data cleaning

Annotations are for each datasets are loaded, sorted, and re-writen into a parquet file. 

### Dataset 1: USA-SBNMS-NOPP4-20080917

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [3]:
root_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-SBNMS-NOPP4-20080917'
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-SBNMS-NOPP4-20080917'
deployment_file = r'deployment_info.csv' 
annotation_file = r'SET01_masterlog_new.txt'

Now we can load and format the manual annotations for this dataset and add the metadata. 

In [4]:
annot = load_raven_table(root_dir,audio_dir,annotation_file,deployment_file)

Duplicate entries removed: 0
Integrity test succesfull
1420 annotations imported.


Let's look at the different annotation labels that were used:

In [5]:
print(annot.get_labels_class())

['Bac_3100', nan]


Let's change 'Bac_3100' to 'MW' (Minke whale) and nan for 'NN' (noise):

In [6]:
annot.data['label_class'].replace(to_replace=['Bac_3100'], value='MW', inplace=True)
annot.data.loc[annot.data['label_class'].isnull(),'label_class'] = 'NN'

Now, having a look a summary of all the annotations available in this dataset.

In [7]:
# print summary (pivot table)
print(annot.summary())

label_class                 MW   NN  Total
deployment_ID                             
USA-SBNMS-NOPP4-20080917  1139  281   1420
Total                     1139  281   1420


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [8]:
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(root_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

  return self - (self // other) * other


### Dataset 2: USA-SBNMS-NOPP5-20090215
Now we can repeat the step above for all the other datasets:

In [9]:
root_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-SBNMS-NOPP5-20090215'
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\USA-SBNMS-NOPP5-20090215'
deployment_file = r'deployment_info.csv' 
annotation_file = r'SET02_masterlog_new.txt'

In [10]:
annot = load_raven_table(root_dir,audio_dir,annotation_file,deployment_file)

Duplicate entries removed: 0
Integrity test succesfull
591 annotations imported.


In [11]:
print(annot.get_labels_class())

['minke_pt', 'Bac_3100', nan]


In [12]:
annot.data['label_class'].replace(to_replace=['Bac_3100', 'minke_pt'], value='MW', inplace=True)
annot.data.loc[annot.data['label_class'].isnull(),'label_class'] = 'NN'

In [13]:
# print summary (pivot table)
print(annot.summary())

label_class                MW   NN  Total
deployment_ID                            
USA-SBNMS-NOPP5-20090215  432  159    591
Total                     432  159    591


In [14]:
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(root_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

  return self - (self // other) * other


# Merging all datasets together

Now that all our datasets are cleaned up, we can merge them all in a single Master annotation dataset.

Defining the path of each dataset:

In [None]:
root_dir = r'C:\Users\xavier.mouy\Documents\PhD\Projects\Dectector\datasets'
dataset_files = ['UVIC_mill-bay_2019\Annotations_dataset_06-MILL.nc',
                 'UVIC_hornby-island_2019\Annotations_dataset_07-HI.nc',
                 'ONC_delta-node_2014\Annotations_dataset_ONC-Delta-2014.nc',
                 'DFO_snake-island_rca-in_20181017\Annotations_dataset_SI-RCAIn-20181017.nc',
                 'DFO_snake-island_rca-out_20181015\Annotations_dataset_SI-RCAOut-20181015.nc',
                ]

Looping through each dataset and merging in to a master dataset:

In [None]:
# # load all annotations
annot = Annotation()
for file in dataset_files:
    tmp = Annotation()
    tmp.from_netcdf(os.path.join(root_dir, file), verbose=True)
    annot = annot + tmp

Now we can see a summary of all the annotatiosn we have:

In [None]:
# print summary (pivot table)
print(annot.summary())

We can also look at the contribution from each analyst:

In [None]:
print(annot.summary(rows='operator_name'))

Finally we can save our Master annotation dataset. It will be used for trainning and evealuation classification models.

In [None]:
#annot.to_parquet(os.path.join(root_dir, 'Master_annotations_dataset.parquet'))
annot.to_netcdf(os.path.join(root_dir, 'Master_annotations_dataset.nc'))