# Creation of the annotation dataset

This notebook describes the steps involved in gathering, cleaning, and merging all manual annotations made by various analysist from various groups and using different software in a single large dataset.

## Data cleaning

Annotations are for each datasets are loaded, sorted, and re-writen into a parquet file. First, ecosound and the other linbraries need to be imported.

In [1]:
import sys
sys.path.append("..")  # Adds higher directory to python modules path.
import os
from ecosound.core.annotation import Annotation
from ecosound.core.metadata import DeploymentInfo

### Dataset 1: DFO - Snake Island RCA-In

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [2]:
root_dir = r'C:\Users\xavier.mouy\Documents\PhD\Projects\Dectector\datasets\DFO_snake-island_rca-in_20181017'
deployment_file = r'deployment_info.csv' 
annotation_dir = r'manual_annotations'
data_dir = r'audio_data'

Instantiate a DeploymentInfo object to handle metadata for the deployment, and create an empty deployment info file.

In [3]:
# Instantiate
Deployment = DeploymentInfo()

In [None]:
# write empty file to fill in (do once only)
Deployment.write_template(os.path.join(root_dir, deployment_file))

A csv file "deployment_info.csv" has now been created in the root_dir. It is empty and only has column headers, and includes teh following fiilds:

* audio_channel_number
* UTC_offset
* sampling_frequency (in Hz)
* bit_depth 
* mooring_platform_name
* recorder_type
* recorder_SN
* hydrophone_model
* hydrophone_SN
* hydrophone_depth
* location_name
* location_lat
* location_lon
* location_water_depth
* deployment_ID
* deployment_date
* recovery_date

This file needs to be filled in by the user with teh appropriate deployment information. Once fileld in, the file can be loaded using the Deployment object:

In [4]:
# load deployment file
deployment_info = Deployment.read(os.path.join(root_dir, deployment_file))
deployment_info

Unnamed: 0,audio_channel_number,UTC_offset,sampling_frequency,bit_depth,mooring_platform_name,recorder_type,recorder_SN,hydrophone_model,hydrophone_SN,hydrophone_depth,location_name,location_lat,location_lon,location_water_depth,deployment_ID,deployment_date,recovery_date
0,1,-8,48000,24,bottom weight,SoundTrap 300,67674121,SoundTrap 300,67674121,13.4,Snake Island RCA-In,49.211667,-123.88405,13.4,SI-RCAIn-20181017,20181016T103806,20181203T120816


Now we can load the manual annotations for this dataset. Here annotatiosn were performed with Raven:

In [5]:
# load all annotations
annot = Annotation()
annot.from_raven(os.path.join(root_dir, annotation_dir),
                 class_header='Class',
                 subclass_header='Sound type',
                 verbose=True)

28 annotation files found.
Duplicate entries removed: 0
Integrity test succesfull
13016 annotations imported.


Now we can fill in all the missing information in teh annotations field with the deployment information:

In [6]:
# Manually fill in missing information
annot.insert_values(software_version='1.5',
                    operator_name='Stephanie Archer',
                    audio_channel=deployment_info.audio_channel_number[0],
                    UTC_offset=deployment_info.UTC_offset[0],
                    audio_file_dir=os.path.join(root_dir, data_dir),
                    audio_sampling_frequency=deployment_info.sampling_frequency[0],
                    audio_bit_depth=deployment_info.bit_depth[0],
                    mooring_platform_name=deployment_info.mooring_platform_name[0],
                    recorder_type=deployment_info.recorder_type[0],
                    recorder_SN=deployment_info.recorder_SN[0],
                    hydrophone_model=deployment_info.hydrophone_model[0],
                    hydrophone_SN=deployment_info.hydrophone_SN[0],
                    hydrophone_depth=deployment_info.hydrophone_depth[0],
                    location_name = deployment_info.location_name[0],
                    location_lat = deployment_info.location_lat[0],
                    location_lon = deployment_info.location_lon[0],
                    location_water_depth = deployment_info.location_water_depth[0],
                    deployment_ID=deployment_info.deployment_ID[0],
                    )

Let's look at the different annotation labels that were used:

In [7]:
print(annot.get_labels_class())

['unkown_invert', 'unknown_invert', 'fish', 'fish?', 'unknown', 'unkown', nan, 'whale?', '?', 'sea lion?', 'airplane', 'mammal']


It is clear that there are some inconsistencies in the label names (e.g. 'unknown', 'unkown'). Let's rename the class labels so everything has a consistent same name. We'll use teh following convention:
* 'FS' for fish
* 'UN' for unknown sound
* 'KW' for killer whale
* 'ANT' for anthropogenic sound
* 'HS' for harbor seal

In [8]:
annot.data['label_class'].replace(to_replace=['fish'], value='FS', inplace=True)
annot.data['label_class'].replace(to_replace=['fish?','unkown_invert','unknown_invert','fish?','unknown','unkown','whale?','?','sea lion?','mammal'], value='UN', inplace=True)
annot.data['label_class'].replace(to_replace=['airplane'], value='ANT', inplace=True)
annot.data['label_class'].dropna(axis=0, inplace=True)

Let's check that the class label are now all consistent.

In [9]:
print(annot.get_labels_class())

['UN', 'FS', 'ANT']


Now, having a look a summary of all the annotations available in this dataset.

In [10]:
# print summary (pivot table)
print(annot.summary())

label_class        ANT     FS   UN  Total
deployment_ID                            
SI-RCAIn-20181017    2  12337  672  13011
Total                2  12337  672  13011


Now that all the metadata (deployment information) are filled in the annotation fields and that all labels have been "cleaned up", we can save the dataset as a parquet file.

In [11]:
#annot.to_parquet(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.parquet'))
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.nc'))

The dataset can also be save as a Raven or PAMlab annotation file.

In [12]:
annot.to_pamlab(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +' annotations.log', single_file=True)
annot.to_raven(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +'.Table.1.selections.txt', single_file=True)

### Dataset 2: DFO - Snake Island RCA-Out

Now we can repeat the step above for all the other datasets:

In [13]:
root_dir = r'C:\Users\xavier.mouy\Documents\PhD\Projects\Dectector\datasets\DFO_snake-island_rca-out_20181015'
deployment_file = r'deployment_info.csv' 
annotation_dir = r'manual_annotations'
data_dir = r'audio_data'

# Instantiate
Deployment = DeploymentInfo()

# write empty file to fill in (do once only)
#Deployment.write_template(os.path.join(root_dir, deployment_file))

# load deployment file
deployment_info = Deployment.read(os.path.join(root_dir, deployment_file))

# load all annotations
annot = Annotation()
annot.from_raven(os.path.join(root_dir, annotation_dir),
                  class_header='Class',
                  subclass_header='Sound type',
                  verbose=True)

# Manually fill in missing information
annot.insert_values(software_version='1.5',
                    operator_name='Stephanie Archer',
                    audio_channel=deployment_info.audio_channel_number[0],
                    UTC_offset=deployment_info.UTC_offset[0],
                    audio_file_dir=os.path.join(root_dir, data_dir),
                    audio_sampling_frequency=deployment_info.sampling_frequency[0],
                    audio_bit_depth=deployment_info.bit_depth[0],
                    mooring_platform_name=deployment_info.mooring_platform_name[0],
                    recorder_type=deployment_info.recorder_type[0],
                    recorder_SN=deployment_info.recorder_SN[0],
                    hydrophone_model=deployment_info.hydrophone_model[0],
                    hydrophone_SN=deployment_info.hydrophone_SN[0],
                    hydrophone_depth=deployment_info.hydrophone_depth[0],
                    location_name = deployment_info.location_name[0],
                    location_lat = deployment_info.location_lat[0],
                    location_lon = deployment_info.location_lon[0],
                    location_water_depth = deployment_info.location_water_depth[0],
                    deployment_ID=deployment_info.deployment_ID[0],
                    )

28 annotation files found.
Duplicate entries removed: 0
Integrity test succesfull
1932 annotations imported.


Some inconsistent class labels here as well:

In [14]:
print(annot.get_labels_class())

['fish', 'fish?', 'fush', '?']


Fixing labels according to our naming convention:

In [15]:
annot.data['label_class'].replace(to_replace=['fish','fish?','fush'], value='FS', inplace=True)
annot.data['label_class'].replace(to_replace=['?'], value='UN', inplace=True)
annot.data['label_class'].dropna(axis=0, inplace=True)
print(annot.get_labels_class())

['FS', 'UN']


Summary:

In [16]:
# print summary (pivot table)
print(annot.summary())

label_class           FS  UN  Total
deployment_ID                      
SI-RCAOut-20181015  1909  23   1932
Total               1909  23   1932


Saving the cleaned up dataset:

In [17]:
# save as parquet file
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.nc'))
#annot.to_parquet(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.parquet'))
annot.to_pamlab(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +' annotations.log', single_file=True)
annot.to_raven(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +'.Table.1.selections.txt', single_file=True)

### Dataset 3: ONC - Delta Node 2014

Repeating the same steps as teh prvious dataset. The difference here is that the annotations were performed with PAMlab instead of Raven.

In [18]:
root_dir = r'C:\Users\xavier.mouy\Documents\PhD\Projects\Dectector\datasets\ONC_delta-node_2014'
deployment_file = r'deployment_info.csv' 
annotation_dir = r'manual_annotations'
data_dir = r'audio_data'

# Instantiate
Deployment = DeploymentInfo()

# write empty file to fill in (do once only)
#Deployment.write_template(os.path.join(root_dir, deployment_file))

# # load deployment file
deployment_info = Deployment.read(os.path.join(root_dir, deployment_file))

# # load all annotations
annot = Annotation()
annot.from_pamlab(os.path.join(root_dir, annotation_dir), verbose=True)

# Mnaually fill in missing information
annot.insert_values(software_version='6.2.2',
                    operator_name='Xavier Mouy',
                    audio_channel=deployment_info.audio_channel_number[0],
                    UTC_offset=deployment_info.UTC_offset[0],
                    audio_file_dir=os.path.join(root_dir, data_dir),
                    audio_sampling_frequency=deployment_info.sampling_frequency[0],
                    audio_bit_depth=deployment_info.bit_depth[0],
                    mooring_platform_name=deployment_info.mooring_platform_name[0],
                    recorder_type=deployment_info.recorder_type[0],
                    recorder_SN=deployment_info.recorder_SN[0],
                    hydrophone_model=deployment_info.hydrophone_model[0],
                    hydrophone_SN=deployment_info.hydrophone_SN[0],
                    hydrophone_depth=deployment_info.hydrophone_depth[0],
                    location_name=deployment_info.location_name[0],
                    location_lat=deployment_info.location_lat[0],
                    location_lon=deployment_info.location_lon[0],
                    location_water_depth=deployment_info.location_water_depth[0],
                    deployment_ID=deployment_info.deployment_ID[0],
                    )

47 annotation files found.
Duplicate entries removed: 0
Integrity test succesfull
857 annotations imported.


No inconsistent class labels this time:

In [19]:
print(annot.get_labels_class())

['FS']


Summary:

In [20]:
# print summary (pivot table)
print(annot.summary())

label_class      FS  Total
deployment_ID             
ONC-Delta-2014  857    857
Total           857    857


Saving the cleaned up dataset:

In [21]:
# save as parquet file
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.nc'))
#annot.to_parquet(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.parquet'))
annot.to_pamlab(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +' annotations.log', single_file=True)
annot.to_raven(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +'.Table.1.selections.txt', single_file=True)

### Dataset 4: UVIC - Hornby Island

We can repeat the step above for all the other datasets:

In [22]:
root_dir = r'C:\Users\xavier.mouy\Documents\PhD\Projects\Dectector\datasets\UVIC_hornby-island_2019'
deployment_file = r'deployment_info.csv' 
annotation_dir = r'manual_annotations'
data_dir = r'audio_data'

# Instantiate
Deployment = DeploymentInfo()

# write empty file to fill in (do once only)
#Deployment.write_template(os.path.join(root_dir, deployment_file))

# load deployment file
deployment_info = Deployment.read(os.path.join(root_dir, deployment_file))

# load all annotations
annot = Annotation()
annot.from_raven(os.path.join(root_dir, annotation_dir), verbose=True)

# Mnaually fill in missing information
annot.insert_values(software_version='1.5',
                    operator_name='Emie Woodburn',
                    audio_channel=deployment_info.audio_channel_number[0],
                    UTC_offset=deployment_info.UTC_offset[0],
                    audio_file_dir=os.path.join(root_dir, data_dir),
                    audio_sampling_frequency=deployment_info.sampling_frequency[0],
                    audio_bit_depth=deployment_info.bit_depth[0],
                    mooring_platform_name=deployment_info.mooring_platform_name[0],
                    recorder_type=deployment_info.recorder_type[0],
                    recorder_SN=deployment_info.recorder_SN[0],
                    hydrophone_model=deployment_info.hydrophone_model[0],
                    hydrophone_SN=deployment_info.hydrophone_SN[0],
                    hydrophone_depth=deployment_info.hydrophone_depth[0],
                    location_name = deployment_info.location_name[0],
                    location_lat = deployment_info.location_lat[0],
                    location_lon = deployment_info.location_lon[0],
                    location_water_depth = deployment_info.location_water_depth[0],
                    deployment_ID=deployment_info.deployment_ID[0],
                    )

47 annotation files found.
Duplicate entries removed: 21162
Integrity test succesfull
21162 annotations imported.


Some inconsistent class labels :

In [23]:
print(annot.get_labels_class())

['FS', 'Seal', 'Unknown', nan, ' FS', 'KW', 'KW ', 'Seal\\', ' ', 'FSFS', 'Chirp', '  ']


Fixing labels according to our naming convention:

In [24]:
annot.data['label_class'].replace(to_replace=['FSFS',' FS'], value='FS', inplace=True)
annot.data['label_class'].replace(to_replace=['KW '], value='KW', inplace=True)
annot.data['label_class'].replace(to_replace=['Seal','Seal\\'], value='HS', inplace=True)
annot.data['label_class'].replace(to_replace=['Unknown','Chirp',' ','  '], value='UN', inplace=True)
annot.data['label_class'].dropna(axis=0, inplace=True)
print(annot.get_labels_class())

['FS', 'HS', 'UN', 'KW']


Summary:

In [25]:
# print summary (pivot table)
print(annot.summary())

label_class       FS  HS  KW  UN  Total
deployment_ID                          
07-HI          21002  33  27  93  21155
Total          21002  33  27  93  21155


Saving the cleaned up dataset:

In [26]:
# save as parquet file
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.nc'))
#annot.to_parquet(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.parquet'))
annot.to_pamlab(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +' annotations.log', single_file=True)
annot.to_raven(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +'.Table.1.selections.txt', single_file=True)

### Dataset 5: UVIC - Mill Bay

We can repeat the step above for all the other datasets:

In [27]:
root_dir = r'C:\Users\xavier.mouy\Documents\PhD\Projects\Dectector\datasets\UVIC_mill-bay_2019'
deployment_file = r'deployment_info.csv' 
annotation_dir = r'manual_annotations'
data_dir = r'audio_data'

# Instantiate
Deployment = DeploymentInfo()

# write empty file to fill in (do once only)
#Deployment.write_template(os.path.join(root_dir, deployment_file))

# load deployment file
deployment_info = Deployment.read(os.path.join(root_dir, deployment_file))

# load all annotations
annot = Annotation()
annot.from_raven(os.path.join(root_dir, annotation_dir),
                 class_header='Sound Type',
                 verbose=True)

# Mnaually fill in missing information
annot.insert_values(software_version='1.5',
                    operator_name='Courtney Evin',
                    audio_channel=deployment_info.audio_channel_number[0],
                    UTC_offset=deployment_info.UTC_offset[0],
                    audio_file_dir=os.path.join(root_dir, data_dir),
                    audio_sampling_frequency=deployment_info.sampling_frequency[0],
                    audio_bit_depth=deployment_info.bit_depth[0],
                    mooring_platform_name=deployment_info.mooring_platform_name[0],
                    recorder_type=deployment_info.recorder_type[0],
                    recorder_SN=deployment_info.recorder_SN[0],
                    hydrophone_model=deployment_info.hydrophone_model[0],
                    hydrophone_SN=deployment_info.hydrophone_SN[0],
                    hydrophone_depth=deployment_info.hydrophone_depth[0],
                    location_name = deployment_info.location_name[0],
                    location_lat = deployment_info.location_lat[0],
                    location_lon = deployment_info.location_lon[0],
                    location_water_depth = deployment_info.location_water_depth[0],
                    deployment_ID=deployment_info.deployment_ID[0],
                    )

48 annotation files found.
Duplicate entries removed: 4058
Integrity test succesfull
4058 annotations imported.


Some inconsistent class labels :

In [28]:
print(annot.get_labels_class())

['FS', 'HS', 'unknown-mammal?', 'unknown', nan, 'unknown-invert', 'fs', 'F', 'SF']


Fixing labels according to our naming convention:

In [29]:
annot.data['label_class'].replace(to_replace=['fs','F','SF'], value='FS', inplace=True)
annot.data['label_class'].replace(to_replace=['unknown-mammal?','unknown','unknown-invert'], value='UN', inplace=True)
annot.data['label_class'].dropna(axis=0, inplace=True)
print(annot.get_labels_class())

['FS', 'HS', 'UN']


Summary:

In [30]:
# print summary (pivot table)
print(annot.summary())

label_class      FS  HS  UN  Total
deployment_ID                     
06-MILL        3987  49  17   4053
Total          3987  49  17   4053


Saving the cleaned up dataset:

In [31]:
# save as parquet file
annot.to_netcdf(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.nc'))
#annot.to_parquet(os.path.join(root_dir, 'Annotations_dataset_' + deployment_info.deployment_ID[0] + '.parquet'))
annot.to_pamlab(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +' annotations.log', single_file=True)
annot.to_raven(root_dir, outfile='Annotations_dataset_' + deployment_info.deployment_ID[0] +'.Table.1.selections.txt', single_file=True)

# Merging all datasets together

Now that all our datasets are cleaned up, we can merge them all in a single Master annotation dataset.

Defining the path of each dataset:

In [32]:
root_dir = r'C:\Users\xavier.mouy\Documents\PhD\Projects\Dectector\datasets'
dataset_files = ['UVIC_mill-bay_2019\Annotations_dataset_06-MILL.nc',
                 'UVIC_hornby-island_2019\Annotations_dataset_07-HI.nc',
                 'ONC_delta-node_2014\Annotations_dataset_ONC-Delta-2014.nc',
                 'DFO_snake-island_rca-in_20181017\Annotations_dataset_SI-RCAIn-20181017.nc',
                 'DFO_snake-island_rca-out_20181015\Annotations_dataset_SI-RCAOut-20181015.nc',
                ]

Looping through each dataset and merging in to a master dataset:

In [33]:
# # load all annotations
annot = Annotation()
for file in dataset_files:
    tmp = Annotation()
    tmp.from_netcdf(os.path.join(root_dir, file), verbose=True)
    annot = annot + tmp

Duplicate entries removed: 0
Integrity test succesfull
4058 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
21162 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
857 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
13016 annotations imported.
Duplicate entries removed: 0
Integrity test succesfull
1932 annotations imported.


Now we can see a summary of all the annotatiosn we have:

In [34]:
# print summary (pivot table)
print(annot.summary())

label_class             ANT     FS  HS  KW   UN  Total
deployment_ID                                         
06-MILL              5    0   3987  49   0   17   4058
07-HI                7    0  21002  33  27   93  21162
ONC-Delta-2014       0    0    857   0   0    0    857
SI-RCAIn-20181017    5    2  12337   0   0  672  13016
SI-RCAOut-20181015   0    0   1909   0   0   23   1932
Total               17    2  40092  82  27  805  41025


We can also look at the contribution from each analyst:

In [35]:
print(annot.summary(rows='operator_name'))

label_class           ANT     FS  HS  KW   UN  Total
operator_name                                       
Courtney Evin      5    0   3987  49   0   17   4058
Emie Woodburn      7    0  21002  33  27   93  21162
Stephanie Archer   5    2  14246   0   0  695  14948
Xavier Mouy        0    0    857   0   0    0    857
Total             17    2  40092  82  27  805  41025


Finally we can save our Master annotation dataset. It will be used for trainning and evealuation classification models.

In [36]:
#annot.to_parquet(os.path.join(root_dir, 'Master_annotations_dataset.parquet'))
annot.to_netcdf(os.path.join(root_dir, 'Master_annotations_dataset.nc'))