# Creation of annotations from unannotated noise recordings


## Purpose of this notebook
This notebook describes the steps involved in automatically creating noise annotations from non-annotated noise recordings. This notebook is used for creating noise annotations from data provided by the Universty of Aberdeen in Scotland.

Annotations are made by breaking down the recording into adjacent annotations of a given duration until the end of the file. Min and max frequency of the annotations are 0 Hz and the Nyquist frequency, respectively.

## Deployment folders

The data provided were separated into folders corresponding to different deployments. As a result, 7 folders were created:

- UK-UAberdeen-MorayFirth-201904_986-110
- UK-UAberdeen-MorayFirth-201904_1027-235
- UK-UAberdeen-MorayFirth-201904_1029-237
- UK-UAberdeen-MorayFirth-202001_1092-112
- UK-UAberdeen-MorayFirth-202001_1093-164
- UK-UAberdeen-MorayFirth-202101_1136-164
- UK-UAberdeen-MorayFirth-202101_1137-112

A deployment_info.csv file was created in each of these folders and contains the metadata for each deployment.

![noise_scotland_folders](attachment:img/noise_scotland_folders.png)


## Import libraries and define functions used throughout

In [13]:
from ecosound.core.annotation import Annotation
from ecosound.core.metadata import DeploymentInfo
from ecosound.core.audiotools import Sound
from ecosound.core.tools import filename_to_datetime
import os
import pandas as pd
import numpy as np
import uuid
from datetime import datetime

def create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass):
    files_list = os.listdir(audio_dir)
    annot_stack = []
    for file in files_list:
        if file.endswith(file_ext):
            print(file)
            # retrieve file start date and time
            file_timestamp = filename_to_datetime(file)

            # retrieve file duration
            audio = Sound(os.path.join(audio_dir, file))
            file_dur = audio.file_duration_sec

            # define annotations start times (relative to start begining of the audio file)
            t1 = np.arange(0, file_dur, annot_dur_sec)
            t2 = t1[1:]
            t2 = np.append(t2, file_dur)
            # makes sure the last annotation is longer than value defined by the user (annot_dur_sec)
            if t2[-1]-t1[-1] < annot_dur_sec:
                #print(t1)
                #print(t2)
                t1 = np.delete(t1, -1)
                t2 = np.delete(t2, -2)
                #print(t1)
                #print(t2)

            # create the annotatiom object
            annot = Annotation()

            annot.data['time_min_offset'] = t1
            annot.data['time_max_offset'] = t2
            annot.insert_values(audio_file_start_date=file_timestamp[0])
            annot.data['time_min_date'] = pd.to_datetime(
                annot.data['audio_file_start_date'] + pd.to_timedelta(
                    annot.data['time_min_offset'], unit='s'))
            annot.data['time_max_date'] = pd.to_datetime(
                annot.data['audio_file_start_date'] +
                pd.to_timedelta(annot.data['time_max_offset'], unit='s'))
            annot.insert_values(audio_channel=1)
            annot.insert_values(audio_file_name=os.path.splitext(os.path.basename(file))[0])
            annot.insert_values(audio_file_dir=audio_dir)
            annot.insert_values(audio_file_extension=os.path.splitext(file)[1])
            annot.insert_values(frequency_min=0)
            annot.insert_values(software_version=0)
            annot.insert_values(operator_name='xavier')
            annot.insert_values(entry_date=datetime.now())
            annot.insert_values(frequency_max=audio.file_sampling_frequency/2)
            annot.insert_values(label_class=label_class)
            annot.insert_values(label_subclass=label_subclass)
            annot.insert_values(from_detector=False)
            annot.insert_values(software_name='custom_python')
            annot.data['uuid'] = annot.data.apply(lambda _: str(uuid.uuid4()), axis=1)
            annot.data['duration'] = annot.data['time_max_offset'] - annot.data['time_min_offset']        
            # add metadata
            annot.insert_metadata(os.path.join(audio_dir, deployment_file)) 
            # stack annotatiosn for each file
            annot_stack.append(annot)
            # check that evrything looks fine
            annot.check_integrity(verbose=False, ignore_frequency_duplicates=True)

    # concatenate all annotations
    annot_concat = annot_stack[0]
    for an_idx in range(1, len(annot_stack)):
        annot_concat = annot_concat + annot_stack[an_idx]
    annot_concat.check_integrity(verbose=False, ignore_frequency_duplicates=True)
    return annot_concat

### Dataset 1: UK-UAberdeen-MorayFirth-201904_986-110

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [31]:
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\UK-UAberdeen-MorayFirth-201904_986-110'
deployment_file = r'deployment_info.csv' 
file_ext = 'wav'

annot_dur_sec = 60  # duration of the noise annotations in seconds
label_class = 'NN'  # label to use for the noise class
label_subclass = '' # label to use for the noise subclass (if needed, e.g. S for seismic airguns)

In [32]:
annot = create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass)

Depl986_1678036995.190402110017.wav
Depl986_1678036995.190406225930.wav
Depl986_1678036995.190410165901.wav


Let's look at the summary of annotations that were created:

In [33]:
annot.summary()

label_class,NN,Total
deployment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
UK-UAberdeen-MorayFirth-201904_986-110,90,90
Total,90,90


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [None]:
annot.to_netcdf(os.path.join(audio_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(audio_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

Here is what the annotations look like in Raven:

![noiseScotland.png](attachment:noiseScotland.png)


### Dataset 2: UK-UAberdeen-MorayFirth-201904_1027-235

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [34]:
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\UK-UAberdeen-MorayFirth-201904_1027-235'
deployment_file = r'deployment_info.csv' 
file_ext = 'wav'

annot_dur_sec = 60  # duration of the noise annotations in seconds
label_class = 'NN'  # label to use for the noise class
label_subclass = '' # label to use for the noise subclass (if needed, e.g. S for seismic airguns)

In [35]:
annot = create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass)

Depl1027_1677725722.190403115956.wav
Depl1027_1677725722.190411055855.wav
Depl1027_1677725722.190415235822.wav


Let's look at the summary of annotations that were created:

In [36]:
annot.summary()

label_class,NN,Total
deployment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
UK-UAberdeen-MorayFirth-201904_1027-235,90,90
Total,90,90


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [6]:
annot.to_netcdf(os.path.join(audio_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(audio_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

### Dataset 3: UK-UAberdeen-MorayFirth-201904_1029-237

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [7]:
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\UK-UAberdeen-MorayFirth-201904_1029-237'
deployment_file = r'deployment_info.csv' 
file_ext = 'wav'

annot_dur_sec = 60  # duration of the noise annotations in seconds
label_class = 'NN'  # label to use for the noise class
label_subclass = '' # label to use for the noise subclass (if needed, e.g. S for seismic airguns)

In [8]:
annot = create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass)

Depl1029_134541352.190403235927.wav
Depl1029_134541352.190404175922.wav
Depl1029_134541352.190409115847.wav


Let's look at the summary of annotations that were created:

In [9]:
annot.summary()

label_class,NN,Total
deployment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
UK-UAberdeen-MorayFirth-201904_1029-237,90,90
Total,90,90


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [10]:
annot.to_netcdf(os.path.join(audio_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(audio_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

### Dataset 4: UK-UAberdeen-MorayFirth-202001_1092-112 (seismic)

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [11]:
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\UK-UAberdeen-MorayFirth-202001_1092-112'
deployment_file = r'deployment_info.csv' 
file_ext = 'wav'

annot_dur_sec = 60  # duration of the noise annotations in seconds
label_class = 'NN'  # label to use for the noise class
label_subclass = 'S' # label to use for the noise subclass (if needed, e.g. S for seismic airguns)

In [14]:
annot = create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass)

Depl1092_1678036995.200101014914.wav
Depl1092_1678036995.200104224914.wav
Depl1092_1678036995.200104234914.wav
Depl1092_1678036995.200111084914.wav
Depl1092_1678036995.200119004914.wav
Depl1092_1678036995.200119034914.wav
Depl1092_1678036995.200121014914.wav
Depl1092_1678036995.200121214914.wav
Depl1092_1678036995.200124014914.wav
Depl1092_1678036995.200124164914.wav
Depl1092_1678036995.200125184914.wav
Depl1092_1678036995.200125214914.wav
Depl1092_1678036995.200128064914.wav
Depl1092_1678036995.200128134914.wav
Depl1092_1678036995.200128144914.wav
Depl1092_1678036995.200201214914.wav
Depl1092_1678036995.200204224914.wav
Depl1092_1678036995.200206004914.wav
Depl1092_1678036995.200213004914.wav
Depl1092_1678036995.200213024914.wav
Depl1092_1678036995.200213084914.wav
Depl1092_1678036995.200226104914.wav
Depl1092_1678036995.200226124914.wav
Depl1092_1678036995.200227004914.wav


Let's look at the summary of annotations that were created:

In [15]:
annot.summary()

label_class,NN,Total
deployment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
UK-UAberdeen-MorayFirth-202001_1092-112,216,216
Total,216,216


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [16]:
annot.to_netcdf(os.path.join(audio_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(audio_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

### Dataset 5: UK-UAberdeen-MorayFirth-202001_1093-164 (seismic)

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [17]:
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\UK-UAberdeen-MorayFirth-202001_1093-164'
deployment_file = r'deployment_info.csv' 
file_ext = 'wav'

annot_dur_sec = 60  # duration of the noise annotations in seconds
label_class = 'NN'  # label to use for the noise class
label_subclass = 'S' # label to use for the noise subclass (if needed, e.g. S for seismic airguns)

In [18]:
annot = create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass)

Depl1093_1677725722.200104205913.wav
Depl1093_1677725722.200110095913.wav
Depl1093_1677725722.200110115913.wav
Depl1093_1677725722.200111205913.wav
Depl1093_1677725722.200119035913.wav
Depl1093_1677725722.200121195913.wav
Depl1093_1677725722.200121235913.wav
Depl1093_1677725722.200123235913.wav
Depl1093_1677725722.200124025913.wav
Depl1093_1677725722.200124165913.wav
Depl1093_1677725722.200126065913.wav
Depl1093_1677725722.200126095913.wav
Depl1093_1677725722.200128135913.wav
Depl1093_1677725722.200130015913.wav
Depl1093_1677725722.200131095913.wav
Depl1093_1677725722.200201185913.wav
Depl1093_1677725722.200202025913.wav
Depl1093_1677725722.200204195913.wav
Depl1093_1677725722.200205085913.wav
Depl1093_1677725722.200205095913.wav
Depl1093_1677725722.200205235913.wav
Depl1093_1677725722.200206015913.wav
Depl1093_1677725722.200213005913.wav
Depl1093_1677725722.200213015913.wav
Depl1093_1677725722.200226235913.wav


Let's look at the summary of annotations that were created:

In [19]:
annot.summary()

label_class,NN,Total
deployment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
UK-UAberdeen-MorayFirth-202001_1093-164,225,225
Total,225,225


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [20]:
annot.to_netcdf(os.path.join(audio_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(audio_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

### Dataset 6: UK-UAberdeen-MorayFirth-202101_1136-164

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [22]:
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\UK-UAberdeen-MorayFirth-202101_1136-164'
deployment_file = r'deployment_info.csv' 
file_ext = 'wav'

annot_dur_sec = 60  # duration of the noise annotations in seconds
label_class = 'NN'  # label to use for the noise class
label_subclass = '' # label to use for the noise subclass (if needed, e.g. S for seismic airguns)

In [23]:
annot = create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass)

Depl1136_1677725722.210102130002.wav
Depl1136_1677725722.210103230002.wav
Depl1136_1677725722.210105030002.wav
Depl1136_1677725722.210105110002.wav
Depl1136_1677725722.210119110002.wav
Depl1136_1677725722.210119180002.wav
Depl1136_1677725722.210208180002.wav
Depl1136_1677725722.210216140002.wav
Depl1136_1677725722.210216170002.wav
Depl1136_1677725722.210217150002.wav
Depl1136_1677725722.210220090002.wav
Depl1136_1677725722.210221010002.wav


Let's look at the summary of annotations that were created:

In [24]:
annot.summary()

label_class,NN,Total
deployment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
UK-UAberdeen-MorayFirth-202101_1136-164,108,108
Total,108,108


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [25]:
annot.to_netcdf(os.path.join(audio_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(audio_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)

### Dataset 7: UK-UAberdeen-MorayFirth-202101_1137-112

Definition of all the paths of all folders with the raw annotation and audio files for this deployment.

In [27]:
audio_dir = r'C:\Users\xavier.mouy\Documents\GitHub\minke-whale-dataset\datasets\UK-UAberdeen-MorayFirth-202101_1137-112'
deployment_file = r'deployment_info.csv' 
file_ext = 'wav'

annot_dur_sec = 60  # duration of the noise annotations in seconds
label_class = 'NN'  # label to use for the noise class
label_subclass = '' # label to use for the noise subclass (if needed, e.g. S for seismic airguns)

In [28]:
annot = create_noise_annot(audio_dir, deployment_file, file_ext, annot_dur_sec, label_class, label_subclass)

Depl1137_1678508072.210107040002.wav
Depl1137_1678508072.210108160002.wav
Depl1137_1678508072.210113150002.wav
Depl1137_1678508072.210114040002.wav
Depl1137_1678508072.210116170002.wav
Depl1137_1678508072.210119040002.wav
Depl1137_1678508072.210122000002.wav
Depl1137_1678508072.210123040002.wav
Depl1137_1678508072.210123120002.wav
Depl1137_1678508072.210208160002.wav
Depl1137_1678508072.210211200002.wav
Depl1137_1678508072.210213110002.wav


Let's look at the summary of annotations that were created:

In [29]:
annot.summary()

label_class,NN,Total
deployment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
UK-UAberdeen-MorayFirth-202101_1137-112,108,108
Total,108,108


The dataset can now be saved as a Raven annotation file and netcdf4 file:

In [30]:
annot.to_netcdf(os.path.join(audio_dir, 'Annotations_dataset_' + annot.data['deployment_ID'][0] +' annotations.nc'))
annot.to_raven(audio_dir, outfile='Annotations_dataset_' + annot.data['deployment_ID'][0] +'.Table.1.selections.txt', single_file=True)