# **Creating embeddings of a deployment** 🐠🐦

<a name=software-requirements></a>
## Software Requirements

The majority of the code is contained in a publicly accessible repository hosted on github: https://github.com/google-research/perch.

**Note:** Perch repository was formerly named 'chirp', so any references to 'chirp' in the code below are references to code in this Perch repository.

<a name="methodology"></a>
# Methodology 💻

## Overview

Our method performs a **vector search**: given a **labeled** vocalization which we'll refer to as a **query** (an audio clip with a known species vocalizing), an unlabeled audio dataset which we'll refer to as our **search corpus**, slice up the search corpus into a collection of clips and search over that corpus to find "matches" with the query.

To do this, we'll follow theses high level steps:

**Step 2:** Set up by importing the relevant modules, choosing a dataset and testing the model.

**Step 2:** Use the "off the shelf" SurfPerch model to generate high quality embeddings for the unlabeled **search corpus**

**Step 3:** Obtain a small number of **query** samples as target sounds, which are labeled samples of unknown fish sounds provided by the user.

**Step 4:** Select a query sample and generate the embedding(s) from this.

**Step 5:** Search within the set of embeddings generated from the search corpus (the raw audio data) for points that are "nearby" the embedding generated from the query sample. This yields a set of audio snippets from the audio data that should sound similar to the query.

**Step 6:** Manually audit the results of step 4. This involves listening to a small number of samples and manually labeling them as a match to our target query or not.

--------------------
**Note:** Aim to repeat steps 5 and 6 until we have 20-30 samples for our target sound. We are "bootstrapping" a training set for a simple linear model.

--------------------
**Step 6:** Train a simple linear model based on the bootstrapped samples that we just generated.

At this point, we have what should be a high quality classifier that can detect a given target sound in its broader dataset. If our model is not performing as well as we'd like, we can continue to generate more training data by repeating the above process using outputs from our linear model to laebl more data, or repeating the original steps 5 and 6 above.




<a name="pipeline-config"></a>
## Setup and Configuration

In this section, we set the configuration parameters we'll be using to process and embed the audio data, along with input and output paths for reading in the data, pre-trained model, and writing the results we produce to file.

In [1]:
# Import various dependencies, including the relevant modules from the Perch
# repository. Note that "chirp" is the old name that the Perch team used, so any
# chirp modules imported here were installed as part of the Perch repository in
# one of the previous cells.

import collections
from collections import Counter
from etils import epath
import ipywidgets as widgets
from IPython.display import display as ipy_display
import matplotlib.pyplot as plt
from ml_collections import config_dict
import numpy as np
import pandas as pd
from scipy.io import wavfile
import shutil
import tensorflow as tf
import tqdm
import os

from chirp import audio_utils
from chirp.inference import embed_lib
from chirp.inference import tf_examples
from chirp.inference.search import bootstrap
from chirp.inference.search import search
from chirp.inference.search import display
from chirp.inference.classify import classify
from chirp.inference.classify import data_lib

2025-02-11 10:25:13.131862: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-02-11 10:25:13.143751: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1739265913.158243    6392 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1739265913.163053    6392 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-11 10:25:13.177951: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [2]:
# Check GPU is used
tf.config.list_physical_devices('GPU')


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

### Set the configuration to use throughout the tutorial **[do not change]**

The Perch codebase provides a general framework for agile modeling, which naturally involves many different configuration options.  We highlight some of the relevant paramers below, and refer the reader to the Perch codebase for more detail.

For this tutorial using the provided pre-trained model, the required values for the following parameters are fixed and should not be changed.

#### Relevant Parameters
* `window_size_s`: The size in seconds of each "chunk" of audio.  Each chunk of audio will be treated like
 a single data point. Note that the model architecture depends on this value, so once selected it cannot be changed.

* `hop_size_s`: The hop size (aka model [*stride*](https://medium.com/machine-learning-algorithms/what-is-stride-in-convolutional-neural-network-e3b4ae9baedb)) is the offset in seconds between successive chunks of audio. When hop_size is equal to window size, the chunks of audio will not overlap at all. Choosing a smaller hop size (a common choice is half of the window_size) may be useful for capturing interesting data points that
correspond to audio on the boundary between two windows. However, a smaller
hop size may also lead to a larger embedding dataset because each instant of
audio is now pesent in multiple windows. As a consequence, you might need to
"de-dupe" your matches since multiple embedded data points may correspond to the same snippet of raw audio.

* `sample_rate`: We use a uniform sample rate of 32 kHz. All audio used for training the base model and generating embeddings is (re)sampled at 32 kHz. This parameter, together with the `window_size_s` of 5 means that each snippet of audio gets represented as a vector of length 5s * 32,000Hz = 160,000.  The value 160,000 must be compatible with your model architecture.

In [3]:
sample_data_folder = "./Shimoni pilot Data/"
output_directory = "./Shimoni pilot Outputs/"

# Enter the name of the folder containing the dataset bellow
dataset_folder = '(NT-R-2)' + '/'

# Models folder
model_folder = "./models/"

# Specify a glob pattern matching any number of wave files.
# Use [wW][aA][vV] to match .wav or .WAV files
unlabeled_audio_pattern = os.path.join(sample_data_folder, dataset_folder, 'raw_audio/*.[wW][aA][vV]')

In [4]:
from utils_agile_model import choose_embedding_model

# Test model choosing
model_name = "surfperch"
embed_fn, config = choose_embedding_model(model_name)

# For readability later in the code
sample_rate = config.embed_fn_config.model_config.sample_rate
hop_size_s = config.embed_fn_config.model_config.hop_size_s
window_size_s = config.embed_fn_config.model_config.window_size_s

print(f"Ready to create embeddings for deployment '{dataset_folder}' using '{model_name}' model")
print(f"Sampling rate:{sample_rate}Hz, Hop size:{hop_size_s}sec, Window size:{window_size_s}sec")



Loading surfperch...


I0000 00:00:1739265915.902711    6392 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9167 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:15:00.0, compute capability: 7.5


Test-run of model...


I0000 00:00:1739265918.841822    6392 service.cc:148] XLA service 0x38c23b30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1739265918.841851    6392 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
2025-02-11 10:25:19.075347: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
W0000 00:00:1739265919.088430    6392 assert_op.cc:38] Ignoring Assert operator jax2tf_infer_fn_/assert_equal_1/Assert/AssertGuard/Assert
I0000 00:00:1739265919.529137    6392 cuda_dnn.cc:529] Loaded cuDNN version 90701
E0000 00:00:1739265920.349282    6392 gpu_timer.cc:82] Delay kernel timed out: measured time has sub-optimal accuracy. There may be a missing warmup execution, please investigate in Nsight Systems.
E0000 00:00:1739265920.468211    6392 gpu_timer.cc:82] Delay kernel timed out: measured time has s


Setup complete!
Ready to create embeddings for deployment '(NT-R-2)/' using 'surfperch' model
Sampling rate:32000Hz, Hop size:5.0sec, Window size:5.0sec


I0000 00:00:1739265924.396525    6392 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


### Specify the data (inputs) and results (outputs) directories
Enter the name of the folder containing the dataset bellow

To execute this pipeline, we need paths to, no modifications are needed if the architecture was respected:
- `unlabeled_audio_pattern`: files where the unlabeled audio dataset is stored.
- `embedding_output_dir`: the directory where the embedded audio will be written.
- `labeled_data_path`: the directory where the labeled samples will be placed post-search and active learning loop.

In [5]:
# Specify a directory where the embeddings will be written.
embedding_output_dir = os.path.join(output_directory, dataset_folder, model_name, 'raw_embeddings/')
if not os.path.exists(embedding_output_dir):
  os.makedirs(embedding_output_dir, exist_ok=True)

config.output_dir = embedding_output_dir
config.source_file_patterns = [unlabeled_audio_pattern]
config.shard_len_s = 30*60 # ENTER DEPLOYMENT FILELENGTH HERE (in sec)
config.num_shards_per_file = -1
config.tf_record_shards = 10

# Create output directory and write the configuration.
output_dir = epath.Path(config.output_dir)
output_dir.mkdir(exist_ok=True, parents=True)

### Write the configuration to JSON to ensure consistency with later stages of the pipeline

In [6]:
# This dumps a config json file next to the embeddings that allows us to reuse
# the same embeddings and ensure that we have the correct config that was used
# to generate them.
embed_lib.maybe_write_config(config, output_dir)

# Create SourceInfos configuration, used in sharded computation when computing
# embeddings. These source_infos contain metadata about how we're going to
# partition the search corpus.  In particular, we're splitting the Powdermill
# audio into hundreds of 5s chunks, and the source_infos help us keep track of
# which chunk came from which raw audio file.
source_infos = embed_lib.create_source_infos(
    config.source_file_patterns,
    config.num_shards_per_file,
    config.shard_len_s)
print(f'Constructed {len(source_infos)} source infos.')     #Should match the number of files in raw audio folder

if len(source_infos) == 0:
    print('No audio files found. Please check the path and try again.')
    print(f'Path: {unlabeled_audio_pattern}')

Constructed 30 source infos.


<a name=embed_data></a>
## Generate Embeddings ⏳

In this section we'll generate the **embeddings** corresponding to both our search corpus (the chosen dataset) as well as our query audio (the target sound chosen shortly).

Recall that the embeddings are new representations of the original data that are generated by the pretrained model. It is important to remember that we are **not** using our pretrained model to classify the new reef dataset.  Rather, we're using the model's **learned features** to map our data into a new representation that is more amenable to simpler classification techniques. The pretrained model was very computationally costly to train. So, the idea is that the heavy lifting of learning salient features has already been done during development of the pretrained model. We can use this pretrained model to extract these features from new marine bioacoustic data, then train a much lighter-weight machine learning model on top of these features. This is the concept of **transfer learning**, ie, re-using the features learned by a model, but in a novel setting.


### Embed the search dataset
We are ready to generate the embeddings for the raw audio.  This cell iterates over the `audio_iterator` created in the previous cell and creates a point (vector) in *embedding space* for each 5 second chunk of raw audio.  We write these embeddings to files (which are written into your `embedding_output_dir` directory that we specified above), and then return a `ds` variable that is a handle on the resulting TFRecordDataset object.

Writing the embeddings to file is useful because for large datasets, this embedding step can take minutes or hours, and we don't want to have to repeatedly regenerate the embeddings.

**GPU usage**: This component will benefit greatly from using a GPU.

❗If you have already computed the embeddings for this dataset, you do not need to run this cell again❗

In [9]:
#@title { vertical-output: true }

# RUN ONLY ONCE (per dataset)
# Embed! This step may take several minutes to run depending on the size of the search corpus
embed_fn.min_audio_s = 1.0
record_file = (output_dir / 'embeddings.tfrecord').as_posix()
succ, fail = 0, 0

existing_embedding_ids = embed_lib.get_existing_source_ids(
    output_dir, 'embeddings-*')

new_source_infos = embed_lib.get_new_source_infos(
    source_infos, existing_embedding_ids, config.embed_fn_config.file_id_depth)

print(f'Found {len(existing_embedding_ids)} existing embedding ids. \n'
      f'Processing {len(new_source_infos)} new source infos. ')

try:
  audio_loader = lambda fp, offset: audio_utils.load_audio_window(
      fp, offset, sample_rate=config.embed_fn_config.model_config.sample_rate,
      window_size_s=config.get('shard_len_s', -1.0))
  audio_iterator = audio_utils.multi_load_audio_window(
      filepaths=[s.filepath for s in source_infos],
      offsets=[s.shard_num * s.shard_len_s for s in source_infos],
      audio_loader=audio_loader,
  )
  with tf_examples.EmbeddingsTFRecordMultiWriter(
      output_dir=output_dir, num_files=config.get('tf_record_shards', 1)) as file_writer:
    for source_info, audio in tqdm.tqdm(
        zip(source_infos, audio_iterator), total=len(source_infos)):
      if not embed_fn.validate_audio(source_info, audio):
        continue
      file_id = source_info.file_id(config.embed_fn_config.file_id_depth)
      offset_s = source_info.shard_num * source_info.shard_len_s
      example = embed_fn.audio_to_example(file_id, offset_s, audio)
      if example is None:
        fail += 1
        continue
      file_writer.write(example.SerializeToString())
      succ += 1
    file_writer.flush()
finally:
  del(audio_iterator)
print(f'\n\nSuccessfully processed {succ} source_infos, failed {fail} times.')

# This can take a few moments to get started
print(f"\n\n Embeddings of the deployment '{dataset_folder}' from '{model_name}' have been successfully created in the folder {output_dir}")

2025-02-11 10:41:09.393207: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


Found 0 existing embedding ids. 
Processing 30 new source infos. 


100%|██████████| 30/30 [00:40<00:00,  1.35s/it]



Successfully processed 30 source_infos, failed 0 times.


 Embeddings of the deployment '(NT-R-2)/' from 'surfperch' have been successfully created in the folder Shimoni pilot Outputs/(NT-R-2)/surfperch/raw_embeddings





## Release GPU ressources (if needed)