## Documentation of 'Lemurcalls'

### Project Background
This code is part of my master's thesis, *Center-Based Segmentation of Lemur Vocalizations Using the Whisper Audio Foundation Model* (submitted on 08.12.2025), but it was not part of the formal evaluation of this thesis or any other university module.

The code for WhisperSeg—including `util`, `utils.py`, `datautils.py`, and `audio_utils.py`—was adapted from https://github.com/nianlonggu/WhisperSeg for the new dataset. For WhisperFormer, `losses.py` was adapted from https://github.com/happyharrycn/actionformer_release. I used ChatGPT and Cursor for debugging, code and text drafting, and brainstorming.

For more details on the project see `thesis.pdf`.

### Intoduction

Consistently detecting, segmenting, and classifying animal vocalizations is crucial for the study and conservation of wildlife. Manual annotation, however, is time-consuming and labor-intensive, highlighting the need for reliable automated approaches. Deep learning methods—especially those leveraging transfer learning—have achieved promising results in Sound Event Detection in recent years. Yet, performance often declines when models are applied to small datasets with diverse and high background noise, as is typical for primate vocalizations. 

With this code, you can train two types of models that both detect, segment and classify lemur calls from long audio recordings. The libary also enables you to assess the performance of the models.
To further imporve results,  the model predictions can be filtered by applying SNR and maximal amplitude thresholds. To determine those thresholds, the SNR and amplitude values for the calls in the training set can be plotted. To analyse model performance, the points can be colored by the predicted confidence scores. 
Further, you can analyse and compare the model predcitions on the test set, by plotting the spectrograms as calculated by Whisper, the final predicted calls and, for WhisperFormer, the frame-wise confidence scores. 

### Library Structure:

lemurcalls is organized into two main model subpackages, lemurcalls.whisperseg and lemurcalls.whisperformer, each covering training, inference, and evaluation workflows.
Shared data handling and utility logic is centralized in common helper modules so both pipelines use consistent preprocessing and label handling.
In addition, the library provides visualization and postprocessing tools (e.g., SNR/amplitude filtering and precision-recall analysis) to systematically inspect and improve model outputs. A more detailed visualization of the library structure can be seen below:


```text
lemurcalls/
├── README.md
├── pyproject.toml
├── presentation_notebook/
│   └── presentation.ipynb
└── lemurcalls/
    ├── __init__.py
    ├── datautils.py
    ├── audio_utils.py
    ├── utils.py
    ├── visualize_predictions.py
    ├── download_whipser.py
    ├── whisperseg/
    │   ├── __init__.py
    │   ├── model.py
    │   ├── train.py
    │   ├── infer.py
    │   ├── infer_folder.py
    │   ├── evaluate.py
    │   ├── evaluate_metrics.py
    │   ├── training_utils.py
    │   ├── datautils_ben.py
    │   ├── utils.py
    │   └── convert_hf_to_ct2.py
    └── whisperformer/
        ├── __init__.py
        ├── model.py
        ├── dataset.py
        ├── datautils.py
        ├── losses.py
        ├── train.py
        ├── infer.py
        ├── postprocessing/
        │   ├── prec_rec.py
        │   └── filter_labels_by_snr.py
        └── visualization/
            └── scatterplot_ampl_snr_score.py
```

### Biological and Mathematical Background

#### Datasets:
The data are audio recordings in the form of .wav files collected at the Affenwald STraußpark in Thüringen. To obtain the data, ring-tailed lemurs were equipped with collors equipped with microphones.
The data for training was manually annotated with the Raven Pro Software: For each detected call onset and offset aswell as a class label were assigned. Only calls belonging to three specific types of calls ('moan', 'wail' and 'hmm') were labeled.

The recordings are affected by diverse background noise (e.g. visitors of the park, traffic from a nearby road,...) and by the fact that microphones worn by one individual often capture calls from nearby conspecifics. To account for this variation, each call was manually classified into one of three quality classes:
1. Quality 1: Loud, high-quality calls that almost certainly originate from the focal individual.
2. Quality 2: Calls of moderate quality that probably do not originate from the focal individual.
3. Quality 3: Low-quality background calls, including very quiet or distant vocalizations.

The annotated JSON files are expanded to include quality labels and have the following structure:
`{onset:[], offset:[], cluster:[], quality:[]}`.





## Explaination of the Models:

WhisperSeg: [WhisperSeg] by Gu et al. leverages the pretrained [Whisper] Transformer—an automatic speech recognition system pretrained on 680,000 hours of multilingual and multitask supervised human speech. They showed that it can be effectively adapted to detect and classify animal vocalizations across multiple species and built a multi-species checkpoint. As a sequence-to sequence model, WhisperSeg outputs for each call onset, offset and call class.

WhisperFormer: Instead of performing next-token-prediction as in WhisperSeg, WhisperFormer directly predicts the call centers and regresses on- and offsets. This center-based approach is inspired by and TAL methods as CenterNet and ActionFormer. To achieve this, WhisperFormer combines the original Whisper Encoder with a lightweight Decoder followed by a Regression Head and Classification Head. These Heads are inspired by [ActionFormer]. The training loss consits of two parts, directly linked of the two model heads. As in~\cite{zhang2022actionformer} I use sigmoid focal loss~\cite{lin2017focal} as class loss $\mathcal{L}_{class}$ and dIoU loss~\cite{zheng2020distance} as regression loss $\mathcal{L}_{reg}$.
The total loss can be described as the weighted sum of class and regression loss multiplied with the total number of positive samples. The regression loss only gets evaluated for samples $t$ with positive ground truth value. Let $CL$ be the ground truth clusters, $\tilde{CL}$ the predicted clusters, $S$ the ground truth segments and $\tilde{S}$ the predicted segments. Then the total loss can be described as
$$ \mathcal{L}_{total}(CL, S, \tilde{CL}, \tilde{S})  = \frac{1}{T_{+}} \left( \mathcal{L}_{class}(CL, \tilde{CL})  + \lambda \cdot \mathbb{1}_{\mathcal{T}_+} \cdot \mathcal{L}_{reg}(S, \tilde{S})\right).$$

WhisperFormer outputs for each spectrogram column and each of the three classes a confidence score for there to be a and relative on and offsets given a call with center at the given spectrogram column. 
Thus, for the final predictions non-maximum suppression ([NMS]) is performed over all calls and a confidence threshold are applied. 


#### Evaluation Metrics:

In line with standard practice in detection tasks, I use the F1 score as the primary evaluation metric for all calls. Let $TP$ denote the number of true positives, $FP$ the number of false positives, $FN$ the number of false negatives, and $FC$ the number of predictions that match the temporal location of a call but are assigned an incorrect class. Then, precision and recall are defined as:
$$precision = \frac{TP}{TP + FP + FC}$$
and 
$$recall = \frac{TP}{TP+FN+FC}.$$
Precision measures the accuracy of the predicted calls, while recall quantifies the proportion of ground truth calls correctly detected. The F1 score is the harmonic mean of precision and recall, balancing these two aspects:
$$F1 = \frac{2 \cdot precision \cdot recall}{precision + recall}.$$
Following standard practice in multiclass classification, a FC prediction contributes both a FP (for the incorrectly predicted class) and a FN (for the true class). 

##### Evaluation by Quality Classes
To account for differences in call quality, we calculate F1 scores with respect to the ground truth quality classes. I distinguish the following metrics:
1.  $F1_{Q1,Q2,Q3}$: F1 score calculated with respect to the ground truth labels of all quality classes 1,2 and 3. 
2. $F1_{Q1,Q2}$: F1 score calculated with respect to the ground truth labels of calls from quality classes 1 and 2. 
3. $F1_{Q1}$: F1 score calculated with respect to the ground truth labels of quality classes 1 only.

Formally, for any subset of quality classes, the F1 score is computed as in ~\ref{f1}, with precision and recall restricted to the selected quality classes.

A limitation of this approach is that, when focusing on detecting high-quality calls (quality class 1), false positives from lower-quality classes (Q2 or Q3) may be less bad than completely missing detections. To address this, I define adjusted F1 metrics:
1. $F1_{Q1,(Q1,Q3)}$: F1 score calculated with respect to the ground truth labels of quality class 1, but FP from q2 and q3 are counted as neither TPs nor FPs.
2. $F1_{Q1,(Q2,Q3)}$: F1 score calculated with respect to the ground truth labels of quality classes 1 and 2, but FP from q3 are counted as neither TPs nor FPs.

In these adjusted metrics, $TP$, $FN$, and $FC$ remain unchanged compared to $F1_{Q1}$ or $F1_{Q1,Q2}$, but the number of false positives may decrease.

## Getting started
First, run the below code to install lemurcalls as editable package with the dependancies from `pyproject.toml`.

In [None]:
pip install -e . 

The library includes to subpackages lemurcalls.whisperseg and lemurcalls.whisperformer as well as a tool to visualize and compare the predictions achieved with the trained models.
In the following we set some demo paths, to demonstrate the functionalities of the python libraries. Since for training large models, you need a GPU, for demonstartion purposes I used small maximal numbers of epochs and a single audio file for inference.

In [None]:
PROJECT_ROOT = "/projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls"
AUDIO_TEST_DIR = "/mnt/lustre-grete/usr/u17327/final/audios_test"
AUDIO_SINGLE_DIR = "/mnt/lustre-grete/usr/u17327/final/audios_single"
LABEL_TEST_DIR = "/mnt/lustre-grete/usr/u17327/final/jsons_test"

WHISPER_BASE_PATH = f"{PROJECT_ROOT}/whisper_models/whisper_base"
WHISPERSEG_TRAIN_OUT = f"{PROJECT_ROOT}/lemurcalls/whisperseg_models"
WHISPERSEG_MODEL_DIR = f"{PROJECT_ROOT}/lemurcalls/model_folder_ben/final_checkpoint_20251116_163404_ct2"

WHISPERFORMER_CKPT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/best_model.pth"
WHISPERFORMER_PRED_DIR = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/sc/2026-02-20_11-48-55"
WHISPERFORMER_VIS_OUT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/visualization"
WHISPERFORMER_SNR_OUT = "/mnt/lustre-grete/usr/u17327/final/jsons_test_filtered"

## The WhisperSeg Subpackage

To train a whisperseg model, run

In [5]:
!python -m lemurcalls.whisperseg.train \
  --initial_model_path "{WHISPER_BASE_PATH}" \
  --model_folder "{WHISPERSEG_TRAIN_OUT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --num_classes 3 \
  --batch_size 4 \
  --n_threads 1 \
  --num_workers 1 \
  --max_num_epochs 1

Using fixed codebook for 3 class(es): {'m': 0, 't': 1, 'w': 2, 'lt': 1, 'h': 1}
Created 609 training samples after slicing
epoch-000:  26%|████████▏                      | 40/152 [03:37<10:05,  5.41s/it]^C


For inference, run:

In [6]:
!python -m lemurcalls.whisperseg.infer \
  -d "{AUDIO_SINGLE_DIR}" \
  -m "{WHISPERSEG_MODEL_DIR}" \
  -o "{WHISPERSEG_MODEL_DIR}"

Model loaded successfully.
Found 1 wav files in: /mnt/lustre-grete/usr/u17327/final/audios_single
INFO:root:Current file: [  0] /mnt/lustre-grete/usr/u17327/final/audios_single/U2024_09_03_10_03_14_799-U2024_09_03_10_04_42_175.UBN_v2.WAV


For evaluation, you can use the original WhisperSeg evaluation.py or use evaluate_metrics.py (that is also applicable for WhisperFormer), that can be run via

In [None]:
python -m lemurcalls.whisperseg.evaluate_metrics \
  --labels "/mnt/lustre-grete/usr/u17327/final/jsons_test/U2024_09_03_10_03_14_799-U2024_09_03_10_04_42_175.UBN_v2.json" \
  --predictions "/path/to/your_prediction_file.json" \
  --overlap_tolerance 0.1

### The WhisperFormer Subpackage


To train a WhisperFormer model, run:

In [None]:
!python -m lemurcalls.whisperformer.train \
  --checkpoint_path /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/best_model.pth \
  --model_folder /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new \
  --audio_folder /mnt/lustre-grete/usr/u17327/final/audios_test \
  --label_folder /mnt/lustre-grete/usr/u17327/final/jsons_test \
  --num_classes 3 \
  --batch_size 4 \
  --max_num_epochs 1 \
  --whisper_size large

For inference with a set confidence score threshold, run

In [None]:
!python -m lemurcalls.whisperformer.infer \
  --checkpoint_path /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/best_model.pth \
  --audio_folder /mnt/lustre-grete/usr/u17327/final/audios_single \
  --output_dir /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/sc \
  --batch_size 4 \
  --iou_threshold 0.4

To identify a suitable confidence threshold and assess overall model behavior, you can compute a precision-recall curve across multiple score thresholds.  
This helps you choose a threshold that best matches your objective (e.g., higher precision to reduce false positives, or higher recall to miss fewer calls).

You can also control which label quality classes are considered during evaluation via `eval_mode`:

- `standard`: evaluates with respect to the $F_1$, $F_{1,2}$ or $F_{1,2,3}$.
- `q3_q2`: applies the quality-aware evaluation strategy, where $F_{1,2,3}$ is calculated.

In [None]:
WHISPERFORMER_PREC_REC_OUT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/prec_rec"

!python -m lemurcalls.whisperformer.postprocessing.prec_rec \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --pred_folder "{WHISPERFORMER_PRED_DIR}" \
  --output_dir "{WHISPERFORMER_PREC_REC_OUT}" \
  --overlap_tolerance 0.3 \
  --allowed_qualities 1 2 3 \
  --eval_mode q3_q2 \
  --thresholds 0.1 0.15 0.2 0.25 0.3 0.35

## Thresholds and visualization

If the aim is to only detect high-quality calls from the focal animal, it can be usefull to apply postprocessing filters, such as Signal-to_Noise Ratio (SNR) and amplitude filters.


To determine the appropriate thresholds for your dataset, you can plot the SNR and maximale amplitudes of the calls in your training set and color them by quality class. Additionally you can color the calls according to the confidence score assigned by a final model.

In [None]:
!python -m lemurcalls.whisperformer.visualization.scatterplot_ampl_snr_score \
  --checkpoint_path "{WHISPERFORMER_CKPT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --output_dir "{WHISPERFORMER_VIS_OUT}" \
  --num_classes 3 \
  --batch_size 4

Arguments saved to: /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/visualization/2026-02-27_11-17-53/run_arguments.json
Checkpoint: whisper_size=large, num_decoder_layers=1, num_head_layers=2, num_classes=3
^C                               


![SNR vs Amplitude (Quality)](scatter_snr_vs_amplitude.png)

We can see in ourdataset when we plot SNR and maximal amplitude of the labeled calls, that quality class 2 and quality class 3 cannot be (linearly) seperated easily. Quality class 1 on the other hand can - with the chosen thresholds - be seperated rather well from the other quality classes. 

![SNR vs Amplitude (Model Score)](scatter_snr_vs_amplitude_model_score.png)

To filter existing predictions by SNR and maximal amplitude use:

In [None]:
!python -m lemurcalls.whisperformer.postprocessing.filter_labels_by_snr \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --output_dir "{WHISPERFORMER_SNR_OUT}" \
  --snr_threshold -1 \
  --amplitude_threshold 0.035

## Visualization of model outputs
To visualize, analyze and compare the predictions of two trained models, you can use the following code:

In [None]:
!python -m lemurcalls.visualize_predictions \
  --checkpoint_path "{WHISPERFORMER_CKPT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --pred_label_folder "{WHISPERFORMER_PRED_DIR}" \
  --output_dir "{WHISPERFORMER_VIS_OUT}"

Arguments saved to: /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/visualization/2026-02-26_22-51-18/run_arguments.json
Checkpoint: num_decoder_layers=1, num_head_layers=2, num_classes=3
Detected Whisper size: large
Model loaded (Whisper large)
                                 

The audio recordings are split into parts of 30 seconds duration and for each part an output with the following four rows is created:
1. The spectrograms as calculated by the original Whisper Encoder
2. For each spectrogram column and each call class the confidence score outputted by the WhisperFormer model are shown. The final predictions above the specified threshold (indicated as red dotted line) are 


![Spec1](U2025_09_24_07_58_20_714-U2025_09_24_07_59_48_091.UBN_v4_segment_00_spectrogram_scores_gt.png)

In [None]:
![Spec2](U2025_09_24_07_58_20_714-U2025_09_24_07_59_48_091.UBN_v4_segment_01_spectrogram_scores_gt copy.png)


In [None]:
![Spec3](U2025_09_24_07_58_20_714-U2025_09_24_07_59_48_091.UBN_v4_segment_01_spectrogram_scores_gt.png)

In [None]:
![Spec4](U2025_09_24_07_58_20_714-U2025_09_24_07_59_48_091.UBN_v4_segment_02_spectrogram_scores_gt copy.png)

![Spec5](U2025_09_24_07_58_20_714-U2025_09_24_07_59_48_091.UBN_v4_segment_02_spectrogram_scores_gt.png)

## Summary



### Refernces

[ActionFormer] 