## Documentation of 'Lemurcalls'

### Intoduction

Consistently detecting, segmenting, and classifying animal vocalizations is crucial for the study and conservation of wildlife. Manual annotation, however, is time-consuming and labor-intensive, highlighting the need for reliable automated approaches. Deep learning methods—especially those leveraging transfer learning—have achieved promising results in Sound Event Detection in recent years. Yet, performance often declines when models are applied to small datasets with diverse and high background noise, as is typical for primate vocalizations. 

With this code, you can train two types of models that both detect, segment and classify lemur calls from long audio recordings. The libary also enables you to assess the performance of the models.
To further imporve results,  the model predictions can be filtered by applying SNR and maximal amplitude thresholds. To determine those thresholds, the SNR and amplitude values for the calls in the training set can be plotted. To analyse model performance, the points can be colored by the predicted confidence scores. 
Further, you can analyse and compare the model predcitions on the test set, by plotting the spectrograms as calculated by Whisper, the final predicted calls and, for WhisperFormer, the frame-wise confidence scores. 

### Library Structure:

lemurcalls is organized into two main model subpackages, lemurcalls.whisperseg and lemurcalls.whisperformer, each covering training, inference, and evaluation workflows.
Shared data handling and utility logic is centralized in common helper modules so both pipelines use consistent preprocessing and label handling.
In addition, the library provides visualization and postprocessing tools (e.g., SNR/amplitude filtering and precision-recall analysis) to systematically inspect and improve model outputs. A more detailed visualization of the library structure can be seen below:


```text
lemurcalls/
├── README.md
├── pyproject.toml
├── presentation_notebook/
│   └── presentation.ipynb
└── lemurcalls/
    ├── __init__.py
    ├── datautils.py
    ├── audio_utils.py
    ├── utils.py
    ├── visualize_predictions.py
    ├── download_whipser.py
    ├── whisperseg/
    │   ├── __init__.py
    │   ├── model.py
    │   ├── train.py
    │   ├── infer.py
    │   ├── infer_folder.py
    │   ├── evaluate.py
    │   ├── evaluate_metrics.py
    │   ├── training_utils.py
    │   ├── datautils_ben.py
    │   ├── utils.py
    │   └── convert_hf_to_ct2.py
    └── whisperformer/
        ├── __init__.py
        ├── model.py
        ├── dataset.py
        ├── datautils.py
        ├── losses.py
        ├── train.py
        ├── infer.py
        ├── postprocessing/
        │   ├── prec_rec.py
        │   └── filter_labels_by_snr.py
        └── visualization/
            └── scatterplot_ampl_snr_score.py
```

#### Datasets:
The data are audio recordings in teh form of .wav files collected at the Affenwald STraußpark in Thüringen. To obtain the data, ring-tailed lemurs were equipped with collors equipped with microphones.
The data for training was manually annotated with the Raven Pro Software: For each detected call onset and offset aswell as a class label were assigned. Only calls belonging to three specific types of calls ('moan', 'wail' and 'hmm') were labeled.

The recordings are affected by diverse background noise (e.g. visitors of the park, traffic from a nearby road,...) and by the fact that microphones worn by one individual often capture calls from nearby conspecifics. To account for this variation, each call was manually classified into one of three quality classes:
1. Quality 1: Loud, high-quality calls that almost certainly originate from the focal individual.
2. Quality 2: Calls of moderate quality that probably do not originate from the focal individual.
3. Quality 3: Low-quality background calls, including very quiet or distant vocalizations.

The annotated JSON files are expanded to include quality labels and have the following structure:
`{onset:[], offset:[], cluster:[], quality:[]}`.

## Explaination of the Models:

WhisperSeg:

WhisperFormer:

## Getting started
First, run the below code to install lemurcalls as editable package with the dependancies from `pyproject.toml`.

In [None]:
pip install -e . 

The library includes to subpackages lemurcalls.whisperseg and lemurcalls.whisperformer as well as a tool to visualize and compare the predictions achieved with the trained models.
In the following we set some demo paths, to demonstrate the functionalities of the python libraries. Since for training large models, you need a GPU, for demonstartion purposes I used small maximal numbers of epochs and a single audio file for inference.

In [None]:
PROJECT_ROOT = "/projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls"
AUDIO_TEST_DIR = "/mnt/lustre-grete/usr/u17327/final/audios_test"
AUDIO_SINGLE_DIR = "/mnt/lustre-grete/usr/u17327/final/audios_single"
LABEL_TEST_DIR = "/mnt/lustre-grete/usr/u17327/final/jsons_test"

WHISPER_BASE_PATH = f"{PROJECT_ROOT}/whisper_models/whisper_base"
WHISPERSEG_TRAIN_OUT = f"{PROJECT_ROOT}/lemurcalls/whisperseg_models"
WHISPERSEG_MODEL_DIR = f"{PROJECT_ROOT}/lemurcalls/model_folder_ben/final_checkpoint_20251116_163404_ct2"

WHISPERFORMER_CKPT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/best_model.pth"
WHISPERFORMER_PRED_DIR = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/sc/2026-02-20_11-48-55"
WHISPERFORMER_VIS_OUT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/visualization"
WHISPERFORMER_SNR_OUT = "/mnt/lustre-grete/usr/u17327/final/jsons_test_filtered"

## The WhisperSeg Subpackage

To train a whisperseg model, run

In [5]:
!python -m lemurcalls.whisperseg.train \
  --initial_model_path "{WHISPER_BASE_PATH}" \
  --model_folder "{WHISPERSEG_TRAIN_OUT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --num_classes 3 \
  --batch_size 4 \
  --n_threads 1 \
  --num_workers 1 \
  --max_num_epochs 1

Using fixed codebook for 3 class(es): {'m': 0, 't': 1, 'w': 2, 'lt': 1, 'h': 1}
Created 609 training samples after slicing
epoch-000:  26%|████████▏                      | 40/152 [03:37<10:05,  5.41s/it]^C


For inference, run:

In [6]:
!python -m lemurcalls.whisperseg.infer \
  -d "{AUDIO_SINGLE_DIR}" \
  -m "{WHISPERSEG_MODEL_DIR}" \
  -o "{WHISPERSEG_MODEL_DIR}"

Model loaded successfully.
Found 1 wav files in: /mnt/lustre-grete/usr/u17327/final/audios_single
INFO:root:Current file: [  0] /mnt/lustre-grete/usr/u17327/final/audios_single/U2024_09_03_10_03_14_799-U2024_09_03_10_04_42_175.UBN_v2.WAV


For evaluation, run:

### The WhisperFormer Subpackage


To train a WhisperFormer model, run:

In [None]:
!python -m lemurcalls.whisperformer.train \
  --checkpoint_path /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/best_model.pth \
  --model_folder /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new \
  --audio_folder /mnt/lustre-grete/usr/u17327/final/audios_test \
  --label_folder /mnt/lustre-grete/usr/u17327/final/jsons_test \
  --num_classes 3 \
  --batch_size 4 \
  --max_num_epochs 1 \
  --whisper_size large

For inference with a set confidence score threshold, run

In [None]:
!python -m lemurcalls.whisperformer.infer \
  --checkpoint_path /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/best_model.pth \
  --audio_folder /mnt/lustre-grete/usr/u17327/final/audios_single \
  --output_dir /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/sc \
  --batch_size 4 \
  --iou_threshold 0.4

To identify a suitable confidence threshold and assess overall model behavior, you can compute a precision-recall curve across multiple score thresholds.  
This helps you choose a threshold that best matches your objective (e.g., higher precision to reduce false positives, or higher recall to miss fewer calls).

You can also control which label quality classes are considered during evaluation via `eval_mode`:

- `standard`: evaluates with the default quality handling.
- `q3_q2`: applies the quality-aware evaluation strategy where quality classes 2 and 3 are treated differently from class 1.

Choose the mode based on your analysis goal.  
For example, if you mainly care about high-quality focal calls, use `q3_q2`; for a more general benchmark, use `standard`.

In [None]:
WHISPERFORMER_PREC_REC_OUT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/prec_rec"

!python -m lemurcalls.whisperformer.postprocessing.prec_rec \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --pred_folder "{WHISPERFORMER_PRED_DIR}" \
  --output_dir "{WHISPERFORMER_PREC_REC_OUT}" \
  --overlap_tolerance 0.3 \
  --allowed_qualities 1 2 3 \
  --eval_mode q3_q2 \
  --thresholds 0.1 0.15 0.2 0.25 0.3 0.35

## Thresholds and visualization

If the aim is to only detect high-quality calls from the focal animal, it can be usefull to apply postprocessing filters, such as SNR and amplitude filters.


To determine the appropriate thresholds for your dataset, you can plot the SNR and maximale amplitudes of the calls in your testset and color them by quality class. Additionally you can color the calls according to the confidence score assigned by a final model.

In [None]:
!python -m lemurcalls.whisperformer.visualization.scatterplot_ampl_snr_score \
  --checkpoint_path "{WHISPERFORMER_CKPT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --output_dir "{WHISPERFORMER_VIS_OUT}" \
  --num_classes 3 \
  --batch_size 4

Arguments saved to: /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/visualization/2026-02-27_11-17-53/run_arguments.json
Checkpoint: whisper_size=large, num_decoder_layers=1, num_head_layers=2, num_classes=3
^C                               


![SNR vs Amplitude (Quality)](scatter_snr_vs_amplitude.png)

We can see in ourdataset when we plot SNR and maximal amplitude of the labeled calls, that quality class 2 and quality class 3 cannot be linearly seperated easily. Qulaity class 1 on teh other ahdn can with the chosen thresholds be seperated rather well from the other quality classes. 

![SNR vs Amplitude (Model Score)](scatter_snr_vs_amplitude_model_score.png)

To filter existing outputs use:

In [None]:
!python -m lemurcalls.whisperformer.postprocessing.filter_labels_by_snr \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --output_dir "{WHISPERFORMER_SNR_OUT}" \
  --snr_threshold -1 \
  --amplitude_threshold 0.035

## Visualization of model outputs
To visualize the output of the model, you can use the following code:

In [None]:
!python -m lemurcalls.visualize_predictions \
  --checkpoint_path "{WHISPERFORMER_CKPT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --pred_label_folder "{WHISPERFORMER_PRED_DIR}" \
  --output_dir "{WHISPERFORMER_VIS_OUT}"

Arguments saved to: /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/visualization/2026-02-26_22-51-18/run_arguments.json
Checkpoint: num_decoder_layers=1, num_head_layers=2, num_classes=3
Detected Whisper size: large
Model loaded (Whisper large)
                                 

## Summary



### Project Background
This code is part of my master's thesis, *Center-Based Segmentation of Lemur Vocalizations Using the Whisper Audio Foundation Model* (submitted on 08.12.2025), but it was not part of the formal evaluation of this thesis or any other university module.

The code for WhisperSeg—including `util`, `utils.py`, `datautils.py`, and `audio_utils.py`—was adapted from https://github.com/nianlonggu/WhisperSeg for the new dataset. For WhisperFormer, `losses.py` was adapted from https://github.com/happyharrycn/actionformer_release. I used ChatGPT and Cursor for debugging, code and text drafting, and brainstorming.