## Documentation of 'Lemurcalls'

### Project Background
This code is part of my master's thesis, *Center-Based Segmentation of Lemur Vocalizations Using the Whisper Audio Foundation Model* (submitted on 08.12.2025), but it was not part of the formal evaluation of this thesis or any other university module.

The code for [WhisperSeg]—including `util`, `utils.py`, `datautils.py`, and `audio_utils.py`—was adapted from https://github.com/nianlonggu/WhisperSeg for the new dataset. For WhisperFormer, `losses.py` was adapted from https://github.com/happyharrycn/actionformer_release. I used ChatGPT and Cursor for debugging, code and text drafting, and brainstorming.

For more details on the project see `thesis.pdf`.

### Introduction

Consistently detecting, segmenting, and classifying animal vocalizations is crucial for the study and conservation of wildlife. Manual annotation, however, is time-consuming and labor-intensive, highlighting the need for reliable automated approaches. Deep learning methods—especially those leveraging transfer learning—have achieved promising results in Sound Event Detection in recent years. Yet, performance often declines when models are applied to small datasets with diverse and high background noise, as is typical for primate vocalizations. 

With this code, you can train two types of models that both detect, segment and classify lemur calls from long audio recordings. The library also enables you to assess the performance of the models.
To further improve results,  the model predictions can be filtered by applying SNR and maximal amplitude thresholds. To determine those thresholds, the SNR and amplitude values for the calls in the training set can be plotted. To analyse model performance, the points can be colored by the predicted confidence scores. 
Further, you can analyse and compare the model predictions on the test set, by plotting the spectrograms as calculated by Whisper, the final predicted calls and, for WhisperFormer, the frame-wise confidence scores. 

### Library Structure:

lemurcalls is organized into two main model subpackages, lemurcalls.whisperseg and lemurcalls.whisperformer, each covering training, inference, and evaluation workflows.
Shared data handling and utility logic is centralized in common helper modules so both pipelines use consistent preprocessing and label handling.
In addition, the library provides visualization and postprocessing tools (e.g., SNR/amplitude filtering and precision-recall analysis) to systematically inspect and improve model outputs. The project also includes two Jupyter notebooks for easy, web-application-friendly usage (e.g., for biology bachelor students). Unit tests are available in \tests\.
A more detailed overview of the library structure is shown below.


```text
lemurcalls/
├── README.md
├── pyproject.toml
├── presentation_notebook/
│   └── presentation.ipynb
└── lemurcalls/
    ├── __init__.py
    ├── datautils.py
    ├── audio_utils.py
    ├── utils.py
    ├── visualize_predictions.py
    ├── download_whipser.py
    ├── whisperseg/
    │   ├── __init__.py
    │   ├── model.py
    │   ├── train.py
    │   ├── infer.py
    │   ├── infer_folder.py
    │   ├── evaluate.py
    │   ├── evaluate_metrics.py
    │   ├── training_utils.py
    │   ├── datautils_ben.py
    │   ├── utils.py
    │   └── convert_hf_to_ct2.py
    └── whisperformer/
        ├── __init__.py
        ├── model.py
        ├── dataset.py
        ├── datautils.py
        ├── losses.py
        ├── train.py
        ├── infer.py
        ├── postprocessing/
        │   ├── prec_rec.py
        │   └── filter_labels_by_snr.py
        └── visualization/
            └── scatterplot_ampl_snr_score.py
```

#### Datasets

The dataset consists of `.wav` audio recordings collected at the Affenwald Strausberg Park (Thuringia, Germany) with a total length of 4.8 hours. Each audio was resampled to 16 kHz. 
For data acquisition, ring-tailed lemurs were equipped with collars containing microphones.

Training data were manually annotated in Raven Pro. For each detected call, annotators assigned onset time, offset time, and a class label.  
Only three call types similar to the target call `moan` (see [Macedonia]) were included: `moan`, `wail`, and `hmm`.

The recordings contain substantial background noise (e.g., visitor voices, nearby road traffic). In addition, microphones worn by one individual often capture calls from nearby conspecifics.  
To account for this variability, each labeled call was assigned one of three quality classes:

1. **Quality 1**: loud, high-quality calls that very likely originate from the focal individual.  
2. **Quality 2**: medium-quality calls that likely originate from non-focal individuals.  
3. **Quality 3**: low-quality background calls, including very quiet or distant vocalizations.

Annotated labels are stored as `.json` files with the structure:  
`{ "onset": [], "offset": [], "cluster": [], "quality": [] }`.

### Explanation of the Models

#### WhisperSeg

[WhisperSeg] (Gu et al.) builds on the pretrained [Whisper] Transformer, an automatic speech recognition model trained on 680,000 hours of multilingual supervised speech data.  
The authors show that Whisper can be adapted effectively for animal sound event detection and classification across multiple species, and they provide a multi-species checkpoint.  
As a sequence-to-sequence model, WhisperSeg predicts onset, offset, and class labels for detected calls.

#### WhisperFormer

In contrast to WhisperSeg’s token-generation objective, WhisperFormer directly predicts call centers and regresses onset/offset boundaries.  
This center-based formulation is inspired by object detection and temporal action localization (TAL), especially [CenterNet] and [ActionFormer].

Architecturally, WhisperFormer uses the Whisper encoder, followed by a lightweight decoder and two task-specific heads:
- a **classification head** for class confidence,
- a **regression head** for temporal boundaries.

Following [ActionFormer], training combines [sigmoid] focal loss for classification and [DIoU] loss for regression:

$$
\mathcal{L}_{\text{total}}(CL, S, \tilde{CL}, \tilde{S})
= \frac{1}{T_{+}}
\left(
\mathcal{L}_{\text{class}}(CL, \tilde{CL})
+ \lambda \,\mathbf{1}_{\mathcal{T}_{+}} \,\mathcal{L}_{\text{reg}}(S, \tilde{S})
\right).
$$

Here, \(CL\) and \(\tilde{CL}\) denote ground-truth and predicted classes, and \(S\) and \(\tilde{S}\) denote ground-truth and predicted temporal segments.  
At inference time, WhisperFormer outputs per-frame class confidence scores and relative onset/offset values; final detections are produced using confidence thresholding and non-maximum suppression ([NMS]).


#### Evaluation Metrics:

In line with standard practice in detection tasks, we use the F1 score as the primary evaluation metric for all calls. A prediction is matched with a ground truth call, when their IoU is above a set threshold. Let $TP$ denote the number of true positives, $FP$ the number of false positives, $FN$ the number of false negatives, and $FC$ the number of predictions that match the temporal location of a call but are assigned an incorrect class. Then, precision and recall are defined as:
$$precision = \frac{TP}{TP + FP + FC}$$
and 
$$recall = \frac{TP}{TP+FN+FC}.$$
Precision measures the accuracy of the predicted calls, while recall quantifies the proportion of ground truth calls correctly detected. The F1 score is the harmonic mean of precision and recall, balancing these two aspects:
$$F1 = \frac{2 \cdot precision \cdot recall}{precision + recall}.$$


##### Evaluation by Quality Classes
To account for differences in call quality, we calculate F1 scores with respect to the ground truth quality classes. We distinguish the following metrics:
1.  $F1_{Q1,Q2,Q3}$: F1 score calculated with respect to the ground truth labels of all quality classes 1,2 and 3. 
2. $F1_{Q1,Q2}$: F1 score calculated with respect to the ground truth labels of calls from quality classes 1 and 2. 
3. $F1_{Q1}$: F1 score calculated with respect to the ground truth labels of quality classes 1 only.

Formally, for any subset of quality classes, the F1 score is computed using the aove formular, with precision and recall restricted to the selected quality classes.

A limitation of this approach is that, when focusing on detecting high-quality calls (quality class 1), false positives from lower-quality classes (Q2 or Q3) may be less bad than completely missing detections. To address this, we define adjusted F1 metrics:
1. $F1_{Q1,(Q1,Q3)}$: F1 score calculated with respect to the ground truth labels of quality class 1, but FP from q2 and q3 are counted as neither TPs nor FPs.
2. $F1_{Q1,(Q2,Q3)}$: F1 score calculated with respect to the ground truth labels of quality classes 1 and 2, but FP from q3 are counted as neither TPs nor FPs.

In these adjusted metrics, $TP$, $FN$, and $FC$ remain unchanged compared to $F1_{Q1}$ or $F1_{Q1,Q2}$, but the number of false positives may decrease.

### Getting started
First, run the below code to install lemurcalls as editable package with the dependancies from `pyproject.toml` plus testing and linting tools.

In [6]:
!pip install -e ".[dev]"

Obtaining file:///mnt/vast-nhr/projects/mthesis_sophie_dierks/lemurcalls
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Collecting ruff>=0.4 (from lemurcalls==0.1.0)
  Downloading ruff-0.15.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (26 kB)
Downloading ruff-0.15.4-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.2/11.2 MB[0m [31m90.5 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: lemurcalls
  Building editable for lemurcalls (pyproject.toml) ... [?25ldone
[?25h  Created wheel for lemurcalls: filename=lemurcalls-0.1.0-0.editable-py3-none-any.whl size=37459 sha256=385518f365e33f7019eff696cefdff4e88e543a3d92d908e0a37f48a411d5341
  Stored in directory: /

The library includes two subpackages, `lemurcalls.whisperseg` and `lemurcalls.whisperformer`, as well as tools to visualize and compare predictions produced by trained models.  
In the following, we define demo paths to showcase the main functionality of the Python package. Since training large models typically requires a GPU, the demonstration uses a small maximum number of epochs and a single audio file for inference.

In [7]:
PROJECT_ROOT = "/projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls"
AUDIO_TEST_DIR = "/mnt/lustre-grete/usr/u17327/final/audios_test"
AUDIO_SINGLE_DIR = "/mnt/lustre-grete/usr/u17327/final/audios_single"
LABEL_TEST_DIR = "/mnt/lustre-grete/usr/u17327/final/jsons_test"

WHISPER_BASE_PATH = f"{PROJECT_ROOT}/whisper_models/whisper_base"
WHISPERSEG_TRAIN_OUT = f"{PROJECT_ROOT}/lemurcalls/whisperseg_models"
WHISPERSEG_MODEL_DIR = (
    f"{PROJECT_ROOT}/lemurcalls/model_folder_ben/final_checkpoint_20251116_163404_ct2"
)

WHISPERFORMER_CKPT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/best_model.pth"
WHISPERFORMER_PRED_DIR = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/sc/2026-02-20_11-48-55"
WHISPERFORMER_VIS_OUT = f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/visualization"
WHISPERFORMER_SNR_OUT = "/mnt/lustre-grete/usr/u17327/final/jsons_test_filtered"

WHISPERFORMER_PREC_REC_OUT = (
    f"{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/prec_rec"
)

### The WhisperSeg Subpackage

To train a whisperseg model, run

In [8]:
!python -m lemurcalls.whisperseg.train \
  --initial_model_path "{WHISPER_BASE_PATH}" \
  --model_folder "{WHISPERSEG_TRAIN_OUT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --num_classes 3 \
  --batch_size 4 \
  --n_threads 1 \
  --num_workers 1 \
  --max_num_epochs 1

Using fixed codebook for 3 class(es): {'m': 0, 't': 1, 'w': 2, 'lt': 1, 'h': 1}
Created 609 training samples after slicing
epoch-000:  99%|█████████████████████████████▊| 151/152 [14:24<00:05,  5.73s/it]
The best checkpoint on validation set is: /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/whisperseg_models/checkpoint-152,
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Saved loss curve to: /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/whisperseg_models/final_checkpoint_20260228_002955_ct2/loss_curve.png
All Done!


For inference, run:

In [9]:
!python -m lemurcalls.whisperseg.infer \
  -d "{AUDIO_SINGLE_DIR}" \
  -m "{WHISPERSEG_MODEL_DIR}" \
  -o "{WHISPERSEG_MODEL_DIR}"

Model loaded successfully.
Found 1 wav files in: /mnt/lustre-grete/usr/u17327/final/audios_single
INFO:root:Current file: [  0] /mnt/lustre-grete/usr/u17327/final/audios_single/U2024_09_03_10_03_14_799-U2024_09_03_10_04_42_175.UBN_v2.WAV


For evaluation, you can use the original WhisperSeg evaluation.py or use evaluate_metrics.py (that is also applicable for WhisperFormer), that can be run via

!python -m lemurcalls.whisperseg.evaluate_metrics \
  --labels "/mnt/lustre-grete/usr/u17327/final/jsons_test/U2024_09_03_10_03_14_799-U2024_09_03_10_04_42_175.UBN_v2.json" \
  --predictions "/path/to/your_prediction_file.json" \
  --overlap_tolerance 0.1

Example output of `evaluate_metrics.py` for WhisperSeg model trained on high-quality data:
```
TP: 38
FP: 2
FN: 3
FC: 1
num gt positives: 42
num predicted positives: 41
Precision: 0.9268
Recall:    0.9048
F1-Score:  0.9157
```

### The WhisperFormer Subpackage


To train a WhisperFormer model, run:

In [16]:
!python -m lemurcalls.whisperformer.train \
  --model_folder "{PROJECT_ROOT}/lemurcalls/model_folder_new" \
  --audio_folder "{AUDIO_SINGLE_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --num_classes 3 \
  --batch_size 1 \
  --max_num_epochs 1 \
  --whisper_size large

Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00,  3.35it/s]
Using codebook for 3 class(es): {'m': 0, 't': 1, 'w': 2, 'lt': 1, 'h': 1}
ID to cluster mapping: {0: 'm', 1: 'h', 2: 'w'}
Created 3 training samples after slicing

=== Starting Epoch 0 ===
epoch-000:   0%|                                          | 0/3 [00:00<?, ?it/s]Epoch 0, Step 0, Training Total Loss: 8.8727
Epoch 0, Step 0, Training Class Loss: 7.7938
Epoch 0, Step 0, Training Regression Loss: 1.0789
epoch-000: 100%|██████████████████████████████████| 3/3 [01:36<00:00, 32.20s/it]
=== End of Epoch 0 ===
Epoch 0, Step 2, Epoch Training Loss: 8.0226
val_ratio = 0, will run validation: False
No validation set (val_ratio = 0)

=== Starting Epoch 1 ===
epoch-001:   0%|                                          | 0/3 [00:00<?, ?it/s]Epoch 1, Step 0, Training Total Loss: 0.0000
Epoch 1, Step 0, Training Class Loss: 0.0000
Epoch 1, Step 0, Training Regression Loss: 0.0000
epoch-001: 100%|███████████████████████

For inference with a set confidence score threshold, run

In [17]:
!python -m lemurcalls.whisperformer.infer \
  --checkpoint_path "{WHISPERFORMER_CKPT}" \
  --audio_folder "{AUDIO_SINGLE_DIR}" \
  --output_dir "{PROJECT_ROOT}/lemurcalls/model_folder_new/final_model_20251205_030535/sc" \
  --batch_size 4 \
  --iou_threshold 0.4

Checkpoint: whisper_size=large, num_decoder_layers=1, num_head_layers=2, num_classes=3

===== Processing U2024_09_03_10_03_14_799-U2024_09_03_10_04_42_175.UBN_v2.WAV =====
[Run 1] Created 3 slices with offset 0
[Run 2] Created 3 slices with offset 1000
[Run 3] Created 3 slices with offset 2000
✅ Predictions saved to /projects/extern/CIDAS/cidas_digitalisierung_lehre/mthesis_sophie_dierks/dir.project/lemurcalls/lemurcalls/model_folder_new/final_model_20251205_030535/sc/2026-02-28_00-56-05/U2024_09_03_10_03_14_799-U2024_09_03_10_04_42_175.UBN_v2_preds.json


To identify a suitable confidence threshold and assess overall model behavior, compute a precision-recall curve across multiple score thresholds.  
This helps you select a threshold that matches your objective (e.g., higher precision to reduce false positives, or higher recall to miss fewer calls).  
The plot also highlights the threshold that achieves the highest $F_1$ score.

You can control which label quality classes are considered via `eval_mode`:

- `standard`: evaluates metrics for the selected quality set (e.g., $F_{1}$, $F_{1,2}$, or $F_{1,2,3}$).
- `q3_q2`: applies the quality-aware evaluation strategy, where quality classes 2 and 3 are treated differently from class 1 (e.g.$F_{1,(2,3)}).

```python
!python -m lemurcalls.whisperformer.postprocessing.prec_rec \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --pred_folder "{WHISPERFORMER_PRED_DIR}" \
  --output_dir "{WHISPERFORMER_PREC_REC_OUT}" \
  --overlap_tolerance 0.3 \
  --allowed_qualities 1 \
  --eval_mode standard
```

![PrecRec](presentation_notebook/precision_recall_curve.png)

### Thresholds and visualization

If the aim is to only detect high-quality calls from the focal animal - as in our application where the goal is in the future to equip each individual with its own collar - it can be usefull to apply postprocessing filters, such as Signal-to_Noise Ratio (SNR) and amplitude filters.

To determine suitable thresholds for your dataset, you can plot the SNR and maximum amplitude of calls in the training set and color the points by quality class.
You can also color the points by the confidence scores produced by a final trained model.


```python
!python -m lemurcalls.whisperformer.visualization.scatterplot_ampl_snr_score \
  --checkpoint_path "{WHISPERFORMER_CKPT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --output_dir "{WHISPERFORMER_VIS_OUT}" \
  --num_classes 3 \
  --batch_size 4
```

![SNR vs Amplitude (Quality)](presentation_notebook/scatter_snr_vs_amplitude.png)

We can see in our dataset when we plot SNR and maximal amplitude of the labeled calls, that quality class 2 and quality class 3 cannot be (linearly) seperated easily. Quality class 1 on the other hand can - with the chosen thresholds - be seperated rather well from the other quality classes. 

![SNR vs Amplitude (Model Score)](presentation_notebook/scatter_snr_vs_amplitude_model_score.png)

To filter existing predictions by SNR and maximal amplitude use:

```python
!python -m lemurcalls.whisperformer.postprocessing.filter_labels_by_snr \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --output_dir "{WHISPERFORMER_SNR_OUT}" \
  --snr_threshold -1 \
  --amplitude_threshold 0.035
```

### Visualization of model outputs
To visualize, analyze and compare the predictions of two trained models, you can use the following code:

```python
!python -m lemurcalls.visualize_predictions \
  --checkpoint_path "{WHISPERFORMER_CKPT}" \
  --audio_folder "{AUDIO_TEST_DIR}" \
  --label_folder "{LABEL_TEST_DIR}" \
  --pred_label_folder "{WHISPERFORMER_PRED_DIR}" \
  --output_dir "{WHISPERFORMER_VIS_OUT}"
```

The audio recordings are split into 30-second segments. For each segment, a figure with the following four rows is generated:

1. Mel spectrogram computed by the original Whisper encoder.  
2. WhisperFormer confidence scores for each spectrogram column and call class; final predictions above the selected threshold (red dashed line) are shown rectangles colored by predicted call class.  
3. Ground-truth labels colored by call class, together with their assigned quality classes.  
4. WhisperSeg predictions colored by predicted call class.

#### Output examples generated using models trained only on high-quality calls (Q1)

![Spec2](presentation_notebook/U2025_09_24_07_58_20_714-U2025_09_24_07_59_48_091.UBN_v4_segment_01_spectrogram_scores_gt.png)

The visualization above highlights a common WhisperFormer issue: because NMS is applied per class, and because `moan` and `wail` are inherently difficult to distinguish, both class scores may exceed the threshold for the same event. This can lead to duplicate predictions for a single ground-truth call.  
For WhisperSeg, this example shows that very short calls can also be detected even when they are not present in the ground-truth labels, likely due to the higher temporal resolution of WhisperSeg spectrograms.

![Spec4](presentation_notebook/U2025_09_24_07_58_20_714-U2025_09_24_07_59_48_091.UBN_v4_segment_02_spectrogram_scores_gt.png)

The visualization above shows strong performance by both models. WhisperSeg produces one false positive wail; based on visual inspection of the spectrogram, this may indicate an annotation error and should be manually re-evaluated.

#### Output examples generated using models trained on all quality classes.

![Spec6](presentation_notebook/U2024_09_24_12_06_37_702-U2024_09_24_12_08_05_079.UBN_v2_segment_00_spectrogram_scores_gt.png)

Here, we can see that when trained on all data, WhisperFormer is more sensitive than WhisperSeg. Further, calls labeled as quality classes 1 and 2 receive higher WhisperFormer confidence scores than calls from quality class 3.

![Spec9](presentation_notebook/U2024_09_24_12_24_06_562-U2024_09_24_12_25_33_938.UBN_v2_segment_00_spectrogram_scores_gt_all.png)

The visualization above shows that both models struggle to correctly detect highly overlapping calls. WhisperFormer confidence scores suggest that they contain additional information that could potentially be used to refine the final onset and offset predictions.

![Spec5](presentation_notebook/U2025_09_24_12_59_26_005-U2025_09_24_13_00_54_382.UBN_v1_segment_02_spectrogram_scores_gt.png)

Here, we can see that the selected confidence threshold for WhisperFormer may be too low, as many unlabeled calls are detected. This example also highlights how difficult it is to consistently annotate faint, low-quality background calls.

### Summary

With this codebase, we showed that both WhisperSeg and WhisperFormer can be trained to automatically detect, segment, and classify lemur calls in long audio recordings, yielding promising results—especially when trained on high-quality data only.  
Using dedicated visualizations, we identified suitable SNR and amplitude thresholds to improve model performance and used precision-recall curves to select an appropriate confidence threshold for WhisperFormer. Additional visual analyses were used to compare model behavior: when trained on data from all quality classes, WhisperFormer appears more sensitive than WhisperSeg, but it also tends to detect very faint calls and occasional background noise.  
These visualizations also helped identify specific failure modes, such as duplicate predictions of two call classes for the same ground-truth call.

### References

- [WhisperSeg] Gu, N., Lee, K., Basha, M., Ram, S. K., You, G., & Hahnloser, R. H. (2024, April). Positive transfer of the whisper speech transformer to human and animal voice activity detection. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7505-7509). IEEE.
- [ActionFormer] Zhang, C. L., Wu, J., & Li, Y. (2022, October). Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision (pp. 492-510). Cham: Springer Nature Switzerland.
- [Whisper] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.
- [NMS] Hosang, J., Benenson, R., & Schiele, B. (2017). Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4507-4515).
- [CenterNet] Zhou, X., Wang, D., & Krahenbuhl, P. (2019). Objects as points. arXiv preprint arXiv:1904.07850.
- [Macedonia] Macedonia, J. M. (1993). The vocal repertoire of the ring-tailed lemur (Lemur catta). Folia Primatologica, 61(4), 186-217.
- [sigmoid] Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2980-2988).
- [DIoU] Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020, April). Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 07, pp. 12993-13000).