Release v3.6.0: Audio Anomaly Detection Modality#694
Conversation
New AudioFeatureEncoder (74-dim handcrafted acoustic features: 20 MFCC, 12 chroma, 5 spectral descriptors via librosa) registered as the 'audio-mfcc' encoder and exposed through EmbeddingOD.for_audio(); new AudioAE detector (DCASE-style log-mel reconstruction autoencoder reusing the PyOD AutoEncoder with clip-level aggregation). ADEngine profiles and routes audio file paths; the knowledge base gains an AudioAE entry plus audio support on EmbeddingOD and MultiModalOD. New optional extra pyod[audio] (librosa, soundfile). Buildable detector count rises from 60 to 61. References the public methods (DCASE 2020 Task 2 baseline; MFCC, chroma, and spectral features via librosa), not any specific manuscript. Reviewed via /implement-review (Codex, no High findings). 18 new audio tests; KB count check and pyod info confirm 61 detectors (3 audio).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ccb8021b1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pad = np.zeros((n_mels, context - n_frames), dtype=spec.dtype) | ||
| spec = np.concatenate([spec, pad], axis=1) |
There was a problem hiding this comment.
Pad short spectrograms at the silence floor
For clips shorter than context, this pads after librosa.power_to_db with 0 dB columns. In the resulting log-mel representation, 0 dB is the reference/max level rather than silence, so short audio clips get artificial high-energy frames during both training and scoring. This affects the documented short-clip path; pad before dB conversion or fill with the spectrogram floor/minimum instead.
Useful? React with 👍 / 👎.
| return np.stack(windows).astype(np.float32) | ||
|
|
||
|
|
||
| class AudioAE(BaseDetector): |
There was a problem hiding this comment.
Add a list-safe predict_proba override
AudioAE documents and accepts list inputs such as waveforms and file paths, but it inherits BaseDetector.predict_proba, which allocates probabilities with X.shape[0]. After AudioAE().fit(clips), calling predict_proba(clips) on those documented list inputs raises AttributeError; this class needs the same kind of list-aware override that EmbeddingOD provides.
Useful? React with 👍 / 👎.
| if self._looks_like_audio_paths(sample[:5]): | ||
| return 'audio' |
There was a problem hiding this comment.
Detect waveform audio before tabular fallback
This new audio sniffing only runs inside the all-strings branch, so the other documented audio inputs added here—lists of waveform arrays or (waveform, sample_rate) tuples accepted by AudioFeatureEncoder and AudioAE—still fall through to tabular. In ADEngine's default flow, profile_data([waveform1, waveform2, ...]) therefore plans tabular detectors instead of EmbeddingOD.for_audio/AudioAE, and unequal-length clips can fail during the numeric np.asarray profiling step; add a conservative waveform/tuple check before the tabular fallback.
Useful? React with 👍 / 👎.
Coverage Report for CI Build 26982041698Coverage decreased (-1.2%) to 92.647%Details
Uncovered Changes
Coverage Regressions1 previously-covered line in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
v3.6.0: Audio Anomaly Detection Modality
Adds audio as a first-class modality on the agentic and multimodal line, entirely additively (no change to existing tabular, text, or image paths).
New in v3.6.0
pyod/utils/encoders/audio.py): each clip becomes a 74-dim handcrafted acoustic vector (20 MFCC, 12 chroma, 5 spectral descriptors, each as mean and std over frames, via librosa). Registered as theaudio-mfccencoder.pyod/models/audio_ae.py): DCASE-style log-mel reconstruction autoencoder that reuses the PyOD AutoEncoder with per-clip mean reconstruction error. Torch-gated._sniff_data_type,profile_data) and routing (for_audioas default,AudioAEas the deep alternative).AudioAEentry;audioadded toEmbeddingODandMultiModalOD.pyod[audio](librosa, soundfile).Counts
Buildable detector count rises from 60 to 61.
pyod info:61 total (43 tabular, 7 time-series, 8 graph, 2 text, 2 image, 1 multimodal, 3 audio).Tests and Review
18 new audio tests (synthetic waveforms; torch-gated deep tests skip without torch). KB count-consistency checks and
regen_skill --checkpass. Reviewed via /implement-review (Codex, no High findings; one Medium and two Low fixed). References the public methods (DCASE 2020 Task 2 baseline; MFCC, chroma, and spectral features via librosa).No breaking API changes.