A pipeline tool for processing raw audio into structured datasets suitable for training speech language models. Handles the full workflow: audio discovery, segmentation, transcription, feature extraction, QA generation, and export.
Training speech LLMs requires large-scale instruction tuning datasets pairing audio with text annotations (transcripts, QA pairs, audio descriptions). Building these datasets from scratch involves a lot of boilerplate: finding audio, cleaning it, segmenting long recordings, running ASR, generating diverse QA, and formatting everything. audiodataforge automates this.
pip install -r requirements.txt# 1. Download some evaluation data
python scripts/download_librispeech.py --subset test-clean --max-samples 100
# 2. Run the pipeline
python cli.py run ./raw_data/librispeech/test-clean -c configs/default.yaml
# 3. Check the results
python cli.py stats ./output/train/metadata.jsonl
python cli.py preview ./output/train/metadata.jsonlpython cli.py scan /path/to/audio --min-dur 2 --max-dur 20- Load — Discover audio files, resample to target SR, filter by duration
- Segment — Split long recordings using VAD (Silero) or silence detection
- Quality Filter — Remove clipped, noisy, or too-short segments
- Annotate — Run ASR (Whisper), extract pitch/energy/rate features
- QA Generation — Create diverse question-answer pairs from templates
- Export — Save as HuggingFace Dataset or JSONL manifest
Edit configs/default.yaml or create your own:
audio:
target_sr: 16000
min_duration: 1.0
max_duration: 30.0
segment:
method: vad # vad | silence | fixed
annotation:
asr_model: openai/whisper-small
generate_qa: true
max_qa_per_sample: 5
export:
format: huggingface # huggingface | jsonl
output_dir: ./output
train_split: 0.9HuggingFace export creates:
output/
├── train/
│ ├── audio/
│ │ ├── 000000.wav
│ │ └── ...
│ └── metadata.jsonl
├── test/
│ └── ...
└── dataset_info.json
Each line in metadata.jsonl:
{
"file_name": "audio/000000.wav",
"duration": 5.2,
"transcript": "the quick brown fox",
"qa_pairs": [
{"question": "What is being said?", "answer": "the quick brown fox", "type": "transcription"},
{"question": "How long is this audio?", "answer": "About 5.2 seconds.", "type": "duration"}
],
"features": {"pitch_mean": 142.3, "energy_mean": 0.032}
}Built during my time at SJTU X-LANCE lab. Inspired by the need for standardized speech instruction data in our lab projects.
MIT