audiodataforge — build speech instruction tuning datasets

A pipeline tool for processing raw audio into structured datasets suitable for training speech language models. Handles the full workflow: audio discovery, segmentation, transcription, feature extraction, QA generation, and export.

Why?

Training speech LLMs requires large-scale instruction tuning datasets pairing audio with text annotations (transcripts, QA pairs, audio descriptions). Building these datasets from scratch involves a lot of boilerplate: finding audio, cleaning it, segmenting long recordings, running ASR, generating diverse QA, and formatting everything. audiodataforge automates this.

Install & Run

pip install -r requirements.txt

Quick start

# 1. Download some evaluation data
python scripts/download_librispeech.py --subset test-clean --max-samples 100

# 2. Run the pipeline
python cli.py run ./raw_data/librispeech/test-clean -c configs/default.yaml

# 3. Check the results
python cli.py stats ./output/train/metadata.jsonl
python cli.py preview ./output/train/metadata.jsonl

Scan audio directory

python cli.py scan /path/to/audio --min-dur 2 --max-dur 20

Pipeline Stages

Load — Discover audio files, resample to target SR, filter by duration
Segment — Split long recordings using VAD (Silero) or silence detection
Quality Filter — Remove clipped, noisy, or too-short segments
Annotate — Run ASR (Whisper), extract pitch/energy/rate features
QA Generation — Create diverse question-answer pairs from templates
Export — Save as HuggingFace Dataset or JSONL manifest

Configuration

Edit configs/default.yaml or create your own:

audio:
  target_sr: 16000
  min_duration: 1.0
  max_duration: 30.0

segment:
  method: vad  # vad | silence | fixed

annotation:
  asr_model: openai/whisper-small
  generate_qa: true
  max_qa_per_sample: 5

export:
  format: huggingface  # huggingface | jsonl
  output_dir: ./output
  train_split: 0.9

Output Format

HuggingFace export creates:

output/
├── train/
│   ├── audio/
│   │   ├── 000000.wav
│   │   └── ...
│   └── metadata.jsonl
├── test/
│   └── ...
└── dataset_info.json

Each line in metadata.jsonl:

{
  "file_name": "audio/000000.wav",
  "duration": 5.2,
  "transcript": "the quick brown fox",
  "qa_pairs": [
    {"question": "What is being said?", "answer": "the quick brown fox", "type": "transcription"},
    {"question": "How long is this audio?", "answer": "About 5.2 seconds.", "type": "duration"}
  ],
  "features": {"pitch_mean": 142.3, "energy_mean": 0.032}
}

Acknowledgements

Built during my time at SJTU X-LANCE lab. Inspired by the need for standardized speech instruction data in our lab projects.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
configs		configs
forge		forge
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

audiodataforge — build speech instruction tuning datasets

Why?

Install & Run

Quick start

Scan audio directory

Pipeline Stages

Configuration

Output Format

Acknowledgements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

audiodataforge — build speech instruction tuning datasets

Why?

Install & Run

Quick start

Scan audio directory

Pipeline Stages

Configuration

Output Format

Acknowledgements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages