Skip to content

ztwconquer/audiodataforge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

audiodataforge — build speech instruction tuning datasets

A pipeline tool for processing raw audio into structured datasets suitable for training speech language models. Handles the full workflow: audio discovery, segmentation, transcription, feature extraction, QA generation, and export.

Why?

Training speech LLMs requires large-scale instruction tuning datasets pairing audio with text annotations (transcripts, QA pairs, audio descriptions). Building these datasets from scratch involves a lot of boilerplate: finding audio, cleaning it, segmenting long recordings, running ASR, generating diverse QA, and formatting everything. audiodataforge automates this.

Install & Run

pip install -r requirements.txt

Quick start

# 1. Download some evaluation data
python scripts/download_librispeech.py --subset test-clean --max-samples 100

# 2. Run the pipeline
python cli.py run ./raw_data/librispeech/test-clean -c configs/default.yaml

# 3. Check the results
python cli.py stats ./output/train/metadata.jsonl
python cli.py preview ./output/train/metadata.jsonl

Scan audio directory

python cli.py scan /path/to/audio --min-dur 2 --max-dur 20

Pipeline Stages

  1. Load — Discover audio files, resample to target SR, filter by duration
  2. Segment — Split long recordings using VAD (Silero) or silence detection
  3. Quality Filter — Remove clipped, noisy, or too-short segments
  4. Annotate — Run ASR (Whisper), extract pitch/energy/rate features
  5. QA Generation — Create diverse question-answer pairs from templates
  6. Export — Save as HuggingFace Dataset or JSONL manifest

Configuration

Edit configs/default.yaml or create your own:

audio:
  target_sr: 16000
  min_duration: 1.0
  max_duration: 30.0

segment:
  method: vad  # vad | silence | fixed

annotation:
  asr_model: openai/whisper-small
  generate_qa: true
  max_qa_per_sample: 5

export:
  format: huggingface  # huggingface | jsonl
  output_dir: ./output
  train_split: 0.9

Output Format

HuggingFace export creates:

output/
├── train/
│   ├── audio/
│   │   ├── 000000.wav
│   │   └── ...
│   └── metadata.jsonl
├── test/
│   └── ...
└── dataset_info.json

Each line in metadata.jsonl:

{
  "file_name": "audio/000000.wav",
  "duration": 5.2,
  "transcript": "the quick brown fox",
  "qa_pairs": [
    {"question": "What is being said?", "answer": "the quick brown fox", "type": "transcription"},
    {"question": "How long is this audio?", "answer": "About 5.2 seconds.", "type": "duration"}
  ],
  "features": {"pitch_mean": 142.3, "energy_mean": 0.032}
}

Acknowledgements

Built during my time at SJTU X-LANCE lab. Inspired by the need for standardized speech instruction data in our lab projects.

License

MIT

About

Pipeline for building speech instruction tuning datasets from raw audio — segment, transcribe, annotate, generate QA pairs, export to HuggingFace format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages