Release v0.3.0 · XqFeng-Josie/VoxKitchen

Added

Published the voxkitchen launcher to PyPI. Users install with
pipx install voxkitchen (or pip install voxkitchen) instead of fetching
a GitHub archive zip. A publish GitHub Actions workflow builds wheel +
sdist on every v* tag and uploads via PyPI Trusted Publishing (OIDC, no
stored API tokens). The workflow also exposes a manual workflow_dispatch
for TestPyPI dry-runs.
PyPI version badge in README.md.
Pipeline YAML interpolation gained POSIX-style fallbacks. ${env:VAR:-foo}
uses foo when VAR is unset or empty; ${env:VAR:?msg} raises a
PipelineLoadError with msg when VAR is missing. The original
${env:VAR} form still fails loudly on unset variables and remains the
default for required tokens.
vkit operators gained two new ways to navigate the 51-operator catalog:
vkit operators --category <cat> shows a single section (audio, segment,
augment, annotate, quality, synthesize, pack, noop) and
vkit operators search <keyword> lists operators whose name or first-line
docstring contains <keyword> (case-insensitive). Empty matches exit with
code 1 so scripts can branch on no-result.
vkit download and vkit ingest --source recipe now warn host users when
invoked outside a managed runtime, mirroring the existing vkit run
warning. The recipe-side dependencies (e.g. datasets) live in the Docker
images, not in the PyPI launcher, so the previous silent failure mode is
replaced with a pointer to vkit docker download <recipe>. vkit ingest --source dir and --source manifest stay quiet — they work on the host
with only the lightweight launcher deps.
vkit schema export writes a JSON Schema for pipeline.yaml files,
derived from PipelineSpec.model_json_schema() plus the registered
operators. A snapshot is committed at docs/schemas/pipeline.schema.json
and served via raw.githubusercontent.com. vkit init now writes a
# yaml-language-server: $schema=… directive at the top of every
scaffolded pipeline.yaml, so VS Code, Neovim, and JetBrains users get
autocompletion on operator names and inline validation of the spec
structure out of the box. See docs/reference/schema.md for editor setup.
TTS tutorials split into three focused pages so each can grow
independently as new engines land. tts-data-prep.md renamed to
tts-training-data.md (quality gate for raw recordings used to train
a TTS model). The combined tts-synthesis.md was split into
tts-speaker.md (built-in / seed-sampled voices — tts_kokoro,
tts_chattts, tts_cosyvoice sft mode) and tts-voice-cloning.md
(short-reference cloning — tts_cosyvoice zero_shot /
cross_lingual, tts_fish_speech). Each tutorial carries its own
capability matrix filtered to the relevant engines, a Quick Start in
both Python and YAML, per-engine config snippets, and a decision flow.
mkdocs.yml, docs/index.md, README.md "What You Can Build", and
the skill operator notes were updated to match.
Five new ingest recipes complete the "common dataset" coverage,
bringing the total registered to 9:
- ljspeech — single-speaker English TTS baseline (24 h, 13.1k
  utterances), downloaded from data.keithito.com. Prefers
  normalized text over raw; preserves the raw form only when
  normalization changed it.
- aishell3 — multi-speaker Mandarin TTS (218 speakers, ~85 h),
  downloaded from OpenSLR/93. Splits the interleaved character +
  pinyin content.txt into supervision text (chars) and
  cut.custom["pinyin"]; enriches gender from spk-info.txt.
- libritts — multi-speaker English TTS derived from LibriSpeech
  (OpenSLR/60). Prefers *.normalized.txt over *.original.txt;
  enriches gender from speakers.tsv. Same seven-subset
  partitioning as LibriSpeech.
- cnceleb — CN-Celeb 1, Chinese speaker recognition (~130k
  utterances, 1000 speakers, 11 genres), from OpenSLR/82.
  Empty-text Supervisions carry speaker / language tags. Subsets
  data / dev / eval follow the canonical splits; overlapping
  subsets deduplicate.
- musan — MUSAN augmentation corpus (~10 GB of non-transcribed
  noise / music / speech), from OpenSLR/17. Closes the loop with
  the existing noise_augment operator. Subsets pick which of the
  three top-level categories to ingest; sub-categories are
  preserved as cut.custom["musan_subcategory"].
Every auto-downloadable recipe ships HEAD-probed compressed-size
metadata in a new download_sizes dict on the Recipe class. The
size is surfaced in three places: a new Size column in
vkit recipes, a per-subset "downloading X GB" log line inside
Recipe.download(), and the table in docs/reference/recipes.md.
Multi-subset recipes (LibriSpeech / LibriTTS / AISHELL-1) show a
range like 299 MB - 28.5 GB. Manual / HuggingFace recipes render
as a dash.

Changed

The TTS training-data tutorial (tts-training-data.md) now opens
with an explicit "quality gate" framing and ends with a Quality
Checklist summarizing the five thresholds (sample rate, duration,
SNR, text present, alignment present).
The Download column of vkit recipes now derives the source label
from each recipe's URL host (keithito / openslr / huggingface /
bare hostname) instead of hard-coding "openslr" for every recipe
with download_urls.
Docker image builds now reuse wheel and layer caches across runs.
docker/Dockerfile mounts a BuildKit cache at /root/.cache/uv on
every RUN uv pip install, so torch / transformers / funasr wheels
download once instead of being refetched in each of the five venvs
(UV_NO_CACHE=1 removed). scripts/release.sh now builds through
docker buildx against a dedicated voxkitchen-builder (auto-created
with the docker-container driver), and pushes layer descriptors to
ghcr.io/xqfeng-josie/voxkitchen:buildcache-<target> via
--cache-to type=registry,mode=max. First build from a cold cache
still takes 1-2 h; subsequent rebuilds where only application source
changed land in single-digit minutes per target. New "Faster
rebuilds" section in docs/docker-build.md walks through both cache
layers, the manual docker buildx command to consume the public
registry cache from a fork, expected timings per scenario, and which
changes invalidate which layers.

Fixed

Wheel no longer ships duplicate copies of
voxkitchen/templates/pipelines/*.yaml. The redundant
[tool.hatch.build.targets.wheel.force-include] block was removed;
hatchling already includes non-Python files inside the package
directory.
Untagged dev builds (e.g. the publish workflow's manual TestPyPI
dispatch) now produce PEP 440-compliant versions like
0.2.1.dev5 instead of 0.2.1.dev5+g<sha>. PyPI and TestPyPI both
reject the latter form. Set via local_scheme = "no-local-version"
in [tool.hatch.version].raw-options.
Removed [wenet], [speaker], and [tts-fish-speech]
optional-dependency groups from pyproject.toml because they
declared pkg @ git+... direct references, which PyPI rejects on
upload. The Docker images now install these from git inside their
respective RUN lines (fish-speech joins wenet's existing
pattern in docker/Dockerfile). Operators that declare
required_extras = ["wenet"|"tts-fish-speech"] still route
correctly via EXTRA_TO_ENV in
voxkitchen/runtime/env_resolver.py.
vkit run now exits with code 1 (runtime failure) when a stage
raises StageFailedError, matching the rest of the CLI's exit-code
convention. Previously it returned code 2, which the codebase
reserves for invocation errors (unknown flag, missing docker
binary, unknown operator category).
vkit ingest inline error messages now use the same error:
prefix the rest of the CLI prints, instead of rendering the whole
line in red without context.
vkit inspect cuts <missing> no longer dumps a full Python
traceback; it prints a one-line
error: manifest does not exist: … and exits 1. Corrupt or empty
manifests are reported the same way.
vkit inspect run|errors|trace <missing> now exit with code 1 on
a missing work directory or an unknown cut id. They previously
printed an error message but returned exit 0, so shell scripts
treated the failure as success.
docs/reference/tools-api.md now documents the full
voxkitchen.tools API surface. compute_speaker_similarity and
tokenize_audio had shipped but were missing from the reference
page — added their import line, usage section, and runtime-image
hint to bring the doc in line with the code.
README image and documentation links now use absolute
raw.githubusercontent.com / github.com URLs. PyPI's README
renderer leaves relative paths intact and resolves them against
https://pypi.org/project/voxkitchen/, which 404s; the project
page's logo and pipeline diagram were broken on the first publish.
voxkitchen.utils.download.download_file is now atomic and
retryable. The body streams into <dest>.partial and is renamed
into place only on success, so an aborted transfer can no longer
be mistaken for a complete download on the next call. Up to three
attempts are made on transient errors (ConnectionResetError /
OSError / …) with exponential backoff (2s, 4s). This came out of
real OpenSLR mid-stream RSTs hit while end-to-end-verifying the
new AISHELL-3 and LibriTTS recipes.
cnceleb recipe rewritten to match the real corpus layout.
Verified against the live 22 GB tarball, the previous
implementation was wrong about dev.lst (it lists speaker IDs,
not paths) and about eval (audio lives in separate
eval/enroll and eval/test flat directories, not as pointers
into data/). Counts now match the paper: 126,532 cuts / 997
speakers in data, 107,953 cuts / 797 speakers in dev, 17,973 cuts
/ 200 speakers in eval.
Local release/push checks now run the same fast lint, format,
typecheck, and pytest gate as CI via scripts/check-ci.sh.

Removed

The tedlium3 recipe is removed entirely. The canonical
openslr.org/resources/51/ mirror was de-listed by the project
upstream — every probe returns 404 and www.openslr.org/51/
reports "Resource not found". Without a working auto-download URL,
the recipe was effectively manual-only, and shipping a registered
recipe whose vkit docker download is a guaranteed no-op was a
UX cost without a corresponding benefit. The STM-parsing and
slice layout logic remain in git history (commit 15e6d19 and its
subsequent corrections); reintroduce via a HuggingFace-streaming
recipe (modelled on fleurs) when a real data path is available.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Added

Changed

Fixed

Removed

Uh oh!