Skip to content

v0.3.0

Choose a tag to compare

@XqFeng-Josie XqFeng-Josie released this 23 May 04:57
· 122 commits to main since this release

Added

  • Published the voxkitchen launcher to PyPI. Users install with
    pipx install voxkitchen (or pip install voxkitchen) instead of fetching
    a GitHub archive zip. A publish GitHub Actions workflow builds wheel +
    sdist on every v* tag and uploads via PyPI Trusted Publishing (OIDC, no
    stored API tokens). The workflow also exposes a manual workflow_dispatch
    for TestPyPI dry-runs.
  • PyPI version badge in README.md.
  • Pipeline YAML interpolation gained POSIX-style fallbacks. ${env:VAR:-foo}
    uses foo when VAR is unset or empty; ${env:VAR:?msg} raises a
    PipelineLoadError with msg when VAR is missing. The original
    ${env:VAR} form still fails loudly on unset variables and remains the
    default for required tokens.
  • vkit operators gained two new ways to navigate the 51-operator catalog:
    vkit operators --category <cat> shows a single section (audio, segment,
    augment, annotate, quality, synthesize, pack, noop) and
    vkit operators search <keyword> lists operators whose name or first-line
    docstring contains <keyword> (case-insensitive). Empty matches exit with
    code 1 so scripts can branch on no-result.
  • vkit download and vkit ingest --source recipe now warn host users when
    invoked outside a managed runtime, mirroring the existing vkit run
    warning. The recipe-side dependencies (e.g. datasets) live in the Docker
    images, not in the PyPI launcher, so the previous silent failure mode is
    replaced with a pointer to vkit docker download <recipe>. vkit ingest --source dir and --source manifest stay quiet — they work on the host
    with only the lightweight launcher deps.
  • vkit schema export writes a JSON Schema for pipeline.yaml files,
    derived from PipelineSpec.model_json_schema() plus the registered
    operators. A snapshot is committed at docs/schemas/pipeline.schema.json
    and served via raw.githubusercontent.com. vkit init now writes a
    # yaml-language-server: $schema=… directive at the top of every
    scaffolded pipeline.yaml, so VS Code, Neovim, and JetBrains users get
    autocompletion on operator names and inline validation of the spec
    structure out of the box. See docs/reference/schema.md for editor setup.
  • TTS tutorials split into three focused pages so each can grow
    independently as new engines land. tts-data-prep.md renamed to
    tts-training-data.md (quality gate for raw recordings used to train
    a TTS model). The combined tts-synthesis.md was split into
    tts-speaker.md (built-in / seed-sampled voices — tts_kokoro,
    tts_chattts, tts_cosyvoice sft mode) and tts-voice-cloning.md
    (short-reference cloning — tts_cosyvoice zero_shot /
    cross_lingual, tts_fish_speech). Each tutorial carries its own
    capability matrix filtered to the relevant engines, a Quick Start in
    both Python and YAML, per-engine config snippets, and a decision flow.
    mkdocs.yml, docs/index.md, README.md "What You Can Build", and
    the skill operator notes were updated to match.
  • Five new ingest recipes complete the "common dataset" coverage,
    bringing the total registered to 9:
    • ljspeech — single-speaker English TTS baseline (24 h, 13.1k
      utterances), downloaded from data.keithito.com. Prefers
      normalized text over raw; preserves the raw form only when
      normalization changed it.
    • aishell3 — multi-speaker Mandarin TTS (218 speakers, ~85 h),
      downloaded from OpenSLR/93. Splits the interleaved character +
      pinyin content.txt into supervision text (chars) and
      cut.custom["pinyin"]; enriches gender from spk-info.txt.
    • libritts — multi-speaker English TTS derived from LibriSpeech
      (OpenSLR/60). Prefers *.normalized.txt over *.original.txt;
      enriches gender from speakers.tsv. Same seven-subset
      partitioning as LibriSpeech.
    • cnceleb — CN-Celeb 1, Chinese speaker recognition (~130k
      utterances, 1000 speakers, 11 genres), from OpenSLR/82.
      Empty-text Supervisions carry speaker / language tags. Subsets
      data / dev / eval follow the canonical splits; overlapping
      subsets deduplicate.
    • musan — MUSAN augmentation corpus (~10 GB of non-transcribed
      noise / music / speech), from OpenSLR/17. Closes the loop with
      the existing noise_augment operator. Subsets pick which of the
      three top-level categories to ingest; sub-categories are
      preserved as cut.custom["musan_subcategory"].
  • Every auto-downloadable recipe ships HEAD-probed compressed-size
    metadata in a new download_sizes dict on the Recipe class. The
    size is surfaced in three places: a new Size column in
    vkit recipes, a per-subset "downloading X GB" log line inside
    Recipe.download(), and the table in docs/reference/recipes.md.
    Multi-subset recipes (LibriSpeech / LibriTTS / AISHELL-1) show a
    range like 299 MB - 28.5 GB. Manual / HuggingFace recipes render
    as a dash.

Changed

  • The TTS training-data tutorial (tts-training-data.md) now opens
    with an explicit "quality gate" framing and ends with a Quality
    Checklist summarizing the five thresholds (sample rate, duration,
    SNR, text present, alignment present).
  • The Download column of vkit recipes now derives the source label
    from each recipe's URL host (keithito / openslr / huggingface /
    bare hostname) instead of hard-coding "openslr" for every recipe
    with download_urls.
  • Docker image builds now reuse wheel and layer caches across runs.
    docker/Dockerfile mounts a BuildKit cache at /root/.cache/uv on
    every RUN uv pip install, so torch / transformers / funasr wheels
    download once instead of being refetched in each of the five venvs
    (UV_NO_CACHE=1 removed). scripts/release.sh now builds through
    docker buildx against a dedicated voxkitchen-builder (auto-created
    with the docker-container driver), and pushes layer descriptors to
    ghcr.io/xqfeng-josie/voxkitchen:buildcache-<target> via
    --cache-to type=registry,mode=max. First build from a cold cache
    still takes 1-2 h; subsequent rebuilds where only application source
    changed land in single-digit minutes per target. New "Faster
    rebuilds" section in docs/docker-build.md walks through both cache
    layers, the manual docker buildx command to consume the public
    registry cache from a fork, expected timings per scenario, and which
    changes invalidate which layers.

Fixed

  • Wheel no longer ships duplicate copies of
    voxkitchen/templates/pipelines/*.yaml. The redundant
    [tool.hatch.build.targets.wheel.force-include] block was removed;
    hatchling already includes non-Python files inside the package
    directory.
  • Untagged dev builds (e.g. the publish workflow's manual TestPyPI
    dispatch) now produce PEP 440-compliant versions like
    0.2.1.dev5 instead of 0.2.1.dev5+g<sha>. PyPI and TestPyPI both
    reject the latter form. Set via local_scheme = "no-local-version"
    in [tool.hatch.version].raw-options.
  • Removed [wenet], [speaker], and [tts-fish-speech]
    optional-dependency groups from pyproject.toml because they
    declared pkg @ git+... direct references, which PyPI rejects on
    upload. The Docker images now install these from git inside their
    respective RUN lines (fish-speech joins wenet's existing
    pattern in docker/Dockerfile). Operators that declare
    required_extras = ["wenet"|"tts-fish-speech"] still route
    correctly via EXTRA_TO_ENV in
    voxkitchen/runtime/env_resolver.py.
  • vkit run now exits with code 1 (runtime failure) when a stage
    raises StageFailedError, matching the rest of the CLI's exit-code
    convention. Previously it returned code 2, which the codebase
    reserves for invocation errors (unknown flag, missing docker
    binary, unknown operator category).
  • vkit ingest inline error messages now use the same error:
    prefix the rest of the CLI prints, instead of rendering the whole
    line in red without context.
  • vkit inspect cuts <missing> no longer dumps a full Python
    traceback; it prints a one-line
    error: manifest does not exist: … and exits 1. Corrupt or empty
    manifests are reported the same way.
  • vkit inspect run|errors|trace <missing> now exit with code 1 on
    a missing work directory or an unknown cut id. They previously
    printed an error message but returned exit 0, so shell scripts
    treated the failure as success.
  • docs/reference/tools-api.md now documents the full
    voxkitchen.tools API surface. compute_speaker_similarity and
    tokenize_audio had shipped but were missing from the reference
    page — added their import line, usage section, and runtime-image
    hint to bring the doc in line with the code.
  • README image and documentation links now use absolute
    raw.githubusercontent.com / github.com URLs. PyPI's README
    renderer leaves relative paths intact and resolves them against
    https://pypi.org/project/voxkitchen/, which 404s; the project
    page's logo and pipeline diagram were broken on the first publish.
  • voxkitchen.utils.download.download_file is now atomic and
    retryable. The body streams into <dest>.partial and is renamed
    into place only on success, so an aborted transfer can no longer
    be mistaken for a complete download on the next call. Up to three
    attempts are made on transient errors (ConnectionResetError /
    OSError / …) with exponential backoff (2s, 4s). This came out of
    real OpenSLR mid-stream RSTs hit while end-to-end-verifying the
    new AISHELL-3 and LibriTTS recipes.
  • cnceleb recipe rewritten to match the real corpus layout.
    Verified against the live 22 GB tarball, the previous
    implementation was wrong about dev.lst (it lists speaker IDs,
    not paths) and about eval (audio lives in separate
    eval/enroll and eval/test flat directories, not as pointers
    into data/). Counts now match the paper: 126,532 cuts / 997
    speakers in data, 107,953 cuts / 797 speakers in dev, 17,973 cuts
    / 200 speakers in eval.
  • Local release/push checks now run the same fast lint, format,
    typecheck, and pytest gate as CI via scripts/check-ci.sh.

Removed

  • The tedlium3 recipe is removed entirely. The canonical
    openslr.org/resources/51/ mirror was de-listed by the project
    upstream — every probe returns 404 and www.openslr.org/51/
    reports "Resource not found". Without a working auto-download URL,
    the recipe was effectively manual-only, and shipping a registered
    recipe whose vkit docker download is a guaranteed no-op was a
    UX cost without a corresponding benefit. The STM-parsing and
    slice layout logic remain in git history (commit 15e6d19 and its
    subsequent corrections); reintroduce via a HuggingFace-streaming
    recipe (modelled on fleurs) when a real data path is available.