<a href="https://colab.research.google.com/github/roixiao/x-research-skill/blob/main/whisperf_colab_FINAL_20260210_225215.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Whisperf (Colab) — Batch Download + Batch Transcribe (FINAL (Colab))

这份 notebook 专门为 **Google Colab / Colab Pro** 重新设计：

- 输出按 run 分目录：`/content/whisperf/runs/<RUN_ID>/...`
- 输入来源：
  - Step 1: `yt-dlp` 批量下载（公开视频通常不需要 cookie；会员/私密才需要 cookie）
  - Step 1-ALT: 上传音频或从 Google Drive 路径导入
- 输出：`txt / srt / json` + `task_log.txt` + `download_logs/`，最后打包 zip 并下载

## 注意
- members-only / 私密视频需要你自己的 `cookies.txt`（Netscape 格式）。
- 即使是公开视频，也可能被站点风控导致拿不到格式（只剩 storyboard）。本 notebook 会记录诊断日志，但不能保证 100% 下载成功。
- 多 GPU：Colab 常见只有 1 张 GPU。本 notebook 支持“多卡时按文件并行”。

---


In [None]:
# ============================================================
# Step 0: Setup (Colab)
# ============================================================

# YouTube 风控变化：建议装 yt-dlp[default] + yt-dlp-ejs，并确保 node 可用。

!pip -q install -U --pre "yt-dlp[default]" yt-dlp-ejs openai-whisper tqdm

!apt-get -qq update > /dev/null
!apt-get -qq install -y ffmpeg nodejs npm > /dev/null 2>&1 || true

# Ubuntu 有时只有 nodejs 没有 node，这里做一个兼容链接。
!bash -lc 'set -e; if ! command -v node >/dev/null 2>&1; then if command -v nodejs >/dev/null 2>&1; then ln -sf "$(command -v nodejs)" /usr/local/bin/node; fi; fi; true'

# Self-check
!bash -lc 'echo "node: $(command -v node || true)"; node -v || true'
!python -m yt_dlp --version

print('Setup complete')


W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
node: /usr/bin/node
v12.22.9
2026.02.09.233747
Setup complete


In [None]:
print(f"Source URL: {urls[0]}")
print(f"Local Path: {ok[0]['path']}")

Source URL: https://youtu.be/dbV1sMfv9TQ?si=v2YlYUGoovCJTiRn
Local Path: /content/whisperf/runs/20260211_041443/audio/youtube_20260211_041922/Joe Tsai, Co-Founder and Chairman, Alibaba： Find Your People_dbV1sMfv9TQ.mp4


In [None]:
# ============================================================
# Step 0b: Run Directories + Logging + GPU Info
# ============================================================

from __future__ import annotations

import os, re, sys, json, glob, time, shutil, zipfile, subprocess, hashlib
from pathlib import Path
from datetime import datetime
import traceback

import torch

RUN_ID = datetime.now().strftime('%Y%m%d_%H%M%S')  #@param {type:"string"}
BASE_DIR = '/content/whisperf'  #@param {type:"string"}

RUN_DIR = Path(BASE_DIR) / 'runs' / RUN_ID
AUDIO_DIR = RUN_DIR / 'audio'
OUT_DIR = RUN_DIR / 'output'
DL_LOG_DIR = RUN_DIR / 'download_logs'

for d in [RUN_DIR, AUDIO_DIR, OUT_DIR, DL_LOG_DIR]:
    d.mkdir(parents=True, exist_ok=True)

COOKIE_FILE = RUN_DIR / 'cookies.txt'
TASK_LOG_PATH = RUN_DIR / 'task_log.txt'

AUDIO_EXTS = {'.mp3', '.m4a', '.wav', '.flac', '.opus', '.ogg', '.aac', '.webm', '.mp4', '.mkv'}


def _ts():
    return datetime.now().strftime('%Y%m%d_%H%M%S')


def safe_name(name: str) -> str:
    name = re.sub(r'[\/:*?"<>|]+', '_', name)
    return re.sub(r'\s+', ' ', name).strip()[:180]


def _log_write(line: str):
    try:
        with open(TASK_LOG_PATH, 'a', encoding='utf-8') as f:
            f.write(line + '\n')
    except Exception:
        pass


def log_event(step: str, message: str, level: str = 'INFO'):
    ts = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    line = f'[{ts}] [{level}] [{step}] {message}'
    _log_write(line)
    print(line)


def log_exception(step: str, where: str, exc: Exception):
    ts = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    header = f'[{ts}] [ERROR] [{step}] {where}: {repr(exc)}'
    _log_write(header)
    _log_write(traceback.format_exc())
    print(header)


NUM_GPUS = torch.cuda.device_count()
log_event('init', f'run_id={RUN_ID} run_dir={RUN_DIR} gpus={NUM_GPUS}')
print('RUN_DIR:', RUN_DIR)
print('AUDIO_DIR:', AUDIO_DIR)
print('OUT_DIR:', OUT_DIR)
print('DL_LOG_DIR:', DL_LOG_DIR)
print('TASK_LOG:', TASK_LOG_PATH)

if NUM_GPUS:
    for i in range(NUM_GPUS):
        print(f'GPU {i}: {torch.cuda.get_device_name(i)}')
else:
    print('No GPU detected (CPU mode will be very slow)')

[2026-02-11 04:24:12] [INFO] [init] run_id=20260211_042412 run_dir=/content/whisperf/runs/20260211_042412 gpus=1
RUN_DIR: /content/whisperf/runs/20260211_042412
AUDIO_DIR: /content/whisperf/runs/20260211_042412/audio
OUT_DIR: /content/whisperf/runs/20260211_042412/output
DL_LOG_DIR: /content/whisperf/runs/20260211_042412/download_logs
TASK_LOG: /content/whisperf/runs/20260211_042412/task_log.txt
GPU 0: NVIDIA A100-SXM4-80GB


In [None]:
# ============================================================
# (Optional) Mount Google Drive (Persistence)
# ============================================================

# 如果你希望关机/断线后也不丢结果，建议把最终 zip 复制到 Drive。
# 挂载会弹出授权窗口，这是 Colab 正常流程。

MOUNT_DRIVE = True  #@param {type:"boolean"}
DRIVE_BASE_DIR = '/content/drive/MyDrive/whisperf_runs'  #@param {type:"string"}

DRIVE_RUN_DIR = None

if MOUNT_DRIVE:
    from google.colab import drive
    drive.mount('/content/drive')
    DRIVE_RUN_DIR = Path(DRIVE_BASE_DIR) / RUN_ID
    DRIVE_RUN_DIR.mkdir(parents=True, exist_ok=True)
    print('Drive dir:', DRIVE_RUN_DIR)


In [None]:
# ============================================================
# Step Cookie (Optional): Provide cookies.txt for members-only
# ============================================================

# USE_COOKIES=False 时，不会传 --cookies。
# COOKIE_PATH 可填 Drive 路径，例如:
#   /content/drive/MyDrive/whisperf/cookies.txt
# 留空且 USE_COOKIES=True 会弹出上传窗口。

USE_COOKIES = False  #@param {type:"boolean"}
COOKIE_PATH = ''  #@param {type:"string"}

if USE_COOKIES:
    from google.colab import files

    src = COOKIE_PATH.strip()
    if src:
        p = Path(src)
        if not p.exists():
            raise FileNotFoundError(f'COOKIE_PATH not found: {p}')
        COOKIE_FILE.write_text(p.read_text(encoding='utf-8', errors='replace'), encoding='utf-8')
        log_event('cookie', f'copied from {p} -> {COOKIE_FILE}')
    else:
        up = files.upload()
        if not up:
            raise RuntimeError('no cookie file uploaded')
        name, data = next(iter(up.items()))
        COOKIE_FILE.write_bytes(data)
        log_event('cookie', f'uploaded {name} -> {COOKIE_FILE}')

    n = sum(1 for l in COOKIE_FILE.read_text(encoding='utf-8', errors='replace').splitlines() if l.strip() and not l.startswith('#'))
    print(f'cookie lines: {n}  path: {COOKIE_FILE}')
else:
    print('cookies disabled')


cookies disabled


In [None]:
# ============================================================
# Step 1: Batch Download (yt-dlp)
# ============================================================

DO_DOWNLOAD = True  #@param {type:"boolean"}

URLS = 'https://youtu.be/dbV1sMfv9TQ?si=v2YlYUGoovCJTiRn'  #@param {type:"string"}

OUT_MODE = 'direct'  #@param ['direct', 'extract']
OUT_FMT = 'mp3'  #@param ['mp3', 'm4a', 'wav', 'flac', 'opus']
OUT_QUALITY = 5  #@param {type:"integer"}
URL_PARALLEL = 2  #@param {type:"integer"}
N_CONNECTIONS = 8  #@param {type:"integer"}
FORCE_IPV4 = True  #@param {type:"boolean"}
SHOW_CMD = True  #@param {type:"boolean"}

PLAYER_CLIENT = 'auto'  #@param ['auto', 'web,android', 'android', 'ios', 'tv_embedded,web', 'default']

DIAG_FORMATS_ONLY = False  #@param {type:"boolean"}


def _js_runtime_auto() -> str:
    import shutil
    return 'node' if shutil.which('node') else ('nodejs' if shutil.which('nodejs') else '')


def _looks_like_blocked(log: str) -> bool:
    s = (log or '').lower()
    needles = [
        'n challenge',
        'only images are available',
        'requested format is not available',
        'sign in',
        'members-only',
        'join this channel',
        'http error 403',
    ]
    return any(n in s for n in needles)


def _run_yt_dlp(cmd: list[str], log_path: Path) -> tuple[int, str]:
    p = subprocess.run(cmd, capture_output=True, text=True)
    combined = (p.stdout or '') + (p.stderr or '')
    try:
        # Fixed the SyntaxError here by using \n instead of literal newlines
        log_path.write_text('CMD:\n' + ' '.join(cmd) + '\n\n' + combined, encoding='utf-8')
    except Exception:
        pass
    return p.returncode, combined


def _collect_urls(text: str) -> list[str]:
    out = []
    seen = set()
    for line in (text or '').splitlines():
        u = line.strip()
        if not u or u.startswith('#'):
            continue
        if u not in seen:
            out.append(u)
            seen.add(u)
    return out


def _is_youtube(url: str) -> bool:
    u = url.lower()
    return 'youtube.com' in u or 'youtu.be' in u


def download_one(url: str, out_dir: Path, cookie_file: str) -> dict:
    out_dir.mkdir(parents=True, exist_ok=True)

    base_cmd = [
        sys.executable, '-m', 'yt_dlp',
        '--no-playlist',
        '--newline',
        '--socket-timeout', '30',
        '--retries', '15',
        '--fragment-retries', '15',
        '--extractor-retries', '5',
        '--retry-sleep', 'http:exp=1:20',
        '--http-chunk-size', '10M',
        '-N', str(int(N_CONNECTIONS)),
        '-f', 'bestaudio/best',
        '-o', str(out_dir / '%(title).150s_%(id)s.%(ext)s'),
    ]

    if FORCE_IPV4:
        base_cmd += ['--force-ipv4']

    if cookie_file:
        base_cmd += ['--cookies', cookie_file]

    jsrt = _js_runtime_auto()
    if jsrt:
        base_cmd += ['--js-runtimes', jsrt]

    def cmd_for(client: str | None):
        cmd = list(base_cmd)
        if _is_youtube(url) and client and client != 'default':
            cmd += ['--extractor-args', f'youtube:player_client={client}']
        if OUT_MODE == 'extract':
            cmd += ['-x', '--audio-format', OUT_FMT]
            if OUT_FMT.lower() in {'mp3','m4a','aac','opus','vorbis'}:
                cmd += ['--audio-quality', str(int(OUT_QUALITY))]
        cmd += [url]
        return cmd

    if PLAYER_CLIENT == 'auto':
        clients = ['web,android', 'android', 'ios', 'tv_embedded,web', 'default']
    else:
        clients = [PLAYER_CLIENT]

    last = {'url': url, 'ok': False, 'error': 'unknown'}

    for attempt, client in enumerate(clients, 1):
        log_path = DL_LOG_DIR / f'{_ts()}_{hashlib.md5(url.encode("utf-8")).hexdigest()[:10]}_{attempt}_{safe_name(client)}.txt'
        cmd = cmd_for(client)
        if SHOW_CMD:
            print('CMD:', ' '.join(cmd))
        code, log = _run_yt_dlp(cmd, log_path)

        if OUT_MODE == 'extract':
            want_ext = '.' + OUT_FMT.lower()
            files = sorted([p for p in out_dir.glob(f'*{want_ext}')], key=lambda p: p.stat().st_mtime, reverse=True)
        else:
            files = sorted([p for p in out_dir.glob('*') if p.is_file() and p.suffix.lower() in AUDIO_EXTS], key=lambda p: p.stat().st_mtime, reverse=True)

        if code == 0 and files:
            fp = files[0]
            return {'url': url, 'ok': True, 'path': str(fp), 'client': client, 'log': str(log_path)}

        last = {'url': url, 'ok': False, 'error': (log[-1000:] if log else f'exit={code}'), 'client': client, 'log': str(log_path)}

        if not _looks_like_blocked(log):
            break

    return last


if not DO_DOWNLOAD:
    print('download skipped')
else:
    urls = _collect_urls(URLS)
    if not urls:
        print('URLS is empty')
    else:
        cookie_file = str(COOKIE_FILE) if USE_COOKIES and COOKIE_FILE.exists() else ''
        log_event('download', f'urls={len(urls)} mode={OUT_MODE} fmt={OUT_FMT} q={OUT_QUALITY} par={URL_PARALLEL} N={N_CONNECTIONS} cookie={bool(cookie_file)}')

        if DIAG_FORMATS_ONLY:
            url = urls[0]
            cmd = [sys.executable, '-m', 'yt_dlp', '--no-playlist']
            if cookie_file:
                cmd += ['--cookies', cookie_file]
            jsrt = _js_runtime_auto()
            if jsrt:
                cmd += ['--js-runtimes', jsrt]
            if FORCE_IPV4:
                cmd += ['--force-ipv4']
            cmd += ['-F', url]
            lp = DL_LOG_DIR / f'{_ts()}_formats_{hashlib.md5(url.encode("utf-8")).hexdigest()[:10]}.txt'
            code, log = _run_yt_dlp(cmd, lp)
            print('Exit=', code)
            print('Saved:', lp)
            print('\n'.join((log or '').splitlines()[-80:]))
        else:
            from concurrent.futures import ThreadPoolExecutor, as_completed

            batch_dir = AUDIO_DIR / f'youtube_{_ts()}'
            batch_dir.mkdir(parents=True, exist_ok=True)
            print('output:', batch_dir)

            results = []
            par = max(1, int(URL_PARALLEL))
            if par <= 1:
                for u in urls:
                    r = download_one(u, batch_dir, cookie_file)
                    results.append(r)
                    print('OK' if r.get('ok') else 'FAIL', u)
            else:
                with ThreadPoolExecutor(max_workers=par) as pool:
                    futs = {pool.submit(download_one, u, batch_dir, cookie_file): u for u in urls}
                    for fut in as_completed(futs):
                        r = fut.result()
                        results.append(r)
                        print('OK' if r.get('ok') else 'FAIL', r.get('url'))

            ok = [r for r in results if r.get('ok')]
            fail = [r for r in results if not r.get('ok')]
            print(f'\nDone: {len(ok)}/{len(results)} ok')
            for r in ok:
                print('  OK:', r.get('path'))
            for r in fail[:10]:
                print('  FAIL:', r.get('url'))
                print('        client:', r.get('client'), 'log:', r.get('log'))

            log_event('download', f'done ok={len(ok)} total={len(results)} out={batch_dir}')

[2026-02-11 04:37:00] [INFO] [download] urls=1 mode=direct fmt=mp3 q=5 par=2 N=8 cookie=False
output: /content/whisperf/runs/20260211_042412/audio/youtube_20260211_043700
CMD: /usr/bin/python3 -m yt_dlp --no-playlist --newline --socket-timeout 30 --retries 15 --fragment-retries 15 --extractor-retries 5 --retry-sleep http:exp=1:20 --http-chunk-size 10M -N 8 -f bestaudio/best -o /content/whisperf/runs/20260211_042412/audio/youtube_20260211_043700/%(title).150s_%(id)s.%(ext)s --force-ipv4 --js-runtimes node --extractor-args youtube:player_client=web,android https://youtu.be/dbV1sMfv9TQ?si=v2YlYUGoovCJTiRn
OK https://youtu.be/dbV1sMfv9TQ?si=v2YlYUGoovCJTiRn

Done: 1/1 ok
  OK: /content/whisperf/runs/20260211_042412/audio/youtube_20260211_043700/Joe Tsai, Co-Founder and Chairman, Alibaba： Find Your People_dbV1sMfv9TQ.mp4
[2026-02-11 04:37:06] [INFO] [download] done ok=1 total=1 out=/content/whisperf/runs/20260211_042412/audio/youtube_20260211_043700


# New Section

In [None]:
# ============================================================
# Step 1-ALT: Import External Audio (skip download)
# ============================================================

DO_IMPORT = False  #@param {type:"boolean"}
IMPORT_MODE = 'upload'  #@param ['upload', 'path']
IMPORT_PATH = ''  #@param {type:"string"}

IMPORT_DIR = AUDIO_DIR / 'imported'
IMPORT_DIR.mkdir(parents=True, exist_ok=True)


def _copy_one(src: Path, dst_dir: Path) -> tuple[bool, str]:
    if not src.exists() or not src.is_file():
        return False, f'not_found: {src}'
    if src.suffix.lower() not in AUDIO_EXTS:
        return False, f'unsupported: {src.name}'
    dst = dst_dir / src.name
    if dst.exists():
        return False, f'exists: {dst.name}'
    shutil.copy2(src, dst)
    return True, dst.name


if not DO_IMPORT:
    print('import skipped')
else:
    from google.colab import files

    imported = []
    skipped = []

    if IMPORT_MODE == 'upload':
        up = files.upload()
        if not up:
            print('no file uploaded')
        for name, data in up.items():
            p = IMPORT_DIR / Path(name).name
            if p.suffix.lower() not in AUDIO_EXTS:
                skipped.append(f'unsupported: {p.name}')
                continue
            p.write_bytes(data)
            imported.append(p.name)

    else:
        p = IMPORT_PATH.strip()
        if not p:
            print('IMPORT_PATH empty')
        else:
            src = Path(p)
            if src.exists() and src.is_file() and src.suffix.lower() == '.zip':
                import zipfile
                with zipfile.ZipFile(src, 'r') as zf:
                    zf.extractall(IMPORT_DIR)
                imported.append(f'zip_extract: {src.name}')
            elif src.exists() and src.is_file():
                ok, msg = _copy_one(src, IMPORT_DIR)
                (imported if ok else skipped).append(msg)
            elif src.exists() and src.is_dir():
                files2 = sorted([f for f in src.rglob('*') if f.is_file() and f.suffix.lower() in AUDIO_EXTS])
                for f in files2:
                    ok, msg = _copy_one(f, IMPORT_DIR)
                    (imported if ok else skipped).append(msg)
            else:
                matches = sorted(glob.glob(p, recursive=True))
                if matches:
                    for m in matches:
                        ok, msg = _copy_one(Path(m), IMPORT_DIR)
                        (imported if ok else skipped).append(msg)
                else:
                    skipped.append(f'not_found: {p}')

    print(f'imported: {len(imported)}')
    for n in imported[:50]:
        print('  +', n)
    if skipped:
        print('skipped:')
        for s in skipped[:50]:
            print('  -', s)

    log_event('import', f'done imported={len(imported)} skipped={len(skipped)} dir={IMPORT_DIR}')

import skipped


In [None]:
# ============================================================
# Step 2: Whisper Batch Transcribe
# ============================================================

DO_TRANSCRIBE = True  #@param {type:"boolean"}

MODEL_NAME = 'small'  #@param ['tiny', 'base', 'small', 'medium', 'large']
LANGUAGE = 'en'  #@param ['auto', 'zh', 'en', 'ja', 'ko', 'es', 'fr', 'de']
TASK = 'transcribe'  #@param ['transcribe', 'translate']
BEAM_SIZE = 5  #@param {type:"slider", min:1, max:10, step:1}
TEMPERATURE = 0.0  #@param {type:"slider", min:0.0, max:1.0, step:0.1}

USE_MULTI_GPU = False  #@param {type:"boolean"}
GPU_WORKERS = 1  #@param {type:"integer"}


def _list_audio() -> list[Path]:
    return sorted([p for p in AUDIO_DIR.rglob('*') if p.is_file() and p.suffix.lower() in AUDIO_EXTS])


def _write_outputs(result: dict, audio_path: Path, out_root: Path):
    import json as _json
    from whisper.utils import get_writer

    base = safe_name(audio_path.stem)
    one_out = out_root / base
    one_out.mkdir(parents=True, exist_ok=True)

    get_writer('txt', str(one_out))(result, base)
    get_writer('srt', str(one_out))(result, base)

    (one_out / f'{base}.json').write_text(_json.dumps(result, ensure_ascii=False, indent=2), encoding='utf-8')


if not DO_TRANSCRIBE:
    print('transcribe skipped')
else:
    audio_files = _list_audio()
    if not audio_files:
        print('No audio found. Run Step 1 or Step 1-ALT first.')
        raise SystemExit(1)

    log_event('transcribe', f'files={len(audio_files)} model={MODEL_NAME} lang={LANGUAGE} task={TASK} beam={BEAM_SIZE} temp={TEMPERATURE}')

    n_gpus = torch.cuda.device_count()
    use_multi = bool(USE_MULTI_GPU) and n_gpus > 1
    workers = max(1, int(GPU_WORKERS))
    if use_multi:
        workers = min(workers, n_gpus, len(audio_files))
    else:
        workers = 1

    print(f'GPU: {n_gpus} | use_multi={use_multi} | workers={workers}')

    if workers == 1:
        import whisper
        from tqdm.auto import tqdm

        device = 'cuda' if torch.cuda.is_available() else 'cpu'
        model = whisper.load_model(MODEL_NAME, device=device)
        lang_arg = None if LANGUAGE == 'auto' else LANGUAGE

        ok = 0
        for ap in tqdm(audio_files, desc='Transcribing', unit='file'):
            base = safe_name(ap.stem)
            out_dir = OUT_DIR / base
            out_json = out_dir / f'{base}.json'
            if out_json.exists():
                continue

            t0 = time.time()
            try:
                result = model.transcribe(
                    str(ap),
                    language=lang_arg,
                    task=TASK,
                    beam_size=int(BEAM_SIZE),
                    temperature=float(TEMPERATURE),
                    fp16=(device == 'cuda'),
                    verbose=False,
                )
                _write_outputs(result, ap, OUT_DIR)
                ok += 1
                log_event('transcribe', f'OK file={ap.name} seconds={time.time()-t0:.1f} lang={result.get("language")}')
            except Exception as e:
                log_exception('transcribe', f'file={ap.name}', e)

        print('done. ok=', ok, 'total=', len(audio_files))

    else:
        import textwrap as _tw
        from tqdm.auto import tqdm

        worker_py = RUN_DIR / '_whisper_worker.py'
        worker_code = r'''
import argparse, json, os, re, sys, time
from pathlib import Path
import torch
import whisper
from whisper.utils import get_writer

def safe_name(name):
    name = re.sub(r'[\/:*?"<>|]+', '_', name)
    return re.sub(r'\s+', ' ', name).strip()[:180]

def write_outputs(result: dict, audio_path: Path, out_root: Path):
    base = safe_name(audio_path.stem)
    one_out = out_root / base
    one_out.mkdir(parents=True, exist_ok=True)
    get_writer('txt', str(one_out))(result, base)
    get_writer('srt', str(one_out))(result, base)
    (one_out / f'{base}.json').write_text(json.dumps(result, ensure_ascii=False, indent=2), encoding='utf-8')

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument('--files-json', required=True)
    ap.add_argument('--out-dir', required=True)
    ap.add_argument('--model', required=True)
    ap.add_argument('--language', default='auto')
    ap.add_argument('--task', default='transcribe')
    ap.add_argument('--beam-size', type=int, default=5)
    ap.add_argument('--temperature', type=float, default=0.0)
    ap.add_argument('--label', default='')
    args = ap.parse_args()

    out_dir = Path(args.out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)

    files = json.loads(Path(args.files_json).read_text('utf-8'))
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    label = args.label or device

    print(f'[{label}] loading {args.model} on {device}, {len(files)} files')
    sys.stdout.flush()
    model = whisper.load_model(args.model, device=device)
    print(f'[{label}] model loaded')
    sys.stdout.flush()

    lang_arg = None if args.language == 'auto' else args.language

    for idx, fpath in enumerate(files, 1):
        apath = Path(fpath)
        base = safe_name(apath.stem)
        out_json = out_dir / base / f'{base}.json'
        if out_json.exists():
            print(f'[{label}] ({idx}/{len(files)}) SKIP {apath.name}')
            continue

        print(f'[{label}] ({idx}/{len(files)}) transcribing {apath.name}')
        sys.stdout.flush()
        t0 = time.time()
        try:
            result = model.transcribe(
                str(apath),
                language=lang_arg,
                task=args.task,
                beam_size=args.beam_size,
                temperature=args.temperature,
                fp16=(device == 'cuda'),
                verbose=False,
            )
            write_outputs(result, apath, out_dir)
            print(f'[{label}] ({idx}/{len(files)}) DONE {time.time()-t0:.1f}s')
        except Exception as e:
            print(f'[{label}] ({idx}/{len(files)}) FAIL {time.time()-t0:.1f}s: {e}')
        sys.stdout.flush()

    return 0

if __name__ == '__main__':
    raise SystemExit(main())
'''
        worker_py.write_text(_tw.dedent(worker_code), encoding='utf-8')

        chunks = [[] for _ in range(workers)]
        for i, ap in enumerate(audio_files):
            chunks[i % workers].append(str(ap))

        procs = []
        log_paths = []
        t_start = time.time()

        for slot in range(workers):
            files_json = RUN_DIR / f'files_gpu{slot}.json'
            files_json.write_text(json.dumps(chunks[slot], ensure_ascii=False), encoding='utf-8')
            lp = RUN_DIR / f'log_gpu{slot}.txt'
            log_paths.append(lp)

            env = os.environ.copy()
            env['CUDA_VISIBLE_DEVICES'] = str(slot)

            cmd = [
                sys.executable, str(worker_py),
                '--files-json', str(files_json),
                '--out-dir', str(OUT_DIR),
                '--model', MODEL_NAME,
                '--language', LANGUAGE,
                '--task', TASK,
                '--beam-size', str(int(BEAM_SIZE)),
                '--temperature', str(float(TEMPERATURE)),
                '--label', f'gpu{slot}',
            ]

            f = open(lp, 'w', encoding='utf-8')
            procs.append((subprocess.Popen(cmd, env=env, stdout=f, stderr=subprocess.STDOUT), f))

        def count_done():
            n = 0
            for p in OUT_DIR.rglob('*.json'):
                if p.name.startswith('manifest_'):
                    continue
                n += 1
            return n

        pbar = tqdm(total=len(audio_files), desc='Transcribing', unit='file')
        last = count_done()
        pbar.update(min(last, pbar.total))

        while any(proc.poll() is None for proc, _f in procs):
            time.sleep(2)
            cur = count_done()
            if cur > last:
                pbar.update(cur - last)
                last = cur

        exit_codes = []
        for proc, f in procs:
            exit_codes.append(proc.wait())
            f.close()

        cur = count_done()
        if cur > last:
            pbar.update(cur - last)
        pbar.close()

        log_event('transcribe', f'done seconds={time.time()-t_start:.1f} exit_codes={exit_codes} done={cur}/{len(audio_files)}')

        for lp in log_paths:
            if lp.exists():
                print(f'\n--- {lp.name} ---')
                txt = lp.read_text(encoding='utf-8', errors='replace').splitlines()
                print('\n'.join(txt[-40:]))

[2026-02-11 04:37:12] [INFO] [transcribe] files=1 model=small lang=en task=transcribe beam=5 temp=0.0
GPU: 1 | use_multi=False | workers=1


100%|████████████████████████████████████████| 461M/461M [00:01<00:00, 365MiB/s]


Transcribing:   0%|          | 0/1 [00:00<?, ?file/s]


  0%|          | 0/325632 [00:00<?, ?frames/s][A
  1%|          | 2448/325632 [00:03<08:14, 653.59frames/s][A
  1%|▏         | 4408/325632 [00:06<07:20, 728.83frames/s][A
  2%|▏         | 6864/325632 [00:08<05:45, 922.90frames/s][A
  3%|▎         | 9384/325632 [00:10<05:19, 991.13frames/s][A
  4%|▎         | 11608/325632 [00:12<04:57, 1055.31frames/s][A
  4%|▍         | 14176/325632 [00:14<05:00, 1035.79frames/s][A
  5%|▌         | 16488/325632 [00:17<05:11, 991.89frames/s] [A
  6%|▌         | 19104/325632 [00:19<05:00, 1019.85frames/s][A
  7%|▋         | 21736/325632 [00:22<05:16, 960.51frames/s] [A
  7%|▋         | 24344/325632 [00:25<04:57, 1011.17frames/s][A
  8%|▊         | 27216/325632 [00:27<04:45, 1045.90frames/s][A
  9%|▉         | 30048/325632 [00:30<04:41, 1048.79frames/s][A
 10%|█         | 32712/325632 [00:32<04:24, 1107.33frames/s][A
 11%|█         | 34704/325632 [00:34<04:27, 1089.18frames/s][A
 12%|█▏        | 37600/325632 [00:36<04:19, 1109.87frames/s]

[2026-02-11 04:42:06] [INFO] [transcribe] OK file=Joe Tsai, Co-Founder and Chairman, Alibaba： Find Your People_dbV1sMfv9TQ.mp4 seconds=286.1 lang=en
done. ok= 1 total= 1


In [None]:
# ============================================================
# Step 3: Package + Download (zip)
# ============================================================

from google.colab import files

log_event('package', 'start')

zip_name = f'whisperf_run_{RUN_ID}.zip'
zip_path = RUN_DIR / zip_name

payload = []
payload += [p for p in OUT_DIR.rglob('*') if p.is_file()]
if TASK_LOG_PATH.exists():
    payload.append(TASK_LOG_PATH)
payload += [p for p in DL_LOG_DIR.rglob('*.txt') if p.is_file()]

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zf:
    for fp in payload:
        try:
            if fp == TASK_LOG_PATH:
                zf.write(fp, arcname='task_log.txt')
            elif str(fp).startswith(str(OUT_DIR)):
                zf.write(fp, arcname=str(fp.relative_to(OUT_DIR)))
            elif str(fp).startswith(str(DL_LOG_DIR)):
                zf.write(fp, arcname=str(Path('download_logs') / fp.name))
            else:
                zf.write(fp, arcname=fp.name)
        except Exception:
            pass

size_mb = zip_path.stat().st_size / (1024 * 1024)
log_event('package', f'zip={zip_path} size_mb={size_mb:.1f} files={len(payload)}')
print('zip:', zip_path, f'({size_mb:.1f} MB)')

files.download(str(zip_path))

if DRIVE_RUN_DIR is not None:
    dst = DRIVE_RUN_DIR / zip_name
    shutil.copy2(zip_path, dst)
    print('saved to drive:', dst)


[2026-02-11 04:42:37] [INFO] [package] start
[2026-02-11 04:42:37] [INFO] [package] zip=/content/whisperf/runs/20260211_042412/whisperf_run_20260211_042412.zip size_mb=0.1 files=5
zip: /content/whisperf/runs/20260211_042412/whisperf_run_20260211_042412.zip (0.1 MB)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>