Skip to content

robit-man/transcribe-cli

Repository files navigation

transcribe-cli

Local audio/video transcription with speaker diarization. No API keys. No cloud. One command.

Quickstart

npm (Node.js API + CLI):

npm install transcribe-cli

Shell (standalone CLI):

curl -sSL https://raw.githubusercontent.com/robit-man/transcribe-cli/main/install.sh | bash

Then:

transcribe audio.mp3
transcribe meeting.wav --model medium --diarize --format json
transcribe batch ./recordings --recursive --format srt

Features

  • 100% Local — Runs on your machine via faster-whisper (CTranslate2). No API keys, no cloud, no data leaves your system.
  • Speaker Diarization — Identify who said what with --diarize (via pyannote.audio)
  • Word-Level Timestamps — Precise per-word timing with --word-timestamps
  • Live Audio Streaming — Real-time transcription via Node.js streams
  • 4 Output Formatstxt, srt (with speaker labels), vtt (with W3C voice tags), json (full metadata)
  • Audio + Video — MP3, WAV, FLAC, AAC, M4A, OGG, WMA, MP4, MKV, AVI, MOV, WebM, FLV
  • Batch Processing — Process entire directories with configurable concurrency
  • 5 Model Sizestiny, base, small, medium, large-v3 (auto-downloads on first use)
  • Auto Audio Extraction — Videos are automatically handled via FFmpeg
  • Dual Interface — Use as CLI tool or Node.js API with full TypeScript types
  • Cross-Platform — Linux and macOS

Requirements

  • Python 3.9+
  • FFmpeg 4.0+
  • ~1 GB disk (for base model; large-v3 needs ~3 GB)

The install script handles all dependencies automatically.

Installation

One-Line Install (Recommended)

curl -sSL https://raw.githubusercontent.com/robit-man/transcribe-cli/main/install.sh | bash

This will:

  1. Install system dependencies (Python, FFmpeg, git) if missing
  2. Clone the repository to ~/.local/share/transcribe-cli
  3. Create a Python virtual environment with all packages
  4. Pre-download the default Whisper model (base)
  5. Create transcribe and transcribe-cli commands in ~/.local/bin
  6. Add ~/.local/bin to your PATH if needed

Environment variables (optional):

  • TRANSCRIBE_INSTALL_DIR — Custom install location (default: ~/.local/share/transcribe-cli)
  • TRANSCRIBE_MODEL — Model to pre-download (default: base)

Manual Install

git clone https://github.com/robit-man/transcribe-cli.git
cd transcribe-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

With Speaker Diarization

pip install -e ".[diarization]"

Usage

Transcribe a Single File

transcribe audio.mp3
transcribe video.mkv --format srt
transcribe recording.wav --output-dir ./transcripts
transcribe lecture.mp3 --model medium --language en

Speaker Diarization

transcribe meeting.wav --diarize --format srt
transcribe interview.mp3 --diarize --format json
transcribe podcast.mp3 --diarize --word-timestamps --format vtt

Batch Processing

transcribe batch ./recordings
transcribe batch ./videos --format srt --concurrency 3
transcribe batch ./media --recursive --dry-run
transcribe batch ./meetings --model medium --diarize --format json

Audio Extraction

transcribe extract video.mkv
transcribe extract video.mp4 --output audio.mp3
transcribe extract video.avi --format wav

Configuration

transcribe config --show        # Show current settings
transcribe config --init        # Create transcribe.toml in current directory
transcribe config --locations   # Show config file search paths

Dependency Check

transcribe setup --check

Node.js API

Install

npm install transcribe-cli

On install, the package automatically:

  1. Creates a Python virtual environment
  2. Installs faster-whisper and all Python dependencies
  3. Downloads the default Whisper model (base)

Set TRANSCRIBE_VERBOSE=1 to see setup progress. Set TRANSCRIBE_MODEL=medium to pre-download a different model.

File Transcription

const { transcribe, transcribeBatch, shutdownBridge } = require('transcribe-cli');

// Transcribe a single file
const result = await transcribe('meeting.mp3', {
  model: 'base',           // tiny, base, small, medium, large-v3
  diarize: true,           // speaker identification
  wordTimestamps: true,    // per-word timing
  format: 'json',          // txt, srt, vtt, json
  language: 'auto',        // or 'en', 'es', etc.
});

console.log(result.text);
console.log(result.speakers);    // ['SPEAKER_00', 'SPEAKER_01']
console.log(result.segments);    // [{id, start, end, text, speaker, words}]

// Batch transcribe a directory
const batch = await transcribeBatch('./recordings', {
  recursive: true,
  concurrency: 3,
  format: 'srt',
});
console.log(`${batch.successful}/${batch.totalFiles} files transcribed`);

// Clean up when done
await shutdownBridge();

Live Audio Streaming

const { TranscribeLive } = require('transcribe-cli');

const live = new TranscribeLive({
  model: 'base',
  sampleRate: 16000,    // Hz
  channels: 1,          // mono
  sampleWidth: 2,       // 16-bit
  chunkDuration: 5,     // seconds per chunk
  wordTimestamps: true,
});

live.on('ready', () => {
  console.log('Model loaded, streaming...');
});

live.on('transcript', (event) => {
  console.log(`[${event.isFinal ? 'FINAL' : 'partial'}] ${event.text}`);
  // event.segments has full timing + speaker info
});

// Feed raw PCM audio buffers
live.write(pcmBuffer);

// Or pipe from any readable stream (microphone, file, etc.)
audioSource.pipe(live.stream);

// Finish and flush remaining audio
await live.finish();

TypeScript

Full type definitions included:

import { transcribe, TranscribeLive, TranscriptionResult, LiveTranscriptEvent } from 'transcribe-cli';

const result: TranscriptionResult = await transcribe('audio.mp3', { diarize: true });

CLI Reference

transcribe <file> [OPTIONS]

Option Short Description Default
--output-dir -o Output directory Current dir
--format -f Output format: txt, srt, vtt, json txt
--language -l Language code or auto auto
--model -m Model: tiny, base, small, medium, large-v3 base
--diarize Enable speaker diarization Off
--word-timestamps Enable word-level timestamps Off
--verbose Verbose output Off

transcribe batch <directory> [OPTIONS]

All options from transcribe plus:

Option Short Description Default
--concurrency -c Max concurrent jobs (1-20) 5
--recursive -r Scan subdirectories Off
--dry-run Preview files without processing Off

transcribe extract <file> [OPTIONS]

Option Short Description Default
--output -o Output file path Auto-generated
--format -f Audio format: mp3, wav mp3

Output Formats

TXT — Plain text

Hello, welcome to the meeting. Today we'll discuss the quarterly results.

SRT — SubRip subtitles (with speaker labels when diarized)

1
00:00:00,000 --> 00:00:03,500
[SPEAKER_00] Hello, welcome to the meeting.

2
00:00:03,500 --> 00:00:07,200
[SPEAKER_01] Thanks for having me.

VTT — WebVTT (with W3C voice tags when diarized)

WEBVTT

00:00:00.000 --> 00:00:03.500
<v SPEAKER_00>Hello, welcome to the meeting.</v>

00:00:03.500 --> 00:00:07.200
<v SPEAKER_01>Thanks for having me.</v>

JSON — Full metadata

{
  "text": "Hello, welcome to the meeting...",
  "language": "en",
  "duration": 120.5,
  "speakers": ["SPEAKER_00", "SPEAKER_01"],
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, welcome to the meeting.",
      "speaker": "SPEAKER_00",
      "words": [
        {"word": "Hello,", "start": 0.1, "end": 0.5},
        {"word": "welcome", "start": 0.6, "end": 1.0}
      ]
    }
  ]
}

Configuration File

Create with transcribe config --init:

[output]
format = "txt"

[processing]
concurrency = 5
language = "auto"
recursive = false

[model]
size = "base"
device = "auto"
compute_type = "auto"

[features]
diarize = false
word_timestamps = false

Config files are searched in order:

  1. ./transcribe.toml
  2. ./.transcriberc
  3. ~/.config/transcribe/config.toml
  4. ~/.transcriberc

Environment Variables

Variable Description Default
TRANSCRIBE_MODEL_SIZE Whisper model size base
TRANSCRIBE_DEVICE Compute device (auto/cpu/cuda) auto
TRANSCRIBE_COMPUTE_TYPE Compute type (auto/int8/float16/float32) auto
TRANSCRIBE_CONCURRENCY Max concurrent batch jobs 5
TRANSCRIBE_LANGUAGE Default language auto
TRANSCRIBE_DIARIZE Enable diarization by default false
TRANSCRIBE_WORD_TIMESTAMPS Enable word timestamps by default false

Model Sizes

Model Size English Multilingual Speed
tiny ~75 MB Good Fair Fastest
base ~150 MB Better Good Fast
small ~500 MB Great Great Moderate
medium ~1.5 GB Excellent Excellent Slower
large-v3 ~3 GB Best Best Slowest

Models are auto-downloaded on first use and cached locally.

Development

git clone https://github.com/robit-man/transcribe-cli.git
cd transcribe-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

# Run tests
pytest

# Run tests without coverage
pytest tests/unit/ -v --no-cov

Uninstall

rm -rf ~/.local/share/transcribe-cli
rm -f ~/.local/bin/transcribe ~/.local/bin/transcribe-cli

License

MIT

Acknowledgments

About

CLI tool for transcribing audio and video files with speaker diarization using OpenAI Whisper API and pyannote.audio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors