Local audio/video transcription with speaker diarization. No API keys. No cloud. One command.
npm (Node.js API + CLI):
npm install transcribe-cliShell (standalone CLI):
curl -sSL https://raw.githubusercontent.com/robit-man/transcribe-cli/main/install.sh | bashThen:
transcribe audio.mp3
transcribe meeting.wav --model medium --diarize --format json
transcribe batch ./recordings --recursive --format srt- 100% Local — Runs on your machine via faster-whisper (CTranslate2). No API keys, no cloud, no data leaves your system.
- Speaker Diarization — Identify who said what with
--diarize(via pyannote.audio) - Word-Level Timestamps — Precise per-word timing with
--word-timestamps - Live Audio Streaming — Real-time transcription via Node.js streams
- 4 Output Formats —
txt,srt(with speaker labels),vtt(with W3C voice tags),json(full metadata) - Audio + Video — MP3, WAV, FLAC, AAC, M4A, OGG, WMA, MP4, MKV, AVI, MOV, WebM, FLV
- Batch Processing — Process entire directories with configurable concurrency
- 5 Model Sizes —
tiny,base,small,medium,large-v3(auto-downloads on first use) - Auto Audio Extraction — Videos are automatically handled via FFmpeg
- Dual Interface — Use as CLI tool or Node.js API with full TypeScript types
- Cross-Platform — Linux and macOS
- Python 3.9+
- FFmpeg 4.0+
- ~1 GB disk (for base model; large-v3 needs ~3 GB)
The install script handles all dependencies automatically.
curl -sSL https://raw.githubusercontent.com/robit-man/transcribe-cli/main/install.sh | bashThis will:
- Install system dependencies (Python, FFmpeg, git) if missing
- Clone the repository to
~/.local/share/transcribe-cli - Create a Python virtual environment with all packages
- Pre-download the default Whisper model (
base) - Create
transcribeandtranscribe-clicommands in~/.local/bin - Add
~/.local/binto your PATH if needed
Environment variables (optional):
TRANSCRIBE_INSTALL_DIR— Custom install location (default:~/.local/share/transcribe-cli)TRANSCRIBE_MODEL— Model to pre-download (default:base)
git clone https://github.com/robit-man/transcribe-cli.git
cd transcribe-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -e .pip install -e ".[diarization]"transcribe audio.mp3
transcribe video.mkv --format srt
transcribe recording.wav --output-dir ./transcripts
transcribe lecture.mp3 --model medium --language entranscribe meeting.wav --diarize --format srt
transcribe interview.mp3 --diarize --format json
transcribe podcast.mp3 --diarize --word-timestamps --format vtttranscribe batch ./recordings
transcribe batch ./videos --format srt --concurrency 3
transcribe batch ./media --recursive --dry-run
transcribe batch ./meetings --model medium --diarize --format jsontranscribe extract video.mkv
transcribe extract video.mp4 --output audio.mp3
transcribe extract video.avi --format wavtranscribe config --show # Show current settings
transcribe config --init # Create transcribe.toml in current directory
transcribe config --locations # Show config file search pathstranscribe setup --checknpm install transcribe-cliOn install, the package automatically:
- Creates a Python virtual environment
- Installs faster-whisper and all Python dependencies
- Downloads the default Whisper model (
base)
Set TRANSCRIBE_VERBOSE=1 to see setup progress. Set TRANSCRIBE_MODEL=medium to pre-download a different model.
const { transcribe, transcribeBatch, shutdownBridge } = require('transcribe-cli');
// Transcribe a single file
const result = await transcribe('meeting.mp3', {
model: 'base', // tiny, base, small, medium, large-v3
diarize: true, // speaker identification
wordTimestamps: true, // per-word timing
format: 'json', // txt, srt, vtt, json
language: 'auto', // or 'en', 'es', etc.
});
console.log(result.text);
console.log(result.speakers); // ['SPEAKER_00', 'SPEAKER_01']
console.log(result.segments); // [{id, start, end, text, speaker, words}]
// Batch transcribe a directory
const batch = await transcribeBatch('./recordings', {
recursive: true,
concurrency: 3,
format: 'srt',
});
console.log(`${batch.successful}/${batch.totalFiles} files transcribed`);
// Clean up when done
await shutdownBridge();const { TranscribeLive } = require('transcribe-cli');
const live = new TranscribeLive({
model: 'base',
sampleRate: 16000, // Hz
channels: 1, // mono
sampleWidth: 2, // 16-bit
chunkDuration: 5, // seconds per chunk
wordTimestamps: true,
});
live.on('ready', () => {
console.log('Model loaded, streaming...');
});
live.on('transcript', (event) => {
console.log(`[${event.isFinal ? 'FINAL' : 'partial'}] ${event.text}`);
// event.segments has full timing + speaker info
});
// Feed raw PCM audio buffers
live.write(pcmBuffer);
// Or pipe from any readable stream (microphone, file, etc.)
audioSource.pipe(live.stream);
// Finish and flush remaining audio
await live.finish();Full type definitions included:
import { transcribe, TranscribeLive, TranscriptionResult, LiveTranscriptEvent } from 'transcribe-cli';
const result: TranscriptionResult = await transcribe('audio.mp3', { diarize: true });| Option | Short | Description | Default |
|---|---|---|---|
--output-dir |
-o |
Output directory | Current dir |
--format |
-f |
Output format: txt, srt, vtt, json |
txt |
--language |
-l |
Language code or auto |
auto |
--model |
-m |
Model: tiny, base, small, medium, large-v3 |
base |
--diarize |
Enable speaker diarization | Off | |
--word-timestamps |
Enable word-level timestamps | Off | |
--verbose |
Verbose output | Off |
All options from transcribe plus:
| Option | Short | Description | Default |
|---|---|---|---|
--concurrency |
-c |
Max concurrent jobs (1-20) | 5 |
--recursive |
-r |
Scan subdirectories | Off |
--dry-run |
Preview files without processing | Off |
| Option | Short | Description | Default |
|---|---|---|---|
--output |
-o |
Output file path | Auto-generated |
--format |
-f |
Audio format: mp3, wav |
mp3 |
Hello, welcome to the meeting. Today we'll discuss the quarterly results.
1
00:00:00,000 --> 00:00:03,500
[SPEAKER_00] Hello, welcome to the meeting.
2
00:00:03,500 --> 00:00:07,200
[SPEAKER_01] Thanks for having me.
WEBVTT
00:00:00.000 --> 00:00:03.500
<v SPEAKER_00>Hello, welcome to the meeting.</v>
00:00:03.500 --> 00:00:07.200
<v SPEAKER_01>Thanks for having me.</v>
{
"text": "Hello, welcome to the meeting...",
"language": "en",
"duration": 120.5,
"speakers": ["SPEAKER_00", "SPEAKER_01"],
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.5,
"text": "Hello, welcome to the meeting.",
"speaker": "SPEAKER_00",
"words": [
{"word": "Hello,", "start": 0.1, "end": 0.5},
{"word": "welcome", "start": 0.6, "end": 1.0}
]
}
]
}Create with transcribe config --init:
[output]
format = "txt"
[processing]
concurrency = 5
language = "auto"
recursive = false
[model]
size = "base"
device = "auto"
compute_type = "auto"
[features]
diarize = false
word_timestamps = falseConfig files are searched in order:
./transcribe.toml./.transcriberc~/.config/transcribe/config.toml~/.transcriberc
| Variable | Description | Default |
|---|---|---|
TRANSCRIBE_MODEL_SIZE |
Whisper model size | base |
TRANSCRIBE_DEVICE |
Compute device (auto/cpu/cuda) |
auto |
TRANSCRIBE_COMPUTE_TYPE |
Compute type (auto/int8/float16/float32) |
auto |
TRANSCRIBE_CONCURRENCY |
Max concurrent batch jobs | 5 |
TRANSCRIBE_LANGUAGE |
Default language | auto |
TRANSCRIBE_DIARIZE |
Enable diarization by default | false |
TRANSCRIBE_WORD_TIMESTAMPS |
Enable word timestamps by default | false |
| Model | Size | English | Multilingual | Speed |
|---|---|---|---|---|
tiny |
~75 MB | Good | Fair | Fastest |
base |
~150 MB | Better | Good | Fast |
small |
~500 MB | Great | Great | Moderate |
medium |
~1.5 GB | Excellent | Excellent | Slower |
large-v3 |
~3 GB | Best | Best | Slowest |
Models are auto-downloaded on first use and cached locally.
git clone https://github.com/robit-man/transcribe-cli.git
cd transcribe-cli
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Run tests
pytest
# Run tests without coverage
pytest tests/unit/ -v --no-covrm -rf ~/.local/share/transcribe-cli
rm -f ~/.local/bin/transcribe ~/.local/bin/transcribe-cliMIT
- faster-whisper — CTranslate2 Whisper implementation
- pyannote.audio — Speaker diarization
- FFmpeg — Audio/video processing
- Typer — CLI framework
- Rich — Terminal formatting