100% Local AI Transcription with Speaker Diarization
No API key. No cloud. No cost. Runs completely offline on your machine.
| Feature | Description | |
|---|---|---|
| โก | Dual Engine | faster-whisper (CPU) + mlx-whisper (Apple Silicon GPU, 2-5x faster) |
| ๐ฃ๏ธ | Speaker Diarization | Auto-identifies Speaker 1, 2, 3... using MFCC + Agglomerative Clustering |
| ๐ | 99+ Languages | Indonesian, English, and 99+ languages with auto-detection |
| ๐ | Multi-Format | Input: MP3, MP4, WAV, M4A, OGG, FLAC, WEBM โ Output: SRT, TXT, DOCX |
| ๐ง | 5 AI Models | tiny (39M) โ large-v3 (1.5B) โ choose speed vs accuracy |
| ๐ | Smart Progress | 5-stage: Upload โ Model โ Transcription โ Speaker ID โ Export |
| ๐พ | Auto Cache | Downloads model once, loads instantly from cache afterward |
| ๐ | Dark Theme UI | Professional web UI with audio player, search & drag-drop |
| ๐ | 100% Offline | Zero data leaves your machine. Your audio stays yours. |
git clone https://github.com/romizone/transcribeAI.git
cd transcribeAI
# โ๏ธ Setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install mlx-whisper # ๐ฅ GPU acceleration
# โถ๏ธ Run
python3 app.py๐ Open http://localhost:8080 in your browser
git clone https://github.com/romizone/transcribeAI.git
cd transcribeAI
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 app.pygit clone https://github.com/romizone/transcribeAI.git
cd transcribeAI
python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt
python app.py๐ก Or use setup scripts:
./setup.sh(macOS/Linux) orsetup.bat(Windows)
Download models ahead of time so transcription starts instantly:
source venv/bin/activate
python3 download_models.py small # ๐ฆ Download recommended model
python3 download_models.py all # ๐ฆ Download all models
python3 download_models.py # ๐ Check download status๐ On Apple Silicon, MLX models are auto-downloaded for GPU acceleration
| Model | Params | Size | Speed | Best For |
|---|---|---|---|---|
tiny |
39M | ~75 MB | โกโกโกโกโก | Quick drafts, short clips |
base |
74M | ~145 MB | โกโกโกโก | Casual transcription |
small โญ |
244M | ~465 MB | โกโกโก | Recommended โ best balance |
medium |
769M | ~1.5 GB | โกโก | Higher accuracy needed |
large-v3 |
1550M | ~2.9 GB | โก | Maximum accuracy |
| Engine | Device | Speed | Install |
|---|---|---|---|
| ๐ฅ mlx-whisper | Apple Silicon GPU | 2-5x faster | pip install mlx-whisper |
| ๐ฅ๏ธ faster-whisper | CPU (all platforms) | Baseline | Included in requirements.txt |
๐ค The app auto-detects Apple Silicon and defaults to mlx-whisper when available
Transcribe directly from terminal โ no browser needed:
# ๐ต Simple transcription
python3 transcribe_cli.py audio.mp3
# ๐ฎ๐ฉ Indonesian, medium model, 3 speakers
python3 transcribe_cli.py video.mp4 --language id --model medium --speakers 3
# ๐ Custom output folder + multiple formats
python3 transcribe_cli.py audio.wav --output ./results --format srt txt docxtranscribeAI/
โโโ ๐ app.py # Flask backend (dual engine, diarization, API)
โโโ ๐ templates/
โ โโโ index.html # Web UI (dark theme, progress, audio player)
โโโ ๐ฅ๏ธ transcribe_cli.py # CLI version
โโโ ๐ฅ download_models.py # Pre-download models offline
โโโ ๐ requirements.txt # Python dependencies
โโโ โ๏ธ setup.sh / setup.bat # Setup scripts
โโโ โถ๏ธ run.sh / run.bat # Run scripts
โโโ ๐ง .env.example # Configuration template
๐ค Audio Input
โ
โผ
๐ง Whisper Transcription
โ faster-whisper (CTranslate2 INT8)
โ mlx-whisper (Apple MLX GPU)
โ
โผ
๐ VAD Filter
โ Silero VAD removes silence
โ
โผ
๐ฃ๏ธ Speaker Diarization
โ MFCC (20 coeff) + Delta + Spectral + Pitch
โ โ StandardScaler โ Agglomerative Clustering
โ
โผ
๐ Export
โโโ ๐ฌ SRT (subtitles)
โโโ ๐ TXT (readable transcript)
โโโ ๐ DOCX (formatted document)
| Problem | Solution |
|---|---|
| ๐ซ Port 5000 in use (macOS) | AirPlay uses port 5000. TranscribeAI uses port 8080 by default |
โ ModuleNotFoundError |
Activate venv first: source venv/bin/activate |
python3 aliased wrong |
Use venv directly: ./venv/bin/python3 app.py |
| โณ Stuck at "Memuat model..." | First run downloads ~465MB model (one-time). Pre-download: python3 download_models.py small |
| Layer | Technology |
|---|---|
| ๐ฅ๏ธ Backend | Flask, faster-whisper, mlx-whisper |
| ๐ต Audio | librosa, numpy, pydub |
| ๐ฃ๏ธ Speaker ID | scikit-learn (Agglomerative Clustering) |
| ๐ Export | python-docx |
| ๐จ Frontend | Vanilla HTML/CSS/JS (zero framework dependencies) |
MIT License โ free for personal and commercial use.
๐ฎ๐ฉ Made in Jakarta, Indonesia
Built with โค๏ธ by Romi Nur Ismanto ยท @romizone
โญ Star this repo if you find it useful!