Skip to content

๐ŸŽ™๏ธ 100% Local AI Transcription with Speaker Diarization โ€” No API key, no cloud, no cost. Supports 99+ languages, dual engine (CPU + Apple Silicon GPU), exports to SRT/TXT/DOCX.

Notifications You must be signed in to change notification settings

romizone/transcribeAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

3 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽ™๏ธ TranscribeAI

100% Local AI Transcription with Speaker Diarization
No API key. No cloud. No cost. Runs completely offline on your machine.


โœจ Features

Feature Description
โšก Dual Engine faster-whisper (CPU) + mlx-whisper (Apple Silicon GPU, 2-5x faster)
๐Ÿ—ฃ๏ธ Speaker Diarization Auto-identifies Speaker 1, 2, 3... using MFCC + Agglomerative Clustering
๐ŸŒ 99+ Languages Indonesian, English, and 99+ languages with auto-detection
๐Ÿ“ Multi-Format Input: MP3, MP4, WAV, M4A, OGG, FLAC, WEBM โ†’ Output: SRT, TXT, DOCX
๐Ÿง  5 AI Models tiny (39M) โ†’ large-v3 (1.5B) โ€” choose speed vs accuracy
๐Ÿ“Š Smart Progress 5-stage: Upload โ†’ Model โ†’ Transcription โ†’ Speaker ID โ†’ Export
๐Ÿ’พ Auto Cache Downloads model once, loads instantly from cache afterward
๐ŸŒ™ Dark Theme UI Professional web UI with audio player, search & drag-drop
๐Ÿ”’ 100% Offline Zero data leaves your machine. Your audio stays yours.

๐Ÿš€ Quick Start

๐ŸŽ macOS โ€” Apple Silicon (M1/M2/M3/M4)

git clone https://github.com/romizone/transcribeAI.git
cd transcribeAI

# โš™๏ธ Setup
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install mlx-whisper          # ๐Ÿ”ฅ GPU acceleration

# โ–ถ๏ธ Run
python3 app.py

๐ŸŒ Open http://localhost:8080 in your browser

๐ŸŽ macOS (Intel) / ๐Ÿง Linux

git clone https://github.com/romizone/transcribeAI.git
cd transcribeAI

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

python3 app.py

๐ŸชŸ Windows

git clone https://github.com/romizone/transcribeAI.git
cd transcribeAI

python -m venv venv
venv\Scripts\activate
pip install -r requirements.txt

python app.py

๐Ÿ’ก Or use setup scripts: ./setup.sh (macOS/Linux) or setup.bat (Windows)


๐Ÿ“ฅ Pre-download Models

Download models ahead of time so transcription starts instantly:

source venv/bin/activate

python3 download_models.py small    # ๐Ÿ“ฆ Download recommended model
python3 download_models.py all      # ๐Ÿ“ฆ Download all models
python3 download_models.py          # ๐Ÿ“‹ Check download status

๐ŸŽ On Apple Silicon, MLX models are auto-downloaded for GPU acceleration


๐Ÿง  Models

Model Params Size Speed Best For
tiny 39M ~75 MB โšกโšกโšกโšกโšก Quick drafts, short clips
base 74M ~145 MB โšกโšกโšกโšก Casual transcription
small โญ 244M ~465 MB โšกโšกโšก Recommended โ€” best balance
medium 769M ~1.5 GB โšกโšก Higher accuracy needed
large-v3 1550M ~2.9 GB โšก Maximum accuracy

๐Ÿ”ง Engines

Engine Device Speed Install
๐Ÿ”ฅ mlx-whisper Apple Silicon GPU 2-5x faster pip install mlx-whisper
๐Ÿ–ฅ๏ธ faster-whisper CPU (all platforms) Baseline Included in requirements.txt

๐Ÿค– The app auto-detects Apple Silicon and defaults to mlx-whisper when available


๐Ÿ’ป CLI Usage

Transcribe directly from terminal โ€” no browser needed:

# ๐ŸŽต Simple transcription
python3 transcribe_cli.py audio.mp3

# ๐Ÿ‡ฎ๐Ÿ‡ฉ Indonesian, medium model, 3 speakers
python3 transcribe_cli.py video.mp4 --language id --model medium --speakers 3

# ๐Ÿ“‚ Custom output folder + multiple formats
python3 transcribe_cli.py audio.wav --output ./results --format srt txt docx

๐Ÿ“‚ Project Structure

transcribeAI/
โ”œโ”€โ”€ ๐Ÿ app.py               # Flask backend (dual engine, diarization, API)
โ”œโ”€โ”€ ๐Ÿ“„ templates/
โ”‚   โ””โ”€โ”€ index.html           # Web UI (dark theme, progress, audio player)
โ”œโ”€โ”€ ๐Ÿ–ฅ๏ธ transcribe_cli.py     # CLI version
โ”œโ”€โ”€ ๐Ÿ“ฅ download_models.py    # Pre-download models offline
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt      # Python dependencies
โ”œโ”€โ”€ โš™๏ธ setup.sh / setup.bat  # Setup scripts
โ”œโ”€โ”€ โ–ถ๏ธ run.sh / run.bat      # Run scripts
โ””โ”€โ”€ ๐Ÿ”ง .env.example          # Configuration template

โš™๏ธ How It Works

๐ŸŽค Audio Input
    โ”‚
    โ–ผ
๐Ÿง  Whisper Transcription
    โ”‚  faster-whisper (CTranslate2 INT8)
    โ”‚  mlx-whisper (Apple MLX GPU)
    โ”‚
    โ–ผ
๐Ÿ”‡ VAD Filter
    โ”‚  Silero VAD removes silence
    โ”‚
    โ–ผ
๐Ÿ—ฃ๏ธ Speaker Diarization
    โ”‚  MFCC (20 coeff) + Delta + Spectral + Pitch
    โ”‚  โ†’ StandardScaler โ†’ Agglomerative Clustering
    โ”‚
    โ–ผ
๐Ÿ“„ Export
    โ”œโ”€โ”€ ๐ŸŽฌ SRT (subtitles)
    โ”œโ”€โ”€ ๐Ÿ“ TXT (readable transcript)
    โ””โ”€โ”€ ๐Ÿ“‘ DOCX (formatted document)

๐Ÿ› ๏ธ Troubleshooting

Problem Solution
๐Ÿšซ Port 5000 in use (macOS) AirPlay uses port 5000. TranscribeAI uses port 8080 by default
โŒ ModuleNotFoundError Activate venv first: source venv/bin/activate
โš ๏ธ python3 aliased wrong Use venv directly: ./venv/bin/python3 app.py
โณ Stuck at "Memuat model..." First run downloads ~465MB model (one-time). Pre-download: python3 download_models.py small

๐Ÿ—๏ธ Tech Stack

Layer Technology
๐Ÿ–ฅ๏ธ Backend Flask, faster-whisper, mlx-whisper
๐ŸŽต Audio librosa, numpy, pydub
๐Ÿ—ฃ๏ธ Speaker ID scikit-learn (Agglomerative Clustering)
๐Ÿ“„ Export python-docx
๐ŸŽจ Frontend Vanilla HTML/CSS/JS (zero framework dependencies)

๐Ÿ“œ License

MIT License โ€” free for personal and commercial use.


๐Ÿ‡ฎ๐Ÿ‡ฉ Made in Jakarta, Indonesia
Built with โค๏ธ by Romi Nur Ismanto ยท @romizone

โญ Star this repo if you find it useful!

About

๐ŸŽ™๏ธ 100% Local AI Transcription with Speaker Diarization โ€” No API key, no cloud, no cost. Supports 99+ languages, dual engine (CPU + Apple Silicon GPU), exports to SRT/TXT/DOCX.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published