A fast cross-platform CPU-first video/audio English-only transcriber for generating caption files with Whisper and CTranslate2, hosted on Hugging Face Spaces. A pip
installable offline CLI tool with CUDA support is provided. By default, Voice Activity Detection (VAD) preprocessing is always enabled.
- Python 3.11
- 4 GB RAM
Simply cURL the endpoint like in the following. Currently, the only available caption format are srt
, vtt
and txt
curl "https://winstxnhdw-CapGen.hf.space/api/v1/transcribe?caption_format=$CAPTION_FORMAT" \
You can also redirect the output to a file.
curl "https://winstxnhdw-CapGen.hf.space/api/v1/transcribe?caption_format=$CAPTION_FORMAT" \
-F "file=@$AUDIO_FILE_PATH" | jq -r ".result" > result.srt
You can stream the captions in real-time with the following.
curl -N "https://winstxnhdw-CapGen.hf.space/api/v1/transcribe/stream?caption_format=$CAPTION_FORMAT" \
is available as a CLI tool with CUDA support. You can install it with pip
pip install git+https://github.com/winstxnhdw/CapGen
You may also install CapGen
with the necessary CUDA binaries.
pip install "capgen[cuda] @ git+https://github.com/winstxnhdw/CapGen"
Now, you can run the CLI tool with the following command.
capgen -c srt -o ./result.srt --cuda < ~/Downloads/audio.mp3
usage: capgen [-h] [-g] [-t] [-w] -c -o [file]
transcribe a compatible audio/video file into a chosen caption file format
positional arguments:
file the file path to a compatible audio/video
-h, --help show this help message and exit
-g, --cuda whether to use CUDA for inference
-t, --threads the number of CPU threads
-w, --workers the number of CPU workers
-c, --caption the chosen caption file format
-o, --output the output file path
You can install the required dependencies for your editor with the following.
poetry install
You can spin the server up locally with the following. You can access the Swagger UI at localhost:7860/api/docs.
docker build -f Dockerfile.build -t capgen .
docker run --rm -e SERVER_PORT=7860 -p 7860:7860 capgen