AutoAudioBook ingests PDF or DOCX books, extracts the text, creates a reviewable annotation artifact with Gemini inline TTS tags, previews chunking, and can generate per-chunk WAV audio with Gemini 3.1 Flash TTS. I run this on a Ubuntu 24.04 VM running on two cores and 4GB ram om a Beelink ME Pro.
This project was developed in VS Code with GitHub Copilot using GPT-5.4.
- Clone the repository:
git clone https://github.com/tronba/AutoAudioBook.git
cd AutoAudioBook- Run the installer:
sudo bash install_ubuntu_24.sh- Open the app in a browser:
http://<server-ip>:8000
The installer will:
- install Ubuntu packages
- create or reuse
.venv - install Python dependencies
- securely prompt for the Gemini API key
- save the key to
/etc/autoaudiobook/autoaudiobook.envwith restricted permissions - install and enable the
systemdservice
To manage the service:
sudo systemctl status autoaudiobook
sudo systemctl restart autoaudiobook
sudo journalctl -u autoaudiobook -n 100 --no-pager
- Imports PDF or DOCX books
- Generates a reviewable annotated DOCX with inline TTS tags
- Lets you upload an approved annotation for audio generation
- Previews chunking before synthesis
- Generates WAV audio with Gemini TTS
- DOCX input files should mark chapter headings with the Word style
Header 1 - Text before the first Chapter 1 marker is included in Chapter 1 by default
- That opening text can be split out instead if you enable the separate pre-chapter text option during generation
Editable inline tag vocabulary lives in tts_tags.toml.
[[expressive_tags]]
tag = "[angry]"
min_mode = "expressive"Each tag belongs to either expressive_tags or vocalization_tags, and each entry includes a min_mode of conservative, balanced, or expressive.
storage/app.db- SQLite databasestorage/uploads/- original uploaded source filesstorage/extracted/- normalized extracted JSONstorage/annotated/- draft and approved annotation DOCX filesstorage/audio/- reserved for future chunk and chapter audio output
Set these environment variables on the server:
GEMINI_API_KEYGEMINI_TEXT_MODELoptional, defaults togemini-2.5-flashGEMINI_TTS_MODELoptional, defaults togemini-3.1-flash-tts-preview