Autiobooks generates .m4b audiobooks from regular .epub e-books, using Kokoro's high-quality speech synthesis.
Kokoro is an open-weight text-to-speech model with 82 million parameters. It yields natural sounding output while being able to run on consumer hardware.
Kokoro supports multiple languages, and Autiobooks exposes all available voices across 9 languages: English (US/GB), Spanish, French, Hindi, Italian, Japanese, Portuguese (BR), and Chinese (Mandarin).
PRs are welcome!
- High-quality TTS — powered by Kokoro, an 82M parameter open-weight model
- Multiple voices — 54 voices across 9 languages: English (US/GB), Spanish, French, Hindi, Italian, Japanese, Portuguese (BR), and Chinese (Mandarin)
- CLI mode — headless conversion for scripting and automation (
python -m autiobooks convert book.epub) - Hierarchical chapter tree — browse chapters in a tree view that follows the epub's table of contents structure, with checkboxes, parent-child propagation, auto-select, duplicate detection, and a full-text preview panel
- PDF support — open PDF files directly (pypdf, bundled as a required dependency)
- Multiple output formats — M4B (with chapters), MP3, FLAC, Opus, or WAV
- Word substitutions — user-defined find/replace pairs for fixing TTS mispronunciations
- Drag and drop — drag an epub or PDF file directly onto the window to open it (install with
pip install "autiobooks[dnd]") - Chapter title detection — automatically extracts chapter titles from the epub's table of contents or headings (can be toggled off)
- Voice preview — listen to a sample of any chapter before converting the full book
- Resume support — if a conversion is cancelled or fails, previously completed chapters are kept so you can resume without re-converting them
- GPU acceleration — CUDA support for significantly faster conversion on NVIDIA GPUs
- Adjustable settings — reading speed, chapter gap duration, bitrate (64/128/192k), VBR mode, output format, and starting chapter number
- Editable metadata — correct the title and author before converting
- Settings persistence — voice, speed, gap, bitrate, theme, and other preferences are saved between sessions
- Cover art — embeds the epub's cover image into the output
.m4bfile - Append M4B — concatenate two
.m4bfiles with merged chapter markers via the Tools menu - Dark/light theme — switchable via Settings > Theme
- Docker support — run in a container with X11 forwarding
- Python 3.10–3.12 (3.13 is not supported due to dependency constraints)
- ffmpeg — required for audio encoding and m4b creation
- tkinter — required for the GUI (included with most Python installations)
- espeak-ng (optional) — improves pronunciation of uncommon words. Without it, Kokoro handles most text well, but espeak-ng provides a fallback for words the model hasn't seen
- NVIDIA GPU (optional) — enables CUDA acceleration for faster conversion. Works with any CUDA-capable GPU. Without a GPU, conversion runs on CPU
Pronunciation fixes:
blaise → /bleɪz/— added to the built-in proper-noun overrides; the given name was absent from misaki's gold/silver lexicons so it fell through to espeak's letter-rule G2P, which is unreliable for non-English etymologydives → /daɪvz/— fixes a misaki silver lexicon bug: the lowercase entry shipped with the biblical proper-noun pronunciation/ˈdaɪvˌiːz/(two syllables, stressed first-then-second), turning every everyday verb/plural ("she dives in", "five dives") into the Luke-16 character. The override is case-insensitive so capitalizedDivesalso gets the verb reading — acceptable trade-off because the biblical character is vanishingly rare in modern reading materialrésumé(CV/noun) preserved before diacritic strip — misaki gold has only the verb pronunciation/ɹəzˈum/, sorésuméwas previously read as the verb "to continue" afterstrip_diacriticserased the accent cue. New_RESUME_NOUN_REruns insidenormalize_unicodebefore the strip and wraps any accented spelling (résumé,Résumé,resumé,résume, plus their-splurals) as[resume](/ˈɹɛzəmeɪ/)markdown; plain unaccentedresumeis untouched and keeps misaki's verb default
Custom voices (beta):
- Drop-in
.ptvoice packs — place PyTorch voice tensors in~/.autiobooks/voices/and they appear in the voice dropdown alongside Kokoro's 54 built-in voices, marked with a ✨ sparkle (e.g.✨ 🇬🇧 bm_steve). Compatible with files produced by kvoicewalk. Tensors are loaded withweights_only=True, kept on CPU regardless of GPU setting, and validated as shape(N, 1, 256)at load time so a malformed file fails with a named error instead of crashing inside synthesis - Tools → Open Voices Folder… — creates the directory if missing and opens it in the system file browser; the voice dropdown re-scans on next click so newly added files appear without restart
- Voice names follow Kokoro's
<lang><gender>_<name>convention (e.g.bm_steve.ptis treated as British male) so the language-flag emoji and language routing work without extra configuration - Beta: API and discovery behavior may change before the feature graduates; no in-app voice-pack training UI yet —
.ptfiles must be produced externally (e.g. with kvoicewalk)
Test additions:
- 13 new cases in
tests/test_text_processing.py:TestResumeNounPreservation(9 cases covering accented variants, plain-word passthrough, verb inflections, non-English language code, full-pipeline survival) and four additions toTestBuiltinPhonemeOverrides(blaise/dives wrapping, case-insensitiveDives, user override preempts builtin)
Pronunciation control:
- Pronunciation Overrides (Tools → Pronunciation Overrides) — user-defined word→IPA mappings; matches are wrapped as
[word](/IPA/)markdown so misaki assigns rating 5 (beats gold/silver/espeak); auto-drops the\bword-boundary anchor for words containing apostrophes or hyphens (handlesO'Brien,Anne-Marie); per-entry case-sensitivity and enable/disable toggle; English-only - Import / Export JSON for pronunciation overrides — share override sets between machines or seed a new install from
scripts/sample_overrides.json(~250 high-confidence inflected-form fixes generated by the new audit script) - Auto-acronym spellout (Preferences → Spell out unknown acronyms, off by default) — rewrites unknown all-caps tokens (
NATO,CIA) as letter-spelled phonemes (N. A. T. O.); skips a hard stoplist of roman numerals (II–XII) plus pronounceable acronyms misaki only stores lowercase (SCUBA, LASER, RADAR, etc.); runs after word substitutions so a rewrite likeNATO → North Atlantic Treaty Organizationsuppresses spellout - Contextual heteronym overrides — words whose pronunciation depends on collocation cues that POS alone can't resolve:
bow/bows/bowed/bowing(bowing-gesture vs archery/violin),content(predicate adjective vs noun),minute(adjectival "tiny" vs time-unit),lead(metal/material vs verb),bass(musical instrument vs fish),row(argument vs line/boat),tearing(cry sense vs rip). Each rule fires only on positive context evidence so misaki's POS-aware gold lookup still wins for ambiguous tokens.
Pronunciation audit harness:
scripts/audit_pronunciations.py— diffs misaki's emitted phonemes against cmudict for the top-N English words and writessuggested_overrides.json(HIGH confidence) andpronunciation_audit.csv(full report) so override sets can be regenerated from authoritative sourcescripts/audit_heteronyms.py— runs ~120 (sentence, target word, expected IPA) cases through the full normalize_text → misaki pipeline; passes 121/123 today; the 2 failures (contractanddoesin rare contexts) are documented inline as known POS-tagger limitsscripts/sample_overrides.json— bundled, ready to import via Tools → Pronunciation Overrides → Import JSON…
Batch queue improvements:
- Live treeview update — adding a job from the main window while a batch is running now appears in the open Batch Queue immediately; previously it was deferred until the running job finished
- Reorder pending jobs mid-run — Move Up / Move Down / Remove stay enabled during a batch but only act on jobs past the running one; the running row is marked with
▶and bold text so it's visually distinct - Clear All becomes "Clear Pending" mid-run — confirmation dialog says how many pending jobs will be removed; the currently-running job continues unaffected
Critical bug fixes (audio quality):
- IPA alphabet folding for Kokoro — Kokoro's phoneme vocab indexes character-by-character and only carries single-letter tokens for the five English diphthongs (
A=eɪ,I=aɪ,O=oʊ,W=aʊ,Y=ɔɪ;Q=əʊ for GB). Sending raw canonical IPA (bˈaʊd) made Kokoro reada(vocab id 43) andʊ(id 135) as two unrelated phonemes instead of the diphthongW(id 39); the duration predictor destabilized and the override audio bled onto neighbouring words ("baud" appeared where it didn't belong, the wrapped word came out as "boud"). New_to_misaki_phonemes()folds canonical diphthongs to misaki's single-letter codes before emitting markdown — both contextual rules and user-entered overrides now feed Kokoro the alphabet it was trained on - Misaki preprocess whitespace patch —
misaki.en.G2P.preprocessusedtext.split()to build its source-token list, discarding\nand every other whitespace run. spaCy's tokenizer keeps those as separate tokens, so on multi-paragraph text the source-token list ended up several hundred tokens shorter than the spaCy mutable list andAlignment.from_stringsdrifted further with every paragraph break. By mid-chapter, every[word](/IPA/)markdown wrapping attached its rating-5 IPA to a?/\n/unrelated-word mutable_token instead of the actual word — the override silently fell back to misaki's gold (e.g.bowed → bˈOdarchery sense) AND the IPA leaked audibly onto whatever wrong token absorbed it.engine.py:_patch_misaki_preprocess()monkey-patches the system misaki at import time with are.split(r'(\s+)', s)variant that keeps whitespace runs as source tokens; bundledautiobooks/misaki/en.pycarries the same fix in-place for PyInstaller builds - Hash-derived chapter WAV filenames — resume now uses
{stem}_chapter_{md5_8}.wavkeyed on the chapter's text, so reshuffling or shrinking the selected chapter set between runs can't feed one chapter's audio into another chapter's slot. Two chapters with identical text deliberately share a wav path (the audio is identical, so re-using is correct).
Other fixes:
- macOS-specific GUI fixes (Mac bug fixes 1 + 2)
- Book Info panel now renders plain text for EPUB
dc:descriptionmetadata that's stored as HTML (<p>…</p>,<br/>, entities) instead of showing raw HTML tags - GUI cleanup for PDF input
- Windows build script fix
Test infrastructure:
- pytest suite (
tests/test_text_processing.py,tests/test_cli.py) — 207 cases covering normalization, diacritics, fractions, abbreviations (including the context-awareSt.Saint/Street andNo.Number resolvers), roman numerals, heteronyms, special characters, substitutions, phoneme overrides (user + built-in proper-noun defaults), acronym spellout, the diphthong-folding helper, and the misaki preprocess whitespace patch (multi-paragraph regression suite that catches alignment drift the single-sentence audit harness cannot detect)
New features:
- CLI mode — headless conversion for scripting and automation:
python -m autiobooks convert book.epub -o book.m4b --voice af_heart; alsolist-chaptersandlist-voicessubcommands; auto-selects non-empty non-duplicate chapters; supports resume, ETA display, and all format/quality options - Multi-language voices — all 54 Kokoro voices now available across 9 languages: English (US/GB), Spanish, French, Hindi, Italian, Japanese, Portuguese (BR), and Chinese (Mandarin); text normalization automatically skips English-specific steps for non-English voices
Dependency upgrades:
- Kokoro 0.9.4 — upgraded from 0.7.9; scipy dependency removed (smaller install), numpy unpinned, MPS (macOS GPU) support added
- Misaki 0.9.4 — upgraded from 0.7.17; improved pronunciation dictionaries, restructured MToken dataclass
- phonemizer-fork — replaces phonemizer as the espeak backend dependency
Text normalization improvements:
- Diacritics stripping — accented words (café, naïve, résumé) are normalized to ASCII so they match the TTS lexicon instead of being silently dropped
- Fraction expansion — Unicode fractions (½, ¼, ¾, ⅓, etc.) are expanded to spoken English ("one half", "one quarter")
- Expanded abbreviations — 22 new entries: military ranks (Pvt., Cpl., Maj., Brig.), ecclesiastical (Fr.), geographical (Rd., Ln., Hwy., Mt., Ft.), publishing (Ch., pp., Vol., No., Ed., Fig., Pt.)
- Expanded symbol handling — §, ∞, ≈, ≠, ≤, ≥ expanded to English words; decorative symbols (¶, †, ‡, arrows, stars) removed
- Heteronym fix — removed wind, tear, wound overrides that conflicted with misaki's native POS-aware pronunciation; kept read/lead disambiguation
Dark mode improvements:
- Fixed button text, combobox arrows, dropdown lists, tooltips, radio buttons, label frames, and paned windows not being properly themed in dark mode
- New dialogs (Preferences, Word Substitutions) now automatically inherit dark theme colors via Tk option database defaults
- Combobox dropdown lists use dark theme colors
Bug fixes:
- Fixed checkboxes in chapter tree not responding to clicks on some Tk versions (dual
identify_region/identify_elementdetection) - Fixed expand/collapse arrows not working in chapter tree view
- Widened Preferences and Word Substitutions dialogs to prevent text cutoff
- Fixed GPU acceleration not resetting to CPU when unchecked —
set_gpu_acceleration(False)previously left torch's default device stuck on CUDA/MPS from a prior run - Fixed duplicate chapter detection being inconsistent between sessions — now uses stable
hashlib.md5instead of Python's randomized built-inhash()(which is re-seeded per interpreter run) - Fixed
prevent_sleep()context manager leaking sleep-inhibit state on exceptions and re-yielding on caller errors; now cleans up correctly regardless of how thewithblock exits - Fixed CUDA download leaving corrupt whl files on disk when interrupted — downloads are now atomic (
.part→ rename) with zip validation and a one-shot retry on truncation or corruption - Fixed voice preview polling loop spawning overlapping
root.after()chains when the user clicked play rapidly — a single tracked after ID is now cancelled before each new preview - Fixed CLI progress output using
\rcarriage returns when stderr isn't a TTY — piped and redirected logs now get plain newlines - Fixed
get_chapter_titles()returningNoneentries when the TOC had no title and no heading could be extracted — now always returns strings - Fixed cover image tempfile leaking if the write failed mid-way — path is now recorded before the write so
finallycleanup always runs - Added
timeout=30to allffprobecalls (probe_duration,_probe_chapters,_probe_format_tags) so a hung ffprobe can no longer freeze the caller indefinitely - Config loader now validates numeric fields (
speed,chapter_gap,starting_chapter) before inserting them into entry widgets so a corrupt config does not crash conversion at float/int cast time - FFmpeg stderr-drain threads now guard against read failures and record the error in the stderr buffer instead of dropping it silently
- Append dialog now logs the underlying ffprobe exception to stderr and shows the exception type in the status label instead of a generic "could not read file"
- Append dialog now validates that the output directory exists and is writable before starting, so long appends don't fail partway through due to permissions
- Batch queue WAV/M4A cleanup now logs unlink failures to stderr instead of giving up silently on the last retry
- CLI PDF metadata path now uses the shared
get_title/get_authorhelpers (PdfBook already implementsget_metadata), eliminating a duplicate try/except block - Fixed CLI crash at the end of every successful conversion —
cmd_convertreferenced a barestart_timeat module scope instead ofstate['start_time'], raisingNameErrorafter the final chapter - Main conversion thread now wraps
run_conversionin a top-level try/except/finally so an unexpected exception insideprevent_sleep()or the conversion loop can no longer leave the UI stuck with Convert disabled and Cancel enabled — the error surfaces in a dialog and controls always re-enable save_confignow prints a warning to stderr onOSErrorinstead of silently swallowing it, so users can tell when settings aren't being persisted (disk full, read-only home, permissions)- Fixed macOS GPU acceleration being effectively broken. Even with "Enable GPU acceleration" ticked, Kokoro ignored
torch.set_default_device('mps')because its constructor only auto-detects cuda-or-cpu — the model loaded on CPU while intermediate tensors went to MPS, causing aRuntimeError: Expected all tensors to be on the same devicecrash during TTS.create_pipelinenow passesdevice=…toKPipelineexplicitly, and the pipeline cache is invalidated whenever the device changes so toggling GPU mid-session actually takes effect - Fixed the "Enable GPU acceleration" checkbox staying greyed out on Apple Silicon Macs. The visibility logic only enabled the checkbox when
torch.cuda.is_available(), so an MPS-capable Mac fell into the "needs CUDA-enabled build" tooltip branch even thoughset_gpu_acceleration(True)would have routed correctly to MPS. Now enables for either CUDA or MPS - Config restore in
start_gui()now callsset_gpu_acceleration(...)immediately after syncing the tk var, so the first preview before the first Convert click respects the saved GPU preference instead of running on CPU
Batch queue improvements:
- Batch queue now shares the single conversion loop (
engine.convert_chapters_to_wav) used by the GUI and CLI, so jobs get per-chapter progress/ETA, chapter-error handling, and the same TTS pipeline as the main window - Batch queue now supports resume — cancelling or hitting an error keeps partial WAVs so the next batch run picks up where it left off
- Batch queue now disables Move Up / Move Down / Remove / Clear All buttons while a run is active so the queue can't be mutated mid-iteration
- Batch queue now snapshots the main-window GPU preference at the start of the run and restores it in a
finallyblock after the last job, instead of leavingtorch's default device on whatever the final job was configured with - Batch queue now detects output filename collisions between queued jobs (e.g. two EPUBs with the same stem exporting to the same folder) and appends
(2),(3), … to later writes; comparison is case-insensitive so Windows/macOS don't overwrite on case - Batch cleanup now reads the real encoded intermediate paths out of
encode_futuresinstead of re-deriving them from the job stem — formerly drifted ifsafe_stemtruncation kicked in - Batch worker now has a top-level try/except/finally that catches any unexpected crash in the loop, logs a traceback, shows an error dialog, and always re-enables the queue mutation buttons and Start button — a crash mid-queue can no longer leave the window half-disabled
- Batch Treeview now shows Format and Bitrate columns for each job
Refactoring:
- Extracted the chapter conversion loop (~80 lines) shared between the CLI and GUI conversion paths into
engine.convert_chapters_to_wav(); callers pass a prepared text list and callbacks (on_chapter_start,on_segment,on_chapter_done,on_chapter_error,cancel_check) to drive their own progress UI - Split the ~2000-line
start_gui()god function into focused modules:theme.py(themes andapply_theme),dialogs.py(append/preferences/substitutions dialogs),batch_window.py(batch queue window and conversion loop); metadata helpers (get_publisher,get_publication_year,get_description) moved toepub_parser.py.autiobooks.pydropped from 2038 to 1289 lines; extracted modules take state via explicit parameters instead of closing overstart_gui's scope.
CLI improvements:
- Added
--no-resumeflag to force re-conversion of all chapters even when cached WAV files exist
Chapter tree view (inspired by abogen):
- Hierarchical chapter selector — flat checkbox list replaced with a
ttk.Treeviewthat follows the epub's TOC structure (Part > Chapter > Section) - Image-based checkboxes — checked, unchecked, and half-checked states drawn with PIL (ttk.Treeview has no native checkboxes)
- Parent-child propagation — checking a parent section checks all its children; parent state updates automatically when children change
- Content preview panel — right-side pane shows the full text of the selected chapter, or book metadata (title, author, publisher, year, description) when no chapter is selected
- Auto-select — non-empty, non-duplicate chapters are automatically selected when a book is loaded
- Duplicate detection — chapters with identical content are marked "(Duplicate)" and excluded from auto-select
- Expand/Collapse All — toolbar buttons for navigating large TOC hierarchies
- Content caching — parsed epub data is cached in memory keyed on (path, mtime, resize); reopening the same file skips re-parsing
New features:
- PDF input — open PDF files directly; text extracted page-by-page via pypdf (BSD-3-Clause, now a required dependency), chapter structure from PDF bookmarks/outline
- Multiple output formats — choose M4B, MP3, FLAC, Opus, or WAV from the format dropdown; M4B retains chapter markers, other formats concatenate into a single file
- Word substitutions — user-defined find/replace pairs (Tools > Word Substitutions) for fixing recurring TTS mispronunciations of names, places, or terms; supports case-sensitive and whole-word matching; saved between sessions
- Heteronym disambiguation — spaCy POS tagging resolves ambiguous words like "read" (reed/red) and "lead" based on grammatical context
- Contraction resolution — spaCy-based expansion of ambiguous contractions ("'s" → is/has, "'d" → would/had) using surrounding context
- Prevent system sleep — OS-level sleep inhibition during conversions (Windows, macOS, Linux) so long books don't fail because the machine went to sleep
- Dark/light theme — switchable via Settings > Theme; preference saved between sessions
Windows Builds:
- Two standalone Windows executables via PyInstaller:
- CPU build (
dist/autiobooks/autiobooks.exe): CPU-only torch, GPU checkbox disabled (grayed out) - CUDA build (
dist-cuda/autiobooks-cuda/autiobooks-cuda.exe): Full GPU acceleration, checkbox enabled by default
- CPU build (
- Bundled ffmpeg and espeak-ng (downloaded at build time)
- Bundled spacy + en_core_web_sm for proper NLP tokenization (both builds)
- GPU checkbox shows but is disabled on CPU build with tooltip explaining CUDA build is needed
- "Don't ask again" preference for CUDA prompt (saved to config)
- Tools > Download CUDA Support... for manual download (bypasses "Don't ask again")
New features:
- Batch queue system — "Add to Batch" button captures the current epub with all its settings (selected chapters, voice, speed, gap, detect titles, starting chapter) into a queue
- Batch Queue window (Tools > Batch Queue...) — view, reorder, remove queued jobs, select output directory, and start batch conversion
- Sequential batch conversion with per-job progress tracking and ETA
- Per-file error handling — failures don't stop the batch, summary shown on completion
New features:
- Configurable bitrate — choose 64k, 128k, or 192k AAC output (default 64k); setting is saved between sessions
- VBR mode — new VBR checkbox uses AAC variable bitrate (
-q:a 2, ~96–128 kbps) for better quality-to-size ratio; disables the bitrate dropdown when active - Editable metadata — a dialog before conversion lets you correct the title and author extracted from the epub
- Clear WAVs button — new button in the chapter list toolbar deletes leftover
_chapter_*.wavfiles for the current book without navigating to the filesystem - Chapter numbers — chapter list now shows a sequence number (1, 2, 3…) before each title, counting only non-empty chapters
GUI improvements:
- Chapter list footer shows total selected chapters, word count, and estimated listening duration (updates live as checkboxes or speed change)
- Save As dialog remembers the last-used output directory separately from the epub input directory
- Append M4B dialog shows chapter count and duration for each selected file after browsing
- Append M4B dialog validates that input files exist and are
.m4bbefore starting
Bug fixes:
- Cancelling a conversion now also cancels any queued background AAC encoding jobs, not just the TTS loop
New features:
- Append M4B — new Tools menu with "Append M4B files..." dialog to concatenate two m4b files; chapter markers from both files are merged with correct timestamps, cover art and metadata are taken from the base file
GUI improvements:
- Starting Chapter # field moved next to the Detect chapter titles checkbox
- Starting Chapter # field is disabled while Detect chapter titles is checked (the two are mutually exclusive)
Bug fixes:
- Fixed chapter markers being silently truncated at 255 in the output m4b file (caused by the Nero
chplatom's 8-bit chapter count limit; now suppressed in favour of the standard MP4 chapter track)
Performance:
- Each chapter is now encoded to AAC in a background thread immediately after TTS completes, overlapping encoding with TTS generation for subsequent chapters
- The final m4b assembly step is now a fast stream copy (remux only) instead of a full re-encode, making the "Creating m4b file" step near-instant
GUI improvements:
- Version number shown in the title bar
Performance:
- m4b creation no longer runs ffprobe for freshly converted chapters — duration is captured directly from the TTS output, which is exact and avoids the subprocess overhead entirely
- Remaining ffprobe calls (resumed chapters) now run in parallel instead of sequentially
GUI improvements:
- Progress percentage shown during m4b encoding (Creating m4b file... 42%)
Bug fixes:
- Temp wav cleanup on success now tracks all chapter files, including any that were created on disk but not used (e.g. a chapter that produced no audio) — previously those could be left behind
- Added a short delay and retry loop before deleting temp wav files to handle cases where the OS still has a file handle open
Bug fixes:
- All GUI progress/status updates now routed through the main thread (fixes rare Tkinter crashes during conversion)
- FFmpeg stderr no longer decoded with text=True — prevents UnicodeDecodeError from leaving WAV temp files behind after a successful conversion
- FFmpeg concat file now correctly escapes single quotes in file paths (fixes conversions failing for epubs with apostrophes in their filename)
- Preview playback polling loop now exits cleanly when the user manually stops playback (previously leaked a polling loop per stopped preview)
Refactoring:
- Split
engine.pyintoepub_parser.py,text_processing.py,config.py, and a slimmedengine.py
Epub parsing:
- Expanded HTML tag handling from 7 to 30+ block-level tags
- No duplication from nested blocks
- Handles
<br>,<img>alt text,<hr>, footnote removal, script/style/nav stripping
Text normalization:
- Unicode cleanup (smart quotes, em-dash, en-dash, ligatures)
- Abbreviation expansion (30+ common book abbreviations)
- Context-aware Roman numeral conversion
- Special character/symbol replacement, URL/email removal
- Scene break marker removal (
***,---, etc.)
GUI improvements:
- Bottom controls in fixed frame (never cut off)
- Compact two-row settings layout
- Chapter titles from epub TOC instead of filenames (with toggle to disable)
- Mouse wheel scrolling on chapter list
- Resizable progress bar with per-chapter progress and ETA
- Threaded preview (no GUI freeze)
- Cancel button for conversions
- Error dialogs instead of terminal-only errors
- Select all / clear all buttons for chapter selection
Performance:
- TTS pipeline cached and reused across chapters (model loads once)
torch.inference_mode()for faster TTS inference- Chapter durations calculated from sample count instead of spawning ffprobe per chapter
Docker:
- Added Dockerfile, docker-compose.yml, and .dockerignore
- X11 forwarding for GUI display
- NVIDIA GPU support
- Volume mounts for books and persistent settings
- Updated devcontainer to match
Bug fixes & polish:
- Save As dialog for output location
- Speed validation blocks conversion
- ffmpeg
-yflag prevents interactive prompts - Temp file cleanup (wav, chapters.txt, preview audio)
- Cover image temp file leak fixed
- M4b overwrite handling
- ffmpeg error capture with clear error messages
- Input validation for chapter number and gap fields
- Defensive metadata extraction for malformed epubs
- Warning suppression (ebooklib, torch, Kokoro)
- Replaced
exit(1)with proper exceptions - Added
lxmlas explicit dependency
- Fix race condition - @Thabian
- Fix issue with output file containing multiple audio stream 10 - @tomhense
- Add an entrypoint for pipx - @tomhense
- Uptick kokoro package
- Fix chapter index - @tomhense
- Fix pip installs
- Fix bug causing errors on some linux installs
- Read epub files with chapters not marked as ITEM_DOCUMENT
- Select all chapters if none are selected
- Window can be resized
- Initial release
Requires Python 3.10–3.12 (3.13 is not supported).
Linux:
sudo apt install ffmpeg python3-tk espeak-ngmacOS:
brew install ffmpeg espeak-ng
brew install python-tk@3.12 # match your Python version: @3.10, @3.11, or @3.12Homebrew's Python does not bundle tkinter — it's a separate formula, and the generic python-tk may not match your interpreter. Pin the version explicitly (python-tk@3.12 for Python 3.12). Verify with:
python3.12 -c "import tkinter; print(tkinter.TkVersion)"If you use pyenv, install tcl-tk via brew before building Python, otherwise pyenv silently compiles without tkinter support. The python.org installer bundles tkinter and needs no extra step.
Windows:
- Install ffmpeg and add it to your PATH
- tkinter is included with the standard Python installer
- espeak-ng is optional but recommended
git clone https://github.com/plusuncold/autiobooks.git
cd autiobooks
pip install .To also enable drag-and-drop support:
pip install ".[dnd]"GUI mode:
python -m autiobooksCLI mode (headless):
# Convert with default settings (auto-selects chapters, af_heart voice)
python -m autiobooks convert book.epub
# Specify voice, speed, and output format
python -m autiobooks convert book.epub --voice bm_daniel --speed 1.2 --format mp3
# Convert specific chapters only
python -m autiobooks convert book.epub --chapters 1,3-5,8
# List available chapters
python -m autiobooks list-chapters book.epub
# List all available voices
python -m autiobooks list-voicesThe program creates .wav files for each chapter, then combines them into a .m4b file for playing using an audiobook player.
If you have an NVIDIA GPU with CUDA support, check the "Enable GPU acceleration" option in the app to significantly speed up conversion. No additional setup is needed beyond having CUDA-compatible drivers installed.
You can run Autiobooks in a Docker container. Since it's a GUI application, you'll need X11 forwarding for display.
docker compose up --buildPlace your .epub files in the ./books/ directory — this is mounted as the working directory inside the container.
Linux / WSL2:
xhost +local:docker
docker compose up --buildWindows (with VcXsrv or similar X server):
# Start VcXsrv with "Disable access control" checked
DISPLAY=host.docker.internal:0 docker compose up --buildmacOS (with XQuartz):
xhost +localhost
DISPLAY=host.docker.internal:0 docker compose up --buildIf you have an NVIDIA GPU and nvidia-container-toolkit installed, the deploy section in docker-compose.yml enables CUDA acceleration. If you don't have a GPU, comment out or remove the deploy section to avoid errors.
| Volume | Purpose |
|---|---|
./books |
Epub input and audiobook output |
autiobooks-config |
Persisted settings between runs |
Pre-built Windows executables are available for download from the releases page. Two variants are provided:
| Build | File | Description |
|---|---|---|
| CPU | autiobooks.exe |
CPU-only, smaller (~1.5GB total), GPU checkbox disabled |
| CUDA | autiobooks-cuda.exe |
Full GPU acceleration (~5.5GB total), requires NVIDIA GPU |
Both builds include bundled ffmpeg and espeak-ng, so no additional installation is required.
If you need to rebuild the Windows executables, you'll need:
- Python 3.12 (64-bit)
- Windows 10/11
- Git for cloning the repository
Build tools (installed automatically by the scripts):
- scoop or chocolatey for espeak-ng
- ~2-6GB free disk space depending on build type
CPU Build:
cd windows
build.batOutput: windows/dist/autiobooks/autiobooks.exe
CUDA Build:
cd windows
build-cuda.batOutput: windows/dist-cuda/autiobooks-cuda/autiobooks-cuda.exe
Both scripts will:
- Create a Python virtual environment
- Install all dependencies
- Download ffmpeg and espeak-ng
- Run PyInstaller
- Copy executables and DLLs to the output folder
Using the builds:
- Run the appropriate exe for your hardware
- On the CPU build, the GPU checkbox is disabled with a tooltip explaining a CUDA build is needed
- On first run with a CUDA build, you'll be prompted to download CUDA runtime (~2.5GB) if not already present
- Use Tools > Download CUDA Support to manually download CUDA (bypasses the "Don't ask again" preference)
by David Nesbitt, distributed under MIT license.
