Add gemini-tts extension#27612
Conversation
- Convert extension to Gemini TTS - Honor external stop in voice picker playback - Clarify Raycast AI workflow fit - Add Nursery Teacher voice preset - Add English paper reading voice presets - Add paper reading voice presets - Merge pull request raycast#1 from xwzhangSZU/codex/fix-speed-control-bugs - Fix speed control state handling - Add speed controls and surface custom voice IDs in pickers - Improve playback visibility and recover-from-error UX - Add voice cloning and tighten MiniMax auth modes - Remove unused Raycast utils dependency - Document MiniMax setup and advantages - Add second MiniMax store screenshot - Add MiniMax store screenshot - Update extension icon - Add Quick Read voice picker - Improve medium text reading workflow - Initial MiniMax TTS Raycast extension
|
Congratulations on your new Raycast extension! 🚀 We're currently experiencing a high volume of incoming requests. As a result, the initial review may take up to 10-15 business days. Once the PR is approved and merged, the extension will be available on our Store. |
Greptile SummaryThis PR adds a new Gemini TTS Raycast extension that reads selected (or clipboard) text aloud using the Gemini REST API, with resume/restart/speed controls, a producer-consumer prefetch pipeline, an LRU audio cache, and a menu-bar status item.
Confidence Score: 4/5Safe to merge with one fix: the session-lock race in read-with-voice needs attention before users on slow networks repeatedly switch voices. The Quick Read, resume, and restart commands are solid and their session-lock usage is correct. The race only manifests inside read-with-voice.tsx when the same view process calls handleRead a second time while the first invocation's async cleanup is still pending — a realistic scenario for anyone browsing voices on a slow connection. Without a fix, switching voices can silently drop the session lock, allowing concurrent TTS playback. extensions/gemini-tts/src/read-with-voice.tsx and extensions/gemini-tts/src/utils/session-lock.ts — the lock's PID-only identity needs to be extended with a per-invocation token to be safe within a single view process. Important Files Changed
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
extensions/gemini-tts/src/read-with-voice.tsx:246-256
**Session lock released by a previous `handleRead` invocation**
All calls to `handleRead` share the same `process.pid`. When the user quickly clicks voice B while voice A is still synthesizing, B's setup (lines 97–122) releases A's lock and immediately acquires a new one — but A's `finally` block runs asynchronously (after B's first `await`), sees `pid === process.pid`, and deletes the lock file B just wrote. From that point on B holds no session lock, so a concurrent Quick Read or Resume command will not detect an active session and can start a parallel reader — resulting in two TTS streams and audio overlap.
The root cause is that `releaseSessionLock` can't distinguish "this process's lock from an earlier invocation" from "this process's lock from the current invocation". A lightweight fix is to track the owner with a per-invocation token: write a `${process.pid}:${token}` string to the lock file and only release when the on-disk token still matches the local token.
### Issue 2 of 3
extensions/gemini-tts/src/utils/audio-player.ts:1-6
Duplicate import from `"child_process"` — `spawn`, `ChildProcess`, and `execSync` can be grouped into a single import statement.
```suggestion
import { spawn, ChildProcess, execSync } from "child_process";
import { writeFileSync, unlinkSync, existsSync, readFileSync } from "fs";
import { tmpdir } from "os";
import { join } from "path";
import { randomUUID } from "crypto";
```
### Issue 3 of 3
extensions/gemini-tts/CHANGELOG.md:3
The date should use the `{PR_MERGE_DATE}` template placeholder rather than a hardcoded date. Raycast automatically substitutes the actual merge date when the PR is merged.
```suggestion
## [Initial Version] - {PR_MERGE_DATE}
```
Reviews (12): Last reviewed commit: "Update CHANGELOG.md and optimise images" | Re-trigger Greptile |
| "description": "Increase reading speed by 0.25× for the next segment.", | ||
| "mode": "no-view" | ||
| }, | ||
| { |
There was a problem hiding this comment.
Title case violation on command titles
"Speed up Reading" should be "Speed Up Reading" — all words in a command title should use title case per the Raycast convention. Similarly, "Read with Voice Selection" should be "Read With Voice Selection".
| { | |
| "title": "Speed Up Reading", |
Rule Used: What: Use title case for titles in package.json.... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/gemini-tts/package.json
Line: 66
Comment:
**Title case violation on command titles**
`"Speed up Reading"` should be `"Speed Up Reading"` — all words in a command title should use title case per the Raycast convention. Similarly, `"Read with Voice Selection"` should be `"Read With Voice Selection"`.
```suggestion
"title": "Speed Up Reading",
```
**Rule Used:** What: Use title case for titles in `package.json`.... ([source](https://app.greptile.com/review/custom-context?memory=a44fb089-4d03-4b60-a4ff-03431cdf0eb4))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Not applying this one — Raycast's official ray lint warns against this form and expects "Read with Voice Selection" / "Speed up Reading" (AP-style: short prepositions like "with" / "up" stay lowercase). Reverted the title-cased variants I briefly had locally so ray lint stays clean. Heads-up in case the lint rule used here is older than the current Raycast convention.
Hide Gemini TTS's per-request synthesis latency behind smarter scheduling and a content-addressed audio cache. Gemini has no streaming endpoint, so the only levers are TTFA reduction and inter-chunk gap removal. - Lead chunk (~60-260 chars at the nearest sentence/clause boundary) shrinks first-audio latency from ~4-8s to ~1-2s on long inputs. - Producer/consumer pipeline (depth-1 prefetch) synthesizes chunk N+1 while chunk N plays, so users only ever wait once. - SHA-256 audio cache makes Restart Reading, Resume, voice preview, and paragraph re-reads instant. LRU sweep at 200 MB; speed excluded from the key so afplay rate changes hit cache. - Static director profile moved to systemInstruction so per-chunk requests carry only the transcript — fewer tokens per chunk on long reads. - Menu-bar status refreshes within ~1s of phase transitions via background launchCommand, throttled to 750ms. Applied to read-with-voice's command-internal loop too, not just the shared reading-runner, so all read paths get the smoothness win.
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Two pre-existing correctness gaps surfaced by the latency audit: - Concurrent-instance race: a Quick Read trigger during the lead chunk's synthesis (before any afplay process exists) used to launch a parallel reader instead of toggle-stopping. Adds a session lock file held across synth+play, and extends stopExternalPlayback to signal stop even when only synthesis (no afplay) is running. - Voice preview now writes playback state, so menu-bar Stop Reading can interrupt it and the menu bar reflects in-progress previews.
- "Audio Cache" row in the menu bar shows current size + entry count and clears the cache on click; repopulates lazily as the user reads. - Menu refresh now fires only on phase transitions (synthesizing / playing / stopped / completed) instead of also on a 750ms time tick. Bounded at ~2 background launches per chunk.
Surface the new performance and stop semantics in the README so the extension's listing reflects what the code actually does now.
Addresses Greptile P1 from the automated review on this PR. The two "Nothing to read" hints were swapped: the message offering "you can also resume your last reading" was shown when no paused session existed, and vice versa. Swap them so the Resume hint surfaces only when there is something to resume.
Live request against gemini-3.1-flash-tts-preview returned: HTTP 400: "Developer instruction is not enabled for this model" so every Quick Read in the previous push was failing on the wire. Move the entire director prompt back inline in `contents`. Bump cache version to invalidate any stale entries. Verified rolled-back shape returns HTTP 200 with audio.
Fire up to 3 parallel synthesis requests ahead of playback so audio buffers are ready before the current chunk finishes. Paid API rate limits easily accommodate this; eliminates audible gaps on long texts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
nitpick: unused SVG file
This looks like it may no longer be needed. If it's safe to remove, deleting it could help keep it cleaner.
There was a problem hiding this comment.
Done in 50cd3c84 — the manifest and README only reference command-icon.png, the source SVG was an editable artifact from the design pass.
There was a problem hiding this comment.
suggestion (non-blocking): Would it make sense to use the same extension icon for both dark and light mode here?
This seems like the two icons are identical, so keeping both may not add much value.
There was a problem hiding this comment.
Consolidated in 50cd3c84 — the dark variant only differed by one or two background shades, so I removed it and Raycast falls back to command-icon.png for both modes.
There was a problem hiding this comment.
nitpick: unused SVG file
This looks like it may no longer be needed. If it's safe to remove, deleting it could help keep it cleaner.
There was a problem hiding this comment.
Done in 50cd3c84 — removed alongside the light source SVG.
| | "gemini-3.1-flash-tts-preview" | ||
| | "gemini-2.5-flash-preview-tts" | ||
| | "gemini-2.5-pro-preview-tts"; | ||
| export type GeminiLanguageMode = "auto" | "cmn" | "en" | "mixed-cmn-en"; |
There was a problem hiding this comment.
question: multiple language support
Is it possible to support multiple languages here?
There was a problem hiding this comment.
The extension already handles multiple languages. languageMode defaults to auto, which passes text straight through to Gemini 3.1 Flash TTS Preview (the extension's default model) — that model supports 70+ languages with automatic input-language detection, so pasted text in any supported language is spoken correctly without extra config.
The cmn / en / mixed-cmn-en values aren't a language whitelist — they're optional delivery hints that bias pronunciation and pacing for the extension's primary Chinese + English legal-academic use case (citations, mixed-script handling, code-switching). I kept the explicit list narrow on purpose since auto plus the model's built-in detection already covers the rest. Happy to broaden it if per-language hints would be valuable.
0xdhrv
left a comment
There was a problem hiding this comment.
Hey @xwzhangSZU 👋
I have added a few comments for you to address.
I'm looking forward to testing this extension again 🔥
Feel free to contact me here or at Slack if you have any questions.
I converted this PR into a draft until it's ready for the review, please press the button Ready for review when it's ready and we'll have a look 😊
- Remove unused source SVGs and the near-identical @dark icon variant; Raycast falls back to command-icon.png for dark mode. - Replace MiniMax-fork leftover metadata screenshots with real Gemini TTS captures: Quick Read picker (Active Configuration line shows gemini-3.1-flash-tts-preview / Mixed CN/EN) and Read-with-Voice picker (voices grouped by personality). Addresses 0xdhrv review on raycast/extensions#27612. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
read-with-voice.tsx and select-voice.tsx acquire a session lock for the first read, then on the next read reach stopExternalPlayback() — whose case-2 path sees hasActiveSession() true (the live PID is us), writes STOP_FILE, and returns true. The follow-up waitForSessionLockRelease then waits for us to release a lock we hold while sitting blocked in the wait, surfacing as a permanent "Stopping previous reading" toast. - Release our own session lock before stopExternalPlayback (no-op when another process owns the lock). - Clear STOP_FILE in the cleanup finally blocks (also Greptile #1): without it a leftover STOP_FILE makes the next session exit on its first iteration. Addresses 0xdhrv review on raycast/extensions#27612. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove unused source SVGs and the near-identical @dark icon variant; Raycast falls back to command-icon.png for dark mode. - Replace MiniMax-fork leftover metadata screenshots with real Gemini TTS captures: Quick Read picker (Active Configuration line shows gemini-3.1-flash-tts-preview / Mixed CN/EN) and Read-with-Voice picker (voices grouped by personality). Addresses 0xdhrv review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
read-with-voice.tsx and select-voice.tsx acquire a session lock for the first read, then on the next read reach stopExternalPlayback() — whose case-2 path sees hasActiveSession() true (the live PID is us), writes STOP_FILE, and returns true. The follow-up waitForSessionLockRelease then waits for us to release a lock we hold while sitting blocked in the wait, surfacing as a permanent "Stopping previous reading" toast. - Release our own session lock before stopExternalPlayback (no-op when another process owns the lock). - Clear STOP_FILE in the cleanup finally blocks (also Greptile raycast#1): without it a leftover STOP_FILE makes the next session exit on its first iteration. Addresses 0xdhrv review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi @0xdhrv — addressed your review and ready for another look.
Inline replies above. Thanks for testing! |
Three real bugs Greptile flagged in earlier review rounds that weren't caught by the on-PR back-and-forth: - restart-reading.tsx / resume-reading.tsx: stop signal was erased before the running session could observe it. Added the `waitForSessionLockRelease()` guard that `read-with-voice.tsx` already uses so STOP_FILE outlives the synchronous tick and the old session actually exits, instead of the new command silently failing on lock contention. - reading-runner.ts: `writePlaybackSpeed()` and the stopPoll `setInterval` ran *before* the `try` that wraps `releaseSessionLock()` in a `finally`. If startup threw, the session lock leaked and blocked every subsequent reading until Raycast restarts. Hoisted resources and moved the try boundary up so cleanup is guaranteed. - gemini-tts.ts: `shouldRetry()` only handled `TTSApiError`, so transient network failures from `fetch` (TypeError, ECONNRESET, EAI_AGAIN, undici socket errors) fell through and aborted long reading sessions on a single dropped packet. Added a NETWORK_ERROR_CODES set + cause-walking so they retry alongside HTTP 429/5xx. Also drops the misleading "Removed" section from CHANGELOG — the extension never had MiniMax features for end-users, so a removal note in an initial-release changelog is confusing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Pushed
One Greptile suggestion intentionally not applied: the "title case" P2 asking to rewrite
|
0xdhrv
left a comment
There was a problem hiding this comment.
Looks good to me, approved ✅
|
Published to the Raycast Store: |
|
🎉 🎉 🎉 We've rewarded your Raycast account with some credits. You will soon be able to exchange them for some swag. |
| } | ||
| } finally { | ||
| synthesisController.abort(); | ||
| clearInterval(stopPoll); | ||
| if (synthesisAbortRef.current === synthesisController) { | ||
| synthesisAbortRef.current = null; | ||
| } | ||
| setProgress((current) => (current?.voiceId === voice.id ? null : current)); | ||
| releaseSessionLock(); | ||
| clearExternalStopRequest(); | ||
| } |
There was a problem hiding this comment.
Session lock released by a previous
handleRead invocation
All calls to handleRead share the same process.pid. When the user quickly clicks voice B while voice A is still synthesizing, B's setup (lines 97–122) releases A's lock and immediately acquires a new one — but A's finally block runs asynchronously (after B's first await), sees pid === process.pid, and deletes the lock file B just wrote. From that point on B holds no session lock, so a concurrent Quick Read or Resume command will not detect an active session and can start a parallel reader — resulting in two TTS streams and audio overlap.
The root cause is that releaseSessionLock can't distinguish "this process's lock from an earlier invocation" from "this process's lock from the current invocation". A lightweight fix is to track the owner with a per-invocation token: write a ${process.pid}:${token} string to the lock file and only release when the on-disk token still matches the local token.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/gemini-tts/src/read-with-voice.tsx
Line: 246-256
Comment:
**Session lock released by a previous `handleRead` invocation**
All calls to `handleRead` share the same `process.pid`. When the user quickly clicks voice B while voice A is still synthesizing, B's setup (lines 97–122) releases A's lock and immediately acquires a new one — but A's `finally` block runs asynchronously (after B's first `await`), sees `pid === process.pid`, and deletes the lock file B just wrote. From that point on B holds no session lock, so a concurrent Quick Read or Resume command will not detect an active session and can start a parallel reader — resulting in two TTS streams and audio overlap.
The root cause is that `releaseSessionLock` can't distinguish "this process's lock from an earlier invocation" from "this process's lock from the current invocation". A lightweight fix is to track the owner with a per-invocation token: write a `${process.pid}:${token}` string to the lock file and only release when the on-disk token still matches the local token.
How can I resolve this? If you propose a fix, please make it concise.
Description
Adds Gemini TTS, a Raycast extension for reading selected macOS text aloud with Gemini text-to-speech.
This extension is optimized for academic and long-form listening rather than generic one-shot TTS:
gemini-3.1-flash-tts-previewandgemini-2.5-flash-preview-tts[short pause]audio tags between paragraphsVoice cloning is intentionally not included because the Gemini TTS API currently provides prebuilt voices rather than a voice-clone endpoint.
Screencast
No screencast is included because the extension requires a user-provided Gemini API key. The UI is built from standard Raycast commands and lists. I validated the extension locally in Raycast development mode and through the Raycast CLI checks below.
Validation
npm run buildnpm run lintnpx tsc --noEmitgit diff --checknpm run dev/ray developlocally and confirmed the extension builds and loads in Raycast.Checklist
npm run buildand tested this distribution build in Raycastassetsfolder are used by the extension itselfREADMEare placed outside of themetadatafolder