Skip to content

Add gemini-tts extension#27612

Merged
raycastbot merged 15 commits into
raycast:mainfrom
xwzhangSZU:ext/gemini-tts
May 27, 2026
Merged

Add gemini-tts extension#27612
raycastbot merged 15 commits into
raycast:mainfrom
xwzhangSZU:ext/gemini-tts

Conversation

@xwzhangSZU
Copy link
Copy Markdown
Contributor

@xwzhangSZU xwzhangSZU commented May 3, 2026

Description

Adds Gemini TTS, a Raycast extension for reading selected macOS text aloud with Gemini text-to-speech.

This extension is optimized for academic and long-form listening rather than generic one-shot TTS:

  • Direct Gemini REST API integration with gemini-3.1-flash-tts-preview and gemini-2.5-flash-preview-tts
  • Quick Read from selected text, with clipboard fallback
  • Resume, restart, stop, playback speed controls, and menu-bar reading status
  • Gemini-native prompt structure using Audio Profile, Scene, Director's Notes, and Transcript sections
  • Smart Auto reading mode that chooses Legal Text Mode, English Paper Reader, Mandarin Lecture, or Bilingual Academic Reader based on selected text
  • Mixed Chinese / English guidance so Chinese is read as Mandarin while embedded English terms, citations, names, and acronyms remain English
  • Smart Academic Pauses using safe [short pause] audio tags between paragraphs
  • Voice picker for Gemini's 30 prebuilt voices, with academic recommendations highlighted
  • README guidance for Google AI Studio API key creation, free-tier quota, AI Studio Billing, and the distinction between Google AI Pro benefits and Gemini API project billing
  • Rainbow scholarly-speaker icon with light/dark assets

Voice cloning is intentionally not included because the Gemini TTS API currently provides prebuilt voices rather than a voice-clone endpoint.

Screencast

No screencast is included because the extension requires a user-provided Gemini API key. The UI is built from standard Raycast commands and lists. I validated the extension locally in Raycast development mode and through the Raycast CLI checks below.

Validation

  • npm run build
  • npm run lint
  • npx tsc --noEmit
  • git diff --check
  • Ran npm run dev / ray develop locally and confirmed the extension builds and loads in Raycast.

Checklist

- Convert extension to Gemini TTS
- Honor external stop in voice picker playback
- Clarify Raycast AI workflow fit
- Add Nursery Teacher voice preset
- Add English paper reading voice presets
- Add paper reading voice presets
- Merge pull request raycast#1 from xwzhangSZU/codex/fix-speed-control-bugs
- Fix speed control state handling
- Add speed controls and surface custom voice IDs in pickers
- Improve playback visibility and recover-from-error UX
- Add voice cloning and tighten MiniMax auth modes
- Remove unused Raycast utils dependency
- Document MiniMax setup and advantages
- Add second MiniMax store screenshot
- Add MiniMax store screenshot
- Update extension icon
- Add Quick Read voice picker
- Improve medium text reading workflow
- Initial MiniMax TTS Raycast extension
@raycastbot raycastbot added new extension Label for PRs with new extensions platform: macOS labels May 3, 2026
@raycastbot
Copy link
Copy Markdown
Collaborator

Congratulations on your new Raycast extension! 🚀

We're currently experiencing a high volume of incoming requests. As a result, the initial review may take up to 10-15 business days.

Once the PR is approved and merged, the extension will be available on our Store.

@xwzhangSZU xwzhangSZU marked this pull request as ready for review May 3, 2026 00:06
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 3, 2026

Greptile Summary

This PR adds a new Gemini TTS Raycast extension that reads selected (or clipboard) text aloud using the Gemini REST API, with resume/restart/speed controls, a producer-consumer prefetch pipeline, an LRU audio cache, and a menu-bar status item.

  • The core API integration, chunking pipeline, cache, and session-lock machinery for no-view commands are well-implemented with careful cancellation and finally-block cleanup.
  • The view-based read-with-voice.tsx command has a session-lock race condition: when the user switches voices quickly, the finally block of the previous handleRead invocation releases the lock the new invocation just acquired (both share the same process.pid), leaving the active reading without a lock and allowing a parallel reader to start.
  • The CHANGELOG uses a hardcoded date instead of the {PR_MERGE_DATE} template placeholder required by the Raycast convention.

Confidence Score: 4/5

Safe to merge with one fix: the session-lock race in read-with-voice needs attention before users on slow networks repeatedly switch voices.

The Quick Read, resume, and restart commands are solid and their session-lock usage is correct. The race only manifests inside read-with-voice.tsx when the same view process calls handleRead a second time while the first invocation's async cleanup is still pending — a realistic scenario for anyone browsing voices on a slow connection. Without a fix, switching voices can silently drop the session lock, allowing concurrent TTS playback.

extensions/gemini-tts/src/read-with-voice.tsx and extensions/gemini-tts/src/utils/session-lock.ts — the lock's PID-only identity needs to be extended with a per-invocation token to be safe within a single view process.

Important Files Changed

Filename Overview
extensions/gemini-tts/src/read-with-voice.tsx View command for browsing voices and reading text; contains a session-lock race condition where a completed prior handleRead's finally block can delete the lock acquired by the new invocation on rapid voice switching.
extensions/gemini-tts/src/utils/session-lock.ts PID-file session lock implementation; works correctly for cross-command exclusion but is vulnerable when multiple invocations share the same PID (same Raycast view process).
extensions/gemini-tts/src/utils/audio-player.ts Audio playback via afplay with PID-file and STOP_FILE cross-command signalling; has a duplicate child_process import (style only).
extensions/gemini-tts/src/api/gemini-tts.ts Core Gemini REST API integration with retry logic, WAV construction, caching, prompt building, and language/experience inference; well-structured and handles cancellation correctly.
extensions/gemini-tts/src/utils/reading-runner.ts Producer/consumer pipeline for multi-chunk TTS playback with prefetch, session lock, stop-poll, and proper finally-block cleanup; logic is sound.
extensions/gemini-tts/CHANGELOG.md Uses a hardcoded date (2026-05-27) instead of the required {PR_MERGE_DATE} template placeholder.
extensions/gemini-tts/package.json Well-formed extension manifest with $schema, correct category, macOS platform, password-type API key preference, and all required fields present.
extensions/gemini-tts/src/utils/audio-cache.ts LRU disk cache for synthesized WAV files with atomic write (tmp+rename), mtime-based eviction, and size cap; implementation is correct.
extensions/gemini-tts/src/quick-read.tsx Toggle-to-stop Quick Read command; messages and resume hints are correctly conditioned on whether a prior session exists.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
extensions/gemini-tts/src/read-with-voice.tsx:246-256
**Session lock released by a previous `handleRead` invocation**

All calls to `handleRead` share the same `process.pid`. When the user quickly clicks voice B while voice A is still synthesizing, B's setup (lines 97–122) releases A's lock and immediately acquires a new one — but A's `finally` block runs asynchronously (after B's first `await`), sees `pid === process.pid`, and deletes the lock file B just wrote. From that point on B holds no session lock, so a concurrent Quick Read or Resume command will not detect an active session and can start a parallel reader — resulting in two TTS streams and audio overlap.

The root cause is that `releaseSessionLock` can't distinguish "this process's lock from an earlier invocation" from "this process's lock from the current invocation". A lightweight fix is to track the owner with a per-invocation token: write a `${process.pid}:${token}` string to the lock file and only release when the on-disk token still matches the local token.

### Issue 2 of 3
extensions/gemini-tts/src/utils/audio-player.ts:1-6
Duplicate import from `"child_process"``spawn`, `ChildProcess`, and `execSync` can be grouped into a single import statement.

```suggestion
import { spawn, ChildProcess, execSync } from "child_process";
import { writeFileSync, unlinkSync, existsSync, readFileSync } from "fs";
import { tmpdir } from "os";
import { join } from "path";
import { randomUUID } from "crypto";
```

### Issue 3 of 3
extensions/gemini-tts/CHANGELOG.md:3
The date should use the `{PR_MERGE_DATE}` template placeholder rather than a hardcoded date. Raycast automatically substitutes the actual merge date when the PR is merged.

```suggestion
## [Initial Version] - {PR_MERGE_DATE}
```

Reviews (12): Last reviewed commit: "Update CHANGELOG.md and optimise images" | Re-trigger Greptile

Comment thread extensions/gemini-tts/CHANGELOG.md Outdated
"description": "Increase reading speed by 0.25× for the next segment.",
"mode": "no-view"
},
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Title case violation on command titles

"Speed up Reading" should be "Speed Up Reading" — all words in a command title should use title case per the Raycast convention. Similarly, "Read with Voice Selection" should be "Read With Voice Selection".

Suggested change
{
"title": "Speed Up Reading",

Rule Used: What: Use title case for titles in package.json.... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/gemini-tts/package.json
Line: 66

Comment:
**Title case violation on command titles**

`"Speed up Reading"` should be `"Speed Up Reading"` — all words in a command title should use title case per the Raycast convention. Similarly, `"Read with Voice Selection"` should be `"Read With Voice Selection"`.

```suggestion
      "title": "Speed Up Reading",
```

**Rule Used:** What: Use title case for titles in `package.json`.... ([source](https://app.greptile.com/review/custom-context?memory=a44fb089-4d03-4b60-a4ff-03431cdf0eb4))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor Author

@xwzhangSZU xwzhangSZU May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applying this one — Raycast's official ray lint warns against this form and expects "Read with Voice Selection" / "Speed up Reading" (AP-style: short prepositions like "with" / "up" stay lowercase). Reverted the title-cased variants I briefly had locally so ray lint stays clean. Heads-up in case the lint rule used here is older than the current Raycast convention.

Hide Gemini TTS's per-request synthesis latency behind smarter scheduling
and a content-addressed audio cache. Gemini has no streaming endpoint,
so the only levers are TTFA reduction and inter-chunk gap removal.

- Lead chunk (~60-260 chars at the nearest sentence/clause boundary)
  shrinks first-audio latency from ~4-8s to ~1-2s on long inputs.
- Producer/consumer pipeline (depth-1 prefetch) synthesizes chunk N+1
  while chunk N plays, so users only ever wait once.
- SHA-256 audio cache makes Restart Reading, Resume, voice preview, and
  paragraph re-reads instant. LRU sweep at 200 MB; speed excluded from
  the key so afplay rate changes hit cache.
- Static director profile moved to systemInstruction so per-chunk
  requests carry only the transcript — fewer tokens per chunk on long
  reads.
- Menu-bar status refreshes within ~1s of phase transitions via
  background launchCommand, throttled to 750ms.

Applied to read-with-voice's command-internal loop too, not just the
shared reading-runner, so all read paths get the smoothness win.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 3, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

Two pre-existing correctness gaps surfaced by the latency audit:

- Concurrent-instance race: a Quick Read trigger during the lead
  chunk's synthesis (before any afplay process exists) used to launch
  a parallel reader instead of toggle-stopping. Adds a session lock
  file held across synth+play, and extends stopExternalPlayback to
  signal stop even when only synthesis (no afplay) is running.
- Voice preview now writes playback state, so menu-bar Stop Reading
  can interrupt it and the menu bar reflects in-progress previews.
Comment thread extensions/gemini-tts/src/quick-read.tsx
xwzhangSZU added 3 commits May 3, 2026 18:16
- "Audio Cache" row in the menu bar shows current size + entry count
  and clears the cache on click; repopulates lazily as the user reads.
- Menu refresh now fires only on phase transitions (synthesizing /
  playing / stopped / completed) instead of also on a 750ms time
  tick. Bounded at ~2 background launches per chunk.
Surface the new performance and stop semantics in the README so the
extension's listing reflects what the code actually does now.
Addresses Greptile P1 from the automated review on this PR. The two
"Nothing to read" hints were swapped: the message offering "you can
also resume your last reading" was shown when no paused session
existed, and vice versa. Swap them so the Resume hint surfaces only
when there is something to resume.
Comment thread extensions/gemini-tts/src/read-with-voice.tsx Outdated
xwzhangSZU and others added 3 commits May 3, 2026 18:35
Live request against gemini-3.1-flash-tts-preview returned:

  HTTP 400: "Developer instruction is not enabled for this model"

so every Quick Read in the previous push was failing on the wire. Move
the entire director prompt back inline in `contents`. Bump cache
version to invalidate any stale entries. Verified rolled-back shape
returns HTTP 200 with audio.
Fire up to 3 parallel synthesis requests ahead of playback so audio
buffers are ready before the current chunk finishes. Paid API rate
limits easily accommodate this; eliminates audible gaps on long texts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@0xdhrv 0xdhrv self-assigned this May 22, 2026
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: unused SVG file

This looks like it may no longer be needed. If it's safe to remove, deleting it could help keep it cleaner.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 50cd3c84 — the manifest and README only reference command-icon.png, the source SVG was an editable artifact from the design pass.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Would it make sense to use the same extension icon for both dark and light mode here?

This seems like the two icons are identical, so keeping both may not add much value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated in 50cd3c84 — the dark variant only differed by one or two background shades, so I removed it and Raycast falls back to command-icon.png for both modes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: unused SVG file

This looks like it may no longer be needed. If it's safe to remove, deleting it could help keep it cleaner.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 50cd3c84 — removed alongside the light source SVG.

| "gemini-3.1-flash-tts-preview"
| "gemini-2.5-flash-preview-tts"
| "gemini-2.5-pro-preview-tts";
export type GeminiLanguageMode = "auto" | "cmn" | "en" | "mixed-cmn-en";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: multiple language support

Is it possible to support multiple languages here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension already handles multiple languages. languageMode defaults to auto, which passes text straight through to Gemini 3.1 Flash TTS Preview (the extension's default model) — that model supports 70+ languages with automatic input-language detection, so pasted text in any supported language is spoken correctly without extra config.

The cmn / en / mixed-cmn-en values aren't a language whitelist — they're optional delivery hints that bias pronunciation and pacing for the extension's primary Chinese + English legal-academic use case (citations, mixed-script handling, code-switching). I kept the explicit list narrow on purpose since auto plus the model's built-in detection already covers the rest. Happy to broaden it if per-language hints would be valuable.

Copy link
Copy Markdown
Contributor

@0xdhrv 0xdhrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xwzhangSZU 👋

I have added a few comments for you to address.

I'm looking forward to testing this extension again 🔥

Feel free to contact me here or at Slack if you have any questions.


I converted this PR into a draft until it's ready for the review, please press the button Ready for review when it's ready and we'll have a look 😊

@0xdhrv 0xdhrv marked this pull request as draft May 22, 2026 18:13
xwzhangSZU added a commit to xwzhangSZU/Raycast-Gemini-TTS that referenced this pull request May 22, 2026
- Remove unused source SVGs and the near-identical @dark icon variant;
  Raycast falls back to command-icon.png for dark mode.
- Replace MiniMax-fork leftover metadata screenshots with real Gemini
  TTS captures: Quick Read picker (Active Configuration line shows
  gemini-3.1-flash-tts-preview / Mixed CN/EN) and Read-with-Voice
  picker (voices grouped by personality).

Addresses 0xdhrv review on raycast/extensions#27612.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xwzhangSZU added a commit to xwzhangSZU/Raycast-Gemini-TTS that referenced this pull request May 22, 2026
read-with-voice.tsx and select-voice.tsx acquire a session lock for the
first read, then on the next read reach stopExternalPlayback() — whose
case-2 path sees hasActiveSession() true (the live PID is us), writes
STOP_FILE, and returns true. The follow-up waitForSessionLockRelease
then waits for us to release a lock we hold while sitting blocked in
the wait, surfacing as a permanent "Stopping previous reading" toast.

- Release our own session lock before stopExternalPlayback (no-op when
  another process owns the lock).
- Clear STOP_FILE in the cleanup finally blocks (also Greptile #1):
  without it a leftover STOP_FILE makes the next session exit on its
  first iteration.

Addresses 0xdhrv review on raycast/extensions#27612.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
xwzhangSZU and others added 2 commits May 23, 2026 07:34
- Remove unused source SVGs and the near-identical @dark icon variant;
  Raycast falls back to command-icon.png for dark mode.
- Replace MiniMax-fork leftover metadata screenshots with real Gemini
  TTS captures: Quick Read picker (Active Configuration line shows
  gemini-3.1-flash-tts-preview / Mixed CN/EN) and Read-with-Voice
  picker (voices grouped by personality).

Addresses 0xdhrv review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
read-with-voice.tsx and select-voice.tsx acquire a session lock for the
first read, then on the next read reach stopExternalPlayback() — whose
case-2 path sees hasActiveSession() true (the live PID is us), writes
STOP_FILE, and returns true. The follow-up waitForSessionLockRelease
then waits for us to release a lock we hold while sitting blocked in
the wait, surfacing as a permanent "Stopping previous reading" toast.

- Release our own session lock before stopExternalPlayback (no-op when
  another process owns the lock).
- Clear STOP_FILE in the cleanup finally blocks (also Greptile raycast#1):
  without it a leftover STOP_FILE makes the next session exit on its
  first iteration.

Addresses 0xdhrv review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xwzhangSZU
Copy link
Copy Markdown
Contributor Author

Hi @0xdhrv — addressed your review and ready for another look.

  • 50cd3c84 — assets cleanup (your 3 nitpicks/suggestion) + replaced the MiniMax-leftover metadata screenshots with real Gemini TTS captures (the latter was also Greptile-flagged on earlier rounds)
  • 20c863fc — separate fix discovered while regenerating the screenshots: Read with Voice Selection could deadlock on a same-view re-read because stopExternalPlayback treated this process's own session lock as an external holder, so waitForSessionLockRelease waited on a lock we held → permanent "Stopping previous reading" toast. Released our own lock first; also cleared STOP_FILE in the cleanup finally (Greptile's prior point).

Inline replies above. Thanks for testing!

@xwzhangSZU xwzhangSZU marked this pull request as ready for review May 22, 2026 23:36
Three real bugs Greptile flagged in earlier review rounds that
weren't caught by the on-PR back-and-forth:

- restart-reading.tsx / resume-reading.tsx: stop signal was erased
  before the running session could observe it. Added the
  `waitForSessionLockRelease()` guard that `read-with-voice.tsx`
  already uses so STOP_FILE outlives the synchronous tick and the
  old session actually exits, instead of the new command silently
  failing on lock contention.

- reading-runner.ts: `writePlaybackSpeed()` and the stopPoll
  `setInterval` ran *before* the `try` that wraps
  `releaseSessionLock()` in a `finally`. If startup threw, the
  session lock leaked and blocked every subsequent reading until
  Raycast restarts. Hoisted resources and moved the try boundary
  up so cleanup is guaranteed.

- gemini-tts.ts: `shouldRetry()` only handled `TTSApiError`, so
  transient network failures from `fetch` (TypeError, ECONNRESET,
  EAI_AGAIN, undici socket errors) fell through and aborted long
  reading sessions on a single dropped packet. Added a
  NETWORK_ERROR_CODES set + cause-walking so they retry alongside
  HTTP 429/5xx.

Also drops the misleading "Removed" section from CHANGELOG — the
extension never had MiniMax features for end-users, so a removal
note in an initial-release changelog is confusing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@xwzhangSZU
Copy link
Copy Markdown
Contributor Author

Pushed d60e8c83 — pure follow-up that closes the remaining Greptile P1 items the earlier rounds left open. No reviewer-facing logic changed, just hardening:

  • restart-reading.tsx / resume-reading.tsx — added the waitForSessionLockRelease() guard that read-with-voice.tsx already uses, so the running session actually observes STOP_FILE before the new command clears it. Previously the stop signal lived ~0 ms (back-to-back sync calls) and the new playReadingSession returned "Another reading is already in progress".
  • reading-runner.ts — moved writePlaybackSpeed() and the stopPoll setInterval inside the try block, so an early throw at session start can't leak the lock and block every subsequent reading until Raycast restarts.
  • gemini-tts.tsshouldRetry() now also retries transient fetch failures (ECONNRESET / ETIMEDOUT / EAI_AGAIN / ENOTFOUND / EPIPE / undici socket errors / TypeError: fetch failed), not just HTTP 429/5xx. A single dropped packet during a long reading no longer aborts the session.
  • CHANGELOG.md — dropped the misleading "Removed" section (Greptile P2) since the extension never shipped MiniMax features to end users; documented the three fixes above.

One Greptile suggestion intentionally not applied: the "title case" P2 asking to rewrite "Read with Voice Selection""Read With Voice Selection" and "Speed up Reading""Speed Up Reading". Raycast's own linter (ray lint) actively warns against that form and expects the AP-style casing (short prepositions stay lowercase), so I left the titles as-is to keep ray lint clean.

ray build is green on the new HEAD. Ready for another look @0xdhrv — thanks!

Copy link
Copy Markdown
Contributor

@0xdhrv 0xdhrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, approved ✅

@raycastbot raycastbot merged commit 3a9bc77 into raycast:main May 27, 2026
3 checks passed
@github-actions
Copy link
Copy Markdown
Contributor

Published to the Raycast Store:
https://raycast.com/xianwei_zhang/gemini-tts

@raycastbot
Copy link
Copy Markdown
Collaborator

🎉 🎉 🎉

We've rewarded your Raycast account with some credits. You will soon be able to exchange them for some swag.

Comment on lines +246 to +256
}
} finally {
synthesisController.abort();
clearInterval(stopPoll);
if (synthesisAbortRef.current === synthesisController) {
synthesisAbortRef.current = null;
}
setProgress((current) => (current?.voiceId === voice.id ? null : current));
releaseSessionLock();
clearExternalStopRequest();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Session lock released by a previous handleRead invocation

All calls to handleRead share the same process.pid. When the user quickly clicks voice B while voice A is still synthesizing, B's setup (lines 97–122) releases A's lock and immediately acquires a new one — but A's finally block runs asynchronously (after B's first await), sees pid === process.pid, and deletes the lock file B just wrote. From that point on B holds no session lock, so a concurrent Quick Read or Resume command will not detect an active session and can start a parallel reader — resulting in two TTS streams and audio overlap.

The root cause is that releaseSessionLock can't distinguish "this process's lock from an earlier invocation" from "this process's lock from the current invocation". A lightweight fix is to track the owner with a per-invocation token: write a ${process.pid}:${token} string to the lock file and only release when the on-disk token still matches the local token.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/gemini-tts/src/read-with-voice.tsx
Line: 246-256

Comment:
**Session lock released by a previous `handleRead` invocation**

All calls to `handleRead` share the same `process.pid`. When the user quickly clicks voice B while voice A is still synthesizing, B's setup (lines 97–122) releases A's lock and immediately acquires a new one — but A's `finally` block runs asynchronously (after B's first `await`), sees `pid === process.pid`, and deletes the lock file B just wrote. From that point on B holds no session lock, so a concurrent Quick Read or Resume command will not detect an active session and can start a parallel reader — resulting in two TTS streams and audio overlap.

The root cause is that `releaseSessionLock` can't distinguish "this process's lock from an earlier invocation" from "this process's lock from the current invocation". A lightweight fix is to track the owner with a per-invocation token: write a `${process.pid}:${token}` string to the lock file and only release when the on-disk token still matches the local token.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new extension Label for PRs with new extensions platform: macOS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants