feat(android/voice): on-device STT and LLM via SpeechRecognizer + Gemini Nano#27
feat(android/voice): on-device STT and LLM via SpeechRecognizer + Gemini Nano#27
Conversation
…ini Nano - Fix Android microphone runtime permission — was silently denied with no prompt; now uses ActivityResultContracts.RequestPermission wired through MainActivity - Add AndroidSpeechRecognizerProvider: wraps android.speech.SpeechRecognizer as a DirectSpeechProvider (combined record+transcribe, no audio upload, works offline) with EXTRA_PREFER_OFFLINE=true and RMS amplitude for waveform animation - Add MlKitLlmFormatterProvider: on-device LLM formatting via ML Kit Prompt API (Gemini Nano through AICore); handles AVAILABLE/DOWNLOADABLE/DOWNLOADING states; returns user-friendly error while model downloads rather than blocking - Add DirectSpeechProvider interface to commonMain to support integrated pipelines that bypass the two-step record→STT path - Both on-device options are configurable via Settings → Voice Capture toggles; toggles only appear on devices that report availability - Pipeline priority: device LLM > Anthropic Claude > OpenAI > no-op - Bump minSdk 24→26 (required by com.google.mlkit:genai-prompt:1.0.0-beta2) - Add genai-prompt dependency to kmp androidMain Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark ResultsComparing
Flamegraph not available Top allocation hotspots (this PR)
|
There was a problem hiding this comment.
Pull request overview
Adds an Android on-device voice pipeline by introducing a “direct” speech path (record+transcribe in one component) and an optional on-device LLM formatter, with settings-driven routing and Android-specific permission handling.
Changes:
- Introduces
DirectSpeechProvider+ pipeline/view-model branching to support integrated listen→transcript providers (AndroidSpeechRecognizer). - Adds on-device LLM formatting provider (ML Kit Prompt API / Gemini Nano) and settings UI toggles gated by device capability.
- Wires Android mic permission prompting via
ActivityResultContracts.RequestPermission, and updates build config for the new ML Kit dependency / minSdk bump.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/voice/VoiceSettings.kt | Adds persisted toggles for on-device STT/LLM. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/voice/VoicePipelineFactory.kt | Extends pipeline factory to accept direct speech + device LLM providers and prioritize them. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/voice/VoicePipelineConfig.kt | Adds directSpeechProvider and effectiveAmplitudeFlow for waveform routing. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/voice/VoiceCaptureViewModel.kt | Branches capture flow to use direct speech provider when present. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/voice/DirectSpeechProvider.kt | New interface for integrated listen/transcribe providers (+ optional amplitude). |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/components/settings/VoiceCaptureSettings.kt | Adds on-device STT/LLM toggles and hides cloud key inputs when enabled. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/components/settings/SettingsDialog.kt | Plumbs device capability flags into voice settings UI. |
| kmp/src/commonMain/kotlin/dev/stapler/stelekit/ui/App.kt | Plumbs device capability flags through app UI and switches waveform to effectiveAmplitudeFlow. |
| kmp/src/androidMain/kotlin/dev/stapler/stelekit/voice/MlKitLlmFormatterProvider.kt | New ML Kit Prompt API formatter provider with eligibility/status handling. |
| kmp/src/androidMain/kotlin/dev/stapler/stelekit/voice/AndroidSpeechRecognizerProvider.kt | New SpeechRecognizer-backed DirectSpeechProvider with RMS amplitude mapping. |
| kmp/src/androidMain/kotlin/dev/stapler/stelekit/voice/AndroidAudioRecorder.kt | Adds suspendable mic permission request hook before recording starts. |
| kmp/build.gradle.kts | Adds ML Kit genai-prompt dependency to androidMain. |
| androidApp/src/main/kotlin/dev/stapler/stelekit/MainActivity.kt | Implements mic permission flow, device capability detection, and new pipeline construction. |
| androidApp/build.gradle.kts | Bumps app minSdk to 26. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (deviceSttAvailable && voiceSettings.getUseDeviceStt()) deviceSttProvider else null, | ||
| if (deviceLlmAvailable && voiceSettings.getUseDeviceLlm()) mlKitProvider else null, | ||
| ) | ||
| var voicePipeline by remember { mutableStateOf(buildPipeline()) } |
There was a problem hiding this comment.
deviceLlmAvailable is resolved asynchronously in LaunchedEffect, but voicePipeline is only built once and then rebuilt only via onRebuildVoicePipeline. If a user already has “Use on-device LLM” enabled, they’ll still start with the cloud/no-op pipeline until they manually open Settings and hit Save. Consider rebuilding the pipeline automatically when deviceLlmAvailable flips to true (or pass mlKitProvider regardless of deviceLlmAvailable and let the provider handle unsupported/downloading states).
| var voicePipeline by remember { mutableStateOf(buildPipeline()) } | |
| var voicePipeline by remember { mutableStateOf(buildPipeline()) } | |
| LaunchedEffect(deviceLlmAvailable) { | |
| voicePipeline = buildPipeline() | |
| } |
There was a problem hiding this comment.
Fixed in commit edbf46c: added LaunchedEffect(deviceLlmAvailable) in MainActivity that calls onRebuildVoicePipeline() whenever the flag flips, so the pipeline is rebuilt automatically without requiring a manual settings round-trip.
| if (requestMicPermission != null && !requestMicPermission()) { | ||
| return@withContext PlatformAudioFile("") | ||
| } | ||
|
|
||
| stopRequested = false | ||
| pauseRequested = false |
There was a problem hiding this comment.
The mic permission await happens before stopRequested is reset to false. If the user taps the mic, the permission dialog shows, and they tap again to stop/cancel while the dialog is up, stopRecording() can set stopRequested=true but it will be overwritten after permission is granted, causing an unexpected recording session to start. Consider checking stopRequested/cancellation before resetting flags, or moving the flag reset earlier and short-circuiting after permission if a stop was requested.
| if (requestMicPermission != null && !requestMicPermission()) { | |
| return@withContext PlatformAudioFile("") | |
| } | |
| stopRequested = false | |
| pauseRequested = false | |
| stopRequested = false | |
| pauseRequested = false | |
| if (requestMicPermission != null && !requestMicPermission()) { | |
| return@withContext PlatformAudioFile("") | |
| } | |
| if (stopRequested) { | |
| return@withContext PlatformAudioFile("") | |
| } |
There was a problem hiding this comment.
Fixed in commit edbf46c: flag resets (isRecording = false, stopRequested = false) moved to before the permission await, and a stopRequested guard was added immediately after permission is granted so that a cancel-during-permission-dialog races are handled correctly.
| mainHandler.post { | ||
| val recognizer = SpeechRecognizer.createSpeechRecognizer(context) | ||
| activeRecognizer = recognizer | ||
|
|
||
| recognizer.setRecognitionListener(object : RecognitionListener { | ||
| override fun onReadyForSpeech(params: Bundle?) {} | ||
| override fun onBeginningOfSpeech() {} | ||
| override fun onBufferReceived(buffer: ByteArray?) {} | ||
| override fun onEndOfSpeech() {} | ||
| override fun onEvent(eventType: Int, params: Bundle?) {} | ||
| override fun onPartialResults(partialResults: Bundle?) {} | ||
|
|
||
| override fun onRmsChanged(rmsdB: Float) { | ||
| // Map roughly -2..10 dB → 0..1 | ||
| _amplitudeFlow.value = ((rmsdB + 2f) / 12f).coerceIn(0f, 1f) | ||
| } | ||
|
|
||
| override fun onResults(results: Bundle?) { | ||
| _amplitudeFlow.value = 0f | ||
| activeRecognizer = null | ||
| recognizer.destroy() | ||
| if (!cont.isActive) return | ||
| val text = results | ||
| ?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION) | ||
| ?.firstOrNull() | ||
| Log.d(TAG, "onResults: text=${text?.take(80)}") | ||
| if (text.isNullOrBlank()) cont.resume(TranscriptResult.Empty) | ||
| else cont.resume(TranscriptResult.Success(text)) | ||
| } | ||
|
|
||
| override fun onError(error: Int) { | ||
| _amplitudeFlow.value = 0f | ||
| activeRecognizer = null | ||
| recognizer.destroy() | ||
| if (!cont.isActive) return | ||
| Log.w(TAG, "onError: code=$error") | ||
| cont.resume(mapError(error)) | ||
| } | ||
| }) | ||
|
|
||
| val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply { | ||
| putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM) | ||
| putExtra(RecognizerIntent.EXTRA_PREFER_OFFLINE, true) | ||
| putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1) | ||
| putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS, 3_000L) | ||
| putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_POSSIBLY_COMPLETE_SILENCE_LENGTH_MILLIS, 1_500L) | ||
| } | ||
| recognizer.startListening(intent) | ||
|
|
||
| cont.invokeOnCancellation { | ||
| mainHandler.post { | ||
| activeRecognizer?.let { | ||
| it.cancel() | ||
| it.destroy() | ||
| activeRecognizer = null | ||
| } | ||
| _amplitudeFlow.value = 0f | ||
| } |
There was a problem hiding this comment.
listen() creates and starts SpeechRecognizer inside a Handler.post without any try/catch. If createSpeechRecognizer(...) or startListening(...) throws (e.g., service not available), the coroutine will never be resumed/cancelled, leaving the UI stuck in Recording. Wrap creation/start in a try/catch and resume with a failure TranscriptResult (and ensure activeRecognizer is cleared/destroyed).
| mainHandler.post { | |
| val recognizer = SpeechRecognizer.createSpeechRecognizer(context) | |
| activeRecognizer = recognizer | |
| recognizer.setRecognitionListener(object : RecognitionListener { | |
| override fun onReadyForSpeech(params: Bundle?) {} | |
| override fun onBeginningOfSpeech() {} | |
| override fun onBufferReceived(buffer: ByteArray?) {} | |
| override fun onEndOfSpeech() {} | |
| override fun onEvent(eventType: Int, params: Bundle?) {} | |
| override fun onPartialResults(partialResults: Bundle?) {} | |
| override fun onRmsChanged(rmsdB: Float) { | |
| // Map roughly -2..10 dB → 0..1 | |
| _amplitudeFlow.value = ((rmsdB + 2f) / 12f).coerceIn(0f, 1f) | |
| } | |
| override fun onResults(results: Bundle?) { | |
| _amplitudeFlow.value = 0f | |
| activeRecognizer = null | |
| recognizer.destroy() | |
| if (!cont.isActive) return | |
| val text = results | |
| ?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION) | |
| ?.firstOrNull() | |
| Log.d(TAG, "onResults: text=${text?.take(80)}") | |
| if (text.isNullOrBlank()) cont.resume(TranscriptResult.Empty) | |
| else cont.resume(TranscriptResult.Success(text)) | |
| } | |
| override fun onError(error: Int) { | |
| _amplitudeFlow.value = 0f | |
| activeRecognizer = null | |
| recognizer.destroy() | |
| if (!cont.isActive) return | |
| Log.w(TAG, "onError: code=$error") | |
| cont.resume(mapError(error)) | |
| } | |
| }) | |
| val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply { | |
| putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM) | |
| putExtra(RecognizerIntent.EXTRA_PREFER_OFFLINE, true) | |
| putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1) | |
| putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS, 3_000L) | |
| putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_POSSIBLY_COMPLETE_SILENCE_LENGTH_MILLIS, 1_500L) | |
| } | |
| recognizer.startListening(intent) | |
| cont.invokeOnCancellation { | |
| mainHandler.post { | |
| activeRecognizer?.let { | |
| it.cancel() | |
| it.destroy() | |
| activeRecognizer = null | |
| } | |
| _amplitudeFlow.value = 0f | |
| } | |
| cont.invokeOnCancellation { | |
| mainHandler.post { | |
| activeRecognizer?.let { | |
| it.cancel() | |
| it.destroy() | |
| activeRecognizer = null | |
| } | |
| _amplitudeFlow.value = 0f | |
| } | |
| } | |
| mainHandler.post { | |
| var recognizer: SpeechRecognizer? = null | |
| try { | |
| recognizer = SpeechRecognizer.createSpeechRecognizer(context) | |
| activeRecognizer = recognizer | |
| recognizer.setRecognitionListener(object : RecognitionListener { | |
| override fun onReadyForSpeech(params: Bundle?) {} | |
| override fun onBeginningOfSpeech() {} | |
| override fun onBufferReceived(buffer: ByteArray?) {} | |
| override fun onEndOfSpeech() {} | |
| override fun onEvent(eventType: Int, params: Bundle?) {} | |
| override fun onPartialResults(partialResults: Bundle?) {} | |
| override fun onRmsChanged(rmsdB: Float) { | |
| // Map roughly -2..10 dB → 0..1 | |
| _amplitudeFlow.value = ((rmsdB + 2f) / 12f).coerceIn(0f, 1f) | |
| } | |
| override fun onResults(results: Bundle?) { | |
| _amplitudeFlow.value = 0f | |
| activeRecognizer = null | |
| recognizer.destroy() | |
| if (!cont.isActive) return | |
| val text = results | |
| ?.getStringArrayList(SpeechRecognizer.RESULTS_RECOGNITION) | |
| ?.firstOrNull() | |
| Log.d(TAG, "onResults: text=${text?.take(80)}") | |
| if (text.isNullOrBlank()) cont.resume(TranscriptResult.Empty) | |
| else cont.resume(TranscriptResult.Success(text)) | |
| } | |
| override fun onError(error: Int) { | |
| _amplitudeFlow.value = 0f | |
| activeRecognizer = null | |
| recognizer.destroy() | |
| if (!cont.isActive) return | |
| Log.w(TAG, "onError: code=$error") | |
| cont.resume(mapError(error)) | |
| } | |
| }) | |
| val intent = Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH).apply { | |
| putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM) | |
| putExtra(RecognizerIntent.EXTRA_PREFER_OFFLINE, true) | |
| putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1) | |
| putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS, 3_000L) | |
| putExtra(RecognizerIntent.EXTRA_SPEECH_INPUT_POSSIBLY_COMPLETE_SILENCE_LENGTH_MILLIS, 1_500L) | |
| } | |
| recognizer.startListening(intent) | |
| } catch (t: Throwable) { | |
| _amplitudeFlow.value = 0f | |
| activeRecognizer = null | |
| recognizer?.destroy() | |
| Log.w(TAG, "Failed to start speech recognition", t) | |
| if (cont.isActive) { | |
| cont.resume(mapError(SpeechRecognizer.ERROR_CLIENT)) | |
| } |
There was a problem hiding this comment.
| _state.value = VoiceCaptureState.Recording | ||
| val transcriptResult = if (pipeline.directSpeechProvider != null) { | ||
| pipeline.directSpeechProvider.listen() | ||
| } else { | ||
| recordAndTranscribe() | ||
| } |
There was a problem hiding this comment.
VoiceCaptureViewModel now has a new direct-speech branch (directSpeechProvider.listen() / stopListening()), but the existing VoiceCaptureViewModelTest suite doesn’t cover it. Adding tests for success + permission denied/empty cases via a fake DirectSpeechProvider would help prevent regressions in the new pipeline path.
There was a problem hiding this comment.
Fixed in commit edbf46c: added 4 tests in VoiceCaptureViewModelTest covering the DirectSpeechProvider path — success, PermissionDenied, Empty, and cancel.
| // Encrypted SharedPreferences for API key storage | ||
| implementation("androidx.security:security-crypto:1.1.0-alpha06") | ||
|
|
||
| // On-device LLM via Gemini Nano (Pixel 9+ and AICore-enabled OEM flagships) | ||
| implementation("com.google.mlkit:genai-prompt:1.0.0-beta2") | ||
| } |
There was a problem hiding this comment.
Adding com.google.mlkit:genai-prompt to the kmp Android source set likely requires raising this module’s android.defaultConfig.minSdk as well. kmp/build.gradle.kts still declares minSdk = 24 (near the bottom of the file), which can cause a Gradle manifest merge failure if genai-prompt has minSdk >= 26 even though the app module is already bumped. Please align the kmp module minSdk with the new requirement.
There was a problem hiding this comment.
Fixed in commit b72c420: kmp/build.gradle.kts android { defaultConfig { minSdk } } raised to 26 to satisfy the genai-prompt:1.0.0-beta2 manifest requirement.
genai-prompt:1.0.0-beta2 declares minSdkVersion 26 in its manifest.
The kmp library module had its own android { defaultConfig { minSdk } }
block at 24, causing processDebugUnitTestManifest to fail.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Main PerformanceDashboard.kt used Dispatchers.IO directly in commonMain, which is a JVM-only symbol. compileCommonMainKotlinMetadata fails when it encounters it because Dispatchers.IO is absent from the multiplatform metadata API surface. queryPercentiles() is a blocking SQLite call so PlatformDispatcher.DB is the correct dispatcher per the project's own dispatcher matrix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Kotlin incremental compilation stores per-project metadata in ~/.gradle/caches/kotlin-build-*/. When a previous CI run compiled PlatformSettings.kt with `: Settings` (commit dc1b51b) the IC state was saved in the Gradle home cache restored by setup-gradle. Subsequent runs restore that stale metadata even though the source no longer has the supertype, causing a spurious expect/actual mismatch: expect: PlatformSettings : Settings actual: PlatformSettings : Any Excluding caches/kotlin-build-* forces a clean IC state per run, eliminating cross-branch metadata pollution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@copilot The cast all of the comments that you left |
- MainActivity: rebuild voice pipeline automatically when deviceLlmAvailable flips via LaunchedEffect(deviceLlmAvailable) - AndroidAudioRecorder: move flag resets before permission check and add stopRequested guard after permission grant to prevent spurious recordings - AndroidSpeechRecognizerProvider: move invokeOnCancellation before mainHandler.post and wrap recognizer creation/start in try/catch so thrown exceptions don't leave the coroutine stuck forever - VoiceCaptureViewModelTest: add 4 tests covering DirectSpeechProvider path (success, PermissionDenied, Empty, cancel) - kmp/build.gradle.kts: minSdk=26 already aligned (no change needed)" Agent-Logs-Url: https://github.com/tstapler/stelekit/sessions/c6a0afad-d76c-48da-90b0-3565a04aeb3f Co-authored-by: tstapler <3860386+tstapler@users.noreply.github.com>
Add an isActive check inside mainHandler.post after setting activeRecognizer so that if cancellation fires before the post runs (cleanup was a no-op), the recognizer is destroyed immediately rather than starting a zombie session." Agent-Logs-Url: https://github.com/tstapler/stelekit/sessions/c6a0afad-d76c-48da-90b0-3565a04aeb3f Co-authored-by: tstapler <3860386+tstapler@users.noreply.github.com>
All four review comments addressed in commits
Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
Brings in: - fix(resilience): loading screen hang + Settings interface (dc1b51b) - test(ci): JVM test job + loading state regression tests (d036af1) - ci: Gradle cache / smoke-test improvements - chore: v0.9.4 release Conflict resolutions: - PerformanceDashboard.kt: keep PlatformDispatcher.DB (correct for SQLite reads, and required for iOS compileCommonMainKotlinMetadata) - PlatformSettings.android.kt: add `: Settings` and `override` modifiers to match the expect class which now implements the Settings interface - FakeRepositories.kt: mark FakeFileSystem `open` so StelekitViewModelLoadingTest can subclass it Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in: - perf(android): decompose Phase 3 chunk writes (b5a51ab) - chore(main): release 0.9.5 (#30) Conflict resolution: - build.gradle.kts: keep minSdk=26 (required by genai-prompt:1.0.0-beta2) and take testInstrumentationRunner from main Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in: - feat(android): SteleKitApplication class + shared GraphManager (0e7915e) - feat(fdroid): fix repo version history + version in Settings (9bb8a77) - fix(search): create-page in link picker now inserts link (2a85c6d) - ci: benchmark retry with rebase (962b647) - chore(main): release 0.10.0 (#34) Conflict resolutions: - build.gradle.kts: keep genai-prompt + add glance dependencies from main - MainActivity.kt: keep our fileSystem init with SAF folder picker and AudioRecorder with mic permission callback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Android Benchmark ResultsComparing
|
Summary
ActivityResultContracts.RequestPermissionthroughMainActivityusing the sameCompletableDeferredpattern as the folder picker — first tap shows the system dialog.AndroidSpeechRecognizerProvider): wrapsandroid.speech.SpeechRecognizeras aDirectSpeechProviderthat handles both recording and transcription in one step. UsesEXTRA_PREFER_OFFLINE=true(Pixel on-device model, no network), 3s silence timeout, and feeds RMS to the existing waveform animation. No Whisper API key needed.MlKitLlmFormatterProvider): formats transcripts using Gemini Nano via ML Kit Prompt API (genai-prompt:1.0.0-beta2). Handles all fourFeatureStatusstates; returns a friendly "downloading, try again" message instead of blocking during model download. Requires Pixel 9+ or AICore-enabled OEM flagships.minSdkbumped 24→26 (required bygenai-prompt).Architecture
New
DirectSpeechProviderinterface incommonMainsupports pipelines where a single component handles both recording and transcription (bypassing the two-step record→STT path).VoiceCaptureViewModelbranches onpipeline.directSpeechProviderbefore falling back to the existingAudioRecorder+SpeechToTextProviderpath.Test plan
NoOpAudioRecorderand no device toggles shown🤖 Generated with Claude Code