A multimodal visual assistant for the visually impaired, powered by the adaptive reasoning of Gemini 3 Pro.
EchoSight is a mobile-first web application designed to act as a digital set of eyes for blind and low-vision users. By leveraging the advanced multimodal reasoning of Google Gemini 3 Pro, the application provides context-aware audio descriptions of the user's surroundings.
Unlike generic image-to-text tools, EchoSight features a novel "Hybrid Thinking Architecture." It dynamically adjusts the AI's reasoning depth based on the user's immediate need: a high-speed Safety Mode for immediate obstacle detection and a deep-reasoning Explore Mode for environmental appreciation. It utilizes native browser APIs for haptics, synthesized audio cues (earcons), and text-to-speech to ensure a completely accessible, eyes-free experience.
Visually impaired individuals often face two distinct challenges that require different cognitive approaches:
- Safety (Speed Critical): "Is there a hurdle, wet floor sign, or staircase in front of me right now?"
- Context (Depth Critical): "What does this sunset look like? What is the vibe of this room?"
Existing tools often treat these problems identically, providing slow, overly verbose descriptions that are dangerous for navigation, or too clinical for enjoyment.
- Function: Instantly identifies trip hazards, obstacles, or urgent text.
- AI Model:
gemini-3-pro-preview(Optimized Configuration). - Engineering Strategy: We utilize the "Low Thinking" configuration (
includeThoughts: false) to suppress the model's internal monologue. This proves that a heavy reasoning model can be tuned for sub-second, safety-critical tasks without switching models. - Example Output: "Caution: Open manhole cover and traffic cones ahead."
- Function: Provides a poetic and atmospheric description of the scene.
- AI Model:
gemini-3-pro-preview(Standard Configuration). - Engineering Strategy: We unlock the full Chain-of-Thought capabilities (
includeThoughts: true), allowing the model to "reason" about the mood, lighting, and cultural context of the scene before generating a description. - Example Output: "Warm golden sunlight hits the brick wall, casting long shadows that suggest a quiet late afternoon."
- Function: Allows specific user questions about the visual field.
- Interaction: User speaks "Where are my keys?"; App captures image + audio transcript.
- Response: Gemini reasons spatially to locate specific objects within the frame.
- Framework: React 19 (ES Modules) & TypeScript.
- AI Engine: Google GenAI SDK (
@google/genai). - Styling: Tailwind CSS.
Instead of switching between different models (e.g., Flash vs Pro), we dynamically adjust the thinking_config parameter of Gemini 3 Pro at runtime:
- Safety Requests:
includeThoughts: false(Minimizes token usage for speed). - Explore Requests:
includeThoughts: true(Maximizes context understanding).
CameraView.tsx: Manages thenavigator.mediaDevices.getUserMediastream. Captures high-quality JPEG frames via an off-screen HTML5 Canvas. Optimized for mobile back-facing cameras (facingMode: 'environment').services/gemini.ts: A stateless service that implements a "Context-Aware" prompt system. It forces the Pro model to be concise (<10 words) for safety, preventing the verbosity common in large reasoning models.services/sound.ts(The Audio Engine): Uses the Web Audio API (AudioContext) to generate sounds programmatically (Oscillators). This removes the need to load external MP3 assets, reducing latency.services/speech.ts: Wrapswindow.speechSynthesisfor Text-to-Speech (TTS) andwindow.SpeechRecognitionfor Voice Commands. Logic included to prioritize natural-sounding "Google" English voices.
The app is built from the ground up for "Eyes-Free" usage:
- Haptic Feedback: The Vibration API (
navigator.vibrate) provides tactile confirmation for every button press and scan completion. - High Contrast UI:
- Safety Button: Deep Red background.
- Explore Button: Deep Blue background.
- Overlay: Bright Yellow text on semi-transparent Black.
- Touch Targets: The bottom 25% of the screen consists entirely of the two main buttons, making them impossible to miss.
| Criteria | Implementation |
|---|---|
| Innovation | Splits visual assistance into "Safety" vs "Aesthetics" using a Hybrid Thinking strategy within a single model. |
| Use of Tech | Leverages Gemini 3 Pro's adaptable parameters. We demonstrate how one model can be tuned for both sub-second alerts and deep poetic reasoning. |
| User Experience | Fully multimodal (Video + Audio + Haptics + Voice). Zero visual reading required. |
| Performance | Zero asset loading (no images/sounds to download). Lightweight React build optimized for mobile networks. |
- Continuous Stream: Use Gemini 3's high throughput to analyze video frames continuously (e.g., every 500ms) without manual triggers.
- Spatial Audio: Use Web Audio API to indicate where an object is in the frame (left/right ear panning).
- Offline Mode: Caching identifying features of known objects locally using Gemini Nano.
Prerequisites: Node.js
- Install dependencies:
npm install - Set the
GEMINI_API_KEYin .env.local to your Gemini API key - Run the app:
npm run dev
