Skip to content

sp19-bel/echosight

Repository files navigation

GHBanner

👁️ EchoSight: Project Report

A multimodal visual assistant for the visually impaired, powered by the adaptive reasoning of Gemini 3 Pro.

1. Executive Summary

EchoSight is a mobile-first web application designed to act as a digital set of eyes for blind and low-vision users. By leveraging the advanced multimodal reasoning of Google Gemini 3 Pro, the application provides context-aware audio descriptions of the user's surroundings.

Unlike generic image-to-text tools, EchoSight features a novel "Hybrid Thinking Architecture." It dynamically adjusts the AI's reasoning depth based on the user's immediate need: a high-speed Safety Mode for immediate obstacle detection and a deep-reasoning Explore Mode for environmental appreciation. It utilizes native browser APIs for haptics, synthesized audio cues (earcons), and text-to-speech to ensure a completely accessible, eyes-free experience.

2. Problem Statement

Visually impaired individuals often face two distinct challenges that require different cognitive approaches:

  • Safety (Speed Critical): "Is there a hurdle, wet floor sign, or staircase in front of me right now?"
  • Context (Depth Critical): "What does this sunset look like? What is the vibe of this room?"

Existing tools often treat these problems identically, providing slow, overly verbose descriptions that are dangerous for navigation, or too clinical for enjoyment.

3. Solution & Core Features

🛡️ Safety Mode (Priority: Latency via Optimization)

  • Function: Instantly identifies trip hazards, obstacles, or urgent text.
  • AI Model: gemini-3-pro-preview (Optimized Configuration).
  • Engineering Strategy: We utilize the "Low Thinking" configuration (includeThoughts: false) to suppress the model's internal monologue. This proves that a heavy reasoning model can be tuned for sub-second, safety-critical tasks without switching models.
  • Example Output: "Caution: Open manhole cover and traffic cones ahead."

🌍 Explore Mode (Priority: Deep Reasoning)

  • Function: Provides a poetic and atmospheric description of the scene.
  • AI Model: gemini-3-pro-preview (Standard Configuration).
  • Engineering Strategy: We unlock the full Chain-of-Thought capabilities (includeThoughts: true), allowing the model to "reason" about the mood, lighting, and cultural context of the scene before generating a description.
  • Example Output: "Warm golden sunlight hits the brick wall, casting long shadows that suggest a quiet late afternoon."

🎙️ Voice Query (Multimodal Agent)

  • Function: Allows specific user questions about the visual field.
  • Interaction: User speaks "Where are my keys?"; App captures image + audio transcript.
  • Response: Gemini reasons spatially to locate specific objects within the frame.

4. Technical Architecture

Tech Stack

  • Framework: React 19 (ES Modules) & TypeScript.
  • AI Engine: Google GenAI SDK (@google/genai).
  • Styling: Tailwind CSS.

Model Strategy: Hybrid Thinking Architecture

Instead of switching between different models (e.g., Flash vs Pro), we dynamically adjust the thinking_config parameter of Gemini 3 Pro at runtime:

  • Safety Requests: includeThoughts: false (Minimizes token usage for speed).
  • Explore Requests: includeThoughts: true (Maximizes context understanding).

Key Components

  • CameraView.tsx: Manages the navigator.mediaDevices.getUserMedia stream. Captures high-quality JPEG frames via an off-screen HTML5 Canvas. Optimized for mobile back-facing cameras (facingMode: 'environment').
  • services/gemini.ts: A stateless service that implements a "Context-Aware" prompt system. It forces the Pro model to be concise (<10 words) for safety, preventing the verbosity common in large reasoning models.
  • services/sound.ts (The Audio Engine): Uses the Web Audio API (AudioContext) to generate sounds programmatically (Oscillators). This removes the need to load external MP3 assets, reducing latency.
  • services/speech.ts: Wraps window.speechSynthesis for Text-to-Speech (TTS) and window.SpeechRecognition for Voice Commands. Logic included to prioritize natural-sounding "Google" English voices.

5. Accessibility & UX Design

The app is built from the ground up for "Eyes-Free" usage:

  • Haptic Feedback: The Vibration API (navigator.vibrate) provides tactile confirmation for every button press and scan completion.
  • High Contrast UI:
    • Safety Button: Deep Red background.
    • Explore Button: Deep Blue background.
    • Overlay: Bright Yellow text on semi-transparent Black.
  • Touch Targets: The bottom 25% of the screen consists entirely of the two main buttons, making them impossible to miss.

6. How it Meets "Gemini 3" Hackathon Criteria

Criteria Implementation
Innovation Splits visual assistance into "Safety" vs "Aesthetics" using a Hybrid Thinking strategy within a single model.
Use of Tech Leverages Gemini 3 Pro's adaptable parameters. We demonstrate how one model can be tuned for both sub-second alerts and deep poetic reasoning.
User Experience Fully multimodal (Video + Audio + Haptics + Voice). Zero visual reading required.
Performance Zero asset loading (no images/sounds to download). Lightweight React build optimized for mobile networks.

7. Future Roadmap

  • Continuous Stream: Use Gemini 3's high throughput to analyze video frames continuously (e.g., every 500ms) without manual triggers.
  • Spatial Audio: Use Web Audio API to indicate where an object is in the frame (left/right ear panning).
  • Offline Mode: Caching identifying features of known objects locally using Gemini Nano.

Run Locally

Prerequisites: Node.js

  1. Install dependencies: npm install
  2. Set the GEMINI_API_KEY in .env.local to your Gemini API key
  3. Run the app: npm run dev

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors