👁️ EchoSight: Project Report

👁️ EchoSight: Project Report

A multimodal visual assistant for the visually impaired, powered by the adaptive reasoning of Gemini 3 Pro.

1. Executive Summary

EchoSight is a mobile-first web application designed to act as a digital set of eyes for blind and low-vision users. By leveraging the advanced multimodal reasoning of Google Gemini 3 Pro, the application provides context-aware audio descriptions of the user's surroundings.

Unlike generic image-to-text tools, EchoSight features a novel "Hybrid Thinking Architecture." It dynamically adjusts the AI's reasoning depth based on the user's immediate need: a high-speed Safety Mode for immediate obstacle detection and a deep-reasoning Explore Mode for environmental appreciation. It utilizes native browser APIs for haptics, synthesized audio cues (earcons), and text-to-speech to ensure a completely accessible, eyes-free experience.

2. Problem Statement

Visually impaired individuals often face two distinct challenges that require different cognitive approaches:

Safety (Speed Critical): "Is there a hurdle, wet floor sign, or staircase in front of me right now?"
Context (Depth Critical): "What does this sunset look like? What is the vibe of this room?"

Existing tools often treat these problems identically, providing slow, overly verbose descriptions that are dangerous for navigation, or too clinical for enjoyment.

3. Solution & Core Features

🛡️ Safety Mode (Priority: Latency via Optimization)

Function: Instantly identifies trip hazards, obstacles, or urgent text.
AI Model: gemini-3-pro-preview (Optimized Configuration).
Engineering Strategy: We utilize the "Low Thinking" configuration (includeThoughts: false) to suppress the model's internal monologue. This proves that a heavy reasoning model can be tuned for sub-second, safety-critical tasks without switching models.
Example Output: "Caution: Open manhole cover and traffic cones ahead."

🌍 Explore Mode (Priority: Deep Reasoning)

Function: Provides a poetic and atmospheric description of the scene.
AI Model: gemini-3-pro-preview (Standard Configuration).
Engineering Strategy: We unlock the full Chain-of-Thought capabilities (includeThoughts: true), allowing the model to "reason" about the mood, lighting, and cultural context of the scene before generating a description.
Example Output: "Warm golden sunlight hits the brick wall, casting long shadows that suggest a quiet late afternoon."

🎙️ Voice Query (Multimodal Agent)

Function: Allows specific user questions about the visual field.
Interaction: User speaks "Where are my keys?"; App captures image + audio transcript.
Response: Gemini reasons spatially to locate specific objects within the frame.

4. Technical Architecture

Tech Stack

Framework: React 19 (ES Modules) & TypeScript.
AI Engine: Google GenAI SDK (@google/genai).
Styling: Tailwind CSS.

Model Strategy: Hybrid Thinking Architecture

Instead of switching between different models (e.g., Flash vs Pro), we dynamically adjust the thinking_config parameter of Gemini 3 Pro at runtime:

Safety Requests: includeThoughts: false (Minimizes token usage for speed).
Explore Requests: includeThoughts: true (Maximizes context understanding).

Key Components

CameraView.tsx: Manages the navigator.mediaDevices.getUserMedia stream. Captures high-quality JPEG frames via an off-screen HTML5 Canvas. Optimized for mobile back-facing cameras (facingMode: 'environment').
services/gemini.ts: A stateless service that implements a "Context-Aware" prompt system. It forces the Pro model to be concise (<10 words) for safety, preventing the verbosity common in large reasoning models.
services/sound.ts (The Audio Engine): Uses the Web Audio API (AudioContext) to generate sounds programmatically (Oscillators). This removes the need to load external MP3 assets, reducing latency.
services/speech.ts: Wraps window.speechSynthesis for Text-to-Speech (TTS) and window.SpeechRecognition for Voice Commands. Logic included to prioritize natural-sounding "Google" English voices.

5. Accessibility & UX Design

The app is built from the ground up for "Eyes-Free" usage:

Haptic Feedback: The Vibration API (navigator.vibrate) provides tactile confirmation for every button press and scan completion.
High Contrast UI:
- Safety Button: Deep Red background.
- Explore Button: Deep Blue background.
- Overlay: Bright Yellow text on semi-transparent Black.
Touch Targets: The bottom 25% of the screen consists entirely of the two main buttons, making them impossible to miss.

6. How it Meets "Gemini 3" Hackathon Criteria

Criteria	Implementation
Innovation	Splits visual assistance into "Safety" vs "Aesthetics" using a Hybrid Thinking strategy within a single model.
Use of Tech	Leverages Gemini 3 Pro's adaptable parameters. We demonstrate how one model can be tuned for both sub-second alerts and deep poetic reasoning.
User Experience	Fully multimodal (Video + Audio + Haptics + Voice). Zero visual reading required.
Performance	Zero asset loading (no images/sounds to download). Lightweight React build optimized for mobile networks.

7. Future Roadmap

Continuous Stream: Use Gemini 3's high throughput to analyze video frames continuously (e.g., every 500ms) without manual triggers.
Spatial Audio: Use Web Audio API to indicate where an object is in the frame (left/right ear panning).
Offline Mode: Caching identifying features of known objects locally using Gemini Nano.

Run Locally

Prerequisites: Node.js

Install dependencies: npm install
Set the GEMINI_API_KEY in .env.local to your Gemini API key
Run the app: npm run dev

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
components		components
services		services
.gitignore		.gitignore
App.tsx		App.tsx
README.md		README.md
index.html		index.html
index.tsx		index.tsx
metadata.json		metadata.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json
types.ts		types.ts
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

👁️ EchoSight: Project Report

1. Executive Summary

2. Problem Statement

3. Solution & Core Features

🛡️ Safety Mode (Priority: Latency via Optimization)

🌍 Explore Mode (Priority: Deep Reasoning)

🎙️ Voice Query (Multimodal Agent)

4. Technical Architecture

Tech Stack

Model Strategy: Hybrid Thinking Architecture

Key Components

5. Accessibility & UX Design

6. How it Meets "Gemini 3" Hackathon Criteria

7. Future Roadmap

Run Locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

👁️ EchoSight: Project Report

1. Executive Summary

2. Problem Statement

3. Solution & Core Features

🛡️ Safety Mode (Priority: Latency via Optimization)

🌍 Explore Mode (Priority: Deep Reasoning)

🎙️ Voice Query (Multimodal Agent)

4. Technical Architecture

Tech Stack

Model Strategy: Hybrid Thinking Architecture

Key Components

5. Accessibility & UX Design

6. How it Meets "Gemini 3" Hackathon Criteria

7. Future Roadmap

Run Locally

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages