A Meta Ray-Ban first-aid guidance system. Point the glasses (or your phone/laptop camera) at a patient - the backend runs two real-time CV models and an AI voice agent that talks you through what it sees:
- Facial droop: detects stroke-related asymmetry using MediaPipe landmarks + EfficientNet-B0
- Heart rate: contactless remote photoplethysmography (rPPG) from skin colour (YOLOR head detection + CHROM/POS ensemble, no contact needed)
- Voice guidance: Gemini agent backed by JRCALC 2022 clinical guidelines GraphRAG
Supported platforms: browser webcam, iOS (iPhone), Android, Meta Ray-Ban glasses.
π 3rd Place - UnicornMafia "To The Americas" Hackathon 2026
Sponsored by Pydantic AI Β· Render Β· MUBIT Β· Lovable Β· Cognition Β· Expedite Β· The Residency
Point the glasses at someone's face. FirstSight runs a MediaPipe landmark extractor and an EfficientNet-B0 model on every frame, scoring mouth, eye, and brow asymmetry in real time. Asymmetric faces (a key FAST indicator) are flagged immediately, with severity graded none / mild / severe. AUROC 0.985 on held-out test data.
The remote photoplethysmography (rPPG) pipeline picks up the ~1% colour change skin makes with each heartbeat. No sensor. No contact. Works across skin tones - a CHROM/POS ensemble automatically selects the algorithm with the stronger signal, so darker Fitzpatrick types aren't misread. Alerts fire for bradycardia, tachycardia, and critical ranges (<40 or >180 BPM) after a sustained window to suppress false alarms.
A Gemini voice agent backed by JRCALC 2022 clinical guidelines listens, watches, and speaks - walking you step by step through stroke assessment, CPR, choking response, and more. It correlates what the CV models see with what you say to surface the right protocol at the right moment.
Browser webcam, iPhone, Android phone, or Meta Ray-Ban glasses via the DAT SDK. The backend accepts JPEG frames over WebSocket from any source - swap the input without changing a line of server code.
A React debug dashboard shows the live annotated frame, processor signal cards, Gemini transcript, and full event trace - so a judge, operator, or paramedic supervisor can see exactly what the system detected and why.
Live on Meta Ray-Ban glasses - Stroke FAST Check playbook running, facial droop detected (likelihood 0.98), Voice Agent active.
Three frames from a live webcam session - same person, same room, models running in real time.
The system correctly reads 0.00 on a symmetric face and jumps to 0.86 on an exaggerated droop - with heart rate running contactlessly in parallel the whole time.
If you are joining this repo as a teammate, start here:
- ARCHITECTURE.md for the backend data-flow design
backend/for the Python serviceviewer/for the React debug dashboardmobile/CameraAccess/for the iOS prototypemobile/CameraAccessAndroid/for the Android prototype
The fastest way to see both models running - no mobile hardware required.
Prerequisites: Python 3.11β3.13, Node 18+, a Gemini API key.
git clone https://github.com/safishamsi/FirstSight.git
cd FirstSight
cp backend/.env.example backend/.env # then set GEMINI_API_KEYModel files required for droop detection. The trained weights are not in the repo. Place these files before starting the backend:
model/droop_model.onnx β EfficientNet-B0 ONNX weights model/face_landmarker.task β MediaPipe face landmarker (download from MediaPipe) checkpoints/threshold.json β calibrated detection thresholdHeart rate works out of the box - the BlazeFace model downloads automatically on first run.
cd backend
make setup # creates .venv, installs deps
make dev # starts uvicorn on :8000Health check:
curl http://127.0.0.1:8000/healthcd viewer
npm install
npm run dev # starts on http://localhost:5174- Open http://localhost:5174
- Go to the STREAMS tab
- Click Start Camera: your browser webcam streams to the backend at 10 fps
- A 10-second warmup bar appears while the processors initialise
- After warmup, two signal cards update in real time:
- Face droop: probability and severity, updated each frame
- Heart rate: BPM reading once the 150-frame rPPG buffer fills (~15 s)
See the mobile quick-start sections below to stream from Meta Ray-Ban glasses or a phone camera instead of the browser webcam.
Useful commands:
make backend-test
make backend-restart
make backend-stopStart by copying the root inventory file:
cp .env.example .envThat root .env is the teammate-facing checklist for all keys used in this repo. The apps do not all read it directly, so use it as the place you collect values, then copy them into the runtime-specific files below.
| Surface | File | What goes there |
|---|---|---|
| Root inventory | .env |
Shared local checklist for Gemini, OpenAI, Stream, and Meta/DAT values |
| Backend | backend/.env |
Vision Agents / FastAPI runtime config |
| iOS sample app | mobile/CameraAccess/CameraAccess/Secrets.swift |
geminiAPIKey, optional WebRTC signaling URL |
| Android sample app | mobile/CameraAccessAndroid/app/src/main/java/com/meta/wearable/dat/externalsampleapps/cameraaccess/Secrets.kt |
geminiAPIKey, optional WebRTC signaling URL |
| Android DAT SDK | mobile/CameraAccessAndroid/local.properties |
github_token, mwdat_application_id, mwdat_client_token |
The backend already has a runnable template:
cp backend/.env.example backend/.envOr use the Make target:
make backend-setupFill in these keys for the backend as needed:
GEMINI_API_KEYfor Gemini realtimeOPENAI_API_KEYfor OpenAI realtimeELEVENLABS_API_KEYif you want backend-generated PCM speech instead of Android local TTS fallbackSTREAM_API_KEYandSTREAM_API_SECRETfor the Vision Agents transport layer
The current Vision Agent backend mode defaults to:
SPEECH_PIPELINE=fast_whisper_pipelineFAST_WHISPER_MODEL_SIZE=baseGEMINI_LLM_MODEL=gemini-3-flash-previewBACKEND_TTS_ENABLED=true
If ELEVENLABS_API_KEY is absent, the backend still runs Fast-Whisper + Gemini and the Android app falls back to local TTS playback.
These values can also be overridden per session from the Android app Settings screen.
The Android sample needs two different things:
- A GitHub Packages token so Gradle can download the DAT Android SDK.
- Meta Wearables app registration values when you are not relying on Developer Mode.
Create the Android local properties file:
cp mobile/CameraAccessAndroid/local.properties.example mobile/CameraAccessAndroid/local.propertiesThen fill in:
github_token- create a GitHub Personal Access Token with
read:packages - GitHub path:
Settings -> Developer settings -> Personal access tokens
- create a GitHub Personal Access Token with
mwdat_application_id- use
0in Developer Mode - for production, get the real value from Wearables Developer Center
- use
mwdat_client_token- empty in simple Developer Mode workflows if not required by your current setup
- for production, get the real value from Wearables Developer Center
Important:
mobile/CameraAccessAndroid/settings.gradle.ktsreadsgithub_tokenmobile/CameraAccessAndroid/app/build.gradle.ktsreadsmwdat_application_idandmwdat_client_tokenmobile/CameraAccessAndroid/app/src/main/AndroidManifest.xmlinjects those into the DAT manifest metadata
Create the sample app secrets files:
cp mobile/CameraAccess/CameraAccess/Secrets.swift.example mobile/CameraAccess/CameraAccess/Secrets.swift
cp mobile/CameraAccessAndroid/app/src/main/java/com/meta/wearable/dat/externalsampleapps/cameraaccess/Secrets.kt.example mobile/CameraAccessAndroid/app/src/main/java/com/meta/wearable/dat/externalsampleapps/cameraaccess/Secrets.ktAt minimum, set:
geminiAPIKey
Optional:
- WebRTC signaling URL
| Path | Purpose |
|---|---|
mobile/CameraAccess/ |
Current iOS Meta Ray-Ban / iPhone prototype |
mobile/CameraAccessAndroid/ |
Current Android Meta Ray-Ban / phone prototype |
mobile/CameraAccess/server/ |
Current WebRTC signaling server for the existing browser viewer |
backend/ |
FastAPI + Vision Agents backend - face droop, heart rate, GraphRAG |
viewer/ |
React debug dashboard for backend session state, transcripts, and processor signals |
ARCHITECTURE.md |
Backend system architecture and data-flow design |
The canonical Android app path is mobile/CameraAccessAndroid/.
The long-term product direction for this repo is a first-aid guidance system:
- the glasses wearer streams live video/audio from their point of view
- the backend runs custom vision models and other integrations through processors and tools
- private medical / first-aid knowledge can be retrieved through GraphRAG
- the wearer receives voice guidance
- judges, developers, and operators can inspect the augmented video and agent traces in a debug dashboard
The backend-first system is built and running. The mobile apps can stream camera and audio to the backend for real-time CV and voice guidance, or connect directly to Gemini Live as a standalone fallback.
When running without the backend, the iOS and Android apps connect directly to Gemini Live for voice and vision:
- "What am I looking at?" -- Gemini sees through your glasses camera and describes the scene
- "What do I do next?" -- voice guidance from Gemini's built-in knowledge
The glasses camera streams at ~1fps to Gemini for visual context, while audio flows bidirectionally in real-time.
flowchart TD
A("πΆοΈ Meta Ray-Ban Glasses\nor phone camera") -->|video frames + mic audio| B("π± iOS / Android App")
B -->|"JPEG frames @ 10fps\nWebSocket"| C("β‘ FastAPI Backend\n:8000")
C --> D["ποΈ Face Droop\nMediaPipe + EfficientNet-B0"]
C --> E["β€οΈ Heart Rate\nrPPG CHROM/POS"]
D --> F["π€ Gemini Agent\n+ JRCALC 2022 GraphRAG"]
E --> F
F -->|"PCM audio 24kHz"| G("π Glasses speaker")
B -.->|"standalone fallback\n(no backend)"| H("β¨ Gemini Live API")
H -.->|"PCM audio 24kHz"| G
Key pieces:
- FastAPI backend -- runs CV processors and a Gemini voice agent with JRCALC clinical GraphRAG
- Face droop -- MediaPipe landmarks + EfficientNet-B0, asymmetry-gated
- Heart rate -- contactless rPPG via CHROM/POS ensemble
- Phone / glasses mode -- test with your phone camera instead of Meta Ray-Ban glasses
- Standalone fallback -- iOS/Android can also connect directly to Gemini Live when no backend is available
For architecture details, see ARCHITECTURE.md.
flowchart TD
IN("πΆοΈ Meta Glasses / π± Phone / π» Browser webcam")
IN -->|"JPEG frames @ 10fps - WebSocket"| VF
subgraph backend["β‘ FastAPI Backend"]
VF["VideoForwarder\nfan-out to all processors"]
subgraph droop["ποΈ Face Droop Processor"]
D1["MediaPipe\n468 landmarks"] --> D2["Asymmetry gate\nmouth Β· eye Β· brow"]
D2 -->|"score > 0.030"| D3["EfficientNet-B0\nONNX - droop probability"]
end
subgraph hr["β€οΈ Heart Rate Processor"]
H1["MediaPipe\nface detect"] --> H2["Forehead ROI crop\n150-frame buffer"]
H2 --> H3["CHROM / POS ensemble\nBVP extraction"]
H3 --> H4["FFT peak β BPM\n+ alert classifier"]
end
VF --> D1
VF --> H1
D3 --> SIG["processor_signals"]
H4 --> SIG
SIG --> GEM["π€ Gemini voice agent\n+ JRCALC 2022 GraphRAG"]
end
SIG -->|"JSON polling - REST"| VIEW["π₯οΈ React Viewer\nlocalhost:5174"]
GEM -->|"voice guidance"| SPK["π Glasses speaker"]
- Each frame is passed through MediaPipe face landmarker to extract 468 landmarks.
- Mouth/eye/brow asymmetry is computed from leftβright landmark differences.
- Asymmetry gate: if the face is symmetric (combined score < 0.030) the CNN is skipped and the frame is logged as not drooping. This eliminates false positives on resting faces.
- If asymmetry exceeds the gate, an EfficientNet-B0 ONNX model runs on the forehead crop. The final probability is the CNN output scaled by the asymmetry weight.
- Severity bands:
none / mild / severebased on distance from the calibrated threshold.
- YOLOR detects heads/faces with a MediaPipe BlazeFace fallback for non-frontal angles (top-down crib cameras, people lying down). DeepSort tracks identities across frames so each person gets an independent BPM reading.
- A forehead ROI is cropped (top 40% of face height) - the flattest skin region with the strongest pulse signal and fewest expression artefacts.
- ROIs accumulate in a 150-frame rolling buffer (~15 s at 10 fps).
- When the buffer is full, CHROM and POS colour-space BVP estimators run and their spectra are averaged. The dominant peak in the physiological band (60β120 BPM) gives the heart rate.
- Frame-diff motion rejection discards blurry or high-motion frames before they enter the buffer.
Every heartbeat pumps a bolus of blood into the capillary bed just beneath the skin. Oxyhaemoglobin (HbOβ) and deoxyhaemoglobin (Hb) absorb light at different wavelengths - HbOβ absorbs strongly in the blueβgreen band (~420β580 nm) and transmits red, while Hb absorbs more broadly. As blood volume in the capillaries rises and falls with each cardiac cycle, the fraction of incident light absorbed changes accordingly.
This is remote photoplethysmography (rPPG): the same physical principle as a pulse oximeter, but measured optically at a distance using ambient or screen light instead of an LED clipped to a finger.
The change is tiny - roughly 0.5β2% of the mean pixel intensity in the green channel, and even smaller in red and blue. The human visual system cannot perceive it, but a camera accumulating photons over millions of pixels can.
Let R(t), G(t), B(t) be the spatially-averaged pixel values over the forehead ROI at frame t. Each channel carries:
C(t) = Cβ Β· [ 1 + Ξ±_C Β· p(t) ] Β· l(t)
where Cβ is the DC skin reflectance, Ξ±_C is the haemoglobin absorption coefficient in that channel, p(t) is the normalised blood volume pulse, and l(t) is multiplicative illumination drift. The goal is to recover p(t) while cancelling l(t).
CHROM exploits the fact that illumination changes project along the skin-tone direction in RGB space. Normalise each channel by its mean to remove DC:
Rn = R/ΞΌR, Gn = G/ΞΌG, Bn = B/ΞΌB
Two chrominance signals are formed that are orthogonal to the illumination direction:
Xs = 3Β·Rn β 2Β·Gn
Ys = 1.5Β·Rn + Gn β 1.5Β·Bn
The BVP is extracted by rotating in the (Xs, Ys) plane to maximise the pulse-to-noise ratio:
BVP_CHROM = Xs β Ξ±Β·Ys, Ξ± = std(Xs) / std(Ys)
The rotation angle Ξ± is computed per window so it adapts to changing lighting. CHROM works best when the skin-tone prior (fixed coefficients 3, 2, 1.5) holds - i.e. lighter Fitzpatrick types under broadband illumination.
POS makes no fixed assumption about skin tone. Instead it estimates the skin-colour vector directly from the data, then projects the signal onto the plane orthogonal to it where pulse lives.
Build the 3ΓT matrix C of normalised RGB traces. Compute the unit skin-colour vector u (the dominant PCA direction, or the temporal mean). Project:
P = (I β uΒ·uα΅) Β· C
The two rows of P are orthogonal colour channels free of illumination. The BVP is:
BVP_POS = P[0] + (std(P[0]) / std(P[1])) Β· P[1]
Because u is estimated from the actual pixels rather than a population prior, POS self-calibrates to darker skin tones, coloured lighting, and non-standard cameras where CHROM's fixed coefficients would misfire.
Both estimators run on every window. For each, the power spectrum in the physiological band (0.67β3.33 Hz, i.e. 40β200 BPM) is computed via zero-padded FFT. The signal-to-noise ratio is:
SNR = peak_power / (band_power β peak_power)
The estimator with the higher SNR is selected for that window. On lighter skin under stable lighting, CHROM typically wins. On darker skin or under LED flicker, POS wins. Confidence reported to the API is SNR / (SNR + 1) scaled to [0, 1].
The BVP signal p(t) sampled at fps Hz is zero-padded to 4Γ its length before FFT to increase frequency resolution. The peak in the physiological band is found, and:
BPM = peak_frequency_Hz Γ 60
A harmonic-weighted search is used: the algorithm checks the fundamental and its first two harmonics (f, 2f, 3f), preferring a candidate whose harmonics also show power, reducing octave errors.
The BPM estimate is smoothed with an exponential moving average (EMA) with Ξ± = 0.2 to suppress frame-to-frame noise, and a sustained-consistency gate (β₯1 s of readings within Β±5 BPM) is required before alerts fire.
The forehead is chosen as the region of interest for three reasons:
- High capillary density: the supraorbital and frontal branches of the ophthalmic artery run close to the surface, giving a stronger pulsatile signal than cheeks or the neck.
- Low melanin variation: the forehead has fewer active melanocytes than the cheeks in most individuals, reducing between-person variation in DC reflectance.
- Flat geometry: minimal specular highlights from curved surfaces (nose bridge, cheekbones) that corrupt colour measurements.
| Endpoint | Purpose |
|---|---|
POST /sessions |
Create ingest session |
WS /sessions/{id}/stream |
Stream JPEG frames + audio |
GET /sessions |
List active sessions |
GET /sessions/{id} |
Session state + processor signals |
GET /sessions/{id}/frame |
Latest annotated preview JPEG |
GET /health |
Liveness check |
| File | Purpose |
|---|---|
backend/app/processors/face_droop.py |
Droop processor |
backend/app/processors/droop_inference.py |
ONNX wrapper + asymmetry gate |
backend/app/preprocess.py |
MediaPipe landmark extraction |
backend/app/processors/heart_rate/processor.py |
rPPG processor |
backend/app/processors/heart_rate/signal_processor.py |
CHROM/POS BVP estimators |
backend/app/agent_factory.py |
Wires processors + LLM + GraphRAG |
viewer/src/CameraStream.tsx |
Webcam streaming + signal cards |
The backend includes a GraphRAG pipeline that searches the JRCALC 2022 Clinical Practice Guidelines (the definitive UK paramedic reference) and surfaces relevant guidance in real time during a session.
When the glasses wearer speaks, the backend automatically searches the JRCALC document and injects matching guidance into the AI's context before it responds. No special command is needed - just ask naturally:
- "What's the adrenaline dose for cardiac arrest?" - the AI receives the exact dose table row from the JRCALC drug doses section
- "Walk me through stroke assessment" - the AI receives the FAST assessment protocol and management steps
- "Is the patient breathing?" - semantic search finds airway and breathing management sections
The search uses a knowledge graph (NetworkX) on top of a FAISS vector index. Each piece of the document (chapters, sections, paragraphs, and table rows) is a node. When a query matches a node, the graph traversal also pulls in related nodes: the parent section for context, sibling chunks, and every other place the same drug or condition is mentioned across different chapters. This means a query for "adrenaline" returns the cardiac arrest dose, the anaphylaxis dose, and the paediatric dose in a single retrieval pass.
The retrieved guidance is prepended to the AI's prompt, labelled Retrieved clinical guidance (JRCALC 2022), so the AI can cite and reason over it while also using what it sees through the camera.
Step 1 - Get the EPUB
Place the JRCALC 2022 EPUB at:
data/jrcalc-clinical-guidelines-2022.epub
Step 2 - Build the index (one-time, ~5β8 minutes)
cd backend
python -m scripts.build_rag_index \
--epub data/jrcalc-clinical-guidelines-2022.epub \
--out rag_index \
--gemini-api-key $GEMINI_API_KEYThis produces a rag_index/ directory containing the FAISS vector index and the NetworkX graph. Expected output:
Ingesting EPUB: data/jrcalc-clinical-guidelines-2022.epub
Ingestion done in 12.3s
Nodes total: 7842
chapter: 48
section: 312
text_chunk: 4201
table: 198
table_row: 2841
entity: 242
Edges total: 18934
Building NetworkX graph...
Graph saved in 0.4s β rag_index/graph.pkl
Building FAISS index (embedding via Google text-embedding-004)...
FAISS index saved in 287.1s β rag_index/faiss.index (7594 vectors)
Done. Total time: 299.8s
Step 3 - Enable in .env
Add to backend/.env:
RAG_ENABLED=trueOptional tuning:
RAG_TOP_K=6 # number of nodes returned per query (default: 6)
RAG_MAX_CONTEXT_TOKENS=1200 # max tokens of guidance injected per turn (default: 1200)
RAG_INDEX_DIR=rag_index # path to the built index (default: rag_index)Step 4 - Restart the backend
make backend-restartThe backend logs confirm the index loaded:
rag retriever loaded index_dir=rag_index nodes=7842 vectors=7594 top_k=6
The JRCALC document is indexed at four levels of granularity:
| Node type | Example |
|---|---|
| Chapter | "Cardiac Arrest" |
| Section | "Management - Adult" |
| Paragraph | "CPR should be initiated immediately..." |
| Table row | `Drug: Adrenaline |
Drug and condition names become entity hub nodes. A query for "adrenaline" finds the entity hub and follows its connections to every dose table row across all conditions (cardiac arrest, anaphylaxis, bradycardia) in one retrieval pass.
Set RAG_ENABLED=false (or omit it entirely - it defaults to false). The rest of the pipeline is completely unaffected.
git clone https://github.com/safishamsi/FirstSight.git
cd FirstSight/mobile/CameraAccess
open CameraAccess.xcodeprojCopy the example file and fill in your values:
cp CameraAccess/Secrets.swift.example CameraAccess/Secrets.swiftEdit Secrets.swift with your Gemini API key (required) and optional WebRTC config.
Select your iPhone as the target device and hit Run (Cmd+R).
Without glasses (iPhone mode):
- Tap "Start on iPhone" -- uses your iPhone's back camera
- Tap the AI button to start a Gemini Live session
- Talk to the AI -- it can see through your iPhone camera
With Meta Ray-Ban glasses:
First, enable Developer Mode in the Meta AI app:
- Open the Meta AI app on your iPhone
- Go to Settings (gear icon, bottom left)
- Tap App Info
- Tap the App version number 5 times -- this unlocks Developer Mode
- Go back to Settings -- you'll now see a Developer Mode toggle. Turn it on.
Then in the iOS app:
- Tap "Start Streaming" in the app
- Tap the AI button for voice + vision conversation
git clone https://github.com/safishamsi/FirstSight.gitOpen mobile/CameraAccessAndroid/ in Android Studio.
The Meta DAT Android SDK is distributed via GitHub Packages. You need a GitHub Personal Access Token with read:packages scope.
- Go to GitHub > Settings > Developer Settings > Personal Access Tokens and create a classic token with
read:packagesscope - In
mobile/CameraAccessAndroid/local.properties, add:
github_token=YOUR_GITHUB_TOKENTip: If you have the
ghCLI installed, you can rungh auth tokento get a valid token. Make sure it hasread:packagesscope -- if not, rungh auth refresh -s read:packages.Note: GitHub Packages requires authentication even for public repositories. The 401 error means your token is missing or invalid.
cd mobile/CameraAccessAndroid/app/src/main/java/com/meta/wearable/dat/externalsampleapps/cameraaccess/
cp Secrets.kt.example Secrets.ktEdit Secrets.kt with your Gemini API key (required) and optional WebRTC config.
- Let Gradle sync in Android Studio (it will download the DAT SDK from GitHub Packages)
- Select your Android phone as the target device
- Click Run (Shift+F10)
Wireless debugging: You can also install via ADB wirelessly. Enable Wireless debugging in your phone's Developer Options, then pair with
adb pair <ip>:<port>.
Without glasses (Phone mode):
- Tap "Start on Phone" -- uses your phone's back camera
- Tap the AI button (sparkle icon) to start a Gemini Live session
- Talk to the AI -- it can see through your phone camera
With Meta Ray-Ban glasses:
Enable Developer Mode in the Meta AI app (same steps as iOS above), then:
- Tap "Start Streaming" in the app
- Tap the AI button for voice + vision conversation
All source code is in mobile/CameraAccess/CameraAccess/:
| File | Purpose |
|---|---|
Gemini/GeminiConfig.swift |
API keys, model config, system prompt |
Gemini/GeminiLiveService.swift |
WebSocket client for Gemini Live API |
Gemini/AudioManager.swift |
Mic capture (PCM 16kHz) + audio playback (PCM 24kHz) |
Gemini/GeminiSessionViewModel.swift |
Session lifecycle, transcript state |
iPhone/IPhoneCameraManager.swift |
AVCaptureSession wrapper for iPhone camera mode |
WebRTC/WebRTCClient.swift |
WebRTC peer connection + SDP negotiation |
WebRTC/SignalingClient.swift |
WebSocket signaling for WebRTC rooms |
All source code is in mobile/CameraAccessAndroid/app/src/main/java/.../cameraaccess/:
| File | Purpose |
|---|---|
gemini/GeminiConfig.kt |
API keys, model config, system prompt |
gemini/GeminiLiveService.kt |
OkHttp WebSocket client for Gemini Live API |
gemini/AudioManager.kt |
AudioRecord (16kHz) + AudioTrack (24kHz) |
gemini/GeminiSessionViewModel.kt |
Session lifecycle, UI state |
phone/PhoneCameraManager.kt |
CameraX wrapper for phone camera mode |
webrtc/WebRTCClient.kt |
WebRTC peer connection (stream-webrtc-android) |
webrtc/SignalingClient.kt |
OkHttp WebSocket signaling for WebRTC rooms |
settings/SettingsManager.kt |
SharedPreferences with Secrets.kt fallback |
- Input: Phone mic -> AudioManager (PCM Int16, 16kHz mono, 100ms chunks) -> Gemini WebSocket
- Output: Gemini WebSocket -> AudioManager playback queue -> Phone speaker
- iOS iPhone mode: Uses
.voiceChataudio session for echo cancellation + mic gating during AI speech - iOS Glasses mode: Uses
.videoChataudio session (mic is on glasses, speaker is on phone -- no echo) - Android: Uses
VOICE_COMMUNICATIONaudio source for built-in acoustic echo cancellation
- Glasses: DAT SDK video stream (24fps) -> throttle to ~1fps -> JPEG (50% quality) -> Gemini
- Phone: Camera capture (30fps) -> throttle to ~1fps -> JPEG -> Gemini
Share your glasses POV in real-time to a browser viewer with bidirectional audio and video.
- Tap the Live button in the app
- The app connects to a signaling server and gets a 6-character room code
- Share the code -- the viewer opens the server URL in a browser and enters it
- WebRTC peer connection is established (SDP + ICE via the signaling server)
- Media flows peer-to-peer: glasses video to browser, browser camera back to iOS PiP
Key details:
- Signaling server: Node.js + WebSocket, located at
mobile/CameraAccess/server/-- serves the browser viewer and relays SDP/ICE - NAT traversal: Google STUN servers + ExpressTURN relay (fetched from
/api/turn) - Video: 24 fps, 2.5 Mbps max bitrate
- Background handling: 60-second grace period for iOS app backgrounding -- room stays alive for reconnection
- Constraint: Cannot run simultaneously with Gemini Live (audio device conflict)
For full details, see mobile/CameraAccess/CameraAccess/WebRTC/README.md.
- iOS 17.0+
- Xcode 15.0+
- Gemini API key (get one free)
- Meta Ray-Ban glasses (optional -- use iPhone mode for testing)
- Android 14+ (API 34+)
- Android Studio Ladybug or newer
- GitHub account with
read:packagestoken (for DAT SDK) - Gemini API key (get one free)
- Meta Ray-Ban glasses (optional -- use Phone mode for testing)
Gemini doesn't hear me -- Check that microphone permission is granted. The app uses aggressive voice activity detection -- speak clearly and at normal volume.
"Gemini API key not configured" -- Add your API key in Secrets.swift or in the in-app Settings.
Echo/feedback in iPhone mode -- The app mutes the mic while the AI is speaking. If you still hear echo, try turning down the volume.
Gradle sync fails with 401 Unauthorized -- Your GitHub token is missing or doesn't have read:packages scope. Check mobile/CameraAccessAndroid/local.properties for github_token, or set GITHUB_TOKEN in your environment. Generate a token at github.com/settings/tokens.
Gemini WebSocket times out -- The Gemini Live API sends binary WebSocket frames. If you're building a custom client, make sure to handle both text and binary frame types.
Audio not working -- Ensure RECORD_AUDIO permission is granted. On Android 13+, you may need to grant this permission manually in Settings > Apps.
Phone camera not starting -- Ensure CAMERA permission is granted. CameraX requires both the permission and a valid lifecycle.
For DAT SDK issues, see the developer documentation or the discussions forum.
This source code is licensed under the license found in the LICENSE file in the root directory of this source tree.






