A modern web application powered by Meta's SAM 3 (Segment Anything Model 3) and Claude for grounded visual reasoning. IRIS enables LLMs to iteratively query segmentation to verify visual facts rather than hallucinating, with an intuitive web interface for interactive analysis.
The Problem: When you send an image directly to Claude Vision or other multimodal LLMs, they can hallucinate counts, positions, and relationships.
The Solution: IRIS forces Claude to "show its work" using actual computer vision tools.
| Scenario | Standard Vision LLM | IRIS (Grounded) |
|---|---|---|
| Counting | "I see approximately 7 people" ❌ Could be wrong | Calls segment_concept("person") → {"count": 5} ✅ Exact |
| Spatial Reasoning | "The car appears to be in the parking space" ❌ Approximation | Calculates IoU between masks → "89% overlap" ✅ Precise |
| Verification | "Most workers appear to be wearing hard hats" ❌ Uncertain | Segments 8 people, 7 hard hats → "Worker #4 missing hard hat" ✅ Specific |
| Video Tracking | "Crowd density increases" ❌ Vague | Frame 0: 3 people, Frame 5: 12 people, Frame 9: 5 people ✅ Precise temporal data |
| Proof | Text description only | Visual masks + bounding boxes + confidence scores ✅ Verifiable |
Question: "Are all workers wearing proper PPE?"
Standard Vision LLM Response:
"I can see several workers. Most appear to be wearing hard hats,
though one in the back may not be."
IRIS Response:
1. segment_concept("person") → 8 workers detected
2. segment_concept("hard hat") → 7 hard hats detected
3. analysis_spatial("person", "hard hat") → 7 overlapping pairs
4. Result: Worker #4 at position [245, 180] has no hard hat
"7 out of 8 workers are wearing hard hats. Worker #4 is not compliant."
+ Visual overlay showing worker #4 circled without hard hat detection
Choose IRIS when you need:
- ✅ Exact counts (not "several" or "many")
- ✅ Spatial measurements (distance, overlap %, containment)
- ✅ Verification with proof (show me where X is detected)
- ✅ Video object tracking (how does count change frame-by-frame?)
- ✅ ML dataset export (COCO, YOLO annotations for training)
- ✅ Audit trails (visual evidence of detections)
Standard vision LLMs are fine for:
- General scene descriptions
- OCR / text reading
- Creative/artistic analysis
- When approximate answers are acceptable
LLM receives an image and calls segmentation tools to verify facts:
segment_concept("red traffic light") # -> 1 instance at (x1, y1, x2, y2)
segment_concept("pedestrian") # -> 3 instances
segment_concept("crosswalk") # -> 1 instance
# Claude can accurately answer: "Is this car running a red light?"Claude analyzes video frames with SAM3 segmentation tools:
# Claude extracts frames and tracks objects across time
segment_concept_in_frame(0, "person") # -> 5 people in frame 0
segment_concept_all_frames("person") # -> Track count changes over time
# Claude answers: "How does crowd density change throughout the video?"Save segmentation results with visual overlays:
- Semi-transparent colored masks
- Corner bracket-style bounding boxes
- Indexed labels
[01],[02], etc.
cd sam3_vision_tools
# Create virtual environment
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# Install dependencies
pip install -r requirements.txt
# For RTX 50 series GPUs (Blackwell architecture):
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu130Copy .env.example to .env and fill in your actual values:
cp .env.example .envThen edit .env with your credentials:
ANTHROPIC_API_KEY=sk-ant-... # Required for Claude API
HF_TOKEN=hf_... # Required for SAM 3 gated model- Python 3.9+
- Node.js 18+ (for web frontend)
- PyTorch 2.9.0+ (for RTX 50 series) or PyTorch 2.0+ (older GPUs)
- CUDA 13.0 (RTX 50 series) or CUDA 12.x (older GPUs)
- ~2GB disk space for SAM 3 model weights
- ffmpeg (for video processing and downsampling)
- Start the backend server:
# Using Python directly (development)
python server.py # Runs on http://localhost:8000
# Or using uvicorn (production-ready)
uvicorn server:app --reload --port 8000- Start the frontend (in a new terminal):
cd web
npm install # First time only
npm run dev # Runs on http://localhost:3000- Open http://localhost:3000 in your browser
# Run the grounded describer demo
python examples.py --demo grounded
# Direct tool usage demo
python examples.py --demo toolsfrom src.agents.grounded_describer import GroundedDescriberAgent
agent = GroundedDescriberAgent(model="claude-sonnet-4-5-20250929")
result = agent.analyze(
image_path="traffic_scene.jpg",
question="Is this car running a red light?",
candidate_concepts=["red traffic light", "green traffic light", "car"]
)
print(result["answer"])IRIS includes a modern web UI built with Next.js 16 and React 19 for interactive visual analysis with real-time feedback.
- Real-time streaming chat with Claude using Server-Sent Events (SSE)
- Drag-and-drop image/video upload with instant preview
- Live visualization overlay with TensorPoint design system styling (dark theme with orange accents)
- Tools execution panel showing active segmentations and results
- Lightbox view for detailed inspection (click image to expand, ESC to close)
- Video timeline with formatted timestamps (MM:SS.mmm) and progress indicators
- Performance settings for video processing (frame skip, resolution, processing mode)
- Responsive design with dark theme optimized for visual analysis
The FastAPI backend (server.py) exposes the following endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/api/health |
GET | Server status and model loaded state |
/api/upload/image |
POST | Upload image, returns dimensions and storage URL |
/api/upload/video |
POST | Upload video with configurable processing mode (frame_extraction or whole_video), frame sampling, and resolution settings |
/api/preload |
POST | Preload SAM3 model (warm start) with SSE progress updates |
/api/chat |
POST | Streaming chat with Claude via SSE (returns events: tool_call, tool_result, visualization, text, done, error) |
/api/media/current |
GET | Retrieve current media as base64 |
/visualizations/{file} |
GET | Static file serving of generated mask visualizations |
When streaming chat responses, the following event types are emitted:
status- Connection status updatestool_call- Claude invoked a tool (includes tool name and input parameters)tool_result- Tool execution completed (includes result data)visualization- Mask visualization image generated (includes URL to fetch)frame_visualization- Per-frame video visualization (for video analysis)text- Claude response text chunk (streaming)done- Chat turn completederror- Error occurred during processing
sam3_vision_tools/
├── server.py # FastAPI backend with SSE streaming (uvicorn)
├── examples.py # CLI demos (grounded describer, tool usage)
├── src/
│ ├── __init__.py
│ ├── sam3_engine.py # Core SAM 3 wrapper (image & video models)
│ ├── claude_tools.py # Tool definitions for Claude
│ ├── viz.py # Mask visualization system (TensorPoint design)
│ ├── video_utils.py # Video trimming and metadata utilities
│ └── agents/
│ └── grounded_describer.py # Grounded visual Q&A agent
├── web/ # Next.js 16 frontend (React 19)
│ ├── src/
│ │ ├── app/ # Next.js app router pages
│ │ │ ├── layout.tsx
│ │ │ └── page.tsx
│ │ ├── components/ # React components
│ │ │ ├── chat-panel.tsx
│ │ │ ├── preview-panel.tsx
│ │ │ ├── tools-panel.tsx
│ │ │ ├── settings-modal.tsx # Performance settings for video
│ │ │ ├── media-upload.tsx
│ │ │ └── ui/ # shadcn/ui components
│ │ ├── contexts/ # React contexts
│ │ │ └── settings-context.tsx
│ │ └── lib/ # API client and utilities
│ │ └── api.ts
│ ├── package.json
│ └── tailwind.config.ts
└── requirements.txt
| Tool | Description |
|---|---|
segment_concept |
Segment all instances of a text-described concept |
segment_multiple_concepts |
Segment multiple concepts in one call |
segment_with_box |
Segment using bounding box constraint |
segment_with_point |
Segment object at a specific point |
compute_mask_overlap |
Compare two segmentation results (IoU) |
get_image_dimensions |
Get image width/height |
| Tool | Description |
|---|---|
segment_concept_in_frame |
Segment concept in a specific frame with timestamp |
segment_concept_all_frames |
Track concept across all frames with temporal summary. Supports both frame extraction mode (sampled frames) and whole_video mode (native SAM3VideoModel tracking with temporal consistency) |
get_video_info |
Get frame count, timestamps, dimensions, duration |
| Tool | Description |
|---|---|
analysis_summarize |
Generate comprehensive segmentation summary with statistics (confidence breakdown, size distribution, spatial clustering) |
analysis_spatial |
Analyze spatial relationships between concepts (overlapping, nearby, far pairs with IoU and distance metrics) |
analysis_compare_concepts |
Compare multiple concepts by count, total_area, or avg_confidence with ranking |
export_dataset |
Export segmentation annotations in COCO JSON, YOLO txt, or Pascal VOC XML formats for ML training |
| Tool | Description |
|---|---|
video_track_changes |
Temporal change detection: compare specific frames, analyze count timelines, track object entry/exit events |
# Grounded describer demo (interactive Q&A with segmentation)
python examples.py --demo grounded
# Direct tool usage demo (programmatic API)
python examples.py --demo toolsFor advanced usage, see the examples.py file which demonstrates:
- Using the GroundedDescriberAgent for visual Q&A
- Direct SAM3Engine API calls
- Tool integration with Claude
Tested on:
- NVIDIA RTX 5070 (Blackwell, sm_120) - PyTorch 2.9.0 + CUDA 13.0
- NVIDIA RTX 30/40 series - PyTorch 2.0+ with CUDA 12.x
- GPU Acceleration: SAM 3 runs much faster on GPU (CUDA recommended)
- Batch Concepts: Use
segment_multiple_conceptsfor efficiency - Caching: Segmentation results are cached per session to avoid redundant computation
- Video Processing Modes:
- Frame Extraction Mode: Samples N frames evenly (default: 15/min, configurable)
- Whole Video Mode: Uses SAM3VideoModel for temporal tracking with frame_skip (default: 2x speedup)
- Video Resolution: Downsample to 720p or 480p for faster processing (configurable in settings)
- Context Management: Message history is automatically truncated to last 20 messages to prevent unbounded context growth
- SAM 3 model (Meta AI license)
- Claude API (Anthropic terms of service)

