IRIS - Iterative Reasoning with Image Segmentation

A modern web application powered by Meta's SAM 3 (Segment Anything Model 3) and Claude for grounded visual reasoning. IRIS enables LLMs to iteratively query segmentation to verify visual facts rather than hallucinating, with an intuitive web interface for interactive analysis.

Why IRIS vs Standard Vision Models?

The Problem: When you send an image directly to Claude Vision or other multimodal LLMs, they can hallucinate counts, positions, and relationships.

The Solution: IRIS forces Claude to "show its work" using actual computer vision tools.

Key Differences

Scenario	Standard Vision LLM	IRIS (Grounded)
Counting	"I see approximately 7 people" ❌ Could be wrong	Calls `segment_concept("person")` → `{"count": 5}` ✅ Exact
Spatial Reasoning	"The car appears to be in the parking space" ❌ Approximation	Calculates IoU between masks → "89% overlap" ✅ Precise
Verification	"Most workers appear to be wearing hard hats" ❌ Uncertain	Segments 8 people, 7 hard hats → "Worker #4 missing hard hat" ✅ Specific
Video Tracking	"Crowd density increases" ❌ Vague	Frame 0: 3 people, Frame 5: 12 people, Frame 9: 5 people ✅ Precise temporal data
Proof	Text description only	Visual masks + bounding boxes + confidence scores ✅ Verifiable

Real-World Example

Question: "Are all workers wearing proper PPE?"

Standard Vision LLM Response:

"I can see several workers. Most appear to be wearing hard hats,
though one in the back may not be."

IRIS Response:

1. segment_concept("person") → 8 workers detected
2. segment_concept("hard hat") → 7 hard hats detected
3. analysis_spatial("person", "hard hat") → 7 overlapping pairs
4. Result: Worker #4 at position [245, 180] has no hard hat

"7 out of 8 workers are wearing hard hats. Worker #4 is not compliant."
+ Visual overlay showing worker #4 circled without hard hat detection

When to Use IRIS

Choose IRIS when you need:

✅ Exact counts (not "several" or "many")
✅ Spatial measurements (distance, overlap %, containment)
✅ Verification with proof (show me where X is detected)
✅ Video object tracking (how does count change frame-by-frame?)
✅ ML dataset export (COCO, YOLO annotations for training)
✅ Audit trails (visual evidence of detections)

Standard vision LLMs are fine for:

General scene descriptions
OCR / text reading
Creative/artistic analysis
When approximate answers are acceptable

Features

1. Grounded Describing Agent

LLM receives an image and calls segmentation tools to verify facts:

segment_concept("red traffic light")  # -> 1 instance at (x1, y1, x2, y2)
segment_concept("pedestrian")         # -> 3 instances
segment_concept("crosswalk")          # -> 1 instance

# Claude can accurately answer: "Is this car running a red light?"

2. Video Analysis Agent

Claude analyzes video frames with SAM3 segmentation tools:

# Claude extracts frames and tracks objects across time
segment_concept_in_frame(0, "person")      # -> 5 people in frame 0
segment_concept_all_frames("person")       # -> Track count changes over time

# Claude answers: "How does crowd density change throughout the video?"

3. Mask Visualization

Save segmentation results with visual overlays:

Semi-transparent colored masks
Corner bracket-style bounding boxes
Indexed labels [01], [02], etc.

Installation

cd sam3_vision_tools

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows
# source venv/bin/activate  # Linux/Mac

# Install dependencies
pip install -r requirements.txt

# For RTX 50 series GPUs (Blackwell architecture):
pip install torch==2.9.0 torchvision==0.24.0 --index-url https://download.pytorch.org/whl/cu130

Environment Variables

Copy .env.example to .env and fill in your actual values:

cp .env.example .env

Then edit .env with your credentials:

ANTHROPIC_API_KEY=sk-ant-...    # Required for Claude API
HF_TOKEN=hf_...                  # Required for SAM 3 gated model

Requirements

Python 3.9+
Node.js 18+ (for web frontend)
PyTorch 2.9.0+ (for RTX 50 series) or PyTorch 2.0+ (older GPUs)
CUDA 13.0 (RTX 50 series) or CUDA 12.x (older GPUs)
~2GB disk space for SAM 3 model weights
ffmpeg (for video processing and downsampling)

Quick Start

Web Interface (Recommended)

Start the backend server:

# Using Python directly (development)
python server.py  # Runs on http://localhost:8000

# Or using uvicorn (production-ready)
uvicorn server:app --reload --port 8000

Start the frontend (in a new terminal):

cd web
npm install  # First time only
npm run dev  # Runs on http://localhost:3000

Open http://localhost:3000 in your browser

CLI Examples (Alternative)

# Run the grounded describer demo
python examples.py --demo grounded

# Direct tool usage demo
python examples.py --demo tools

Python API

from src.agents.grounded_describer import GroundedDescriberAgent

agent = GroundedDescriberAgent(model="claude-sonnet-4-5-20250929")

result = agent.analyze(
    image_path="traffic_scene.jpg",
    question="Is this car running a red light?",
    candidate_concepts=["red traffic light", "green traffic light", "car"]
)

print(result["answer"])

Web Interface

IRIS includes a modern web UI built with Next.js 16 and React 19 for interactive visual analysis with real-time feedback.

Features

Real-time streaming chat with Claude using Server-Sent Events (SSE)
Drag-and-drop image/video upload with instant preview
Live visualization overlay with TensorPoint design system styling (dark theme with orange accents)
Tools execution panel showing active segmentations and results
Lightbox view for detailed inspection (click image to expand, ESC to close)
Video timeline with formatted timestamps (MM:SS.mmm) and progress indicators
Performance settings for video processing (frame skip, resolution, processing mode)
Responsive design with dark theme optimized for visual analysis

API Endpoints

The FastAPI backend (server.py) exposes the following endpoints:

Endpoint	Method	Purpose
`/api/health`	GET	Server status and model loaded state
`/api/upload/image`	POST	Upload image, returns dimensions and storage URL
`/api/upload/video`	POST	Upload video with configurable processing mode (frame_extraction or whole_video), frame sampling, and resolution settings
`/api/preload`	POST	Preload SAM3 model (warm start) with SSE progress updates
`/api/chat`	POST	Streaming chat with Claude via SSE (returns events: tool_call, tool_result, visualization, text, done, error)
`/api/media/current`	GET	Retrieve current media as base64
`/visualizations/{file}`	GET	Static file serving of generated mask visualizations

SSE Event Types

When streaming chat responses, the following event types are emitted:

status - Connection status updates
tool_call - Claude invoked a tool (includes tool name and input parameters)
tool_result - Tool execution completed (includes result data)
visualization - Mask visualization image generated (includes URL to fetch)
frame_visualization - Per-frame video visualization (for video analysis)
text - Claude response text chunk (streaming)
done - Chat turn completed
error - Error occurred during processing

Architecture

sam3_vision_tools/
├── server.py              # FastAPI backend with SSE streaming (uvicorn)
├── examples.py            # CLI demos (grounded describer, tool usage)
├── src/
│   ├── __init__.py
│   ├── sam3_engine.py     # Core SAM 3 wrapper (image & video models)
│   ├── claude_tools.py    # Tool definitions for Claude
│   ├── viz.py             # Mask visualization system (TensorPoint design)
│   ├── video_utils.py     # Video trimming and metadata utilities
│   └── agents/
│       └── grounded_describer.py  # Grounded visual Q&A agent
├── web/                   # Next.js 16 frontend (React 19)
│   ├── src/
│   │   ├── app/           # Next.js app router pages
│   │   │   ├── layout.tsx
│   │   │   └── page.tsx
│   │   ├── components/    # React components
│   │   │   ├── chat-panel.tsx
│   │   │   ├── preview-panel.tsx
│   │   │   ├── tools-panel.tsx
│   │   │   ├── settings-modal.tsx  # Performance settings for video
│   │   │   ├── media-upload.tsx
│   │   │   └── ui/        # shadcn/ui components
│   │   ├── contexts/      # React contexts
│   │   │   └── settings-context.tsx
│   │   └── lib/           # API client and utilities
│   │       └── api.ts
│   ├── package.json
│   └── tailwind.config.ts
└── requirements.txt

Available Tools

Tool	Description
`segment_concept`	Segment all instances of a text-described concept
`segment_multiple_concepts`	Segment multiple concepts in one call
`segment_with_box`	Segment using bounding box constraint
`segment_with_point`	Segment object at a specific point
`compute_mask_overlap`	Compare two segmentation results (IoU)
`get_image_dimensions`	Get image width/height

Video Tools

Tool	Description
`segment_concept_in_frame`	Segment concept in a specific frame with timestamp
`segment_concept_all_frames`	Track concept across all frames with temporal summary. Supports both frame extraction mode (sampled frames) and whole_video mode (native SAM3VideoModel tracking with temporal consistency)
`get_video_info`	Get frame count, timestamps, dimensions, duration

Analysis Tools

Tool	Description
`analysis_summarize`	Generate comprehensive segmentation summary with statistics (confidence breakdown, size distribution, spatial clustering)
`analysis_spatial`	Analyze spatial relationships between concepts (overlapping, nearby, far pairs with IoU and distance metrics)
`analysis_compare_concepts`	Compare multiple concepts by count, total_area, or avg_confidence with ranking
`export_dataset`	Export segmentation annotations in COCO JSON, YOLO txt, or Pascal VOC XML formats for ML training

Video Analysis Tools

Tool	Description
`video_track_changes`	Temporal change detection: compare specific frames, analyze count timelines, track object entry/exit events

Running Examples

# Grounded describer demo (interactive Q&A with segmentation)
python examples.py --demo grounded

# Direct tool usage demo (programmatic API)
python examples.py --demo tools

For advanced usage, see the examples.py file which demonstrates:

Using the GroundedDescriberAgent for visual Q&A
Direct SAM3Engine API calls
Tool integration with Claude

GPU Support

Tested on:

NVIDIA RTX 5070 (Blackwell, sm_120) - PyTorch 2.9.0 + CUDA 13.0
NVIDIA RTX 30/40 series - PyTorch 2.0+ with CUDA 12.x

Performance Tips

GPU Acceleration: SAM 3 runs much faster on GPU (CUDA recommended)
Batch Concepts: Use segment_multiple_concepts for efficiency
Caching: Segmentation results are cached per session to avoid redundant computation
Video Processing Modes:
- Frame Extraction Mode: Samples N frames evenly (default: 15/min, configurable)
- Whole Video Mode: Uses SAM3VideoModel for temporal tracking with frame_skip (default: 2x speedup)
Video Resolution: Downsample to 720p or 480p for faster processing (configurable in settings)
Context Management: Message history is automatically truncated to last 20 messages to prevent unbounded context growth

License

SAM 3 model (Meta AI license)
Claude API (Anthropic terms of service)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.claude		.claude
docs		docs
src		src
web		web
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
examples.py		examples.py
requirements.txt		requirements.txt
server.py		server.py

Folders and files

Latest commit

History

Repository files navigation

IRIS - Iterative Reasoning with Image Segmentation

Why IRIS vs Standard Vision Models?

Key Differences

Real-World Example

When to Use IRIS

Features

1. Grounded Describing Agent

2. Video Analysis Agent

3. Mask Visualization

Installation

Environment Variables

Requirements

Quick Start

Web Interface (Recommended)

CLI Examples (Alternative)

Python API

Web Interface

Features

API Endpoints

SSE Event Types

Architecture

Available Tools

Video Tools

Analysis Tools

Video Analysis Tools

Running Examples

GPU Support

Performance Tips

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages