Skip to content

TheFloatingString/sam3-on-modal

Repository files navigation

SAM3 Inference API on Modal

img

Running SAM-3 by AI at Meta on Modal GPUs.

Features

  • Image Segmentation: Text-prompted image segmentation
  • Video Segmentation: Text-prompted video frame segmentation across sequences

Prerequisites

  1. Modal account and installed CLI: https://modal.com
  2. Python 3.9+
  3. Dependencies (see pyproject.toml)

Installation

pip install uv
uv sync

Environment Keys

In a .env file:

IMAGE_ENDPOINT="<endpoint>"
VIDEO_ENDPOINT="<endpoint>"

And set up HuggingFace token secret on Modal:

uv run modal secret create huggingface HF_TOKEN="<HF TOKEN>"

Deployment

Deploy to Modal:

uv run modal run modal_app.py

Or for a persistent deployment:

uv run modal deploy modal_app.py

Example inference:

uv run infer_golden_gate.py

API Endpoints

Health Check

GET /health_check

Simple health check endpoint.

curl https://your-workspace--sam3-inference-health-check.modal.run

Response:

{
  "status": "healthy",
  "service": "sam3-inference"
}

Image Inference

POST /infer_image

Perform segmentation on an image with a text prompt.

Request body:

{
  "image_base64": "base64_encoded_image_data",
  "prompt": "dog"
}

Response:

{
  "success": true,
  "data": {
    "masks": [...],
    "boxes": [...],
    "scores": [...]
  }
}

Example using Python:

import base64
import requests

image_path = "path/to/image.jpg"
with open(image_path, "rb") as f:
    image_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    "https://your-workspace--sam3-inference-infer-image.modal.run",
    json={
        "image_base64": image_base64,
        "prompt": "cat",
    }
)

print(response.json())

Example using cURL:

# First, encode your image
IMAGE_B64=$(base64 -w 0 image.jpg)

# Then POST
curl -X POST https://your-workspace--sam3-inference-infer-image.modal.run \
  -H "Content-Type: application/json" \
  -d '{
    "image_base64": "'$IMAGE_B64'",
    "prompt": "dog"
  }'

Video Inference

POST /infer_video

Perform segmentation on video frames with text prompts.

Two actions are supported:

1. Start a video session

Request body:

{
  "action": "start_session",
  "video_path": "/path/to/video.mp4"
}

Response:

{
  "success": true,
  "data": {
    "session_id": "session_123456"
  }
}

2. Add a prompt to the session

Request body:

{
  "action": "add_prompt",
  "session_id": "session_123456",
  "frame_index": 0,
  "prompt": "person walking"
}

Response:

{
  "success": true,
  "data": {
    "outputs": {...},
    ...
  }
}

Example using Python:

import requests

endpoint = "https://your-workspace--sam3-inference-infer-video.modal.run"

# Start session
session_response = requests.post(
    endpoint,
    json={
        "action": "start_session",
        "video_path": "/path/to/video.mp4",
    }
)

session_id = session_response.json()["data"]["session_id"]

# Add prompt
prompt_response = requests.post(
    endpoint,
    json={
        "action": "add_prompt",
        "session_id": session_id,
        "frame_index": 0,
        "prompt": "dog",
    }
)

print(prompt_response.json())

Usage Examples

See client_example.py for complete client implementations.

Local Development

For local testing without Modal:

from PIL import Image
from sam3.model_builder import build_sam3_image_model
from sam3.model.sam3_image_processor import Sam3Processor

# Load model
model = build_sam3_image_model()
processor = Sam3Processor(model)

# Load image
image = Image.open("image.jpg")

# Inference
inference_state = processor.set_image(image)
output = processor.set_text_prompt(state=inference_state, prompt="dog")

masks, boxes, scores = output["masks"], output["boxes"], output["scores"]

Project Structure

.
├── modal_app.py           # Main Modal app with endpoints
├── client_example.py      # Example client code
├── pyproject.toml         # Dependencies
├── README.md              # This file
└── main.py                # Placeholder entry point

Configuration

The app uses GPU "l40s" by default. To use a different GPU, modify modal_app.py:

@app.cls(image=image, gpu="h100")  # Change "l40s" to desired GPU
class SAM3ImagePredictor:
    ...

Troubleshooting

Model download issues

If SAM3 model download fails, ensure you have sufficient disk space and network connectivity.

GPU out of memory

Reduce batch sizes or use a larger GPU (h100 vs l40s).

Endpoint timeout

For large images/videos, increase the function timeout:

@app.function(timeout=600)  # 10 minutes
@modal.web_endpoint(method="POST")
async def infer_image(request_dict: dict) -> dict:
    ...

License

MIT

References

About

Running SAM-3 inference on Modal GPUs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages