Skip to content

Add SAM2 streaming video tracker (inference_models + workflow block)#2245

Open
hansent wants to merge 8 commits intomainfrom
claude/sam-video-tracking-inference-models
Open

Add SAM2 streaming video tracker (inference_models + workflow block)#2245
hansent wants to merge 8 commits intomainfrom
claude/sam-video-tracking-inference-models

Conversation

@hansent
Copy link
Copy Markdown
Collaborator

@hansent hansent commented Apr 20, 2026

Summary

  • New HF-backed SAM2Video model in inference_models/inference_models/models/sam2_video/ — inherits from the new shared HFStreamingVideoBase so the streaming (prompt / track) contract is defined once and reused by future HF video trackers.
  • Registered under model architecture sam2video (instance-segmentation, HF backend). Four variants shipped: sam2video/tiny, sam2video/small (default), sam2video/base-plus, sam2video/large — all four weight zips uploaded to rf-platform-models/ and registered on staging via model-registry-sdk#5.
  • New workflow block roboflow_core/segment_anything_2_video@v1 (LOCAL-only — raises NotImplementedError on remote execution since per-video session state can't cross a process boundary). Multiplexes a single SAM2 predictor across videos by keying state on WorkflowImageData.video_metadata.video_identifier.
  • Three prompt modes on the block: first_frame (prompt once, then track), every_n_frames (re-seed every N frames), every_frame (re-seed every frame).
  • Fixes a real bug in the HF input_boxes nesting ([image, boxes, coords] — 3 levels, not 4) that the mocked unit tests missed and only surfaced under real weights.
  • The legacy SegmentAnything2VideoRT / sam2_rt.SAM2ForStream (Meta's sam2 package camera predictor) is untouched by this PR; both remain registered concurrently.

Out of scope for this PR

SAM3 video (streaming) was implemented on this branch alongside SAM2 but stripped in the final commit. The facebook/sam3 checkpoints are gated on HuggingFace and we're still waiting on access. HFStreamingVideoBase is deliberately left in models/common/ so the SAM3 follow-up can inherit it unchanged.

Test plan

  • 25 unit tests pass — block logic (tests/workflows/unit_tests/core_steps/models/foundation/test_segment_anything2_video.py, test_streaming_video_common.py) + SAM2 HF model (inference_models/tests/unit_tests/models/test_sam2_video.py)
  • Staging registration verified: sam2video/small loads via AutoModel.from_pretrained + runs prompt + track end-to-end against api.roboflow.one
  • Registration script for all four variants lives at roboflow/model-registry-sdk PR Include class_id and class_list in inference response #5
  • Run integration tests on a GPU runner once any uploaded variant is sealed
  • Smoke test the workflow block via InferencePipeline + a short MP4
  • After staging is fully verified: register against production

hansent added 6 commits April 20, 2026 11:50
Introduces a SAM3 streaming tracker that mirrors the existing
SAM2ForStream interface (prompt / track returning (masks, object_ids,
state_dict)) so both can be used interchangeably by upstream code.

- inference_models/models/sam3_rt/sam3_pytorch.py: SAM3ForStream
  backed by HuggingFace transformers' Sam3VideoModel /
  Sam3VideoProcessor.  The native sam3 package's video predictor
  requires a full video resource upfront; the transformers port
  exposes init_video_session + per-frame model(frame=...), which is
  the shape we need for InferencePipeline-style streaming.
- Accepts bbox and/or text prompts; state_dict is opaque (wraps the
  HF Sam3VideoInferenceSession) and must be kept in memory by the
  caller — it's not serializable across processes.
- Register (segment-anything-3-rt, INSTANCE_SEGMENTATION_TASK,
  BackendType.HF) in models_registry alongside the existing
  SAM2-RT entry.

Tests:
- inference_models/tests/unit_tests/models/test_sam3_rt.py — 24
  unit tests covering helpers (_normalise_bboxes,
  _unpack_processed_outputs, etc.) plus class behaviour using
  MagicMock model/processor.  No weights required.
- inference_models/tests/integration_tests/models/test_sam2_rt_predictions.py
  — new integration suite for the existing SAM2ForStream
  (prompt -> track on synthetic frames, centroid-moves assertion,
  track-without-prompt raises, torch.Tensor input).
- inference_models/tests/integration_tests/models/test_sam3_rt_predictions.py
  — analogous suite for SAM3ForStream.
- conftest.py: sam2_rt_package and sam3_rt_package fixtures
  download zips from rf-platform-models; docstrings list the
  expected file contents for upload.

https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Two new LOCAL-only workflow blocks that drive the inference_models
streaming trackers (SAM2ForStream / SAM3ForStream) from workflows
powered by InferencePipeline.  Both blocks multiplex a single model
instance across many videos by keying state_dicts on
video_metadata.video_identifier, reset sessions when frame_number
rolls back, and support three prompt modes: first_frame,
every_n_frames, every_frame.

- inference/core/workflows/core_steps/models/foundation/
  _streaming_video_common.py: shared helpers (state bookkeeping,
  prompt-vs-track decision logic, sv.Detections assembly with
  SAM-assigned tracker_ids).
- segment_anything2_video/v1.py: SAM2VideoTrackerBlockV1
  (type: roboflow_core/segment_anything_2_video@v1,
   default model_id: segment-anything-2-rt).
- segment_anything3_video/v1.py: SAM3VideoTrackerBlockV1
  (type: roboflow_core/sam3_video@v1,
   default model_id: segment-anything-3-rt).  Additionally accepts
  text prompts via class_names; boxes win when both are supplied.
- Both raise NotImplementedError on REMOTE step execution — per-video
  session state cannot survive a remote boundary.
- Models are loaded via inference_models.AutoModel.from_pretrained
  so backend negotiation / package download / caching flow through
  the standard inference_models pipeline.
- Registered in core_steps/loader.py.

Tests (30 total, all passing, no weights required):
- test_segment_anything2_video.py — 10 tests covering manifest,
  REMOTE rejection, first_frame/every_n_frames/every_frame modes,
  state threading across track calls, multi-stream isolation,
  stream-restart detection.
- test_segment_anything3_video.py — 9 tests with similar coverage
  plus text-vs-box prompt routing.
- test_streaming_video_common.py — 11 tests for the shared helpers.

https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Refactors the streaming trackers into a shared HuggingFace
transformers base and adds a SAM2Video counterpart to the existing
SAM3Video.  The older sam2_rt (SAM2ForStream using Meta's sam2 camera
predictor) is kept untouched — per the feedback it hasn't been
exercised much in practice.

Model classes
-------------
- inference_models/models/common/hf_streaming_video.py:
  HFStreamingVideoBase containing all the HF streaming boilerplate —
  session init, prompt/track methods, mask/obj_id extraction, opaque
  state_dict contract.
- inference_models/models/sam2_video/sam2_video_hf.py: SAM2Video
  (lazy-imports transformers.Sam2VideoModel / Sam2VideoProcessor;
   rejects text prompts).
- inference_models/models/sam3_video/sam3_video_hf.py: SAM3Video,
  moved from the previous sam3_rt path; now a thin ~25-line subclass
  after the shared base absorbed the helpers (lazy-imports
  transformers.Sam3VideoModel / Sam3VideoProcessor; accepts both
  text and box prompts).

Registry
--------
- sam2video: (INSTANCE_SEGMENTATION_TASK, BackendType.HF) -> SAM2Video
- sam3video: (INSTANCE_SEGMENTATION_TASK, BackendType.HF) -> SAM3Video
- segment-anything-2-rt stays registered against SAM2ForStream.
- segment-anything-3-rt entry dropped (never released).

Workflow blocks now default to these ids:
- roboflow_core/segment_anything_2_video@v1 -> "sam2video"
- roboflow_core/sam3_video@v1               -> "sam3video"

Tests
-----
- Unit tests: added test_sam2_video.py (4 SAM2-specific), renamed
  test_sam3_rt.py -> test_sam3_video.py and updated imports (24 tests
  covering helpers on the shared base plus SAM3 class behaviour).
- Integration tests: renamed SAM3 file, added SAM2 counterpart.  New
  fixtures sam2_video_package / sam3_video_package (expected zips at
  rf-platform-models/sam2video.zip and rf-platform-models/sam3video.zip).
- Workflow block tests updated to use sam2video / sam3video ids.
- All 58 non-integration tests pass locally.

https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Removes the integration tests and fixture I'd added for the existing
SAM2ForStream (sam2_rt) — keeping them would require uploading a
segment-anything-2-rt.zip to the test assets bucket, but the goal for
this PR is to leave that untested path alone.

- Deleted tests/integration_tests/models/test_sam2_rt_predictions.py
- Removed SAM2_RT_PACKAGE_URL + sam2_rt_package fixture from conftest.py
- Fixed two docstring references that still said SAM2ForStream when they
  now point at the new SAM2Video.

The SAM2ForStream registry entry itself stays — it's the legacy model
that existed before this branch and we're not touching it.

https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Captures the state of this branch while session context is fresh —
what's where, how to test, what needs uploading, known gotchas,
and a sketch of the follow-up "add a model" Claude skill.  Doc is
temporary and should be deleted before the PR merges.

https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
- Default workflow block to sam2video/small, advertise all four
  Hiera backbones (tiny / small / base-plus / large) via examples
  and get_supported_model_variants.
- Fix Sam2VideoProcessor.add_inputs_to_inference_session call: the
  processor expects input_boxes with 3 nesting levels ([image,
  boxes, coords]); we were passing 4, which raised ValueError on the
  first real-weights prompt. Unit tests missed it because they mock
  the processor — surfaced by end-to-end verify against the uploaded
  sam2video-small.zip.
- Point the inference_models integration fixture URL at
  sam2video-small.zip (the variant that matches the new default).
- Update sam2video workflow-block unit tests to pass the new
  sam2video/small default through the mocks.
@hansent hansent force-pushed the claude/sam-video-tracking-inference-models branch from cb663b6 to d7e6945 Compare April 20, 2026 16:50
@roboflow roboflow deleted a comment from CLAassistant Apr 20, 2026
Descope the SAM3 streaming-video work to a follow-up so this PR can
ship SAM2 video alone. SAM3's HF port requires the gated facebook/sam3
checkpoints, which aren't available yet.

Removed:
- inference_models SAM3 model class, registry entry, unit + integration
  tests, and test fixture
- sam3_video workflow block + loader registration + unit tests
- SAM_VIDEO_HANDOFF.md (served its purpose during the branch work)

Left in place:
- HFStreamingVideoBase in inference_models/models/common — reusable,
  SAM2 uses it today and a future SAM3 port can inherit unchanged
- _streaming_video_common workflow helpers — still used by the SAM2
  video block
- SAM2 video class, registry entry, workflow block, and all SAM2 tests

Comments that referenced SAM3Video / test_sam3_video.py have been
generalised or trimmed so nothing dangles.
@hansent hansent changed the title Claude/sam video tracking inference models Add SAM2 streaming video tracker (inference_models + workflow block) Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant