Add SAM2 streaming video tracker (inference_models + workflow block)#2245
Open
Add SAM2 streaming video tracker (inference_models + workflow block)#2245
Conversation
Introduces a SAM3 streaming tracker that mirrors the existing SAM2ForStream interface (prompt / track returning (masks, object_ids, state_dict)) so both can be used interchangeably by upstream code. - inference_models/models/sam3_rt/sam3_pytorch.py: SAM3ForStream backed by HuggingFace transformers' Sam3VideoModel / Sam3VideoProcessor. The native sam3 package's video predictor requires a full video resource upfront; the transformers port exposes init_video_session + per-frame model(frame=...), which is the shape we need for InferencePipeline-style streaming. - Accepts bbox and/or text prompts; state_dict is opaque (wraps the HF Sam3VideoInferenceSession) and must be kept in memory by the caller — it's not serializable across processes. - Register (segment-anything-3-rt, INSTANCE_SEGMENTATION_TASK, BackendType.HF) in models_registry alongside the existing SAM2-RT entry. Tests: - inference_models/tests/unit_tests/models/test_sam3_rt.py — 24 unit tests covering helpers (_normalise_bboxes, _unpack_processed_outputs, etc.) plus class behaviour using MagicMock model/processor. No weights required. - inference_models/tests/integration_tests/models/test_sam2_rt_predictions.py — new integration suite for the existing SAM2ForStream (prompt -> track on synthetic frames, centroid-moves assertion, track-without-prompt raises, torch.Tensor input). - inference_models/tests/integration_tests/models/test_sam3_rt_predictions.py — analogous suite for SAM3ForStream. - conftest.py: sam2_rt_package and sam3_rt_package fixtures download zips from rf-platform-models; docstrings list the expected file contents for upload. https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Two new LOCAL-only workflow blocks that drive the inference_models streaming trackers (SAM2ForStream / SAM3ForStream) from workflows powered by InferencePipeline. Both blocks multiplex a single model instance across many videos by keying state_dicts on video_metadata.video_identifier, reset sessions when frame_number rolls back, and support three prompt modes: first_frame, every_n_frames, every_frame. - inference/core/workflows/core_steps/models/foundation/ _streaming_video_common.py: shared helpers (state bookkeeping, prompt-vs-track decision logic, sv.Detections assembly with SAM-assigned tracker_ids). - segment_anything2_video/v1.py: SAM2VideoTrackerBlockV1 (type: roboflow_core/segment_anything_2_video@v1, default model_id: segment-anything-2-rt). - segment_anything3_video/v1.py: SAM3VideoTrackerBlockV1 (type: roboflow_core/sam3_video@v1, default model_id: segment-anything-3-rt). Additionally accepts text prompts via class_names; boxes win when both are supplied. - Both raise NotImplementedError on REMOTE step execution — per-video session state cannot survive a remote boundary. - Models are loaded via inference_models.AutoModel.from_pretrained so backend negotiation / package download / caching flow through the standard inference_models pipeline. - Registered in core_steps/loader.py. Tests (30 total, all passing, no weights required): - test_segment_anything2_video.py — 10 tests covering manifest, REMOTE rejection, first_frame/every_n_frames/every_frame modes, state threading across track calls, multi-stream isolation, stream-restart detection. - test_segment_anything3_video.py — 9 tests with similar coverage plus text-vs-box prompt routing. - test_streaming_video_common.py — 11 tests for the shared helpers. https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Refactors the streaming trackers into a shared HuggingFace transformers base and adds a SAM2Video counterpart to the existing SAM3Video. The older sam2_rt (SAM2ForStream using Meta's sam2 camera predictor) is kept untouched — per the feedback it hasn't been exercised much in practice. Model classes ------------- - inference_models/models/common/hf_streaming_video.py: HFStreamingVideoBase containing all the HF streaming boilerplate — session init, prompt/track methods, mask/obj_id extraction, opaque state_dict contract. - inference_models/models/sam2_video/sam2_video_hf.py: SAM2Video (lazy-imports transformers.Sam2VideoModel / Sam2VideoProcessor; rejects text prompts). - inference_models/models/sam3_video/sam3_video_hf.py: SAM3Video, moved from the previous sam3_rt path; now a thin ~25-line subclass after the shared base absorbed the helpers (lazy-imports transformers.Sam3VideoModel / Sam3VideoProcessor; accepts both text and box prompts). Registry -------- - sam2video: (INSTANCE_SEGMENTATION_TASK, BackendType.HF) -> SAM2Video - sam3video: (INSTANCE_SEGMENTATION_TASK, BackendType.HF) -> SAM3Video - segment-anything-2-rt stays registered against SAM2ForStream. - segment-anything-3-rt entry dropped (never released). Workflow blocks now default to these ids: - roboflow_core/segment_anything_2_video@v1 -> "sam2video" - roboflow_core/sam3_video@v1 -> "sam3video" Tests ----- - Unit tests: added test_sam2_video.py (4 SAM2-specific), renamed test_sam3_rt.py -> test_sam3_video.py and updated imports (24 tests covering helpers on the shared base plus SAM3 class behaviour). - Integration tests: renamed SAM3 file, added SAM2 counterpart. New fixtures sam2_video_package / sam3_video_package (expected zips at rf-platform-models/sam2video.zip and rf-platform-models/sam3video.zip). - Workflow block tests updated to use sam2video / sam3video ids. - All 58 non-integration tests pass locally. https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Removes the integration tests and fixture I'd added for the existing SAM2ForStream (sam2_rt) — keeping them would require uploading a segment-anything-2-rt.zip to the test assets bucket, but the goal for this PR is to leave that untested path alone. - Deleted tests/integration_tests/models/test_sam2_rt_predictions.py - Removed SAM2_RT_PACKAGE_URL + sam2_rt_package fixture from conftest.py - Fixed two docstring references that still said SAM2ForStream when they now point at the new SAM2Video. The SAM2ForStream registry entry itself stays — it's the legacy model that existed before this branch and we're not touching it. https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
Captures the state of this branch while session context is fresh — what's where, how to test, what needs uploading, known gotchas, and a sketch of the follow-up "add a model" Claude skill. Doc is temporary and should be deleted before the PR merges. https://claude.ai/code/session_01T3k4sfbkaV3warwV7MHRZN
- Default workflow block to sam2video/small, advertise all four Hiera backbones (tiny / small / base-plus / large) via examples and get_supported_model_variants. - Fix Sam2VideoProcessor.add_inputs_to_inference_session call: the processor expects input_boxes with 3 nesting levels ([image, boxes, coords]); we were passing 4, which raised ValueError on the first real-weights prompt. Unit tests missed it because they mock the processor — surfaced by end-to-end verify against the uploaded sam2video-small.zip. - Point the inference_models integration fixture URL at sam2video-small.zip (the variant that matches the new default). - Update sam2video workflow-block unit tests to pass the new sam2video/small default through the mocks.
cb663b6 to
d7e6945
Compare
2 tasks
Descope the SAM3 streaming-video work to a follow-up so this PR can ship SAM2 video alone. SAM3's HF port requires the gated facebook/sam3 checkpoints, which aren't available yet. Removed: - inference_models SAM3 model class, registry entry, unit + integration tests, and test fixture - sam3_video workflow block + loader registration + unit tests - SAM_VIDEO_HANDOFF.md (served its purpose during the branch work) Left in place: - HFStreamingVideoBase in inference_models/models/common — reusable, SAM2 uses it today and a future SAM3 port can inherit unchanged - _streaming_video_common workflow helpers — still used by the SAM2 video block - SAM2 video class, registry entry, workflow block, and all SAM2 tests Comments that referenced SAM3Video / test_sam3_video.py have been generalised or trimmed so nothing dangles.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SAM2Videomodel ininference_models/inference_models/models/sam2_video/— inherits from the new sharedHFStreamingVideoBaseso the streaming (prompt/track) contract is defined once and reused by future HF video trackers.sam2video(instance-segmentation, HF backend). Four variants shipped:sam2video/tiny,sam2video/small(default),sam2video/base-plus,sam2video/large— all four weight zips uploaded torf-platform-models/and registered on staging via model-registry-sdk#5.roboflow_core/segment_anything_2_video@v1(LOCAL-only — raisesNotImplementedErroron remote execution since per-video session state can't cross a process boundary). Multiplexes a single SAM2 predictor across videos by keying state onWorkflowImageData.video_metadata.video_identifier.first_frame(prompt once, then track),every_n_frames(re-seed every N frames),every_frame(re-seed every frame).input_boxesnesting ([image, boxes, coords]— 3 levels, not 4) that the mocked unit tests missed and only surfaced under real weights.SegmentAnything2VideoRT/sam2_rt.SAM2ForStream(Meta'ssam2package camera predictor) is untouched by this PR; both remain registered concurrently.Out of scope for this PR
SAM3 video (streaming) was implemented on this branch alongside SAM2 but stripped in the final commit. The
facebook/sam3checkpoints are gated on HuggingFace and we're still waiting on access.HFStreamingVideoBaseis deliberately left inmodels/common/so the SAM3 follow-up can inherit it unchanged.Test plan
tests/workflows/unit_tests/core_steps/models/foundation/test_segment_anything2_video.py,test_streaming_video_common.py) + SAM2 HF model (inference_models/tests/unit_tests/models/test_sam2_video.py)sam2video/smallloads viaAutoModel.from_pretrained+ runsprompt+trackend-to-end againstapi.roboflow.oneroboflow/model-registry-sdkPR Includeclass_idandclass_listin inference response #5InferencePipeline+ a short MP4