Skip to content

feat(workflows): add Roboflow text-image-pairs model block#2261

Open
joaomarcoscrs wants to merge 5 commits intomainfrom
feat/roboflow-text-image-pairs-block
Open

feat(workflows): add Roboflow text-image-pairs model block#2261
joaomarcoscrs wants to merge 5 commits intomainfrom
feat/roboflow-text-image-pairs-block

Conversation

@joaomarcoscrs
Copy link
Copy Markdown
Contributor

@joaomarcoscrs joaomarcoscrs commented Apr 23, 2026

What

New Workflow block roboflow_core/roboflow_text_image_pairs_model@v1 ("Multimodal Model") — the missing sibling to the other Roboflow project-type blocks. Dispatches hosted (or local) inference for fine-tuned text-image-pairs models by model_id. Supported architectures server-side: PaliGemma 2, Florence 2, Qwen 2.5 VL, Qwen 3 VL, Qwen 3.5, SmolVLM2, SmolVLM 256M.

CleanShot 2026-04-23 at 08 49 51@2x CleanShot 2026-04-23 at 08 51 04@2x

Why

/infer/lmm/{model_id} already serves these models end-to-end. The only gap was a Workflow block wrapper — these project types couldn't be used in workflows despite the backend being ready.

Shape

  • Inputs: images, model_id, optional prompt, disable_active_learning, active_learning_target_dataset
  • Output: single response of kind LANGUAGE_MODEL_OUTPUT_KIND — raw pass-through, no parsing
  • Local path: LMMInferenceRequest (AL controls honored)
  • Remote path: InferenceHTTPClient.infer_lmm(model_id_in_path=True) with InferenceConfiguration(source="workflow-execution", ...)
  • Composes with vlm_as_detector / vlm_as_classifier downstream for Florence-2 fine-tunes today

Known limitations (documented in code)

inference_sdk.infer_lmm doesn't yet accept AL kwargs or honor max_batch_size / max_concurrent_requests. AL controls propagate only on the local path; remote path still calls client.configure(...) so source tags telemetry. SDK-side follow-up.

Tested

Follow-ups (separate PRs)

  • Extend inference_sdk.infer_lmm to accept AL kwargs + forward InferenceConfiguration.source
  • Add paligemma / qwen / smolvlm strategies to vlm_as_detector / vlm_as_classifier so non-Florence-2 fine-tunes get structured parsing

Note

Medium Risk
Adds a new workflow model block that issues local or remote LMM/VLM inference requests and exposes active-learning controls, which could affect runtime behavior and external API usage if misconfigured.

Overview
Adds a new Workflow block, roboflow_core/roboflow_text_image_pairs_model@v1, to run Roboflow text-image-pairs (multimodal/VLM) models by model_id and return the raw response as LANGUAGE_MODEL_OUTPUT_KIND.

The block supports both local execution (via ModelManager + LMMInferenceRequest, including active-learning flags) and remote execution (via InferenceHTTPClient.infer_lmm with workflow-specific InferenceConfiguration), and is registered in the workflow loader so it becomes available to workflows. Unit tests were added to validate the new block manifest and output contract.

Reviewed by Cursor Bugbot for commit e5010b2. Bugbot is set up for automated code reviews on this repo. Configure here.

@joaomarcoscrs joaomarcoscrs self-assigned this Apr 23, 2026
prediction = self._model_manager.infer_from_request_sync(
model_id=model_id, request=request
)
predictions.append({"response": prediction.response})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response may be dict, breaking downstream string-expecting blocks

Medium Severity

LMMInferenceResponse.response is typed Union[str, dict], and the block passes it through raw as the response output. However, the declared output kind LANGUAGE_MODEL_OUTPUT_KIND has internal_data_type="str", and downstream blocks like vlm_as_detector call string2json(raw_json=vlm_output) which runs a regex .findall() on the value — this will fail if response is a dict (e.g. Florence-2 fine-tunes return structured dicts). The Florence-2 block avoids this by wrapping with json.dumps(). Both the local path (prediction.response) and remote path (result.get("response")) are affected.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit af1d82f. Configure here.

@hansent
Copy link
Copy Markdown
Collaborator

hansent commented Apr 23, 2026

don't we have individual blocks for most of those?

@joaomarcoscrs
Copy link
Copy Markdown
Contributor Author

@hansent we have for the base models, not fine-tunes

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e5010b2. Configure here.

prompt=prompt or "",
disable_active_learning=disable_active_learning,
active_learning_target_dataset=active_learning_target_dataset,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AL controls silently dropped by LMMInferenceRequest

Medium Severity

LMMInferenceRequest inherits from CVInferenceRequest, which does not define disable_active_learning or active_learning_target_dataset fields. Those fields live on ObjectDetectionInferenceRequest and ClassificationInferenceRequest instead. Because Pydantic v2 defaults to extra='ignore', the two AL kwargs passed to the constructor are silently discarded. The local path therefore does not actually honor active learning controls, despite the code, comments, and PR description claiming it does.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e5010b2. Configure here.

@hansent
Copy link
Copy Markdown
Collaborator

hansent commented Apr 23, 2026

@hansent we have for the base models, not fine-tunes

I think on those blocks you can set the model id to your fine tuned model.

not opposed to having a general block like we do for object detection / classification, but I think the aprameter / config on the LMM models is pretty different from model to model sometimes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants