Add Llama4VisionModel for multimodal decoding #1809

hengtaoguo · 2025-06-05T22:17:50Z

Description

This PR implements a complete class Llama4VisionModel, by integrating all Llama4 basic vision components. It allows Llama4 multimodal decode by describing an input image. Joint by @hengtaoguo and @aireenmei .

Core change Llama4VisionModel, which converts image tiles (batch_size, num_tiles, C, H, W) to feature activations (batch_size, num_tiles, num_patches, vision_output_dim_for_vit). Then Llama4MultiModalProjector projects it to (batch_size, num_tiles, num_patches, base_emb_dim). Example: (8, 5, 3, 336, 336) -> (8, 5, 144, 4096) -> (8, 5, 144, 5120)
Refactor by adding a get_dummy_image_shape_for_init() to create desired dummy images for different models, for jit init purpose.

Tests

Tested full multimodal decode on v5p-16 cluster with this command, and workload with screenshot.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

hengtaoguo force-pushed the hengtaoguo-vision branch from 6c03057 to e79e4d7 Compare June 20, 2025 04:46

VisionModel

02adb51

hengtaoguo force-pushed the hengtaoguo-vision branch from ed26908 to 02adb51 Compare June 20, 2025 05:34

hengtaoguo marked this pull request as ready for review June 20, 2025 05:35

hengtaoguo requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, gagika, shralex, yangyuwei, SurbhiJainUSC, A9isha and aireenmei as code owners June 20, 2025 05:35

hengtaoguo self-assigned this Jun 20, 2025

hengtaoguo changed the title ~~[WIP] Full Llama4 VisionModel~~ Implement Llama4VisionModel for multimodal decoding Jun 20, 2025

pylint

639a965

hengtaoguo changed the title ~~Implement Llama4VisionModel for multimodal decoding~~ Add Llama4VisionModel for multimodal decoding Jun 20, 2025

pylint

8092fbd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Llama4VisionModel for multimodal decoding #1809

Add Llama4VisionModel for multimodal decoding #1809

hengtaoguo commented Jun 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add Llama4VisionModel for multimodal decoding #1809

Are you sure you want to change the base?

Add Llama4VisionModel for multimodal decoding #1809

Conversation

hengtaoguo commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

Uh oh!

hengtaoguo commented Jun 5, 2025 •

edited

Loading