Add Phi-3.5-vision #36036

Dahlbomii · 2025-02-04T22:20:42Z

Draft PR for now, still need to add tests and convert to modular

cc @Rocketknight1

Fixes #36071

Rocketknight1 · 2025-02-05T17:19:10Z

Looks good so far! The TODO list from here is:

Make the transformers imports relative imports so that we stop getting circular import errors in the CI
Add the phi3_5.md doc (you can copy the layout from a similar model and copy the text from the model card)
pip install transformers[quality] and make fixup or make style to get the repo consistency checks green

Once everything is green, then:

Add tests (you can copy them from another similar VLM and just change model names etc.)

Once the new tests are green as well, then:

Convert the modeling and configuration files to modular, then regenerating the modeling/config files and confirm everything passes

At that point, the PR should be finished!

ArthurZucker · 2025-02-13T15:40:39Z

Fixes #36166!

Dahlbomii · 2025-02-24T00:21:41Z

I've run into an issue when trying to run the model. From Modeling_phi3_v.py I'm getting "AttributeError: 'DynamicCache' object has no attribute 'get_max_length'" This might be due to a change to the DynamicCache class, but I really can't tell!

zucchini-nlp · 2025-02-24T08:39:50Z

@Dahlbomii hey! Yes, we deprecated get_max_length from cache a few versions ago, now it has to be Cache.get_max_cache_shape(layer_idx)

Rocketknight1 · 2025-02-24T17:18:48Z

Woah - this means the remote code version of Phi-3.5-vision-instruct is broken on main right now and requires a fixed older version of transformers. Definitely makes this PR more urgent/important!

Rocketknight1 · 2025-02-28T15:48:17Z

cc @zucchini-nlp @gante I can't figure this one out either! The test_generate_continue_from_inputs_embeds test is failing for Phi-3V but I don't really understand how generate() handles inputs_embeds - when I added some breakpoints it seems to generate an input_ids tensor with shape (batch_size, 0), which then causes errors in Phi.

zucchini-nlp · 2025-02-28T15:56:22Z

Seems like Phi-3.5V was not updated for the latest changes we had in generation, probably it has its own custom generation loop. For this test, we can call super().prepare_input_for_generation(kwargs) to get all the inputs processed correctly. Then image related kwargs can be set to None if needed. You can take a look on how we did it in other VLMs, hope it solves the issue

Rocketknight1 · 2025-02-28T16:18:54Z

@zucchini-nlp yes, that was it, sorry! They don't override generate() but they do override prepare_input_for_generation() and discard inputs_embeds after the first step.

Dahlbomii · 2025-03-14T08:49:20Z

@Rocketknight1 for reasons that escape me, when the test processor tries to call self.get_component for the tokenizer, it breaks. Any insight as to why?

Dahlbomii · 2025-04-01T09:10:58Z

@Rocketknight1 ALRIGHT I got most of the tests into the green, but I have no idea what I'm doing with the chat template stuff!

Rocketknight1 · 2025-04-01T14:51:36Z

test_apply_chat_template() isn't really necessary - I think it just copied over from smolvlm! You can remove it.

…s fix

Dahlbomii · 2025-04-07T21:46:40Z

@Rocketknight1 With that I think it might be ready for review

Rocketknight1

Congrats on the PR! This looks almost ready, and the modular bit looks like it was really annoying. There are a lot of classes that look like they should be inheritable from somewhere else in the codebase, but they're just different enough that you can't.

cc @zucchini-nlp I made some comments, but you're more familiar with VLMs than me, can you review and see if anything else should be changed before we ping a core maintainer?

Rocketknight1 · 2025-04-08T16:04:49Z

src/transformers/models/phi3_v/modular_phi3_v.py

+class Phi3RMSNorm(nn.Module):
+    def __init__(self, hidden_size, eps=1e-6):
+        """
+        Phi3RMSNorm is equivalent to T5LayerNorm
+        """
+        super().__init__()
+        self.weight = nn.Parameter(torch.ones(hidden_size))
+        self.variance_epsilon = eps
+
+    def forward(self, hidden_states):
+        input_dtype = hidden_states.dtype
+        hidden_states = hidden_states.to(torch.float32)
+        variance = hidden_states.pow(2).mean(-1, keepdim=True)
+        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
+        return self.weight * hidden_states.to(input_dtype)


I think this can just be inherited from a class like T5LayerNorm!

Rocketknight1 · 2025-04-08T16:06:36Z

src/transformers/models/phi3_v/modular_phi3_v.py

+class Phi3RotaryEmbedding(nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        self.register_buffer("inv_freq", None, persistent=False)
+
+    @torch.no_grad()
+    def forward(self, x, position_ids, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if self.inv_freq is None:
+            self.inv_freq = 1.0 / (
+                self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
+            )
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # Force float32 since bfloat16 loses precision on long contexts
+        # See https://github.com/huggingface/transformers/pull/29285
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


Although the code is a little different, I think this could be inherited from another RotaryEmbedding class in the library, with the same outputs. Maybe open_llama?

I think it is pretty much same as phi3 or phi3-MoE. We also support rope scaling and dynamic rope with unified API, adding a decorator will do the thing

Rocketknight1 · 2025-04-08T16:08:06Z

src/transformers/models/phi3_v/modular_phi3_v.py

+class Phi3MLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+
+        self.config = config
+        self.gate_up_proj = nn.Linear(config.hidden_size, 2 * config.intermediate_size, bias=False)
+        self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
+
+        self.activation_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
+        up_states = self.gate_up_proj(hidden_states)
+
+        gate, up_states = up_states.chunk(2, dim=-1)
+        up_states = up_states * self.activation_fn(gate)
+
+        return self.down_proj(up_states)


I think this could be inherited from Phi3!

Rocketknight1 · 2025-04-08T16:09:17Z

src/transformers/models/phi3_v/modular_phi3_v.py

+        return self.down_proj(up_states)
+
+
+class Phi3Attention(nn.Module):


Although the code is a little different, I think you could maybe inherit the layer from Phi3 and not need this code here! (But you'd have to test to be sure)

+1, seems pretty much same. The qkv can be split in conversion script if needed and we can inherit from Phi3, which also adds new attention interface (easier TGI, vLLM integrations)

Rocketknight1 · 2025-04-08T16:11:22Z

src/transformers/models/phi3_v/modular_phi3_v.py

+}
+
+
+class Phi3DecoderLayer(nn.Module):


Although the code is a little different, I think this could inherit from the equivalent class in Phi3 without output changes (maybe!)

Rocketknight1 · 2025-04-08T16:12:44Z

src/transformers/models/phi3_v/modular_phi3_v.py

+    _supports_sdpa = False
+    _supports_cache_class = True
+
+    _version = "0.0.5"


Suggested change

_version = "0.0.5"

Probably unnecessary!

zucchini-nlp

@Dahlbomii thanks a lot for working on this!

The PR seems to be adapted from custom code in the hub mostly, and I realize that custom code is outdated and doesn't follow transformers standards. Would be nice to do little bit more clean up before merging, by unifying text and vision backbones as AutoModel (I guess it is identical to CLIP and Phi3) and leave only the Base/ConditionalLM multimodal class in modular

I left a few comments below on how we can do that. LMK if that makes sense

zucchini-nlp · 2025-04-08T16:31:29Z

docs/source/en/model_doc/phi3_v.md

@@ -0,0 +1,59 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.


zucchini-nlp · 2025-04-08T16:32:09Z

docs/source/en/model_doc/phi3_v.md

+
+
+


Let's add the abstract and finalize docs before pinging core maintainer

zucchini-nlp · 2025-04-08T16:33:25Z

src/transformers/models/auto/modeling_auto.py

@@ -574,6 +575,7 @@
        ("persimmon", "PersimmonForCausalLM"),
        ("phi", "PhiForCausalLM"),
        ("phi3", "Phi3ForCausalLM"),
+        ("phi3_v", "Phi3VForCausalLM"),


has to be in ImageTextToText mapping when it works with "image+text". AutoCausalLM is currently reserved for text modality only

zucchini-nlp · 2025-04-08T16:44:15Z

src/transformers/models/phi3_v/modular_phi3_v.py

+class Phi3RotaryEmbedding(nn.Module):
+    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
+        super().__init__()
+
+        self.dim = dim
+        self.max_position_embeddings = max_position_embeddings
+        self.base = base
+        self.register_buffer("inv_freq", None, persistent=False)
+
+    @torch.no_grad()
+    def forward(self, x, position_ids, seq_len=None):
+        # x: [bs, num_attention_heads, seq_len, head_size]
+        if self.inv_freq is None:
+            self.inv_freq = 1.0 / (
+                self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64, device=x.device).float() / self.dim)
+            )
+        inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
+        position_ids_expanded = position_ids[:, None, :].float()
+        # Force float32 since bfloat16 loses precision on long contexts
+        # See https://github.com/huggingface/transformers/pull/29285
+        device_type = x.device.type
+        device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
+        with torch.autocast(device_type=device_type, enabled=False):
+            freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
+            emb = torch.cat((freqs, freqs), dim=-1)
+            cos = emb.cos()
+            sin = emb.sin()
+        return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)


I think it is pretty much same as phi3 or phi3-MoE. We also support rope scaling and dynamic rope with unified API, adding a decorator will do the thing

zucchini-nlp · 2025-04-08T16:45:29Z

src/transformers/models/phi3_v/modular_phi3_v.py

+    def _init_rope(self):
+        if self.rope_scaling is None:
+            self.rotary_emb = Phi3RotaryEmbedding(
+                self.head_dim,
+                max_position_embeddings=self.max_position_embeddings,
+                base=self.rope_theta,
+            )
+        else:
+            scaling_type = self.config.rope_scaling["type"]
+            if scaling_type == "su":
+                self.rotary_emb = Phi3SuScaledRotaryEmbedding(self.head_dim, self.config)
+            elif scaling_type == "yarn":
+                self.rotary_emb = Phi3YarnScaledRotaryEmbedding(self.head_dim, self.config)
+            else:
+                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")


we init RoPE once per model in base class, this is an old way and has to be removed

zucchini-nlp · 2025-04-08T17:10:01Z

src/transformers/models/phi3_v/processing_phi3_v.py

+        images: ImageInput = None,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = None,
+        max_length=None,
+        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
+        add_special_tokens: bool = True,
+    ) -> BatchFeature:


let's add kwargs with standard processing API. For ex in llava

transformers/src/transformers/models/llava_next/processing_llava_next.py

Lines 32 to 41 in 08e3217

class LlavaNextProcessorKwargs(ProcessingKwargs, total=False):

_defaults = {

"text_kwargs": {

"padding": False,

},

"images_kwargs": {

"do_pad": True,

},

}

zucchini-nlp · 2025-04-08T17:11:09Z

src/transformers/models/phi3_v/processing_phi3_v.py

+    def calc_num_image_tokens(self, images: ImageInput):
+        """Calculate the number of image tokens for each image.
+        Args:
+            images (`ImageInput`):
+                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
+                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
+        """
+        return self.image_processor.calc_num_image_tokens(images)
+
+    def calc_num_image_tokens_from_image_size(self, width, height):
+        """Calculate the number of image token for an image with given width and height.
+        Args:
+            width (`int`):
+                Width of the image.
+            height (`int`):
+                Height of the image.
+        """
+        return self.image_processor.calc_num_image_tokens_from_image_size(width, height)
+


can we move these all from image processor to processor?

zucchini-nlp · 2025-04-08T17:15:48Z

src/transformers/models/phi3_v/processing_phi3_v.py

+        input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(0)
+        attention_mask = (input_ids > -1000000).to(torch.long)


inputs are converted to tensor only if asked by users, so we have to put it in Batchfeature and let it handle all type conversion

zucchini-nlp · 2025-04-08T17:19:31Z

src/transformers/models/phi3_v/processing_phi3_v.py

+
+        pattern = r"<\|image_\d+\|>"
+        prompt_chunks = [self.tokenizer(chunk).input_ids for chunk in re.split(pattern, texts)]
+
+        if "num_img_tokens" in images:
+            num_img_tokens = images["num_img_tokens"]
+        else:
+            assert "num_crops" in images, "num_crops must be provided in images if num_img_tokens is not provided"
+            num_crops = images["num_crops"]
+            num_img_tokens = [_num_crops * self.num_img_tokens for _num_crops in num_crops]
+
+        images, image_sizes = images["pixel_values"], images["image_sizes"]
+
+        # image_tags needs to start from 1 to n
+        image_tags = re.findall(pattern, texts)
+        # image_ids = [int(s.split("|")[1].split("_")[-1]) * -1 for s in image_tags]
+        # image_ids_pad = [[iid]*num_img_tokens[i] for i, iid in enumerate(image_ids)]
+        image_ids = [int(s.split("|")[1].split("_")[-1]) for s in image_tags]
+        unique_image_ids = sorted(set(image_ids))
+        # image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be [1, 4, 5]
+        # check the condition
+        assert unique_image_ids == list(range(1, len(unique_image_ids) + 1)), (
+            f"image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be {unique_image_ids}"
+        )
+        # total images must be the same as the number of image tags
+        assert len(unique_image_ids) == len(images), (
+            f"total images must be the same as the number of image tags, got {len(unique_image_ids)} image tags and {len(images)} images"


this looks too complicated! From what I see adding special number on image tokens is not needed, since what happens here is

text = "<|image|> What is this?" text_after_separator = " <placeholder> <placeholder> [<placeholder> .... ] What is this?""

This is same as in many other VLMs, and the simplest case is LLaVA. Let's clean up a bit and make nececssary changes to processor config when needed

zucchini-nlp · 2025-04-08T17:21:02Z

tests/models/phi3_v/test_modeling_phi3_v.py

+            token_labels,
+            choice_labels,
+        ) = config_and_inputs
+        inputs_dict = {"input_ids": input_ids, "attention_mask": input_mask}


pixel_values have to be in inputs, so we test same way as model will be used by users

Dahlbomii force-pushed the add_phi3-5_vision branch from 4911ef9 to 7e16056 Compare February 11, 2025 23:02

Rocketknight1 mentioned this pull request Feb 13, 2025

Add phi3 Vision #36166

Open

2 tasks

Dahlbomii force-pushed the add_phi3-5_vision branch from 1f31710 to 0848b01 Compare March 3, 2025 23:06

Rocketknight1 mentioned this pull request Mar 10, 2025

modeling_phi3 errors with AttributeError: 'DynamicCache' object has no attribute 'get_max_length' #36071

Closed

4 tasks

Rocketknight1 marked this pull request as ready for review March 10, 2025 19:09

github-actions bot requested review from ArthurZucker and Rocketknight1 March 10, 2025 19:09

Rocketknight1 force-pushed the add_phi3-5_vision branch from 52f6b48 to 56c8b9f Compare March 10, 2025 19:11

Dahlbomii force-pushed the add_phi3-5_vision branch from 404d2a7 to f4684f7 Compare April 2, 2025 21:48

Dahlbomii and others added 10 commits April 7, 2025 18:35

added phi3_5 to transformers\models with .py files from original

1947f6e

added boilerplate code for phi3V

bfa6093

CI fixes

53d1c3b

CI fixes, round two

26eff93

fixed various import issues

e49678e

additional import fixes

ea89cb6

hopefully the last of the import fixes

5b5193e

Test commit to fix import structure

9fbc77b

More init cleanup

3cc825e

Now that I see the actual problem, here's the REAL final import error…

6ea81b3

…s fix

Dahlbomii and others added 26 commits April 7, 2025 18:35

changed _attn_implementation to eager

168f164

Push a half-hearted image processor test

989269f

various changes attempting to make ground on the image processor tests

0e9abe7

All tests passing except test_call_pytorch, still working on that!

b953d95

Nevermind, all green

7d27a1e

Cleaned up some frass and remembered to run make fix-copies

2277b76

Pushing an equally half-hearted processor test

7955076

make style

14e1876

fill docstring

94b8dcf

Add chat template init

ee71564

Small adjustments to test processor

964bac6

Stop trying to use a slow processor bc it's cursed

3859260

test processor updates and slight changes to processing

6bad26a

removed unnecessary tests

014a6d7

maked fix-copies

3ef5840

maked fix-copies

ceea689

Modular file added

6cff2b9

Minor adjustments to modular and test processor

f62b912

Minor adjustment to modular it's FINE

203b674

removed vestigial import

2717d14

More import changes

9a735a2

missing variable

fd36d79

various fixes

d8d6423

import adjustments

c0b024d

_flash_supports_window_size only in flashattn contexts

a721e00

Move things around to stop annoying modular

c26b108

Rocketknight1 force-pushed the add_phi3-5_vision branch from 08109e0 to c26b108 Compare April 7, 2025 17:35

Rocketknight1 reviewed Apr 8, 2025

View reviewed changes

zucchini-nlp reviewed Apr 9, 2025

View reviewed changes

		return self.down_proj(up_states)


		class Phi3Attention(nn.Module):

		@@ -0,0 +1,59 @@
		<!--Copyright 2024 The HuggingFace Team. All rights reserved.

	class LlavaNextProcessorKwargs(ProcessingKwargs, total=False):
	_defaults = {
	"text_kwargs": {
	"padding": False,
	},
	"images_kwargs": {
	"do_pad": True,
	},
	}

		input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(0)
		attention_mask = (input_ids > -1000000).to(torch.long)

Add Phi-3.5-vision #36036

Are you sure you want to change the base?

Add Phi-3.5-vision #36036

Uh oh!

Conversation

Dahlbomii commented Feb 4, 2025 • edited by Rocketknight1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Feb 13, 2025

Uh oh!

Dahlbomii commented Feb 24, 2025

Uh oh!

zucchini-nlp commented Feb 24, 2025

Uh oh!

Rocketknight1 commented Feb 24, 2025

Uh oh!

Rocketknight1 commented Feb 28, 2025

Uh oh!

zucchini-nlp commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Feb 28, 2025

Uh oh!

Dahlbomii commented Mar 14, 2025

Uh oh!

Dahlbomii commented Apr 1, 2025

Uh oh!

Rocketknight1 commented Apr 1, 2025

Uh oh!

Dahlbomii commented Apr 7, 2025

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Dahlbomii commented Feb 4, 2025 •

edited by Rocketknight1

Loading

Rocketknight1 commented Feb 5, 2025 •

edited

Loading

zucchini-nlp commented Feb 28, 2025 •

edited

Loading

zucchini-nlp left a comment •

edited

Loading