Clarification on `dynamic_img_size` and `img_size` parameters in timm models #2414

vadori · 2025-01-17T17:54:20Z

vadori
Jan 17, 2025

Hi!

I have some questions regarding the dynamic_img_size and img_size parameters when creating a timm model and loading pre-trained weights. From my understanding, setting patch_size and img_size interpolates the patch embeddings and the conv2d projection layer, respectively, to match the specified values if they differ from those used during the model's pretraining.

However, I’m a bit unclear on the specific role of enabling dynamic_img_size.

Does this option allow the model to handle varying input sizes dynamically during training and inference?
Is this achieved through interpolation, similar to what happens when specifying a different img_size?
Would it be correct to say that a use case for setting both img_size and dynamic_img_size=True is to train the model on a fixed img_size while allowing inference on images of varying sizes?
Alternatively, could another use case involve initializing the model with a different img_size (compared to pretraining) and then allowing flexibility to process various sizes during training?
Lastly, if all processed images are of the same size, could enabling dynamic_img_size=True introduce any performance drawbacks?

Thank you in advance for your insights!

Answered by rwightman

Jan 17, 2025

@vadori so, yeah a bit confusing...

Changing img_size and patch_size are a step change that will permantently change the corresponding value for the model, interpolating the original pretrained values once at load time. Those values change the model configuration such that the dimensions of some parameters are different, it is essentially a different model architecture variation for each combination of those values.

Just like the original model, once resized the model expects all inputs to match the new size.

To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the mode…

View full answer

rwightman · 2025-01-17T21:39:54Z

rwightman
Jan 17, 2025
Maintainer

@vadori so, yeah a bit confusing...

Changing img_size and patch_size are a step change that will permantently change the corresponding value for the model, interpolating the original pretrained values once at load time. Those values change the model configuration such that the dimensions of some parameters are different, it is essentially a different model architecture variation for each combination of those values.

Just like the original model, once resized the model expects all inputs to match the new size.

To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the model with those values the same (or define a new model, existing defined model with matching sizing). You can of course again map those weights as 'pretrained' and then load again into the model with another different img and/or patch size but I don't think I've ever found a need for that...

Now, dynamic_img_size (and related dynamic_img_pad) do not alter the sizes of any parameters, they remain as they were, it's the same model weights are compatible without any interpolation or adjustment needed. Aside from having the functionality enabled the weights remain the same. Every input can be a different size and the aspects of the model that need to match the input size (position embedding) is scaled to match that. So for each batch at train or inference time you can use a different size.

There is a slight runtime hit for dynamic image size to do the interpolation but for most models, esp larger ones I feel it's negligible.

I do feel there is another tradeoff to consider. I feel there is an optimal range for which a given absolute position embedding can be interpolated for a different resolution. With dynamic you are always using the original size, so say a patch16 224x224 vit, your pos-embed maps to 14x14 grid (if you unflatten it) and any input size you throw at it will be adding interpolation of those points, the pos embed will remaine quite coarse compared to say the 32x32 feature grid if you pass in a 512x512 image.

If you have considerable data to fine-tune or train a model on at a different resolution (range). I feel it's better to adjust the img_size / patch_size to fit your range and, fine-tune or train so you have a finer grained pos embed and then consider using dynamic_img_size if you need the flexibility to have different sizes at train (or especially inference) around that new size.

0 replies

vadori · 2025-01-22T13:59:43Z

vadori
Jan 22, 2025
Author

Great, thank you very much for your response, @rwightman!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Clarification on `dynamic_img_size` and `img_size` parameters in timm models #2414

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Clarification on dynamic_img_size and img_size parameters in timm models #2414

Uh oh!

Uh oh!

vadori Jan 17, 2025

Replies: 2 comments

Uh oh!

Uh oh!

rwightman Jan 17, 2025 Maintainer

Uh oh!

vadori Jan 22, 2025 Author

Clarification on `dynamic_img_size` and `img_size` parameters in timm models #2414

vadori
Jan 17, 2025

rwightman
Jan 17, 2025
Maintainer

vadori
Jan 22, 2025
Author