-
Hi! I have some questions regarding the dynamic_img_size and img_size parameters when creating a timm model and loading pre-trained weights. From my understanding, setting patch_size and img_size interpolates the patch embeddings and the conv2d projection layer, respectively, to match the specified values if they differ from those used during the model's pretraining. However, I’m a bit unclear on the specific role of enabling dynamic_img_size.
Thank you in advance for your insights! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
@vadori so, yeah a bit confusing... Changing Just like the original model, once resized the model expects all inputs to match the new size. To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the model with those values the same (or define a new model, existing defined model with matching sizing). You can of course again map those weights as 'pretrained' and then load again into the model with another different img and/or patch size but I don't think I've ever found a need for that... Now, There is a slight runtime hit for dynamic image size to do the interpolation but for most models, esp larger ones I feel it's negligible. I do feel there is another tradeoff to consider. I feel there is an optimal range for which a given absolute position embedding can be interpolated for a different resolution. With dynamic you are always using the original size, so say a patch16 224x224 vit, your pos-embed maps to 14x14 grid (if you unflatten it) and any input size you throw at it will be adding interpolation of those points, the pos embed will remaine quite coarse compared to say the 32x32 feature grid if you pass in a 512x512 image. If you have considerable data to fine-tune or train a model on at a different resolution (range). I feel it's better to adjust the img_size / patch_size to fit your range and, fine-tune or train so you have a finer grained pos embed and then consider using |
Beta Was this translation helpful? Give feedback.
-
Great, thank you very much for your response, @rwightman! |
Beta Was this translation helpful? Give feedback.
@vadori so, yeah a bit confusing...
Changing
img_size
andpatch_size
are a step change that will permantently change the corresponding value for the model, interpolating the original pretrained values once at load time. Those values change the model configuration such that the dimensions of some parameters are different, it is essentially a different model architecture variation for each combination of those values.Just like the original model, once resized the model expects all inputs to match the new size.
To again load and use the model correctly after it's been trained or fine-tuned with those values different from the original pretrained weights, you have to continue using the mode…