[Contributions Welcome] Add Fast Image Processors

## Community contributions: Add Fast Image Processors

Fast image processors have been rolling out progressively for a while. Now that the [BaseImageProcessorFast](https://github.com/huggingface/transformers/blob/main/src/transformers/image_processing_utils_fast.py#L292), from which all fast image processors inherit, is in a more stable state, I'm opening this issue to encourage contributors to add fast image processors for models that still only have a "slow" image processor.

### How to implement a Fast Image Processor

The core principle of fast image processors is to use `torch` and `torchvision` functions for image transformations instead of `PIL` or `numpy`. Among other performance benefits, this enables processing images on GPU, significantly improving inference speed.

Another key difference compared to slow image processors is that, unlike `BaseImageProcessor`, which provides only a minimal skeleton, `BaseImageProcessorFast` includes all the fundamental functionalities needed for a basic image processor. This allows optimizations made in BaseImageProcessorFast to propagate to its inherited classes. Additionally, most repetitive logic for image loading and argument handling is managed within BaseImageProcessorFast. Except in rare cases, inherited classes do not need to handle image loading, conversion, or retrieving arguments from class attributes in the call/preprocess function, this is all handled in `BaseImageProcessorFast`.

#### Getting Started

Run the following command:
```bash
transformers-cli add-fast-image-processor --model-name model_name
```
where `model_name` is the name of the model (as found in its folder under `transformers/src/transformers/models`) for which you're adding the fast image processor.

This command will handle all necessary imports and generate a basic fast image processor, which will look similar to this example for Beit:

```python
# coding=utf-8
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fast Image processor class for Beit."""

from ...image_processing_utils_fast import BASE_IMAGE_PROCESSOR_FAST_DOCSTRING, BaseImageProcessorFast
from ...image_utils import IMAGENET_STANDARD_MEAN, IMAGENET_STANDARD_STD, PILImageResampling
from ...utils import add_start_docstrings


@add_start_docstrings(
    "Constructs a fast Beit image processor.",
    BASE_IMAGE_PROCESSOR_FAST_DOCSTRING,
)
class BeitImageProcessorFast(BaseImageProcessorFast):
    # This generated class can be used as a starting point for the fast image processor.
    # if the image processor is only used for simple augmentations, such as resizing, center cropping, rescaling, or normalizing,
    # only the default values should be set in the class.
    # If the image processor requires more complex augmentations, methods from BaseImageProcessorFast can be overridden.
    # In most cases, only the `_preprocess` method should be overridden.

    # For an example of a fast image processor requiring more complex augmentations, see `LlavaNextImageProcessorFast`.

    # Default values should be checked against the slow image processor
    # None values left after checking can be removed
    resample = PILImageResampling.BICUBIC
    image_mean = IMAGENET_STANDARD_MEAN
    image_std = IMAGENET_STANDARD_STD
    size = {"height": 256, "width": 256}
    default_to_square = None
    crop_size = {"height": 224, "width": 224}
    do_resize = True
    do_center_crop = True
    do_rescale = True
    do_normalize = True
    do_convert_rgb = None


__all__ = ["BeitImageProcessorFast"]
```

As explained in the generated file, if the image processor only performs basic augmentations such as resizing, center cropping, rescaling, and normalizing, the generated file might be sufficient for a working fast image processor. The class attributes, such as `resample` and `image_mean`, are automatically parsed from the slow image processor when running the script above. However, you should verify their correctness and check for any missing or incorrectly assigned values.

### Customizing the Image Processor

If the image processor requires additional functionalities beyond the basic augmentations, you will need to override the `_preprocess` function in `BaseImageProcessorFast`. Check the `_preprocess` implementation in `BaseImageProcessorFast` for reference. Notably, it leverages `group_images_by_shape` and `reorder_images` to enable batch processing, significantly increasing processing speed, particularly on GPUs. If you create new image processing functions, ensure they support batch processing by utilizing `group_images_by_shape` and `reorder_images` where possible.

If your image processor requires additional kwargs not present in [`DefaultFastImageProcessorKwargs`](https://github.com/huggingface/transformers/blob/main/src/transformers/image_processing_utils_fast.py#L172), you must create a `ModelNameFastImageProcessorKwargs` class that inherits from `DefaultFastImageProcessorKwargs` and defines the new kwargs. Additionally, you should document the added kwargs in the class and the `preprocess` function using `add_start_docstrings`. (This documentation process may be simplified soon, but is necessary for now to get a correct documentation).

For an example of handling custom kwargs and documentation, refer to [LlavaNextImageProcessorFast](https://github.com/huggingface/transformers/blob/main/src/transformers/models/llava_next/image_processing_llava_next_fast.py).

### Important Notes

- In nearly all cases, `_preprocess` is the only function in `BaseImageProcessorFast` that needs to be overridden.
- The `_preprocess` function does not require default values for its arguments, as they are automatically derived from class attributes if not explicitly provided.
- Even if `PIL` images or `numpy` arrays are passed to the image processor, the `images` argument in `_preprocess` will always be a list of tensors, with the channel dimension first.

### Handling Edge Cases

- **Nested Images:** If images are provided as nested lists (e.g., `[[image1, image2], [image3]]`), they will be flattened to `[image1, image2, image3]` by default before being passed to `_preprocess`. This behavior can be modified by overriding `_prepare_images_structure`, though flattening is generally recommended.
- **Formatting Custom  Kwargs:** If any custom kwargs require formatting before `_preprocess`, override `_further_process_kwargs`.
- **Validating Custom Kwargs:** If additional validation is needed for custom kwargs or existing ones, override `_validate_preprocess_kwargs`.

### Testing

In the case where the model already has a `test_image_processing_model_name.py` file under `transformers/tests/models/model_name`, the script ran before should have imported the fast image processor to the file, and added it as a `fast_image_processing_class` class attribute to the `ModelNameImageProcessingTest` class.
However this is not enough to get all the tests to run on the fast image processor. For all the test functions under `ModelNameImageProcessingTest`, you need to replace `image_processing = self.image_processing_class(**self.image_processor_dict)` with a loop over `self.image_processor_list`.

For example, the `test_image_processor_properties` test in `test_image_processing_beit.py` which looks like this:

```python
    def test_image_processor_properties(self):
        image_processing = self.image_processing_class(**self.image_processor_dict)
        self.assertTrue(hasattr(image_processing, "do_resize"))
        self.assertTrue(hasattr(image_processing, "size"))
        self.assertTrue(hasattr(image_processing, "do_center_crop"))
        self.assertTrue(hasattr(image_processing, "center_crop"))
        self.assertTrue(hasattr(image_processing, "do_normalize"))
        self.assertTrue(hasattr(image_processing, "image_mean"))
        self.assertTrue(hasattr(image_processing, "image_std"))
        self.assertTrue(hasattr(image_processing, "do_reduce_labels"))
```
should be changed to this:

```python
    def test_image_processor_properties(self):
        for image_processing_class in self.image_processor_list:
            image_processing = image_processing_class(**self.image_processor_dict)
            self.assertTrue(hasattr(image_processing, "do_resize"))
            self.assertTrue(hasattr(image_processing, "size"))
            self.assertTrue(hasattr(image_processing, "do_center_crop"))
            self.assertTrue(hasattr(image_processing, "center_crop"))
            self.assertTrue(hasattr(image_processing, "do_normalize"))
            self.assertTrue(hasattr(image_processing, "image_mean"))
            self.assertTrue(hasattr(image_processing, "image_std"))
            self.assertTrue(hasattr(image_processing, "do_reduce_labels"))
```

In the case where no image processing test file is present, now is a great time to add one! You can have a look at the[ CLIP image processing test file](https://github.com/huggingface/transformers/blob/ad5d40de9c4d4899d5b79243f63e22c72e8b3669/tests/models/clip/test_image_processing_clip.py) to use as a simple starting point.

Don't hesitate to add model-specific tests if you feel like there are some non-standard image processing techniques in the processor :).

To run the tests, use this command:
```bash
RUN_SLOW=1 python -m pytest tests/models/model_name/test_image_processing_model_name.py
```


### Choosing an Image Processor to Implement

The difficulty of implementing a fast image processor varies by model. If this is your first issue, consider starting with an easier one!

Happy coding!

Here is the list of fast image processors left to implement:
- [x] BEiT -> https://github.com/huggingface/transformers/pull/37005
- [x] BiT -> https://github.com/huggingface/transformers/pull/37180
- [x] Blip
- [x] BridgeTower -> https://github.com/huggingface/transformers/pull/37373
- [ ] Chameleon -> https://github.com/huggingface/transformers/pull/37140
- [x] Chinese-CLIP -> https://github.com/huggingface/transformers/pull/37012
- [x] CLIP
- [x] Conditional-DETR -> https://github.com/huggingface/transformers/pull/37071
- [X] ConvNext
- [X] Deformable-DETR 
- [x] Deit
- [x] DepthPro
- [x] ~~Deta~~ (deprecated)
- [x] DETR
- [x] Donut -> https://github.com/huggingface/transformers/pull/37081
- [x] DPT -> https://github.com/huggingface/transformers/pull/37481
- [x] ~~EfficientFormer~~ (deprecated)
- [x] EfficientNet -> https://github.com/huggingface/transformers/pull/37055
- [x] Flava -> https://github.com/huggingface/transformers/pull/37135
- [ ] Fuyu -> https://github.com/huggingface/transformers/pull/37410
- [x] Gemma3
- [ ] GLPN -> https://github.com/huggingface/transformers/pull/38461
- [x] GotOcr2
- [x] Grounding Dino -> https://github.com/huggingface/transformers/pull/37108
- [ ] Idefics 2 -> https://github.com/huggingface/transformers/pull/38157
- [ ] Idefics3 -> https://github.com/huggingface/transformers/pull/38157
- [ ] ImageGPT -> https://github.com/huggingface/transformers/pull/37320
- [x] LayoutLMv2 -> https://github.com/huggingface/transformers/pull/37203
- [x] LayoutLMv3 -> https://github.com/huggingface/transformers/pull/37201
- [x] LeViT -> https://github.com/huggingface/transformers/pull/37154
- [x] LLava
- [x] LLaVa-NeXT
- [ ] LLaVa-NeXT-Video -> https://github.com/huggingface/transformers/pull/37297
- [x] LLaVa-Onevision
- [ ] Mask2Former -> https://github.com/huggingface/transformers/pull/35685
- [ ] MaskFormer -> https://github.com/huggingface/transformers/pull/35685
- [ ] MLlama -> https://github.com/huggingface/transformers/pull/37539
- [x] MobileNetV1 -> https://github.com/huggingface/transformers/pull/37111
- [x] MobileNetV2 -> https://github.com/huggingface/transformers/pull/37113
- [x] MobileViT -> https://github.com/huggingface/transformers/pull/37143
- [x] Nougat -> https://github.com/huggingface/transformers/pull/37661
- [ ] OneFormer -> https://github.com/huggingface/transformers/pull/38343
- [ ] OWLv2 -> https://github.com/huggingface/transformers/pull/37289 / https://github.com/huggingface/transformers/pull/39041
- [x] OwlViT -> https://github.com/huggingface/transformers/pull/37164
- [x] Perceiver -> https://github.com/huggingface/transformers/pull/37176
- [ ] Pix2Struct -> https://github.com/huggingface/transformers/pull/37210
- [x] Pixtral
- [x] PoolFormer -> https://github.com/huggingface/transformers/pull/37182
- [x] Pvt -> https://github.com/huggingface/transformers/pull/37204
- [x] Qwen2-VL (Not standard as it also handles videos, don't use it as an example :) )
- [x] RT-DETR
- [ ] SAM -> https://github.com/huggingface/transformers/pull/36999
- [ ] Segformer -> https://github.com/huggingface/transformers/pull/37024
- [x] SigLIP
- [x] SigLIP2
- [ ] SmolVLM -> https://github.com/huggingface/transformers/pull/38157
- [ ] SuperPoint -> https://github.com/huggingface/transformers/pull/37804
- [x] Swin2SR  -> https://github.com/huggingface/transformers/pull/37169
- [x] ~~TVLT~~ (deprecated)
- [ ] TVP
- [ ] Video-LLaVA -> https://github.com/huggingface/transformers/pull/37023
- [ ] VideoMAE -> https://github.com/huggingface/transformers/pull/37191
- [x] Vilt -> https://github.com/huggingface/transformers/pull/37304
- [x] ViT
- [x] ~~ViT hybrid~~ (deprecated)
- [x] ViTMatte -> https://github.com/huggingface/transformers/pull/37616
- [ ] VitPose -> https://github.com/huggingface/transformers/pull/38502
- [ ] Vivit
- [x] YOLOS -> https://github.com/huggingface/transformers/pull/37292
- [x] ZoeDepth -> https://github.com/huggingface/transformers/pull/38515

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Contributions Welcome] Add Fast Image Processors #36978

Community contributions: Add Fast Image Processors

How to implement a Fast Image Processor

Getting Started

Customizing the Image Processor

Important Notes

Handling Edge Cases

Testing

Choosing an Image Processor to Implement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Contributions Welcome] Add Fast Image Processors #36978

Description

Community contributions: Add Fast Image Processors

How to implement a Fast Image Processor

Getting Started

Customizing the Image Processor

Important Notes

Handling Edge Cases

Testing

Choosing an Image Processor to Implement

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions