[![Roboflow Notebooks](https://media.roboflow.com/notebooks/template/bannertest2-2.png?ik-sdk-version=javascript-1.4.3&updatedAt=1672932710194)](https://github.com/roboflow/notebooks)

# Fine-tune PaliGemma2 on Object Detection Dataset

---

[![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md)
[![arXiv](https://img.shields.io/badge/arXiv-2412.03555-b31b1b.svg)](https://arxiv.org/abs/2412.03555)

PaliGemma 2 is built by combining the SigLIP-So400m vision encoder with the more recent and capable language models from the Gemma 2 family.

![PaliGemma2 Figure.1](https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/paligemma2-1.png)

The authors use a 3-stage training approach similar to the original PaliGemma. In stage 1, they combine the pretrained vision and language model components and train them jointly on a multimodal task mixture. In stage 2, they train the models at higher resolutions of 448px^2 and 896px^2. In stage 3, they fine-tune the models on the target transfer tasks.

PaliGemma 2 models outperform the original PaliGemma at the same resolution and model size. Increasing the model size and resolution generally improves performance across a wide range of tasks, but the benefits differ depending on the task. Some tasks benefit more from increased resolution, while others benefit more from a larger language model.

![PaliGemma2 Figure.2](https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/paligemma2-2.png)

Notebook requires A100 with 40GB of VRAM to train.

## Setup

### Configure your API keys

To fine-tune PaliGemma2, you need to provide your HuggingFace Token and Roboflow API key. Follow these steps:

- Open your [`HuggingFace Settings`](https://huggingface.co/settings) page. Click `Access Tokens` then `New Token` to generate new token.
- Go to your [`Roboflow Settings`](https://app.roboflow.com/settings/api) page. Click `Copy`. This will place your private key in the clipboard.
- In Colab, go to the left pane and click on `Secrets` (🔑).
    - Store HuggingFace Access Token under the name `HF_TOKEN`.
    - Store Roboflow API Key under the name `ROBOFLOW_API_KEY`.

### Select the runtime

Let's make sure that we have access to GPU. We can use `nvidia-smi` command to do that. In case of any problems navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `T4 GPU`, and then click `Save`.

In [1]:
!nvidia-smi
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"   #,1"

Wed Oct 22 04:05:30 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:81:00.0 Off |                  N/A |
| 30%   31C    P8             30W /  350W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00

In [2]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Download dataset from Roboflow Universe

To fine-tune PaliGemma2, prepare your dataset in JSONL format. You can use Roboflow to easily convert any dataset into this format.

In [3]:
#!pip install -q peft bitsandbytes transformers==4.47.0 tf-keras
!rsync -a --progress /data/lmbraid19/argusm/datasets/indoorCVPR_09.tar /tmp/ && mkdir -p /tmp/indoorCVPR && tar -xf /tmp/indoorCVPR_09.tar -C /tmp/indoorCVPR
!rsync -a --progress /work/dlclarge2/zhangj-zhangj-CFM/data/training2 /tmp/
!file /tmp/indoorCVPR
!file /tmp/training2

sending incremental file list
sending incremental file list
/tmp/indoorCVPR: directory
/tmp/training2: directory


**NOTE:** Let's read the first few lines of the annotation file and examine the dataset format.

In [40]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
from cvla.data_loader_h5 import H5Dataset
from cvla.data_loader_jsonl import JSONLDataset
from cvla.data_augmentations import augment_image_rgb, RandomizeBackgrounds
from cvla.data_augmentations import complexify_text, DepthAugmentation
from cvla.data_loader_images import ImageFolderDataset
from torchvision import transforms


model_location = Path("/data/lmbraid19/argusm/models")
dataset_location = Path("/tmp/training2")

bg_image_dataset = ImageFolderDataset("/tmp/indoorCVPR/Images", transform=transforms.RandomResizedCrop((448,448)))
randomize_background = RandomizeBackgrounds(p=0.2, background_images=bg_image_dataset)
augment_depth = DepthAugmentation(depth_range=(25, 100), max_delta_depth=35)
train_dataset = H5Dataset(dataset_location, augment_rgb=augment_image_rgb, augment_text=complexify_text,
                          augment_depth=augment_depth, return_depth=True,action_encoder="xyzrotvec-cam-512xy")
#, augment_rgbds=randomize_background

print("dataset_location:", dataset_location,"samples:", len(train_dataset))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
dataset_location: /tmp/training2 samples: 88244


In [41]:
image, entry = train_dataset[0]
print("Image shape from dataset:", image[1].shape)


Image shape from dataset: (448, 448, 3)


### Set up and test data loaders

In [42]:
from cvla.utils_vis import render_example
import matplotlib.pyplot as plt
from cvla.utils_traj_tokens import getActionEncInstance

enc = getActionEncInstance("xyzrotvec-cam-512xy")
num_samples = 3*2
html_imgs = ""
for i in range(num_samples):
    image, sample = train_dataset[i]
    prefix = sample["prefix"]
    html_imgs += render_example(image[0], label=sample["suffix"], enc=enc, text=prefix, camera=sample["camera"])
    html_imgs += render_example(image[1], label=sample["suffix"], enc=enc, text=prefix, camera=sample["camera"])

plot_images = True
if plot_images:
    from IPython.display import display, HTML
    display(HTML(html_imgs))
    



### Load PaliGemma2 model

**NOTE:** PaliGemma2 offers 9 pre-trained models with sizes of `3B`, `10B`, and `28B` parameters, and resolutions of `224`, `448`, and `896` pixels. In this tutorial, I'll be using the [`google/paligemma2-3b-pt-448`](https://huggingface.co/google/paligemma2-3b-pt-448) checkpoint. Resolution has a key impact on the mAP of the trained model, and it seems that `448` offers the most optimal balance between performance and compute resources required to train the model.

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

In [31]:
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch

# setting device on GPU if available, else CPU
print("cuda visible devices:", os.environ["CUDA_VISIBLE_DEVICES"])
devices_good = sorted((int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(",")))
DEVICE = torch.device('cuda')
print(DEVICE)
print('Using device:', DEVICE)
print("Good devices", devices_good)

TORCH_DTYPE = torch.bfloat16
# use checkpoint
#LOCAL_CHECKPOINT = "/data/lmbraid19/argusm/models/_text_lr3e-05xyzrotvec-cam-512xy256d_2025-04-23_12-03-48/checkpoint-4687"

#fine-tune directly on paligemma2
MODEL_NAME = "google/paligemma2-3b-pt-224"

processor = PaliGemmaProcessor.from_pretrained("google/paligemma2-3b-pt-224")
model = PaliGemmaForConditionalGeneration.from_pretrained(
    #LOCAL_CHECKPOINT,
    MODEL_NAME,
    torch_dtype=TORCH_DTYPE,
    device_map="auto",
    attn_implementation='eager'
)
#.to("cuda") 


cuda visible devices: 0,1
cuda
Using device: cuda
Good devices [0, 1]


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [32]:
import random
def augment_suffix(suffix):
    parts = suffix.split(' ; ')
    random.shuffle(parts)
    return ' ; '.join(parts)

def collate_fn(batch):
    images, labels = zip(*batch)
    prefixes = ["<image>" + label["prefix"] for label in labels]
    suffixes = [augment_suffix(label["suffix"]) for label in labels]

    inputs = processor(
        text=prefixes,
        images=images,
        return_tensors="pt",
        suffix=suffixes,
        padding="longest"
    ).to(TORCH_DTYPE).to(DEVICE)

    return inputs

In [33]:
def collate_fn(batch):
    images, labels = zip(*batch)
    prefixes = [label["prefix"] for label in labels]
    suffixes = [augment_suffix(label["suffix"]) for label in labels]
    inputs = processor(
        text=prefixes,
        images=list(images),
        return_tensors="pt",
        suffix=suffixes,
        padding="longest"
    ).to(TORCH_DTYPE)#.to(DEVICE)
    print("prefixes", prefixes)
    return inputs

batch = [train_dataset[i] for i in range(3)]
inputs = collate_fn(batch)
for x in inputs:
    print(x, inputs[x].shape)



You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['pick up the white ribbed shell and put it in the graffiti volkswagen van <loc0258><loc0511><loc0044><seg039><seg068><seg070>', 'put the vibrant hexagonal dreidel in the red can <loc0308><loc0355><loc0053><seg045><seg079><seg089>', 'put the stylish purple eyeglasses in the rustic wooden bracket <loc0271><loc0361><loc0049><seg039><seg076><seg078>']
input_ids torch.Size([3, 548])
token_type_ids torch.Size([3, 548])
attention_mask torch.Size([3, 548])
pixel_values torch.Size([6, 3, 224, 224])
labels torch.Size([3, 548])


In [17]:
inputs["input_ids"]

tensor([[257152, 257152, 257152,  ..., 257088, 257088,      1],
        [257152, 257152, 257152,  ..., 257104, 257114,      1],
        [257152, 257152, 257152,  ..., 257116, 257123,      1]])

In [43]:
def collate_fn(batch):
    images, labels = zip(*batch)
    prefixes = ["<image><image>" + label["prefix"] for label in labels]
    suffixes = [augment_suffix(label["suffix"]) for label in labels]
    images_flat = [img for img_list_x in images for img in img_list_x]
    inputs = processor(
        text=prefixes,
        images=images_flat,
        return_tensors="pt",
        suffix=suffixes,
        padding="longest"
    ).to(TORCH_DTYPE)
    return inputs

batch = [train_dataset[i] for i in range(3)]
inputs = collate_fn(batch)
for x in inputs:
    print(x, inputs[x].shape)

input_ids torch.Size([3, 547])
token_type_ids torch.Size([3, 547])
attention_mask torch.Size([3, 547])
pixel_values torch.Size([6, 3, 224, 224])
labels torch.Size([3, 547])


In [19]:
inputs["input_ids"]

tensor([[257152, 257152, 257152,  ..., 257088, 257088,      1],
        [257152, 257152, 257152,  ..., 257104, 257114,      1],
        [257152, 257152, 257152,  ..., 257116, 257123,      1]])

In [44]:
import numpy as np
def compute_metrics(eval_pred):
    predictions, label_tokens = eval_pred  # Extract predictions and labels
    if isinstance(predictions, tuple):  # Some models return tuples
        predictions = predictions[0]

    # Convert to token indices if necessary (e.g., for text generation models)
    pred_tokens = np.argmax(predictions, axis=-1)  # Assuming logits, take argmax

    pred_texts = processor.tokenizer.batch_decode(pred_tokens[:,-SEQLEN-1:], skip_special_tokens=True)
    label_text = processor.tokenizer.batch_decode(label_tokens[:,-SEQLEN-1:], skip_special_tokens=True)

    print(pred_tokens[:,-SEQLEN-1:])
    print(label_tokens[:,-SEQLEN-1:])
    print(label_text)
    print(pred_texts)
    print()
    return {"accuracy": 0}

## Fine-tune with JAX settings

In [22]:
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

for param in model.vision_tower.parameters():
    param.requires_grad = False

for param in model.multi_modal_projector.parameters():
    param.requires_grad = False
    
for name, param in model.named_parameters():
    if param.requires_grad == True:
        if "self_attn" in name:
            param.requires_grad = True
        else:
            param.requires_grad = False

TRAIN_EXAMPLES = len(train_dataset)
BATCH_SIZE = 32
BATCH_SIZE_DEV = 2 # on l40 was 8
GRAD_ACCUM = int(round(BATCH_SIZE / BATCH_SIZE_DEV))
TRAIN_STEPS = (TRAIN_EXAMPLES // BATCH_SIZE)
SEQLEN = 12
SAVE_STEPS = int(TRAIN_STEPS / 15)
SAVE_LIMIT = 5

run_name = "_topview_70000_baseline_depth"
new_model_location = Path("/work/dlclarge2/zhangj-zhangj-CFM/models")
save_path = new_model_location / (str(Path(dataset_location).stem) + run_name)
print("save_path", save_path)
print("TRAIN_STEPS",TRAIN_STEPS)
print("GRAD_ACCUM", GRAD_ACCUM)

args_jax = Seq2SeqTrainingArguments(
    max_steps=TRAIN_STEPS,
    remove_unused_columns=False,
    per_device_train_batch_size=BATCH_SIZE_DEV,
    gradient_accumulation_steps=GRAD_ACCUM,
    learning_rate=3e-5,  # 1e-5, 2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=.05,
    generation_max_length=SEQLEN,
    logging_steps=10,
    optim="adafactor",
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    save_total_limit=SAVE_LIMIT,
    output_dir=save_path,
    bf16=True,
    report_to=["tensorboard"],
    dataloader_pin_memory=False,
    dataloader_num_workers=4,  
    #dataloader_prefetch_factor=2,
    #eval_strategy="steps",
    #eval_steps=4,
    #per_device_eval_batch_size=BATCH_SIZE_DEV,
    #eval_accumulation_steps=GRAD_ACCUM
)
#gradient_checkpointing=True,
#weight_decay=3e-7,

trainer = Seq2SeqTrainer(
    model=model,
    train_dataset=train_dataset,
    #eval_dataset=train_dataset_small,
    data_collator=collate_fn,
    args=args_jax,
    #compute_metrics=compute_metrics
)

save_path /work/dlclarge2/zhangj-zhangj-CFM/models/training2_topview_70000_baseline_depth
TRAIN_STEPS 2757
GRAD_ACCUM 16


TypeError: Seq2SeqTrainingArguments.__init__() got an unexpected keyword argument 'place_model_on_device'

In [None]:
#only when recover from last time training
#last_checkpoint = "/work/dlclarge2/zhangj-zhangj-CFM/models/training2_topview_70000_based/checkpoint-183"
#trainer.train(resume_from_checkpoint=last_checkpoint)


In [11]:
for key, value in inputs.items():
    if torch.is_tensor(value):
        inputs[key] = value.to(DEVICE)

In [12]:
trainer.train()



You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.




You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.





You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the vibrant blue toy in the round shiny apple <loc0374><loc0522><loc0059><seg042><seg062><seg060>', 'put the pixelated toy in the halved green avocado <loc0309><loc0711><loc0050><seg046><seg075><seg084>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the fiddler crab in the conical seashell <loc0398><loc0707><loc0054><seg051><seg084><seg101>', 'put the cylindrical metal can in the cylindrical plastic object <loc0641><loc0169><loc0045><seg050><seg083><seg097>']
prefixes ['put the red folder in the red ball <loc0543><loc0662><loc0049><seg045><seg087><seg094>', 'put the pyramid shaped keycap inside the cylindrical dark flashlight <loc0343><loc0099><loc0036><seg048><seg080><seg094>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.





You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the spiky creamy seashell in the clay pipe <loc0196><loc0509><loc0040><seg056><seg048><seg020>', 'put the ceiling medallion in the rectangular black box <loc0437><loc0362><loc0056><seg050><seg090><seg101>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixesprefixes  ['put the orange juice carton in the white spiral seashell <loc0448><loc0387><loc0059><seg044><seg057><seg051>', 'put the square body teapot in the miniature gate <loc0360><loc0102><loc0051><seg059><seg039><seg013>']['place the gray fabric sneaker in the arrowhead <loc0002><loc0389><loc0032><seg059><seg046><seg013>', 'put the glossy pink button in the small l-shaped bracket <loc0009><loc0477><loc0043><seg053><seg059><seg045>']

tensor([1024, 1024], dtype=torch.int32)


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


 ['put the football in the ceramic pink mug <loc0541><loc0562><loc0053><seg045><seg075><seg083>', 'put the toy figure in the small green container <loc0426><loc0002><loc0035><seg056><seg086><seg110>']
prefixes ['put the cube inside the silver dome screw <loc0469><loc0146><loc0049><seg059><seg035><seg014>', 'put the turquoise bracket in the sleek maroon car <loc0450><loc0901><loc0050><seg046><seg088><seg096>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.




You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


tensor([1024, 1024], dtype=torch.int32)
prefixes ['put the realistic cupcake model in the glossy yellow pear <loc0321><loc0537><loc0048><seg061><seg090><seg117>', 'put the apple in the stylized chess king <loc0529><loc0567><loc0049><seg050><seg090><seg101>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the magenta candy stick in the red lenses frame <loc0211><loc0629><loc0051><seg056><seg043><seg019>', 'put the detailed mech model in the pink pig <loc0475><loc0410><loc0046><seg044><seg079><seg086>']
prefixes ['put the smooth red onion in the coral <loc0359><loc1023><loc0028><seg040><seg064><seg061>', 'put the crab in the seated figure statue <loc0465><loc0258><loc0046><seg047><seg091><seg098>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


 tensor([1024, 1024], dtype=torch.int32)
prefixes ['pick the light-colored rustic bowl and put it in the small egyptian figure <loc0501><loc0677><loc0054><seg038><seg073><seg075>', 'place the melted yellow candle inside the miniature red torii <loc0527><loc0002><loc0034><seg050><seg087><seg101>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


 tensor([1024, 1024], dtype=torch.int32)['put the grenade inside the sporty blue cap <loc0518><loc0406><loc0056><seg040><seg064><seg065>', 'put the crouching animal sculpture in the green cup <loc0183><loc0507><loc0052><seg044><seg060><seg055>']



You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.




You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


tensor([1024, 1024], dtype=torch.int32)
prefixes ['put the cylindrical black grenade inside the colorful mug <loc0541><loc0368><loc0042><seg053><seg033><seg021>', 'put the light grey connector in the stacked cigarettes <loc0323><loc0208><loc0054><seg064><seg035><seg008>']
prefixes ['put the black shotgun shell in the ammonite <loc0457><loc0312><loc0043><seg051><seg090><seg103>', 'put the old rusty object in the wooden metal pistol <loc0474><loc0141><loc0044><seg047><seg091><seg096>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the black gemstone in the small animal skull <loc0522><loc0848><loc0038><seg045><seg085><seg093>', 'place the yellow block inside the milk carton blue <loc0522><loc0567><loc0052><seg054><seg095><seg106>']

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


 tensor([1024, 1024], dtype=torch.int32)

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.





You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the mango drink inside the ivory-colored dice <loc0639><loc0346><loc0050><seg047><seg078><seg089>', 'put the pluto in the reddish-brown teapot <loc0498><loc0750><loc0042><seg058><seg091><seg114>']
prefixesprefixes  ['pick the smart speaker and put it in the decorative bird sculpture <loc0602><loc0061><loc0046><seg061><seg096><seg117>', 'put the black tie fighter in the rustic wooden bracket <loc0419><loc0888><loc0038><seg041><seg070><seg074>']['put the gray transformer in the whimsical gingerbread house <loc0480><loc0173><loc0041><seg057><seg093><seg111>', 'put the colorful round pumpkin in the bark-textured rustic bowl <loc0272><loc0189><loc0052><seg063><seg037><seg011>']



You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the sleek black laptop in the compact black revolver <loc0488><loc0686><loc0053><seg044><seg076><seg083>', 'pick up the small green crocodile and put it in the red construction toy <loc0505><loc0843><loc0046><seg047><seg077><seg090>']
tensor([1024, 1024], dtype=torch.int32)


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the pinecone in the human dentures <loc0358><loc0414><loc0048><seg049><seg071><seg080>', 'put the whimsical toy bomb in the grenade <loc0508><loc0002><loc0043><seg051><seg087><seg103>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the rough mineral garnet in the pastel rosebud <loc0631><loc0993><loc0035><seg043><seg080><seg086>', 'put the rusty die in the ceramic lion <loc0449><loc0210><loc0054><seg060><seg033><seg012>']


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 10.57 GiB of which 15.12 MiB is free. Including non-PyTorch memory, this process has 10.55 GiB memory in use. Of the allocated memory 10.04 GiB is allocated by PyTorch, and 330.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
print("Model device:", next(model.parameters()).device)
for k, v in inputs.items():
    if torch.is_tensor(v):
        print(f"  {k}: {v.device}")


In [None]:
print(next(model.parameters()).device)
print({k: v.device for k, v in inputs.items() if torch.is_tensor(v)})