[![Roboflow Notebooks](https://media.roboflow.com/notebooks/template/bannertest2-2.png?ik-sdk-version=javascript-1.4.3&updatedAt=1672932710194)](https://github.com/roboflow/notebooks)

# Fine-tune PaliGemma2 on Object Detection Dataset

---

[![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md)
[![arXiv](https://img.shields.io/badge/arXiv-2412.03555-b31b1b.svg)](https://arxiv.org/abs/2412.03555)

PaliGemma 2 is built by combining the SigLIP-So400m vision encoder with the more recent and capable language models from the Gemma 2 family.

![PaliGemma2 Figure.1](https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/paligemma2-1.png)

The authors use a 3-stage training approach similar to the original PaliGemma. In stage 1, they combine the pretrained vision and language model components and train them jointly on a multimodal task mixture. In stage 2, they train the models at higher resolutions of 448px^2 and 896px^2. In stage 3, they fine-tune the models on the target transfer tasks.

PaliGemma 2 models outperform the original PaliGemma at the same resolution and model size. Increasing the model size and resolution generally improves performance across a wide range of tasks, but the benefits differ depending on the task. Some tasks benefit more from increased resolution, while others benefit more from a larger language model.

![PaliGemma2 Figure.2](https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/paligemma2-2.png)

Notebook requires A100 with 40GB of VRAM to train.

## Setup

### Configure your API keys

To fine-tune PaliGemma2, you need to provide your HuggingFace Token and Roboflow API key. Follow these steps:

- Open your [`HuggingFace Settings`](https://huggingface.co/settings) page. Click `Access Tokens` then `New Token` to generate new token.
- Go to your [`Roboflow Settings`](https://app.roboflow.com/settings/api) page. Click `Copy`. This will place your private key in the clipboard.
- In Colab, go to the left pane and click on `Secrets` (🔑).
    - Store HuggingFace Access Token under the name `HF_TOKEN`.
    - Store Roboflow API Key under the name `ROBOFLOW_API_KEY`.

### Select the runtime

Let's make sure that we have access to GPU. We can use `nvidia-smi` command to do that. In case of any problems navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `T4 GPU`, and then click `Save`.

In [6]:
!nvidia-smi
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"   #,1"

Fri Oct 24 01:25:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.82.07              Driver Version: 580.82.07      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:E1:00.0 Off |                  N/A |
| 30%   30C    P8             22W /  350W |   13586MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+----------------------------------------------

In [8]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

### Download dataset from Roboflow Universe

To fine-tune PaliGemma2, prepare your dataset in JSONL format. You can use Roboflow to easily convert any dataset into this format.

In [9]:
#!pip install -q peft bitsandbytes transformers==4.47.0 tf-keras
!rsync -a --progress /data/lmbraid19/argusm/datasets/indoorCVPR_09.tar /tmp/ && mkdir -p /tmp/indoorCVPR && tar -xf /tmp/indoorCVPR_09.tar -C /tmp/indoorCVPR
!rsync -a --progress /work/dlclarge2/zhangj-zhangj-CFM/data/training2 /tmp/
!file /tmp/indoorCVPR
!file /tmp/training2

sending incremental file list
sending incremental file list
/tmp/indoorCVPR: directory
/tmp/training2: directory


**NOTE:** Let's read the first few lines of the annotation file and examine the dataset format.

In [10]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
from cvla.data_loader_h5 import H5Dataset
from cvla.data_loader_jsonl import JSONLDataset
from cvla.data_augmentations import augment_image_rgb, RandomizeBackgrounds
from cvla.data_augmentations import complexify_text, DepthAugmentation
from cvla.data_loader_images import ImageFolderDataset
from torchvision import transforms
from torch.utils.data import random_split
import torch
import random

model_location = Path("/data/lmbraid19/argusm/models")
dataset_location = Path("/tmp/training2")

bg_image_dataset = ImageFolderDataset("/tmp/indoorCVPR/Images", transform=transforms.RandomResizedCrop((448,448)))
randomize_background = RandomizeBackgrounds(p=0.2, background_images=bg_image_dataset)
augment_depth = DepthAugmentation(depth_range=(25, 100), max_delta_depth=35)

full_dataset = H5Dataset(
    dataset_location,
    augment_rgb=augment_image_rgb,
    augment_text=complexify_text,
    augment_depth=augment_depth,
    return_depth=False,
    action_encoder="xyzrotvec-cam-512xy",
)

# 手动定义验证集大小
val_size = 1000  # 固定1000条
train_size = len(full_dataset) - val_size

generator = torch.Generator().manual_seed(42)
train_dataset, val_dataset = random_split(full_dataset, [train_size, val_size], generator=generator)

val_indices_small = random.sample(range(len(val_dataset)), 200)
val_dataset_small = torch.utils.data.Subset(val_dataset, val_indices_small)

print(f"Total samples: {len(full_dataset)} | Train: {len(train_dataset)} | Val: {len(val_dataset)}| Smallv:{len(val_dataset_small)}")


'''
train_dataset = H5Dataset(dataset_location, augment_rgb=augment_image_rgb, augment_text=complexify_text,
                          augment_depth=augment_depth, return_depth=True,action_encoder="xyzrotvec-cam-512xy")
#, augment_rgbds=randomize_background

print("dataset_location:", dataset_location,"samples:", len(train_dataset))
'''

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Total samples: 88244 | Train: 87244 | Val: 1000| Smallv:200


'\ntrain_dataset = H5Dataset(dataset_location, augment_rgb=augment_image_rgb, augment_text=complexify_text,\n                          augment_depth=augment_depth, return_depth=True,action_encoder="xyzrotvec-cam-512xy")\n#, augment_rgbds=randomize_background\n\nprint("dataset_location:", dataset_location,"samples:", len(train_dataset))\n'

### Set up and test data loaders

In [5]:
from cvla.utils_vis import render_example
import matplotlib.pyplot as plt
from cvla.utils_traj_tokens import getActionEncInstance

enc = getActionEncInstance("xyzrotvec-cam-512xy")
num_samples = 3*2
html_imgs = ""
for i in range(num_samples):
    image, sample = train_dataset[i]
    prefix = sample["prefix"]
    html_imgs += render_example(image, label=sample["suffix"], enc=enc, text=prefix, camera=sample["camera"])

plot_images = True
if plot_images:
    from IPython.display import display, HTML
    display(HTML(html_imgs))
    

ValueError: image was <class 'list'>

### Load PaliGemma2 model

**NOTE:** PaliGemma2 offers 9 pre-trained models with sizes of `3B`, `10B`, and `28B` parameters, and resolutions of `224`, `448`, and `896` pixels. In this tutorial, I'll be using the [`google/paligemma2-3b-pt-448`](https://huggingface.co/google/paligemma2-3b-pt-448) checkpoint. Resolution has a key impact on the mAP of the trained model, and it seems that `448` offers the most optimal balance between performance and compute resources required to train the model.

In [None]:
# from huggingface_hub import notebook_login
# notebook_login()

In [11]:
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch
import transformers

#transformers.utils.logging.set_verbosity_error()

# setting device on GPU if available, else CPU
print("cuda visible devices:", os.environ["CUDA_VISIBLE_DEVICES"])
devices_good = sorted((int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(",")))
DEVICE = torch.device('cuda')
print(DEVICE)
print('Using device:', DEVICE)
print("Good devices", devices_good)

TORCH_DTYPE = torch.bfloat16
# use checkpoint
#LOCAL_CHECKPOINT = "/data/lmbraid19/argusm/models/_text_lr3e-05xyzrotvec-cam-512xy256d_2025-04-23_12-03-48/checkpoint-4687"

#fine-tune directly on paligemma2
MODEL_NAME = "google/paligemma2-3b-pt-224"

processor = PaliGemmaProcessor.from_pretrained("google/paligemma2-3b-pt-224")
model = PaliGemmaForConditionalGeneration.from_pretrained(
    #LOCAL_CHECKPOINT,
    MODEL_NAME,
    torch_dtype=TORCH_DTYPE,
    device_map=None,
    attn_implementation='eager'
)
#.to("cuda") 
tokenizer = processor.tokenizer


cuda visible devices: 0,1
cuda
Using device: cuda
Good devices [0, 1]


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
import random
def augment_suffix(suffix):
    parts = suffix.split(' ; ')
    random.shuffle(parts)
    return ' ; '.join(parts)

In [19]:
def collate_fn(batch):
    images, labels = zip(*batch)
    prefixes = ["<image>" + label["prefix"] for label in labels]
    suffixes = [augment_suffix(label["suffix"]) for label in labels]

    inputs = processor(
        text=prefixes,
        images=images,
        return_tensors="pt",
        suffix=suffixes,
        padding="longest"
    ).to(TORCH_DTYPE)

    return inputs

#debug
def collate_fn(batch):
    images, labels = zip(*batch)
    prefixes = [label["prefix"] for label in labels]
    suffixes = [augment_suffix(label["suffix"]) for label in labels]
    inputs = processor(
        text=prefixes,
        images=list(images),
        return_tensors="pt",
        suffix=suffixes,
        padding="longest"
    ).to(TORCH_DTYPE)#.to(DEVICE)
    print("prefixes", prefixes)
    return inputs

batch = [train_dataset[i] for i in range(3)]
inputs = collate_fn(batch)
for x in inputs:
    print(x, inputs[x].shape)

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the red folder inside the red ball <loc0271><loc0331><loc0049><seg045><seg087><seg094>', 'put the pyramid shaped keycap in the cylindrical dark flashlight <loc0171><loc0049><loc0036><seg048><seg080><seg094>', 'put the fiddler crab in the conical seashell <loc0199><loc0353><loc0054><seg051><seg084><seg101>']
input_ids torch.Size([3, 545])
token_type_ids torch.Size([3, 545])
attention_mask torch.Size([3, 545])
pixel_values torch.Size([6, 3, 224, 224])
labels torch.Size([3, 545])


In [None]:
inputs["input_ids"]

In [20]:
from cvla.utils_eval import Evaluator
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer
import torch
from torch.utils.tensorboard import SummaryWriter
from tqdm import tqdm
from math import ceil



TRAIN_EXAMPLES = len(train_dataset)
BATCH_SIZE = 32
BATCH_SIZE_DEV = 2
GRAD_ACCUM = int(round(BATCH_SIZE / BATCH_SIZE_DEV))
TRAIN_STEPS = TRAIN_EXAMPLES // BATCH_SIZE
SEQLEN = 12
#EVAL_STEPS = 200
EVAL_STEPS = 2
SAVE_LIMIT = 5
LOGGING_STEPS = 10


run_name = "test"
new_model_location = Path("/work/dlclarge2/zhangj-zhangj-CFM/models")
save_path = new_model_location / (str(Path(dataset_location).stem) + run_name)
print("save_path", save_path)
print("TRAIN_STEPS",TRAIN_STEPS)
print("GRAD_ACCUM", GRAD_ACCUM)

writer = SummaryWriter(log_dir=str(save_path / "tb_logs"))

class CustomTrainer(Seq2SeqTrainer):
    """
    Trainer that:
      - uses normal loss for training
      - runs model.generate() for evaluation
      - uses Evaluator to compute real-world metrics
    """

    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
    
        outputs = model(**inputs)
        loss = getattr(outputs, "loss", None)
        if loss is None:
            raise ValueError("Model outputs do not contain a 'loss' field.")

        if self.state.global_step % self.args.logging_steps == 0:
            writer.add_scalar("train/loss_total", loss.item(), self.state.global_step)
            writer.add_scalar("train/lr", self.optimizer.param_groups[0]["lr"], self.state.global_step)

        return (loss, outputs) if return_outputs else loss

    def evaluate(self, eval_dataset=None, ignore_keys=None, metric_key_prefix="eval"):
        """
        Overridden evaluation that generates predictions textually
        and computes spatial metrics via Evaluator.
        """
        self.model.eval()
        dataset = eval_dataset or self.eval_dataset
        # helper: unwrap nested Subsets to access H5Dataset
        def unwrap_dataset(dset):
            while hasattr(dset, "dataset"):
                dset = dset.dataset
            return dset

        base_dataset = unwrap_dataset(dataset)
        camera = dataset[0][1]["camera"]

        evaluator = Evaluator(
            getActionEncInstance("xyzrotvec-cam-512xy"),
            camera_fixed=camera,
            encoder_labels=base_dataset.action_encoder,  # ✅ now always valid
        )
        # sample limited subset for speed
        eval_batch_size = self.args.per_device_eval_batch_size
        test_samples = min(len(dataset), 200)
        device = next(self.model.parameters()).device
        
        for start_idx in tqdm(range(0, test_samples, eval_batch_size), total=ceil(test_samples / eval_batch_size)):
            batch_i = range(start_idx, min(start_idx + eval_batch_size, test_samples))
            batch = [dataset[i] for i in batch_i]
            inputs = self.data_collator(batch)
            inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
            prefix_length = inputs["input_ids"].shape[-1]

            with torch.inference_mode():
                generation = self.model.generate(**inputs, max_new_tokens=13, do_sample=False, use_cache=False)
                decoded = [
                    self.processing_class.decode(x[prefix_length:], skip_special_tokens=True) for x in generation
                ]
                decoded_labels = [
                    self.processing_class.decode([t for t in x.tolist() if t >= 0], skip_special_tokens=True)
                    for x in inputs["labels"]
                ]
            if start_idx == 0:
                print("decoded[0]:", decoded[0] if decoded else None)
                print("decoded_label[0]:", decoded_labels[0] if decoded_labels else None)

            for pred, label in zip(decoded, decoded_labels):
                evaluator.evaluate(pred, label, camera=camera)

        stats = evaluator.report_stats()
        metrics = {f"{metric_key_prefix}_{k}": v for k, v in stats.items()}

        # log to TensorBoard
        for k, v in metrics.items():
            writer.add_scalar(k, v, self.state.global_step)

        self.log(metrics)
        return metrics

save_path /work/dlclarge2/zhangj-zhangj-CFM/models/training2test
TRAIN_STEPS 2726
GRAD_ACCUM 16


## Fine-tune with JAX settings

In [21]:

for param in model.vision_tower.parameters():
    param.requires_grad = False

for param in model.multi_modal_projector.parameters():
    param.requires_grad = False
    
for name, param in model.named_parameters():
    if param.requires_grad == True:
        if "self_attn" in name:
            param.requires_grad = True
        else:
            param.requires_grad = False

args_jax = Seq2SeqTrainingArguments(
    max_steps=TRAIN_STEPS,
    per_device_train_batch_size=BATCH_SIZE_DEV,
    gradient_accumulation_steps=GRAD_ACCUM,
    learning_rate=3e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    generation_max_length=SEQLEN,
    logging_steps=LOGGING_STEPS,
    optim="adafactor",
    evaluation_strategy="steps",
    eval_steps=EVAL_STEPS,
    save_strategy="steps",
    save_steps=EVAL_STEPS,
    save_total_limit=SAVE_LIMIT,
    load_best_model_at_end=True,
    metric_for_best_model="cart_l1",
    greater_is_better=False,
    bf16=True,
    output_dir=save_path,
    report_to=["tensorboard"],
    dataloader_num_workers=4,
    remove_unused_columns=False,
)

trainer = CustomTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=val_dataset_small,
    data_collator=collate_fn,  # replace with your collate_fn if you use one
    args=args_jax,
)

  trainer = CustomTrainer(


In [13]:
#only when recover from last time training
#last_checkpoint = "/work/dlclarge2/zhangj-zhangj-CFM/models/training2_topview_70000_based/checkpoint-183"
#trainer.train(resume_from_checkpoint=last_checkpoint)


In [22]:
trainer.train()
trainer.save_model(str(save_path / "final_checkpoint"))
writer.close()
print("✅ Training completed successfully with Evaluator-based validation.")

You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
Y

prefixes ['put the cursive nameplate in the ring-shaped chocolate pastry <loc0242><loc0169><loc0054><seg044><seg090><seg095>', 'put the blocky blue figure inside the floral porcelain teacup <loc0288><loc0250><loc0056><seg049><seg090><seg100>']prefixes
 ['put the glossy smooth mango in the curved object <loc0323><loc0086><loc0046><seg056><seg091><seg109>', 'put the compact gray spacecraft inside the turquoise model <loc0252><loc0266><loc0047><seg045><seg089><seg094>']prefixes
 ['put the diverse succulent mix in the off-white bone shape <loc0223><loc0456><loc0037><seg052><seg084><seg102>', 'put the glossy blue orb in the irregular stone-like slab <loc0301><loc0511><loc0030><seg041><seg067><seg070>']prefixes
 ['put the vintage brass candleholder in the ceramic bottle stopper <loc0214><loc0253><loc0048><seg063><seg098><seg115>', 'put the donut in the aluminum can <loc0242><loc0421><loc0053><seg045><seg086><seg094>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
Y

prefixes ['put the shiny golden ornament in the rustic metal mug <loc0246><loc0511><loc0032><seg039><seg070><seg072>', 'put the flaky almond croissant in the mandible <loc0302><loc0065><loc0046><seg049><seg084><seg098>']
prefixes ['put the cooling fan inside the realistic button mushroom <loc0002><loc0081><loc0037><seg060><seg049><seg011>', 'put the sleek maroon car in the cartoonish red tomato <loc0260><loc0511><loc0040><seg040><seg071><seg075>']
prefixesprefixes  ['put the blue ornament inside the clownfish <loc0183><loc0511><loc0034><seg047><seg062><seg060>', 'put the rectangular crumpled package in the colorful model <loc0275><loc0407><loc0051><seg040><seg074><seg078>']['put the crenellated chess piece in the metal rods <loc0295><loc0374><loc0049><seg045><seg086><seg094>', 'put the glossy brown donut inside the pixelated pink glasses <loc0235><loc0368><loc0041><seg040><seg072><seg076>']



You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
Y

prefixes ['put the ghostly toy figure in the model cicada <loc0293><loc0069><loc0045><seg046><seg081><seg092>', 'put the cube in the red cube apple <loc0191><loc0262><loc0038><seg047><seg078><seg089>']
prefixes prefixes['place the orange-yellow can design in the circular base terrain <loc0233><loc0261><loc0052><seg050><seg089><seg102>', 'put the 20-sided red die in the vintage toy car <loc0218><loc0163><loc0054><seg043><seg081><seg087>'] 
['put the aged crescent pendant in the mineral <loc0171><loc0326><loc0051><seg048><seg066><seg072>', 'put the white box inside the colorful dna structure <loc0249><loc0086><loc0040><seg045><seg070><seg079>']
prefixes ['pick the red bracket and put it in the yellow toy car <loc0232><loc0340><loc0042><seg050><seg083><seg099>', 'pick the cartoon head and put it in the jawbone <loc0152><loc0269><loc0048><seg057><seg039><seg016>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
Y

prefixes ['put the cylindrical bullet in the purple heart <loc0233><loc0262><loc0052><seg048><seg092><seg099>', 'put the classic light bulb in the mickey shapes <loc0277><loc0115><loc0052><seg055><seg097><seg108>']prefixes
 ['put the pink character in the dark bowl spoon <loc0291><loc0009><loc0044><seg059><seg099><seg111>', 'put the bumpy cherimoya fruit in the mineral <loc0085><loc0022><loc0043><seg066><seg039><seg006>']
prefixes ['put the turquoise mug in the iron <loc0269><loc0099><loc0050><seg053><seg092><seg105>', 'put the wireless black mouse in the brown paper cup <loc0069><loc0277><loc0042><seg051><seg065><seg064>']prefixes
 ['put the textured greenish-yellow melon in the lifelike crab model <loc0268><loc0152><loc0053><seg055><seg095><seg109>', 'pick the spherical red gray and put it in the yellow frame sunglasses <loc0226><loc0026><loc0048><seg061><seg101><seg113>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
Y

prefixesprefixes  ['put the textured bird figure in the colorful caps <loc0267><loc0169><loc0053><seg045><seg078><seg086>', 'put the choker with heart in the square white mug <loc0187><loc0259><loc0053><seg058><seg039><seg016>']['put the brown seashell in the topographic model <loc0199><loc0254><loc0045><seg046><seg086><seg094>', 'put the molecular model display in the rectangular brown platform <loc0254><loc0488><loc0043><seg042><seg082><seg088>']

prefixes ['put the realistic brown pinecone inside the tartan mug <loc0097><loc0332><loc0042><seg043><seg060><seg055>', 'place the industrial check valve inside the yellow bat <loc0139><loc0093><loc0037><seg050><seg057><seg047>']
prefixes ['put the temple model in the red mushroom <loc0303><loc0441><loc0043><seg040><seg065><seg067>', 'put the modern t-shaped pipe in the blue bracket <loc0040><loc0181><loc0048><seg053><seg057><seg045>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.
Y

prefixes ['put the bone replica in the gray beige fossil <loc0268><loc0511><loc0032><seg045><seg070><seg077>', 'put the vibrant green pear in the mug <loc0244><loc0482><loc0038><seg051><seg089><seg102>']
prefixes ['pick up the faceted magenta ball and put it in the stylized black bat <loc0175><loc0388><loc0039><seg042><seg059><seg057>', 'put the gold dual-ring structure inside the small gold container <loc0291><loc0001><loc0040><seg061><seg100><seg111>']
prefixes ['put the creamy dessert cup in the natural decorative piece <loc0148><loc0511><loc0034><seg046><seg065><seg065>', 'put the roll of tape in the stylized beige rook <loc0247><loc0167><loc0048><seg062><seg026><seg015>']
prefixes ['put the rusty metal nut inside the small beige gazebo <loc0274><loc0243><loc0062><seg046><seg080><seg088>', 'put the small bird model in the model insect <loc0232><loc0110><loc0046><seg056><seg095><seg110>']


You are passing both `text` and `images` to `PaliGemmaProcessor`. The processor expects special image tokens in the text, as many tokens as there are images per each text. It is recommended to add `<image>` tokens in the very beginning of your text. For this call, we will infer how many images each text has and add special tokens.


prefixes ['put the miniature house in the ceramic mug <loc0255><loc0119><loc0046><seg051><seg094><seg102>', 'put the smooth wooden block in the thin oval lenses <loc0228><loc0463><loc0051><seg039><seg072><seg074>']


OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 23.56 GiB of which 17.69 MiB is free. Process 691545 has 13.26 GiB memory in use. Including non-PyTorch memory, this process has 10.27 GiB memory in use. Of the allocated memory 9.57 GiB is allocated by PyTorch, and 395.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [None]:
for key, value in inputs.items():
    if torch.is_tensor(value):
        inputs[key] = value.to(DEVICE)

In [None]:
print("Model device:", next(model.parameters()).device)
for k, v in inputs.items():
    if torch.is_tensor(v):
        print(f"  {k}: {v.device}")


In [None]:
print(next(model.parameters()).device)
print({k: v.device for k, v in inputs.items() if torch.is_tensor(v)})