[![Roboflow Notebooks](https://media.roboflow.com/notebooks/template/bannertest2-2.png?ik-sdk-version=javascript-1.4.3&updatedAt=1672932710194)](https://github.com/roboflow/notebooks)

# Fine-tune PaliGemma2 on Object Detection Dataset

---

[![GitHub](https://badges.aleen42.com/src/github.svg)](https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/README.md)
[![arXiv](https://img.shields.io/badge/arXiv-2412.03555-b31b1b.svg)](https://arxiv.org/abs/2412.03555)

PaliGemma 2 is built by combining the SigLIP-So400m vision encoder with the more recent and capable language models from the Gemma 2 family.

![PaliGemma2 Figure.1](https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/paligemma2-1.png)

The authors use a 3-stage training approach similar to the original PaliGemma. In stage 1, they combine the pretrained vision and language model components and train them jointly on a multimodal task mixture. In stage 2, they train the models at higher resolutions of 448px^2 and 896px^2. In stage 3, they fine-tune the models on the target transfer tasks.

PaliGemma 2 models outperform the original PaliGemma at the same resolution and model size. Increasing the model size and resolution generally improves performance across a wide range of tasks, but the benefits differ depending on the task. Some tasks benefit more from increased resolution, while others benefit more from a larger language model.

![PaliGemma2 Figure.2](https://storage.googleapis.com/com-roboflow-marketing/notebooks/examples/paligemma2-2.png)

Notebook requires A100 with 40GB of VRAM to train.

## Setup

### Configure your API keys

To fine-tune PaliGemma2, you need to provide your HuggingFace Token and Roboflow API key. Follow these steps:

- Open your [`HuggingFace Settings`](https://huggingface.co/settings) page. Click `Access Tokens` then `New Token` to generate new token.
- Go to your [`Roboflow Settings`](https://app.roboflow.com/settings/api) page. Click `Copy`. This will place your private key in the clipboard.
- In Colab, go to the left pane and click on `Secrets` (🔑).
    - Store HuggingFace Access Token under the name `HF_TOKEN`.
    - Store Roboflow API Key under the name `ROBOFLOW_API_KEY`.

### Select the runtime

Let's make sure that we have access to GPU. We can use `nvidia-smi` command to do that. In case of any problems navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `T4 GPU`, and then click `Save`.

In [None]:
!(nvidia-smi |tr -s ' '|grep -Eo "| [0123456789]+ N/A N/A [0-9]{3,} .*"|awk -F' ' '{system("s=$(cat /proc/"$4"/cmdline| tr \"\\0\" \" \");u=$(ps -o uname= -p "$4");echo "$1"sep"$4"sep$u sep"$7"sep" ) }'|sed 's/sep/\t/g')

### Download dataset from Roboflow Universe

To fine-tune PaliGemma2, prepare your dataset in JSONL format. You can use Roboflow to easily convert any dataset into this format.

In [None]:
#%pip install transformers==4.48.0 accelerate==1.3.0 (this is what I currently use)
#%pip install -q peft bitsandbytes transformers==4.47.0 tf-keras
#%pip install git+https://github.com/huggingface/transformers
#%pip install git+https://github.com/huggingface/accelerate

**NOTE:** Let's read the first few lines of the annotation file and examine the dataset format.

### Set up and test data loaders

In [None]:
%load_ext autoreload
%autoreload 2
import os
import json
import shutil
import random
from pathlib import Path
from torchvision import transforms

from cvla.utils_vis import render_example
from cvla.data_loader_h5 import H5Dataset
from cvla.data_loader_images import ImageFolderDataset
from cvla.data_augmentations import RandomizeBackgrounds, augment_image_rgb, complexify_text, DepthAugmentation

os.environ["CUDA_VISIBLE_DEVICES"]="0,1"


#dataset_location = Path("/tmp/cvla-7-obja")
#dataset_location = Path("/tmp/clevr-act-7-depth")
dataset_location = "/tmp/cvla-obja-camRF-sceneR-9"

model_location = Path("/data/lmbraid19/argusm/models")
save_path = model_location / (str(Path(dataset_location).stem) + "_e512s_depth")

return_depth = True
action_encoder = "xyzrotvec-cam-512xy128d"
bg_image_dataset = ImageFolderDataset("/tmp/indoorCVPR/Images", transform=transforms.RandomResizedCrop((448, 448)))
randomize_background = RandomizeBackgrounds(p=0.2, background_images = bg_image_dataset)
#augment_depth = DepthAugmentation(depth_range=(25, 100), max_delta_depth=30)
train_dataset = H5Dataset(dataset_location, augment_rgbds=randomize_background, augment_rgb=augment_image_rgb, augment_text=complexify_text,
                          action_encoder=action_encoder, return_depth=return_depth)
decc_data = train_dataset.action_encoder.decode_caption

print("dataset_location", dataset_location)
print("save_path", save_path)

cur_path = Path("hf_finetune_paligemma2.ipynb").resolve()
os.makedirs(save_path, exist_ok=True)
shutil.copy(cur_path, save_path / ("train_" + str(cur_path.name)))
json.dump(dict(return_depth=return_depth, action_encoder=action_encoder), open(save_path / "cvla_info.json","w"))

num_samples = 3*4
html_imgs = ""
for i in range(num_samples):
    image, sample = train_dataset[i]
    image = image[1] if len(image) > 1 else image
    prefix = sample["prefix"]
    html_imgs += render_example(image, label=sample["suffix"], text=prefix, camera=sample["camera"], enc=train_dataset.action_encoder)

plot_images = True
if plot_images:
    from IPython.display import display, HTML
    display(HTML(html_imgs))

### Load PaliGemma2 model

**NOTE:** PaliGemma2 offers 9 pre-trained models with sizes of `3B`, `10B`, and `28B` parameters, and resolutions of `224`, `448`, and `896` pixels. In this tutorial, I'll be using the [`google/paligemma2-3b-pt-448`](https://huggingface.co/google/paligemma2-3b-pt-448) checkpoint. Resolution has a key impact on the mAP of the trained model, and it seems that `448` offers the most optimal balance between performance and compute resources required to train the model.

In [None]:
#from huggingface_hub import notebook_login
#notebook_login()

In [None]:
import torch
from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration

# setting device on GPU if available, else CPU
print("cuda visible devices:", os.environ["CUDA_VISIBLE_DEVICES"])
devices_good = sorted((int(x) for x in os.environ["CUDA_VISIBLE_DEVICES"].split(",")))
DEVICE = torch.device('cuda')
print(DEVICE)
print('Using device:', DEVICE)
print("Good devices", devices_good)

TORCH_DTYPE = torch.bfloat16
MODEL_ID ="google/paligemma2-3b-pt-224"
processor = PaliGemmaProcessor.from_pretrained(MODEL_ID)
model = PaliGemmaForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=TORCH_DTYPE, device_map="auto", attn_implementation='eager')

In [None]:
# import requests
# from PIL import Image
# url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
# image = Image.open(requests.get(url, stream=True).raw)

# # Instruct the model to create a caption in Spanish
# prompt = "caption en"
# model_inputs = processor(text=prompt, images=image, return_tensors="pt")
# input_len = model_inputs["input_ids"].shape[-1]

# with torch.inference_mode():
#     generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
#     generation = generation[0][input_len:]
#     decoded = processor.decode(generation, skip_special_tokens=True)
#     print(decoded)
# assert decoded == "automobile model is a classic car ."

In [None]:
def augment_suffix(suffix):
    parts = suffix.split(' ; ')
    random.shuffle(parts)
    return ' ; '.join(parts)

def collate_fn(batch):
    images, labels = zip(*batch)
    prefixes = ["<image>" + label["prefix"] for label in labels]
    suffixes = [augment_suffix(label["suffix"]) for label in labels]

    inputs = processor(
        text=prefixes,
        images=images,
        return_tensors="pt",
        suffix=suffixes,
        padding="longest"
    ).to(TORCH_DTYPE)#.to(DEVICE)

    return inputs

if return_depth:
    def collate_fn(batch):
        images, labels = zip(*batch)
        prefixes = ["<image><image>" + label["prefix"] for label in labels]
        suffixes = [augment_suffix(label["suffix"]) for label in labels]
        images_flat = [img for img_list_x in images for img in img_list_x]
        inputs = processor(
            text=prefixes,
            images=images_flat,
            return_tensors="pt",
            suffix=suffixes,
            padding="longest"
        ).to(TORCH_DTYPE)
        return inputs


## Fine-tune with JAX settings

In [None]:
# import numpy as np
# def compute_metrics(eval_pred):
#     predictions, label_tokens = eval_pred  # Extract predictions and labels
#     if isinstance(predictions, tuple):  # Some models return tuples
#         predictions = predictions[0]

#     # Convert to token indices if necessary (e.g., for text generation models)
#     pred_tokens = np.argmax(predictions, axis=-1)  # Assuming logits, take argmax

#     pred_texts = processor.tokenizer.batch_decode(pred_tokens[:,-SEQLEN-1:], skip_special_tokens=True)
#     label_text = processor.tokenizer.batch_decode(label_tokens[:,-SEQLEN-1:], skip_special_tokens=True)

#     print(pred_tokens[:,-SEQLEN-1:])
#     print(label_tokens[:,-SEQLEN-1:])
#     print(label_text)
#     print(pred_texts)
#     print()
#     return {"accuracy": 0}

In [None]:
%reload_ext autoreload
%autoreload 2
from transformers import Seq2SeqTrainingArguments
from transformers import Seq2SeqTrainer

for param in model.vision_tower.parameters():
    param.requires_grad = False

for param in model.multi_modal_projector.parameters():
    param.requires_grad = False
    
for name, param in model.named_parameters():
    if param.requires_grad == True:
        if "self_attn" in name:
            param.requires_grad = True
        else:
            param.requires_grad = False

TRAIN_EXAMPLES = len(train_dataset)
BATCH_SIZE = 32
BATCH_SIZE_DEV = 8
if return_depth:
    BATCH_SIZE_DEV = 2
GRAD_ACCUM = int(round(BATCH_SIZE / BATCH_SIZE_DEV))
TRAIN_STEPS = (TRAIN_EXAMPLES // BATCH_SIZE)
SEQLEN = 12
SAVE_STEPS = int(TRAIN_STEPS / 15)
SAVE_LIMIT = 5


print("TRAIN_STEPS",TRAIN_STEPS)
print("GRAD_ACCUM", GRAD_ACCUM)

args_jax = Seq2SeqTrainingArguments(
    max_steps=TRAIN_STEPS,
    remove_unused_columns=False,
    per_device_train_batch_size=BATCH_SIZE_DEV,
    gradient_accumulation_steps=GRAD_ACCUM,
    learning_rate=3e-5,  # 1e-5, 2e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=.05,
    generation_max_length=SEQLEN,
    logging_steps=10,
    optim="adafactor",
    save_strategy="steps",
    save_steps=SAVE_STEPS,
    save_total_limit=SAVE_LIMIT,
    output_dir=save_path,
    bf16=True,
    report_to=["tensorboard"],
    dataloader_pin_memory=False,
    dataloader_num_workers=4,
    #dataloader_prefetch_factor=2,
    #eval_strategy="steps",
    #eval_steps=4,
    #per_device_eval_batch_size=BATCH_SIZE_DEV,
    #eval_accumulation_steps=GRAD_ACCUM
)
#gradient_checkpointing=True,
#weight_decay=3e-7,
#     
trainer = Seq2SeqTrainer(
    model=model,
    train_dataset=train_dataset,
    #eval_dataset=train_dataset,
    data_collator=collate_fn,
    args=args_jax,
    #compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:

import sys; import torch; import transformers; import tokenizers; import accelerate; \
print('Python Version:', sys.version); \
print('Torch Version:', torch.__version__); \
print('CUDA Available:', torch.cuda.is_available()); \
print('CUDA Device Count:', torch.cuda.device_count()); \
print('GPU Name:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU'); \
print('Transformers Version:', transformers.__version__); \
print('Tokenizers Version:', tokenizers.__version__); \
print('Accelerate Version:', accelerate.__version__)

In [None]:
#print(save_path)
#

In [None]:
# from tqdm.notebook import tqdm
# test_samples = 5
# decode_dataset = [None, ]*test_samples
# for i in tqdm(range(test_samples), total=test_samples):
#     image, label = train_dataset[i]
#     prefix = "<image>" + label["prefix"]
#     suffix = label["suffix"]
#     inputs = processor(
#         text=prefix,
#         images=image,
#         return_tensors="pt",
#         suffix = [augment_suffix(suffix)]
#     ).to(TORCH_DTYPE).to(DEVICE)
#     prefix_length = inputs["input_ids"].shape[-1]

#     with torch.inference_mode():
#         generation = model.generate(**inputs, max_new_tokens=12, do_sample=False, use_cache=False)
#         generation = generation[0][prefix_length:]
#         decoded = processor.decode(generation, skip_special_tokens=True)
#     print(suffix)
#     print(decoded)
#     print()

### Run inference with fine-tuned PaliGemma2 model

In [None]:
# Load files
#model = PaliGemmaForConditionalGeneration.from_pretrained(save_path)

In [None]:
from tqdm.notebook import tqdm

def augment_suffix(suffix):
    parts = suffix.split(' ; ')
    random.shuffle(parts)
    return ' ; '.join(parts)

test_samples = 25
decode_dataset = [None, ]*test_samples
for i in tqdm(range(test_samples), total=test_samples):
    image, label = test_dataset[i]
    prefix = "<image>" + label["prefix"]
    suffix = label["suffix"]
    inputs = processor(
        text=prefix,
        images=image,
        return_tensors="pt",
        suffix = [augment_suffix(suffix)]
    ).to(TORCH_DTYPE).to(DEVICE)
    prefix_length = inputs["input_ids"].shape[-1]

    with torch.inference_mode():
        generation = model.generate(**inputs, max_new_tokens=12, do_sample=False, use_cache=False)
        generation = generation[0][prefix_length:]
        decoded = processor.decode(generation, skip_special_tokens=True)
    decode_dataset[i] = decoded

print(decode_dataset)

In [None]:
import numpy as np
class ModelWrapper:
    def __init__(self, transformers_model=model):
        self.model = transformers_model
    
    def make_predictions(self, image, prefix):
        prefix = "<image>" + prefix
        image = Image.fromarray(image)
        inputs = processor(text=prefix,
                           images=image,
                           return_tensors="pt").to(TORCH_DTYPE).to(DEVICE_0)
        prefix_length = inputs["input_ids"].shape[-1]
        with torch.inference_mode():
            generation = model.generate(**inputs, max_new_tokens=12, do_sample=False, use_cache=False)
            generation = generation[0][prefix_length:]
            decoded = processor.decode(generation, skip_special_tokens=True)
        return None, None, None, decoded
model_wrapped = ModelWrapper(model)

i = 0
image, label = test_dataset[i]
print(image)
print(label["prefix"])
res = model_wrapped.make_predictions(np.asarray(image), label["prefix"])
print(res)


In [None]:
%reload_ext autoreload
%autoreload 2
import json
from PIL import Image
from mani_skill.examples.run_env import Args, iterate_env, save_dataset

        
parsed_args = Args()
parsed_args.env_id = "ClevrMove-v1"
parsed_args.render_mode = "rgb_array"
parsed_args.control_mode = "pd_joint_pos"

env_iter = iterate_env(parsed_args, vis=False, model=model_wrapped)

In [None]:
for i in range(50):
    next(env_iter)

# Some code to figure out inputs

Looks like the first 256 tokens (i.e. 16x16) will get replaced with the outputs of the image encoder.

In [None]:
image, label = test_dataset[1]
prefix = "<image>" + label["prefix"]
suffix = label["suffix"]


inputs = processor(
    text=prefix,
    images=image,
    return_tensors="pt",
    #suffix = [augment_suffix(suffix)]
)

#print(label["suffix"])
#print(decoded)
for input_name, input in inputs.items():
    print(input_name, input.shape)
extra = 273 - 256 
print(extra, extra**.5)
print(inputs["input_ids"][:, 256:])


print(processor.decode(inputs["input_ids"][0, 256:]))
tmp = processor.decode([108])
#print(processor.tokenizer.eos_token)
#print(processor.image_token_id)

In [None]:
# Find where parameters are located

from collections import defaultdict
param_locations = defaultdict(list)
for i in model.named_parameters():
    #print(f"{i[0]} -> {i[1].device}")
    param_locations[f"{i[1].device}"]= f"{i[0]}"

for k, v in param_locations.items():
    print(k, len(v))

#print(DEVICE)

In [None]:
# #Additional Info when using cuda
# import torch
# for i in range(torch.cuda.device_count()):
#    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
#    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')
#    print(torch.cuda.get_device_properties(i).name, )


In [None]:
print("cuda visible devices:", os.environ["CUDA_VISIBLE_DEVICES"])
DEVICE = torch.device('cuda')
model = PaliGemmaForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=TORCH_DTYPE, device_map="auto")

batch = [valid_dataset[i] for i in range(8)]
inputs = collate_fn(batch)
#generate_ids = model.generate(**inputs, max_length=286+30)
trainer.model.train()
trainer.compute_loss(model, inputs, return_outputs=False, num_items_in_batch=416)
print("works")
trainer.model.train(False)
trainer.compute_loss(model, inputs, return_outputs=False, num_items_in_batch=416)
print("fails.")

In [None]:
batch = [valid_dataset[i] for i in range(8)]
inputs = collate_fn(batch)
#generate_ids = model.generate(**inputs, max_length=286+30)
trainer.model.train()
trainer.compute_loss(model, inputs, return_outputs=False, num_items_in_batch=416)
print("works")
trainer.model.train(False)
trainer.compute_loss(model, inputs, return_outputs=False, num_items_in_batch=416)
print("fails.")

#raise ValueError
#pass
# orig_context_manager = trainer.compute_loss_context_manager
# class TempTrainContext(object):
#     def __init__(self, trainer):
#         self.trainer = trainer
#         self.orig_context_manager = trainer.compute_loss_context_manager
#     def __enter__(self):
#         self.orig_context_inst = self.orig_context_manager()
#         self.orig_context_inst.__enter__()
#         self.training_enter = self.trainer.model.training
#         self.trainer.model.train()
#     def __exit__(self, type, value, traceback):
#         self.trainer.model.train(self.training_enter)
#         self.orig_context_inst.__exit__(type, value, traceback)
#     def __call__(self):
#         return self
# trainer.compute_loss_context_manager = TempTrainContext(trainer)

## Fine-tune PaliGemma2 using LoRA

In [None]:
# # @title Freeze the image encoder

# TORCH_DTYPE = torch.bfloat16
# #model = PaliGemmaForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=TORCH_DTYPE).to(DEVICE)
# model = PaliGemmaForConditionalGeneration.from_pretrained(MODEL_ID, torch_dtype=TORCH_DTYPE, device_map="auto")## max_memory={1:"25GB",})# 5:"25GB", 6:"25GB"})  # was auto

# for param in model.vision_tower.parameters():
#     param.requires_grad = False

# for param in model.multi_modal_projector.parameters():
#     param.requires_grad = False

In [None]:
# # @title Fine-tune the entire model with LoRA and QLoRA
# from transformers import BitsAndBytesConfig
# from peft import get_peft_model, LoraConfig

# bnb_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.bfloat16
# )

# lora_config = LoraConfig(
#     r=8,
#     target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
#     task_type="CAUSAL_LM",
# )
# model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()

In [None]:
# from transformers import Trainer, TrainingArguments, Seq2SeqTrainingArguments

# TRAIN_EXAMPLES = len(train_dataset.entries)
# BATCH_SIZE = 18
# BATCH_SIZE_DEV = 6
# GRAD_ACCUM = BATCH_SIZE // BATCH_SIZE_DEV

# TRAIN_STEPS = TRAIN_EXAMPLES // BATCH_SIZE
# SAVE_STEPS = TRAIN_STEPS // 8
# SEQLEN = 32

# args_lora = Seq2SeqTrainingArguments(
#     num_train_epochs=1,
#     remove_unused_columns=False,
#     per_device_train_batch_size=BATCH_SIZE_DEV,
#     gradient_accumulation_steps=GRAD_ACCUM,
#     #gradient_checkpointing=True, use_cache=False,
#     generation_max_length=SEQLEN,
#     warmup_steps=2,
#     learning_rate=.005#2e-5,
#     weight_decay=1e-6,
#     adam_beta2=0.999,
#     logging_steps=10,
#     optim="adamw_hf",
#     save_strategy="steps",
#     save_steps=1000,
#     save_total_limit=1,
#     output_dir=save_path,
#     bf16=True,
#     report_to=["tensorboard"],
#     dataloader_pin_memory=False
# )

# trainer = Trainer(
#     model=model,
#     train_dataset=train_dataset,
#     #eval_dataset=valid_dataset,
#     data_collator=collate_fn,
#     args=args_lora
# )

# trainer.train()
#print(save_path)
#trainer.save_model(save_path)

In [None]:
from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast

class CustomProcessor:
    """
    A wrapper around a Hugging Face Processor (e.g., PaliGemmaProcessor) that allows
    overriding or adding token mappings according to a given dictionary of new tokens.

    Args:
        processor: A Hugging Face Processor object with a .tokenizer attribute.
        new_tokens: Dict[str, int] mapping token strings (e.g. "<my_token>") to desired token IDs.
    """
    def __init__(self, processor, new_tokens: dict):
        self.processor = processor
        self.tokenizer = processor.tokenizer
        self._override_tokens(new_tokens)

    def _override_tokens(self, new_tokens: dict):
        # Update tokenizer mappings
        # Supports both Python and Rust tokenizers
        enc = getattr(self.tokenizer, 'encoder', None)
        dec = getattr(self.tokenizer, 'decoder', None)
        vocab = getattr(self.tokenizer, 'vocab', None)
        ids_to_tokens = getattr(self.tokenizer, 'ids_to_tokens', None)

        for token, token_id in new_tokens.items():
            # Update encoder (token -> id)
            if enc is not None:
                enc[token] = token_id
            # Update vocab for Python tokenizers
            if vocab is not None:
                vocab[token] = token_id
            # Update decoder (id -> token)
            if dec is not None:
                dec[token_id] = token
            # Update fast tokenizer ids_to_tokens
            if ids_to_tokens is not None:
                ids_to_tokens[token_id] = token

        # If using a fast tokenizer, ensure the tokenizer knows about special tokens
        # so they get recognized during tokenization and decoding
        self.tokenizer.special_tokens_map_extended = {
            **getattr(self.tokenizer, 'special_tokens_map_extended', {}),
            **{token: token for token in new_tokens.keys()}
        }

    def __call__(self, *args, **kwargs):
        # Delegate tokenization to the underlying processor
        return self.processor(*args, **kwargs)

    def decode(self, token_ids, **kwargs):
        # Delegate decoding, using the possibly overridden mappings
        return self.processor.decode(token_ids, **kwargs)

    def save_pretrained(self, save_directory):
        # Save both processor and tokenizer adjustments
        self.processor.save_pretrained(save_directory)
        self.tokenizer.save_pretrained(save_directory)

def get_last_n_tokens(processor, n=100):
    tok = processor.tokenizer
    try:
        vocab = tok.get_vocab()
    except AttributeError:
        vocab = tok.vocab

    id_to_token = {id_: t for t, id_ in vocab.items()}
    last_ids = sorted(id_to_token)[-n:]
    return [(i, id_to_token[i]) for i in last_ids]

# Example usage:
# from paligemma import PaliGemmaProcessor
# base_processor = PaliGemmaProcessor.from_pretrained('...')
# new_tokens = {'<my_token_1>': 10000, '<my_token_2>': 10001}
# processor = CustomProcessor(base_processor, new_tokens)
# encoded = processor('some text <my_token_1> more text')
# decoded = processor.decode(encoded['input_ids'])
#print("special tokens", processor.tokenizer.all_special_tokens)
#get_last_n_tokens(processor, 100+1024+128)


my_tokens = [f"<pos{x:03d}>" for x in range(512)] + [f"<dep{x:03d}>" for x in range(128)] + [f"<rot{x:03d}>" for x in range(128)]
print(len(my_tokens))
print(my_tokens[-1])
print(my_tokens)
last_token = 255967 # the last token to use
new_tokens = {token: last_token - i for i, token in enumerate(reversed(my_tokens))}

# Quick check
print(f"Total tokens: {len(new_tokens)}")
print(f"{my_tokens[0]} -> {new_tokens[my_tokens[0]]}")
print(f"{my_tokens[-1]} -> {new_tokens[my_tokens[-1]]}")

In [None]:
#<points x1="33.0" y1="63.7" x2="34.2" y2="67.1" alt="move">text</points>
#<eetraj x1="33.0" y1="63.7" d1="12.3" ra1="12.3" rb="12.3" rc="12.3" ... alt="object_name">object_name</eetraj>