## Chart Gemma (pre-trained)

In [2]:
from transformers import AutoProcessor, AutoModelForImageTextToText
from PIL import Image

processor = AutoProcessor.from_pretrained("ahmed-masry/chartgemma")
model = AutoModelForImageTextToText.from_pretrained("ahmed-masry/chartgemma")

image_path = "../ImageList/7.png"
image = Image.open(image_path)

text_input = "<image> <bos>"  

inputs = processor(images=image, text=text_input, return_tensors="pt")
outputs = model.generate(**inputs,max_new_tokens=200)
description = processor.decode(outputs[0], skip_special_tokens=True)

print("Generated Description:", description)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Generated Description:  
The chart shows the scaling required on model-simulated signals as a function of the 21st century warming (in degrees Celsius) for different climate models. The scaling is measured in units of standard deviation from the global mean, with the black line representing the mean scaling. The chart also shows the estimated contributions to the 21st century warming for different climate models. The contributions are measured in units of standard deviation from the global mean, with the black line representing the mean contribution. 

The chart shows that the scaling required for the 21st century warming is generally positive, meaning that the models tend to predict higher warming than the global mean. However, there is significant variation in the scaling required across different models. The chart also shows that the estimated contributions to the 21st century warming are generally positive, but the magnitude of the contribution varies across different models.

Some

## Chart Gemma (fine-tuned)

In [5]:
import os
import pandas as pd
from PIL import Image
import torch
import re
from torch.utils.data import Dataset, DataLoader
from transformers import AutoProcessor, AutoModelForImageTextToText, Seq2SeqTrainer, Seq2SeqTrainingArguments


data = pd.read_csv("../captions.csv")

def clean_caption(caption):
    caption = caption.lower()
    caption = re.sub(r"figure\s+\d+(\.\d+)*", "", caption)
    caption = re.sub(r"[^a-z0-9A-Z\s]", "", caption)
    caption = re.sub(r"\s+", " ", caption).strip()
    return caption
    

class ChartCaptionDataset(Dataset):
    def __init__(self, csv_file, img_folder, processor, max_length=128):
        self.data = pd.read_csv(csv_file)
        self.data['full_caption'] = self.data['full_caption'].apply(clean_caption)
        self.img_folder = img_folder
        self.processor = processor
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        img_id = self.data.iloc[idx]['imageid']
        caption = self.data.iloc[idx]['full_caption']
        img_path = os.path.join(self.img_folder, f"{img_id}.png")

        image = Image.open(img_path).convert('RGB')
        image = image.resize((128,128))
        text_input = "<image> <bos> " + caption
        inputs = self.processor(images=image, text=text_input, padding="max_length", truncation=True, return_tensors="pt")
        inputs = {k: v.squeeze() for k, v in inputs.items()}

        labels = inputs['input_ids'].clone()
        labels[labels == self.processor.tokenizer.pad_token_id] = -100
        inputs['labels'] = labels

        return inputs


processor = AutoProcessor.from_pretrained("ahmed-masry/chartgemma")
model = AutoModelForImageTextToText.from_pretrained("ahmed-masry/chartgemma")
model.gradient_checkpointing_enable()

# model = model.to("cpu")

dataset = ChartCaptionDataset(
    csv_file="../captions.csv",
    img_folder="../ImageList",
    processor=processor
)

dataloader = DataLoader(dataset, batch_size=1, shuffle=True)


training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="no",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    logging_dir="./logs",
    output_dir="./chartgemma_finetune",
    num_train_epochs=3,
    save_steps=500,
    save_total_limit=2,
    fp16=True,
    gradient_accumulation_steps=8,
    dataloader_num_workers = 2,
)

def collate_fn(batch):
    input_ids = torch.nn.utils.rnn.pad_sequence([item['input_ids'] for item in batch], batch_first=True, padding_value=processor.tokenizer.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence([item['attention_mask'] for item in batch], batch_first=True, padding_value=0)
    pixel_values = torch.stack([item['pixel_values'] for item in batch])
    labels = torch.nn.utils.rnn.pad_sequence([item['labels'] for item in batch], batch_first=True, padding_value=-100)
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'pixel_values': pixel_values,
        'labels': labels
    }

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    data_collator=collate_fn,
)

for epoch in range(training_args.num_train_epochs):
    # torch.cuda.empty_cache()  
    trainer.train()

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokeniz

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 8.00 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 21.87 GiB is allocated by PyTorch, and 304.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)