# LongT5 training with the large crawled data set

2024-11-26 9:32 CET

This is the first full run on the large dataset of crawled web pages, including some with markup size reduction.

The results are not great. The perceptual loss is not really going down, and while there's some improvement in the text being generated (see examples in Tensorboard), it's all over the place. It seems like a dead end. Perhaps with more epochs this would somewhat improve but it's alarming that there's no improvement in the loss throughout the training.

We might need two things here:
1. More data with individual, simpler elements and corresponding style sheets
2. Use a pretrained model, like Codestral. Perhaps the pretained L5 would work too, but its tokenizer lacks a lot of syntactically important tokens, so that's not possible as it is, unless we add all those tokens and resize embeddings, which defeats the purpose of using a pretained model.

In [1]:
# Check CUDA version! 

# !pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu125

!pip install -U git+https://github.com/huggingface/transformers.git@13493215abceafc1653af88b045120014fb4c1fc
!pip install -U git+https://github.com/huggingface/accelerate.git@8ade23cc6aec7c3bd3d80fef6378cafaade75bbe
!pip install -U git+https://github.com/huggingface/datasets.git@01f91bae037c98f2e05456287bab21470adb8f07
!pip install -U git+https://github.com/huggingface/evaluate.git@55f1bc6e072b05c2d9db1589a07e20f38902b1ec
!pip install -U git+https://github.com/huggingface/safetensors.git@f5839b6aee407652aa3078d91206b618dd84e3c2

!pip install pytest-playwright
!playwright install
!pip install matplotlib
!pip install pillow
!pip install torchvision
!pip install lpips

!playwright install-deps  

!pip install -U numpy
!pip install tensorboard

Collecting git+https://github.com/huggingface/transformers.git@13493215abceafc1653af88b045120014fb4c1fc
  Cloning https://github.com/huggingface/transformers.git (to revision 13493215abceafc1653af88b045120014fb4c1fc) to /tmp/pip-req-build-oebdabwp
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-oebdabwp
  Running command git rev-parse -q --verify 'sha^13493215abceafc1653af88b045120014fb4c1fc'
  Running command git fetch -q https://github.com/huggingface/transformers.git 13493215abceafc1653af88b045120014fb4c1fc
  Running command git checkout -q 13493215abceafc1653af88b045120014fb4c1fc
  Resolved https://github.com/huggingface/transformers.git to commit 13493215abceafc1653af88b045120014fb4c1fc
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting huggingface-hub<1.0,>=0.24.0 (from transformers==

In [2]:
!apt install zip -y
!rm -rf data-rb-large
!mkdir -p data-rb-large
!wget "https://www.dropbox.com/scl/fi/wifnx1vkuuet8tup5nnuw/data-rb-large-filtered.zip?rlkey=gut77ydbidjbf9qcrmyk55uic&st=8lt2w0tb&dl=1" -O model.zip
!unzip model.zip -d data-rb-large

!wget "https://www.dropbox.com/scl/fi/bx38ass59j6my2g2m01pb/tokenizer.json?rlkey=62zub5qk6ow7xa2vpa9gyglho&dl=1" -O data-rb-large/tokenizer.json

!rm -rf data-rb-validate
!mkdir -p data-rb-validate
!wget "https://www.dropbox.com/scl/fi/5szml8y5l248mcabj9rqg/verify-dataset.zip?rlkey=se33rwtxgngn0ts1i0pc8f6wk&st=1d68x9zt&dl=1" -O validate.zip
!unzip validate.zip -d data-rb-validate

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  unzip
The following NEW packages will be installed:
  unzip zip
0 upgraded, 2 newly installed, 0 to remove and 32 not upgraded.
Need to get 350 kB of archives.
After this operation, 930 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 unzip amd64 6.0-26ubuntu3.2 [175 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 zip amd64 3.0-12build2 [176 kB]
Fetched 350 kB in 0s (1585 kB/s)[33m
debconf: delaying package configuration, since apt-utils is not installed

7[0;23r8[1ASelecting previously unselected package unzip.
(Reading database ... 34390 files and directories currently installed.)
Preparing to unpack .../unzip_6.0-26ubuntu3.2_amd64.deb ...
7[24;0f[42m[30mProgress: [  0%][49m[39m [..........................................................] 87[24;0f[42m

In [1]:
from datasets import load_from_disk

filtered_data = load_from_disk('data-rb-large')

Loading dataset from disk:   0%|          | 0/17 [00:00<?, ?it/s]

In [2]:
filtered_data

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'input_length', 'labels', 'label_length'],
        num_rows: 202664
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'input_length', 'labels', 'label_length'],
        num_rows: 4
    })
})

In [3]:
from transformers import T5TokenizerFast
from tokenizers.processors import TemplateProcessing

tokenizer = T5TokenizerFast(tokenizer_file="data-rb-large/tokenizer.json",  
                                padding=True, 
                                eos_token="</s>",unk_token="<unk>", 
                                pad_token="<pad>")

post_processor = TemplateProcessing(
    single="$A </s>",
    pair="$A </s> $B </s>",
    special_tokens=[("</s>", tokenizer.eos_token_id)]
)

tokenizer._tokenizer.post_processor = post_processor

In [4]:
import numpy as np
from utils.similarity import calculate_metrics
from torch.utils.tensorboard import SummaryWriter
from PIL import Image
import torch

def add_image_to_tensorboard(name, step, img_path):
    image = Image.open(img_path)
    image = image.convert('RGB')
    image_array = np.array(image)
    image_tensor = torch.from_numpy(image_array)
    image_tensor = image_tensor.permute(2, 0, 1)
    image_tensor = image_tensor.float() / 255.0
    
    writer = SummaryWriter(log_dir=log_dir)
    writer.add_image(name, image_tensor, step)
    
def add_text_to_tensorboard(name, step, text):
    writer = SummaryWriter(log_dir=log_dir)
    writer.add_text(name, text, step)

def postprocess_text(preds, labels):
    preds = [pred.strip().replace('<unk>', '') for pred in preds]
    labels = [[label.strip().replace('<unk>', '')] for label in labels]

    return preds, labels

def create_compute_metrics(trainer):
    def compute_metrics(eval_predictions):
        try:
            predictions, labels = eval_predictions
            if isinstance(predictions, tuple):
                predictions = predictions[0]

            predictions = predictions[-4:]
            labels = labels[-4:]

            decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_predictions, decoded_labels = postprocess_text(decoded_predictions, decoded_labels)

            similarity_scores = []
            perceptual_losses = []
            index = 1
            for prediction, label in zip(decoded_predictions, decoded_labels):
                add_text_to_tensorboard(f'valid_{index}_label_text', trainer.state.global_step, label[0])
                add_text_to_tensorboard(f'valid_{index}_prediction_text', trainer.state.global_step, prediction)
                
                metrics = calculate_metrics(prediction, label[0])
                
                if metrics is not None:
                    similarity_scores.append(metrics['similarity'])
                    perceptual_losses.append(metrics['perceptual_loss'])
                    
                    add_image_to_tensorboard(f'valid_{index}_expectation', trainer.state.global_step, metrics['expected_screenshot_path'])
                    add_image_to_tensorboard(f'valid_{index}_prediction', trainer.state.global_step, metrics['predicted_screenshot_path'])
                
                index += 1

            result = {
                'similarity': float(np.mean(similarity_scores)),
                'perceptual_loss': float(np.mean(perceptual_losses)),
            }

            result = {k: round(v, 4) for k, v in result.items()}
            return result
        except Exception as e:
            print("Error during evaluation: ", str(e))
            print("Eval predictions: ", eval_predictions)
            
            return {
                'similarity': np.nan,
                'perceptual_loss': np.nan,
            }

    return compute_metrics

In [5]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model="google/long-t5-tglobal-base")

In [6]:
!pip install torch 

[0m

In [7]:
from transformers import AutoModelForSeq2SeqLM, AutoConfig

config = AutoConfig.from_pretrained(
    "google/long-t5-tglobal-base",
    vocab_size=len(tokenizer),
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id = tokenizer.eos_token_id,
    decoder_start_token_id = tokenizer.pad_token_id,
)

model = AutoModelForSeq2SeqLM.from_config(config)
model.resize_token_embeddings(len(tokenizer))

Embedding(32228, 768)

In [8]:
config

LongT5Config {
  "_name_or_path": "google/long-t5-tglobal-base",
  "architectures": [
    "LongT5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 3,
  "dense_act_fn": "gelu_new",
  "dropout_rate": 0.1,
  "encoder_attention_type": "transient-global",
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "global_block_size": 16,
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "is_gated_act": true,
  "layer_norm_epsilon": 1e-06,
  "local_radius": 127,
  "model_type": "longt5",
  "n_positions": 4096,
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 3,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.47.0.dev0",
  "use_cache": true,
  "vocab_size": 32228
}

In [9]:
from transformers import GenerationConfig

generation_config = GenerationConfig.from_model_config(model.config)
generation_config._from_model_config = False
generation_config.max_new_tokens = 16_384

generation_config

GenerationConfig {
  "decoder_start_token_id": 3,
  "eos_token_id": 1,
  "max_new_tokens": 16384,
  "pad_token_id": 3
}

In [10]:
from datetime import datetime
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
model_output_dir = f"model-{timestamp}"

In [11]:
log_dir = f'{model_output_dir}/runs'

%load_ext tensorboard
%tensorboard --logdir '{log_dir}' --port=6006

In [12]:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, EarlyStoppingCallback

training_args = Seq2SeqTrainingArguments(
    output_dir=model_output_dir,
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    learning_rate=2e-5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=16,
    gradient_checkpointing=True,
    weight_decay=0.01,
    num_train_epochs=1,
    logging_strategy="steps",
    logging_steps=25,
    logging_dir=log_dir,
    predict_with_generate=True,
    bf16=True,
    generation_config=generation_config,
    report_to="tensorboard",
    save_safetensors=False,
    load_best_model_at_end = True,
    metric_for_best_model = 'perceptual_loss'
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=filtered_data["train"],
    eval_dataset=filtered_data["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=None,
    # callbacks = [EarlyStoppingCallback(early_stopping_patience=3, early_stopping_threshold=0.0)]
)

trainer.compute_metrics = create_compute_metrics(trainer)

trainer.train()

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Step,Training Loss,Validation Loss,Similarity,Perceptual Loss,Runtime,Samples Per Second,Steps Per Second
100,62.9281,46.00621,0.3771,0.707,733.6331,0.005,0.005
200,59.5656,31.334991,,,539.5232,0.007,0.007
300,54.1582,29.095015,0.406,0.8417,733.2576,0.005,0.005
400,52.0824,28.818523,0.3771,0.706,557.4893,0.007,0.007
500,50.1403,25.55298,0.3771,0.706,559.4431,0.007,0.007
600,47.3848,24.87969,,,540.7399,0.007,0.007
700,45.4758,24.728485,,,369.3203,0.011,0.011
800,43.7241,23.391459,,,540.0625,0.007,0.007
900,42.7701,25.108578,,,538.5293,0.007,0.007
1000,41.9129,22.438965,,,537.3286,0.007,0.007


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_clas

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1d5be10>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6ecee18d0>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6ecee2fd0>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6eca63810>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1c38710>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1bf3e90>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6eca40410>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6eceec790>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6ecad3390>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1bf14d0>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1bf0d10>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6ecad3f90>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1a229d0>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1a5bb90>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1cd4e10>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6eca40e90>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1a10f90>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6eca608d0>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e131d3d0>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e13c1790>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1a22450>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e0f6a350>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e13fa190>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e0f6a050>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e13e6b90>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e1a772d0>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

Error during evaluation:  out of range integral type conversion attempted
Eval predictions:  <transformers.trainer_utils.EvalPrediction object at 0x7af6e064e750>


  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Trainer.to

TrainOutput(global_step=12666, training_loss=25.203807231469984, metrics={'train_runtime': 225425.6099, 'train_samples_per_second': 0.899, 'train_steps_per_second': 0.056, 'total_flos': 8.186587381801728e+17, 'train_loss': 25.203807231469984, 'epoch': 0.999960525796392})

In [15]:
import os

trainer.save_model(os.path.join(model_output_dir, "final"))
tokenizer.save_pretrained(os.path.join(model_output_dir, "final"))

('model-2024-11-23_08-14-52/final/tokenizer_config.json',
 'model-2024-11-23_08-14-52/final/special_tokens_map.json',
 'model-2024-11-23_08-14-52/final/tokenizer.json')

In [21]:
from transformers import LongT5ForConditionalGeneration

text = '<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="393" height="852" viewBox="0 0 393 852"><g id="html1"><g data-tag="head" id="head1" data-z-index="auto" data-stacking-context="true" aria-owns="script1"><g data-tag="script" id="script1" data-z-index="auto" data-stacking-context="true"/></g><g data-tag="body" id="body1" data-z-index="auto" data-stacking-context="true" role="document" aria-owns="style1"><g data-stacking-layer="rootBackgroundAndBorders"><rect width="377" height="836" x="8" y="8" fill="rgb(226, 168, 110)"/></g><g data-tag="style" id="style1" data-z-index="auto" data-stacking-context="true"/></g></g></svg>'

inputs = tokenizer(text, return_tensors="pt").input_ids
inputs = inputs.to('cuda:0')

model2 = LongT5ForConditionalGeneration.from_pretrained(os.path.join(model_output_dir, "final"))
model2.to('cuda:0')
model2.generation_config = generation_config

outputs = model2.generate(inputs)

translation = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(translation)

<body><div id="wrapper"><div id="wrapper"><div id="app"><div id="wrapper"><div id="page"><div id="wrapper"><div id="page"><div id="wrapper"><div id="page"><div id="wrapper"><div id="page"><img alt="b" src="/body> <style> </style>
