# Fine-tuning Llama-2 Model with Deepspeed on Intel Gaudi

In this Jupyter notebook, we will fine-tune a [Llama-2-70b](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) model with Deepspeed on Intel Gaudi. We will use PyTorch for model training and Ray for distributed training. We will use dataset [tatsu-lab/alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca).

[Intel Gaudi AI Processors (HPUs)](https://habana.ai) are AI hardware accelerators designed by Habana Labs. For more information, see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/index.html) and [Gaudi Developer Docs](https://developer.habana.ai/).

Basic features for this fine-tuning example are:
- Running on HPUs, support three execution mode: ["lazy", "eager", "eager.compile"](https://docs.habana.ai/en/latest/PyTorch/Reference/PyTorch_Gaudi_Theory_of_Operations.html).
- Deepspeed integrated and LoRA training.
- [`GaudiTrainer`](https://github.com/huggingface/optimum-habana/blob/main/optimum/habana/transformers/trainer.py) based training.
- Llama-2-70b model.
- Ray based scheduling and management.

## Prepare environment
This example run on single node with 8 HPUs.

We recommend using a prebuilt container to run these examples. To run a container, you need Docker. See [Install Docker Engine](https://docs.docker.com/engine/install/) for installation instructions.

Next, follow [Run Using Containers](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html?highlight=installer#run-using-containers) to install the Habana drivers and container runtime.

### Get docker image
``` bash
docker pull vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
```
### Run docker image
``` bash
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
# maybe should mapping your workspace volumns
```
### Install dependency
``` bash
# for exection mode "eager" or "eager.compile", please install "optimum-habana>1.11.1"
pip install ray[train] notebook transformers datasets evaluate peft accelerate scikit-learn optimum-habana
# install deepspeed
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.15.1

# this notebook verfied with packages' version:
# transformers==4.38.2
# datasets==2.19.1
# evaluate==0.4.2
# peft==0.4.0
# accelerate==0.27.2
# scikit-learn==1.4.2
# optimum-habana==1.11.1
# deepspeed==0.12.4+hpu.synapse.v1.15.0
```

In [1]:
import os
import copy
import time
from typing import Dict

import torch
from torch import nn
from torch.utils.data import DataLoader

import datasets
import transformers
from transformers import (
    Trainer,
    TrainingArguments,
    AutoConfig,
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
    AutoModelForSequenceClassification,
)

from tqdm import tqdm

import peft

from optimum.habana import GaudiTrainer, GaudiConfig, GaudiTrainingArguments
from optimum.habana.transformers.modeling_utils import adapt_transformers_to_gaudi

  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(
  _torch_pytree._register_pytree_node(


## Prepare Dataset Function

Preprocessing the raw dataset's each line with specified format.

In [2]:

def preprocess_dataset(raw_datasets):

    PROMPT_DICT = {
        "prompt_with_input": (
            "Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:"
        ),
        "prompt_without_input": (
            "Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Response:"
        ),
    }

    def create_prompts(examples):
        prompts = {}
        prompts["source"] = []
        prompts["target"] = []
        for example in examples:
            prompt_template = (
                PROMPT_DICT["prompt_with_input"] if example["input"] != "" else PROMPT_DICT["prompt_without_input"]
            )
            source = prompt_template.format_map(example)
            prompts["source"].append(source)
            prompts["target"].append(example["output"])
        return prompts

    # Preprocessing the datasets.
    for key in raw_datasets:
        prompts = create_prompts(raw_datasets[key])
        columns_to_be_removed = list(raw_datasets[key].features.keys())
        raw_datasets[key] = raw_datasets[key].add_column("prompt_sources", prompts["source"])
        raw_datasets[key] = raw_datasets[key].add_column("prompt_targets", prompts["target"])
        raw_datasets[key] = raw_datasets[key].remove_columns(columns_to_be_removed)

## Dataset to Tokenizer Function

Tokenize each line in dataset by model tokenizer.

In example codes, we concatenate the dataset's line content to accelerate training speed.

All datasets are processed as "train" datasets, no evaluation datasets are sampled from raw_datasets.

In [3]:

def preprocess_dataset_to_tokenizer(raw_datasets, tokenizer):
    max_seq_length = 512
    tokenizer.pad_token_id = 0
    tokenizer.eos_token_id = 1
    tokenizer.bos_token_id = 2

    def tokenize(prompt, add_eos_token=True):
        results = tokenizer(
            prompt,
            truncation=True,
            max_length=max_seq_length,
            padding=False,
            return_tensors=None,
        )
        for i in range(len(results["input_ids"])):
            if (
                results["input_ids"][i][-1] != tokenizer.eos_token_id
                and len(results["input_ids"][i]) < max_seq_length
                and add_eos_token
            ):
                results["input_ids"][i].append(tokenizer.eos_token_id)
                results["attention_mask"][i].append(1)

        results["labels"] = copy.deepcopy(results["input_ids"])
        results["input_id_len"] = [len(result) for result in results["input_ids"]]
        return results

    def preprocess_function(examples):
        keys = list(examples.data.keys())
        if len(keys) != 2:
            raise ValueError("Unsupported dataset format")

        st = [s + t for s, t in zip(examples[keys[0]], examples[keys[1]])]

        examples_tokenized = tokenize(st)
        input_ids = examples_tokenized["input_ids"]
        labels = examples_tokenized["labels"]
        return {
            "input_ids": input_ids,
            "labels": labels,
            "attention_mask": examples_tokenized["attention_mask"],
        }

    tokenized_datasets = raw_datasets.map(
        preprocess_function,
        batched=True,
        load_from_cache_file=True,
    )

    def concatenate_data(dataset, max_seq_length):
        concatenated_dataset = {}
        for column in dataset.features:
            concatenated_data = [item for sample in dataset[column] for item in sample]
            reshaped_data = [
                concatenated_data[i * max_seq_length : (i + 1) * max_seq_length]
                for i in range(len(concatenated_data) // max_seq_length)
            ]
            concatenated_dataset[column] = reshaped_data
        return datasets.Dataset.from_dict(concatenated_dataset)

    tokenized_datasets_ = tokenized_datasets["train"].remove_columns(["prompt_sources", "prompt_targets"])
    tokenized_datasets["train"] = concatenate_data(tokenized_datasets_, max_seq_length)

    return tokenized_datasets

## Prepare training arguments

By instance object of `GaudiTrainingArguments`, the essential of initialization HPU will be called, such as HPU device spcification.

In [4]:

def prepare_training_args(config: Dict):
    # prepare execution mode config
    execution_mode = config["execution_mode"]
    use_lazy_mode = True if execution_mode == "lazy" else False
    torch_compile_backend = "hpu_backend" if execution_mode == "eager.compile" else None

    return GaudiTrainingArguments(deepspeed=config["deepspeed"],
                                  output_dir=config["output"],
                                  do_train=True,
                                  do_eval=False,
                                  per_device_train_batch_size=config["batch_size_per_worker"],
                                  bf16=True,
                                  learning_rate=config["lr"],
                                  save_strategy="no",
                                  torch_compile_backend=torch_compile_backend,
                                  evaluation_strategy="no",
                                  lr_scheduler_type="cosine",
                                  num_train_epochs=config["epochs"],
                                  use_lazy_mode=use_lazy_mode,
                                  use_habana=True,
                                  pipelining_fwd_bwd=True,
                                  save_only_model=True,
                                  gradient_checkpointing=True,
                                  warmup_ratio=0.03,
                                  throughput_warmup_steps=3,
                                  logging_steps=5)

## Training Function

This function will be executed by each worker during training, with following steps:

- loading datasets and preprocess datasets, just load the first 4096 item as training datasets.
- loading pretrained model as tokenizer, and process datasets to tokenizer.
- loading pretrained model, convert to lora model, and move model to HPU device.
- preparing data collator.
- preparing training args, an instance of `GaudiTrainingArguments`.
- preparing instance of `GaudiTrainer`.
- calling `train()` to train model.
- saving model results.


Compared to a training function for GPU, no changes are needed to port to HPU. Internally, Ray Train does these things:

- Detect HPU and set the device.
- Initialize the habana PyTorch backend.
- Initialize the habana distributed backend.

In [5]:

def train_func_per_worker(config: Dict):
    # adapt transformers to gaudi
    adapt_transformers_to_gaudi()

    # prepare training arguments
    training_args = prepare_training_args(config)

    # prepare datasets
    # here we use dataset "tatsu-lab/alpaca" from huggingface
    # and sample some part
    raw_datasets = datasets.DatasetDict({"train": datasets.load_dataset("tatsu-lab/alpaca", split='train[0:4096]')})
    preprocess_dataset(raw_datasets)

    # prepare tokenizer
    tokenizer = AutoTokenizer.from_pretrained(config["model"])
    tokenized_datasets = preprocess_dataset_to_tokenizer(raw_datasets, tokenizer)

    # prepare model
    if config["deepspeed"] is not None:
        auto_config = AutoConfig.from_pretrained(config["model"], use_cache=False, revision="main", use_auth_token=None, trust_remote_code=None)
        model = AutoModelForCausalLM.from_pretrained(config["model"], config=auto_config, **config["model_config"])
        model.generation_config.attn_softmax_bf16 = True
        model.generation_config.use_flash_attention = True
    else:
        model = AutoModelForCausalLM.from_pretrained(config["model"], **config["model_config"])

    peft_config = peft.LoraConfig(**config["lora_config"])
    model.enable_input_require_grads()
    model = peft.get_peft_model(model, peft_config)
    device = training_args.device
    model.to(dtype=config["model_config"]["torch_dtype"], device=device)

    # prepare data collator
    data_collator = DataCollatorForLanguageModeling(tokenizer, pad_to_multiple_of=8, return_tensors="pt", mlm=False)

    gaudi_config = GaudiConfig()
    gaudi_config.use_fused_adam = True
    gaudi_config.use_fused_clip_norm = True

    trainer = GaudiTrainer(
        model=model,
        gaudi_config=gaudi_config,
        args=training_args,
        train_dataset=tokenized_datasets["train"],
        eval_dataset=None,
        tokenizer=tokenizer,
        data_collator=data_collator,
        compute_metrics=None,
        preprocess_logits_for_metrics=None,
    )

    train_result = trainer.train()
    print(f"train_result = {train_result}")
    trainer.save_model()

    return train_result

## Main Training Function
The `train_llama` function sets up the distributed training environment using Ray and starts the training process. To enable training using HPU, we only need to make the following changes:
- Set the exectuion mode for training, supported execution mode are:

    - "lazy": Deferred execution of graphs, comprising of ops delivered from script op by op similar to Eager mode. It gives the Eager mode experience with performance on Gaudi. Unlike Eager Mode with torch.compile, graph is analyzed in each iteration leading to a higher CPU usage.
    - "eager": Op-by-op execution as defined in standard PyTorch Eager mode scripts.
    - "eager.compile": Eager mode extended with `torch.compile` - Similar to Eager mode but extended with wrapping complete or part of model (such as a function) into a graph. Parts that are not wrapped are executed eagerly.

    More detail theory can be found [here](https://docs.habana.ai/en/latest/PyTorch/Reference/PyTorch_Gaudi_Theory_of_Operations.html), and detail performance results can be found [here](https://developer.habana.ai/get-started/habana-models-performance/)
- Require an HPU for each worker in ScalingConfig
- Set backend to `hccl` in TorchConfig

In [6]:

def train_llama(num_workers, execution_mode):
    import ray
    from ray.train import ScalingConfig
    from ray.train.torch import TorchTrainer, TorchConfig

    # deepspeed config, can also place it to config file
    deepspeed_config = {
        "steps_per_print": 64,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "gradient_accumulation_steps": "auto",
        "bf16": {
            "enabled": True
        },
        "gradient_clipping": 1.0,
        "zero_optimization": {
            "stage": 3,
            "overlap_comm": False,
            "contiguous_gradients": False,
            "stage3_gather_16bit_weights_on_model_save": True
        }
    }

    # Preparing train configurations
    train_config = {
        "execution_mode": execution_mode,
        "model": "/root/models/models--meta-llama--Llama-2-70b-chat-hf/snapshots/e9149a12809580e8602995856f8098ce973d1080/",
        "model_config": {"torch_dtype": torch.bfloat16, "trust_remote_code": None, "use_auth_token": None},
        "lora_config": {"task_type": "CAUSAL_LM", "r": 4, "lora_dropout": 0.1, "target_modules": ["q_proj", "v_proj", "k_proj", "o_proj"]},
        "lr": 0.0018,
        "epochs": 2,
        "batch_size_per_worker": 10,
        "output": "/tmp/ray/",
        "deepspeed": deepspeed_config,
    }

    # Configure computation resources
    # In ScalingConfig, require an HPU for each worker
    scaling_config = ScalingConfig(num_workers=num_workers,
                                   use_gpu=False,
                                   resources_per_worker={"CPU": 1, "HPU": 1})
    # Set backend to hccl in TorchConfig
    torch_config = TorchConfig(backend="hccl")

    # start your ray cluster
    ray.init()

    # Initialize a Ray TorchTrainer
    trainer = TorchTrainer(
        train_loop_per_worker=train_func_per_worker,
        train_loop_config=train_config,
        torch_config=torch_config,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

## Start Training

Finally, we call the `train_llama` function to start the training process. You can adjust the number of workers to use, and the execution mode for HPU.

In [7]:
# set some environment variables
os.environ["RAY_EXPERIMENTAL_NOSET_HABANA_VISIBLE_MODULES"] = "0"
os.environ["PT_HPU_MAX_COMPOUND_OP_SIZE"] = "10"
os.environ["DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED"] = "1"
# execution_mode are ["lazy", "eager", "eager.compile"]
execution_mode = "lazy"
os.environ["PT_HPU_LAZY_MODE"] = "1" if execution_mode == "lazy" else "0"
train_llama(num_workers=8, execution_mode="lazy")

2024-05-08 01:35:23,594	INFO worker.py:1749 -- Started a local Ray instance.
2024-05-08 01:35:24,779	INFO tune.py:614 -- [output] This uses the legacy output and progress reporter, as Jupyter notebooks are not supported by the new engine, yet. For more information, please see https://github.com/ray-project/ray/issues/36949


== Status ==
Current time: 2024-05-08 01:35:24 (running for 00:00:00.11)
Using FIFO scheduling algorithm.
Logical resource usage: 0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 0.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 PENDING)




[36m(pid=699623)[0m   _torch_pytree._register_pytree_node(
[36m(TrainTrainable pid=699623)[0m   _torch_pytree._register_pytree_node(
[36m(TrainTrainable pid=699623)[0m   _torch_pytree._register_pytree_node(


== Status ==
Current time: 2024-05-08 01:35:29 (running for 00:00:05.15)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 PENDING)




[36m(RayTrainWorker pid=700358)[0m   _torch_pytree._register_pytree_node(
[36m(RayTrainWorker pid=700363)[0m   _torch_pytree._register_pytree_node(
[36m(RayTrainWorker pid=700357)[0m Setting up process group for: env:// [rank=0, world_size=8]


== Status ==
Current time: 2024-05-08 01:35:34 (running for 00:00:10.16)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




[36m(TorchTrainer pid=699623)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700357) world_rank=0, local_rank=0, node_rank=0
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700358) world_rank=1, local_rank=1, node_rank=0
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700359) world_rank=2, local_rank=2, node_rank=0
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700360) world_rank=3, local_rank=3, node_rank=0
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700361) world_rank=4, local_rank=4, node_rank=0
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700362) world_rank=5, local_rank=5, node_rank=0
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700363) world_rank=6, local_rank=6, node_rank=0
[36m(TorchTrainer pid=699623)[0m - (ip=100.83.111.228, pid=700364) world_rank=7, local_rank=7, node_rank=0


== Status ==
Current time: 2024-05-08 01:35:39 (running for 00:00:15.18)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




[36m(RayTrainWorker pid=700362)[0m   _torch_pytree._register_pytree_node([32m [repeated 22x across cluster][0m


[36m(RayTrainWorker pid=700363)[0m [2024-05-08 01:35:42,394] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)
[36m(RayTrainWorker pid=700363)[0m [2024-05-08 01:35:42,573] [INFO] [comm.py:637:init_distributed] cdb=None


Map:   0%|          | 0/4096 [00:00<?, ? examples/s]
Map:  24%|██▍       | 1000/4096 [00:00<00:00, 7068.02 examples/s]


== Status ==
Current time: 2024-05-08 01:35:44 (running for 00:00:20.20)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Map:  98%|█████████▊| 4000/4096 [00:00<00:00, 5056.81 examples/s]
Map: 100%|██████████| 4096/4096 [00:00<00:00, 5347.28 examples/s]


== Status ==
Current time: 2024-05-08 01:35:50 (running for 00:00:25.22)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




[36m(RayTrainWorker pid=700358)[0m   tensor: Tensor = fn(*args, **kwargs)
Map:   0%|          | 0/4096 [00:00<?, ? examples/s][32m [repeated 6x across cluster][0m
Map:  73%|███████▎  | 3000/4096 [00:00<00:00, 7674.50 examples/s][32m [repeated 20x across cluster][0m
Map: 100%|██████████| 4096/4096 [00:00<00:00, 5443.58 examples/s][32m [repeated 5x across cluster][0m
Map: 100%|██████████| 4096/4096 [00:00<00:00, 5321.53 examples/s][32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=700357)[0m  PT_HPU_LAZY_MODE = 1
[36m(RayTrainWorker pid=700357)[0m  PT_RECIPE_CACHE_PATH = 
[36m(RayTrainWorker pid=700357)[0m  PT_CACHE_FOLDER_DELETE = 0
[36m(RayTrainWorker pid=700357)[0m  PT_HPU_RECIPE_CACHE_CONFIG = 
[36m(RayTrainWorker pid=700357)[0m  PT_HPU_MAX_COMPOUND_OP_SIZE = 10
[36m(RayTrainWorker pid=700357)[0m  PT_HPU_LAZY_ACC_PAR_MODE = 1
[36m(RayTrainWorker pid=700357)[0m  PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
[36m(RayTrainWorker pid=700357)[0m -------------

[36m(RayTrainWorker pid=700357)[0m [2024-05-08 01:35:53,804] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 723, num_elems = 68.98B
[36m(RayTrainWorker pid=700358)[0m [2024-05-08 01:35:42,394] [INFO] [real_accelerator.py:178:get_accelerator] Setting ds_accelerator to hpu (auto detect)[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=700358)[0m [2024-05-08 01:35:42,572] [INFO] [comm.py:637:init_distributed] cdb=None[32m [repeated 7x across cluster][0m


Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]


== Status ==
Current time: 2024-05-08 01:35:55 (running for 00:00:30.24)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:36:00 (running for 00:00:35.25)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:36:05 (running for 00:00:40.27)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/

Loading checkpoint shards:   7%|▋         | 1/15 [00:11<02:44, 11.73s/it]
[36m(RayTrainWorker pid=700364)[0m   tensor: Tensor = fn(*args, **kwargs)[32m [repeated 6x across cluster][0m
Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s][32m [repeated 7x across cluster][0m


== Status ==
Current time: 2024-05-08 01:36:10 (running for 00:00:45.29)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:36:15 (running for 00:00:50.31)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  13%|█▎        | 2/15 [00:23<02:29, 11.52s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:36:20 (running for 00:00:55.32)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:36:25 (running for 00:01:00.34)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  20%|██        | 3/15 [00:34<02:18, 11.53s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:36:30 (running for 00:01:05.36)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:36:35 (running for 00:01:10.38)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  27%|██▋       | 4/15 [00:45<02:05, 11.42s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:36:40 (running for 00:01:15.39)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:36:45 (running for 00:01:20.41)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:36:50 (running for 00:01:25.43)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/

Loading checkpoint shards:  33%|███▎      | 5/15 [00:57<01:53, 11.37s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:36:55 (running for 00:01:30.44)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:37:00 (running for 00:01:35.46)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  40%|████      | 6/15 [01:08<01:42, 11.39s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:37:05 (running for 00:01:40.48)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:37:10 (running for 00:01:45.49)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  47%|████▋     | 7/15 [01:20<01:31, 11.45s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:37:15 (running for 00:01:50.51)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:37:20 (running for 00:01:55.53)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:37:25 (running for 00:02:00.55)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/

Loading checkpoint shards:  53%|█████▎    | 8/15 [01:31<01:20, 11.44s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:37:30 (running for 00:02:05.56)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:37:35 (running for 00:02:10.58)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  60%|██████    | 9/15 [01:43<01:08, 11.44s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:37:40 (running for 00:02:15.60)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:37:45 (running for 00:02:20.62)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  67%|██████▋   | 10/15 [01:54<00:57, 11.45s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:37:50 (running for 00:02:25.63)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:37:55 (running for 00:02:30.65)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  73%|███████▎  | 11/15 [02:06<00:45, 11.49s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:38:00 (running for 00:02:35.67)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:38:05 (running for 00:02:40.68)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:38:10 (running for 00:02:45.70)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/

Loading checkpoint shards:  80%|████████  | 12/15 [02:17<00:34, 11.50s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:38:15 (running for 00:02:50.71)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:38:20 (running for 00:02:55.73)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  87%|████████▋ | 13/15 [02:28<00:22, 11.46s/it][32m [repeated 8x across cluster][0m


== Status ==
Current time: 2024-05-08 01:38:25 (running for 00:03:00.74)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:38:30 (running for 00:03:05.76)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




Loading checkpoint shards:  93%|█████████▎| 14/15 [02:40<00:11, 11.33s/it]
Loading checkpoint shards:  87%|████████▋ | 13/15 [02:29<00:22, 11.46s/it][32m [repeated 7x across cluster][0m
Loading checkpoint shards: 100%|██████████| 15/15 [02:40<00:00, 10.73s/it]


== Status ==
Current time: 2024-05-08 01:38:35 (running for 00:03:10.78)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


[36m(RayTrainWorker pid=700357)[0m Parameter Offload: Total persistent parameters: 17702912 in 801 params
== Status ==
Current time: 2024-05-08 01:38:40 (running for 00:03:15.79)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




  0%|          | 0/26 [00:00<?, ?it/s]
Loading checkpoint shards:  93%|█████████▎| 14/15 [02:40<00:11, 11.33s/it][32m [repeated 7x across cluster][0m
Loading checkpoint shards: 100%|██████████| 15/15 [02:40<00:00, 10.73s/it][32m [repeated 7x across cluster][0m


== Status ==
Current time: 2024-05-08 01:38:45 (running for 00:03:20.81)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:38:50 (running for 00:03:25.83)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:38:55 (running for 00:03:30.85)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/

  4%|▍         | 1/26 [02:07<53:15, 127.81s/it]


== Status ==
Current time: 2024-05-08 01:40:51 (running for 00:05:26.29)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:40:56 (running for 00:05:31.31)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:41:01 (running for 00:05:36.33)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/

  8%|▊         | 2/26 [03:57<46:46, 116.94s/it]


== Status ==
Current time: 2024-05-08 01:42:41 (running for 00:07:16.69)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:42:46 (running for 00:07:21.71)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 12%|█▏        | 3/26 [04:07<26:05, 68.06s/it] 


== Status ==
Current time: 2024-05-08 01:42:51 (running for 00:07:26.72)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:42:56 (running for 00:07:31.74)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 15%|█▌        | 4/26 [04:16<16:24, 44.75s/it]


== Status ==
Current time: 2024-05-08 01:43:01 (running for 00:07:36.75)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:43:06 (running for 00:07:41.77)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 19%|█▉        | 5/26 [04:25<11:11, 31.96s/it]
 19%|█▉        | 5/26 [04:25<11:11, 31.96s/it]


[36m(RayTrainWorker pid=700357)[0m {'loss': 1.5577, 'grad_norm': 0.8229730129241943, 'learning_rate': 0.0016886760120394771, 'epoch': 0.38, 'memory_allocated (GB)': 16.85, 'max_memory_allocated (GB)': 29.34, 'total_memory_available (GB)': 94.62}
== Status ==
Current time: 2024-05-08 01:43:11 (running for 00:07:46.79)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:43:16 (running for 00:07:51.80)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 23%|██▎       | 6/26 [04:34<08:03, 24.18s/it]


== Status ==
Current time: 2024-05-08 01:43:21 (running for 00:07:56.82)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 27%|██▋       | 7/26 [04:43<06:06, 19.29s/it]


== Status ==
Current time: 2024-05-08 01:43:26 (running for 00:08:01.84)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:43:31 (running for 00:08:06.85)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 31%|███       | 8/26 [04:52<04:48, 16.01s/it]


== Status ==
Current time: 2024-05-08 01:43:36 (running for 00:08:11.87)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:43:41 (running for 00:08:16.88)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 35%|███▍      | 9/26 [05:01<03:54, 13.80s/it]


== Status ==
Current time: 2024-05-08 01:43:46 (running for 00:08:21.90)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:43:51 (running for 00:08:26.91)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 38%|███▊      | 10/26 [05:10<03:16, 12.29s/it]
 38%|███▊      | 10/26 [05:10<03:16, 12.29s/it]


[36m(RayTrainWorker pid=700357)[0m {'loss': 1.1295, 'grad_norm': 0.18815693259239197, 'learning_rate': 0.0012832013624085653, 'epoch': 0.77, 'memory_allocated (GB)': 16.85, 'max_memory_allocated (GB)': 29.39, 'total_memory_available (GB)': 94.62}
== Status ==
Current time: 2024-05-08 01:43:56 (running for 00:08:31.93)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:44:01 (running for 00:08:36.94)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 42%|████▏     | 11/26 [05:19<02:49, 11.29s/it]


== Status ==
Current time: 2024-05-08 01:44:06 (running for 00:08:41.96)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 46%|████▌     | 12/26 [05:28<02:28, 10.58s/it]


== Status ==
Current time: 2024-05-08 01:44:11 (running for 00:08:46.98)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:44:16 (running for 00:08:52.00)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 50%|█████     | 13/26 [05:37<02:12, 10.16s/it]


== Status ==
Current time: 2024-05-08 01:44:21 (running for 00:08:57.02)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:44:26 (running for 00:09:02.03)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 54%|█████▍    | 14/26 [05:46<01:57,  9.83s/it]


== Status ==
Current time: 2024-05-08 01:44:31 (running for 00:09:07.05)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:44:36 (running for 00:09:12.06)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 58%|█████▊    | 15/26 [05:55<01:45,  9.55s/it]
 58%|█████▊    | 15/26 [05:55<01:45,  9.55s/it]


[36m(RayTrainWorker pid=700357)[0m {'loss': 0.9853, 'grad_norm': 0.1367674320936203, 'learning_rate': 0.0007313568168728476, 'epoch': 1.15, 'memory_allocated (GB)': 16.85, 'max_memory_allocated (GB)': 29.6, 'total_memory_available (GB)': 94.62}
== Status ==
Current time: 2024-05-08 01:44:41 (running for 00:09:17.08)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:44:46 (running for 00:09:22.10)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 62%|██████▏   | 16/26 [06:04<01:34,  9.44s/it]


== Status ==
Current time: 2024-05-08 01:44:51 (running for 00:09:27.11)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 65%|██████▌   | 17/26 [06:13<01:24,  9.37s/it]


== Status ==
Current time: 2024-05-08 01:44:56 (running for 00:09:32.13)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:45:01 (running for 00:09:37.14)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 69%|██████▉   | 18/26 [06:23<01:14,  9.30s/it]


== Status ==
Current time: 2024-05-08 01:45:06 (running for 00:09:42.16)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:45:11 (running for 00:09:47.17)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 73%|███████▎  | 19/26 [06:32<01:04,  9.25s/it]


== Status ==
Current time: 2024-05-08 01:45:16 (running for 00:09:52.19)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:45:21 (running for 00:09:57.21)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 77%|███████▋  | 20/26 [06:41<00:55,  9.24s/it]
 77%|███████▋  | 20/26 [06:41<00:55,  9.24s/it]


[36m(RayTrainWorker pid=700357)[0m {'loss': 0.9102, 'grad_norm': 0.07150674611330032, 'learning_rate': 0.0002439282353207298, 'epoch': 1.54, 'memory_allocated (GB)': 16.85, 'max_memory_allocated (GB)': 29.6, 'total_memory_available (GB)': 94.62}
== Status ==
Current time: 2024-05-08 01:45:27 (running for 00:10:02.22)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:45:32 (running for 00:10:07.24)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 81%|████████  | 21/26 [06:50<00:45,  9.20s/it]


== Status ==
Current time: 2024-05-08 01:45:37 (running for 00:10:12.26)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:45:42 (running for 00:10:17.27)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 85%|████████▍ | 22/26 [06:59<00:36,  9.13s/it]


== Status ==
Current time: 2024-05-08 01:45:47 (running for 00:10:22.29)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 88%|████████▊ | 23/26 [07:08<00:27,  9.08s/it]


== Status ==
Current time: 2024-05-08 01:45:52 (running for 00:10:27.30)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (0.0/1.0 TPU, 8.0/8.0 HPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:45:57 (running for 00:10:32.33)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 92%|█████████▏| 24/26 [07:17<00:18,  9.10s/it]


== Status ==
Current time: 2024-05-08 01:46:02 (running for 00:10:37.34)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:46:07 (running for 00:10:42.36)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




 96%|█████████▌| 25/26 [07:26<00:09,  9.08s/it]
 96%|█████████▌| 25/26 [07:26<00:09,  9.08s/it]


[36m(RayTrainWorker pid=700357)[0m {'loss': 0.8973, 'grad_norm': 0.07026992738246918, 'learning_rate': 7.096768816970011e-06, 'epoch': 1.92, 'memory_allocated (GB)': 16.85, 'max_memory_allocated (GB)': 29.6, 'total_memory_available (GB)': 94.62}
== Status ==
Current time: 2024-05-08 01:46:12 (running for 00:10:47.38)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)


== Status ==
Current time: 2024-05-08 01:46:17 (running for 00:10:52.40)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 RUNNING)




100%|██████████| 26/26 [07:35<00:00,  9.04s/it]
100%|██████████| 26/26 [07:35<00:00, 17.53s/it]


[36m(RayTrainWorker pid=700363)[0m train_result = TrainOutput(global_step=26, training_loss=1.0912420657964854, metrics={'train_runtime': 455.8151, 'train_samples_per_second': 8.352, 'train_steps_per_second': 0.11, 'total_flos': 113117358981120.0, 'train_loss': 1.0912420657964854, 'epoch': 2.0, 'memory_allocated (GB)': 16.85, 'max_memory_allocated (GB)': 29.51, 'total_memory_available (GB)': 94.62})
[36m(RayTrainWorker pid=700357)[0m {'train_runtime': 455.6666, 'train_samples_per_second': 8.352, 'train_steps_per_second': 0.11, 'train_loss': 1.0917885830769172, 'epoch': 2.0, 'memory_allocated (GB)': 16.85, 'max_memory_allocated (GB)': 29.6, 'total_memory_available (GB)': 94.62}
[36m(RayTrainWorker pid=700357)[0m train_result = TrainOutput(global_step=26, training_loss=1.0917885830769172, metrics={'train_runtime': 455.6666, 'train_samples_per_second': 8.352, 'train_steps_per_second': 0.11, 'train_loss': 1.0917885830769172, 'epoch': 2.0, 'memory_allocated (GB)': 16.85, 'max_memory_a

You may want to consider increasing the `CheckpointConfig(num_to_keep)` or decreasing the frequency of saving checkpoints.
You can suppress this error by setting the environment variable TUNE_WARN_EXCESSIVE_EXPERIMENT_CHECKPOINT_SYNC_THRESHOLD_S to a smaller value than the current threshold (5.0).
2024-05-08 01:47:39,968	INFO tune.py:1007 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/TorchTrainer_2024-05-08_01-35-24' in 0.0025s.
2024-05-08 01:47:39,970	INFO tune.py:1039 -- Total run time: 735.19 seconds (735.17 seconds for the tuning loop).


Trial TorchTrainer_3de91_00000 completed. Last result: 
== Status ==
Current time: 2024-05-08 01:47:39 (running for 00:12:15.18)
Using FIFO scheduling algorithm.
Logical resource usage: 9.0/152 CPUs, 0/0 GPUs (8.0/8.0 HPU, 0.0/1.0 TPU)
Result logdir: /tmp/ray/session_2024-05-08_01-35-20_406541_689603/artifacts/2024-05-08_01-35-24/TorchTrainer_2024-05-08_01-35-24/driver_artifacts
Number of trials: 1/1 (1 TERMINATED)


Training result: Result(
  metrics={},
  path='/root/ray_results/TorchTrainer_2024-05-08_01-35-24/TorchTrainer_3de91_00000_0_2024-05-08_01-35-24',
  filesystem='local',
  checkpoint=None
)
