# GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed

In this example, we will showcase how to use the Ray AIR for **GPT-J fine-tuning**. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).

We will use Ray AIR (with the 🤗 Transformers integration) and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models.

This example focuses more on the performance and distributed computing aspects of Ray AIR. If you are looking for a more beginner friendly introduction to Ray AIR 🤗 Transformers integration, see {doc}`this example </ray-air/examples/huggingface_text_classification>`.

It is highly recommended to read [Ray AIR Key Concepts](air-key-concepts) and [Ray Data Key Concepts](data_key_concepts) before starting this example.

```{note}
In order to run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The amount of memory needed will depend on the model. This notebook is being tested with 16 g4dn.4xlarge instances.
```

In this notebook, we will:
1. [Set up Ray](#setup)
2. [Load the dataset](#load)
3. [Preprocess the dataset with Ray AIR](#preprocess)
4. [Run the training with Ray AIR](#train)
5. [Generate text from prompt with Ray AIR](#predict)

Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with `transformers==4.26.0`):

In [1]:
#! pip install "datasets" "evaluate" "accelerate>=0.16.0" "transformers>=4.26.0" "torch>=1.12.0" "deepspeed"

In [2]:
import numpy as np
import pandas as pd
import os

## Set up Ray <a name="setup"></a>

First, let's set some global variables. We will use 16 workers, each being assigned 1 GPU and 8 CPUs.

In [4]:
model_name = "EleutherAI/gpt-j-6B"
use_gpu = True
num_workers = 16
cpus_per_worker = 8

We will use `ray.init()` to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.

We define a {ref}`runtime environment <runtime-environments>` to ensure that the Ray workers have access to all the necessary packages. You can omit the `runtime_env` argument if you have all of the packages already installed on each node in your cluster.

In [5]:
import ray

ray.init(
    runtime_env={
        "pip": [
            "datasets",
            "evaluate",
            "accelerate>=0.16.0",
            "transformers>=4.26.0",
            "torch>=1.12.0",
            "deepspeed",
        ]
    }
)

2023-03-06 16:35:03,964	INFO worker.py:1360 -- Connecting to existing Ray cluster at address: 10.0.30.196:6379...
2023-03-06 16:35:03,973	INFO worker.py:1548 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://console.anyscale-staging.com/api/v2/sessions/ses_sedlspnpy16naa5lm9kf2cmi2y/services?redirect_to=dashboard [39m[22m
2023-03-06 16:35:04,548	INFO packaging.py:503 -- Creating a file package for local directory '/tmp/ray_tmp_module/ray'.
2023-03-06 16:35:05,467	INFO packaging.py:330 -- Pushing file package 'gcs://_ray_pkg_f864ba6869d6802c.zip' (145.05MiB) to Ray cluster...
2023-03-06 16:35:07,789	INFO packaging.py:343 -- Successfully pushed file package 'gcs://_ray_pkg_f864ba6869d6802c.zip'.
2023-03-06 16:35:08,306	INFO packaging.py:330 -- Pushing file package 'gcs://_ray_pkg_9628256c2f3f4cb7c4a2b90d6cdc5bef.zip' (162.95MiB) to Ray cluster...
2023-03-06 16:35:10,727	INFO packaging.py:343 -- Successfully pushed file package 'gcs://_ray_pkg_9628256c2f3f4cb7c4a2b90d6

0,1
Python version:,3.8.16
Ray version:,3.0.0.dev0
Dashboard:,http://console.anyscale-staging.com/api/v2/sessions/ses_sedlspnpy16naa5lm9kf2cmi2y/services?redirect_to=dashboard


In [6]:
# THIS SHOULD BE HIDDEN IN DOCS AND ONLY RAN IN CI
# Download the model from our S3 mirror as it's faster

import ray
import subprocess
import ray.util.scheduling_strategies


def force_on_node(node_id: str, remote_func_or_actor_class):
    scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
        node_id=node_id, soft=False
    )
    options = {"scheduling_strategy": scheduling_strategy}
    return remote_func_or_actor_class.options(**options)


def run_on_every_node(remote_func_or_actor_class, **remote_kwargs):
    refs = []
    for node in ray.nodes():
        if node["Alive"] and node["Resources"].get("GPU", None):
            refs.append(
                force_on_node(node["NodeID"], remote_func_or_actor_class).remote(
                    **remote_kwargs
                )
            )
    return ray.get(refs)


@ray.remote(num_gpus=1)
def download_model():
    from transformers.utils.hub import TRANSFORMERS_CACHE

    path = os.path.expanduser(
        os.path.join(TRANSFORMERS_CACHE, "models--EleutherAI--gpt-j-6B")
    )
    subprocess.run(["mkdir", "-p", os.path.join(path, "snapshots", "main")])
    subprocess.run(["mkdir", "-p", os.path.join(path, "refs")])
    if os.path.exists(os.path.join(path, "refs", "main")):
        return
    subprocess.run(
        [
            "aws",
            "s3",
            "sync",
            "--quiet",
            "s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/",
            os.path.join(path, "snapshots", "main"),
        ]
    )
    with open(os.path.join(path, "snapshots", "main", "hash"), "r") as f:
        f_hash = f.read().strip()
    with open(os.path.join(path, "refs", "main"), "w") as f:
        f.write(f_hash)
    os.rename(
        os.path.join(path, "snapshots", "main"), os.path.join(path, "snapshots", f_hash)
    )


_ = run_on_every_node(download_model)

## Loading the dataset <a name="load"></a>

We will be fine-tuning the model on the [`tiny_shakespeare` dataset](https://huggingface.co/datasets/tiny_shakespeare), comprised of 40,000 lines of Shakespeare from a variety of Shakespeare's plays. The aim will be to make the GPT-J model better at generating text in the style of Shakespeare.

In [7]:
from datasets import load_dataset

print("Loading tiny_shakespeare dataset")
current_dataset = load_dataset("tiny_shakespeare")
current_dataset

Loading tiny_shakespeare dataset


Found cached dataset tiny_shakespeare (/home/ray/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

We will use [Ray Data](datasets) for distributed preprocessing and data ingestion. We can easily convert the dataset obtained from Hugging Face Hub to Ray Data by using {meth}`ray.data.read_api.from_huggingface`.

In [8]:
import ray.data

ray_datasets = ray.data.from_huggingface(current_dataset)
ray_datasets

{'train': Dataset(num_blocks=1, num_rows=1, schema={text: string}),
 'validation': Dataset(num_blocks=1, num_rows=1, schema={text: string}),
 'test': Dataset(num_blocks=1, num_rows=1, schema={text: string})}

Because the dataset is represented by a single large string, we will need to do some preprocessing. For that, we will define two [Ray AIR Preprocessors](air-preprocessors) using the {class}`~ray.data.preprocessors.BatchMapper` API, allowing us to define functions that will be applied on batches of data.

The `split_text` function will take the single string and split it into separate lines, removing empty lines and character names ending with ':' (eg. 'ROMEO:'). The `tokenize` function will take the lines and tokenize them using the 🤗 Tokenizer associated with the model, ensuring each entry has the same length (`block_size`) by padding and truncating. This is necessary for training.

```{note}
This preprocessing can be done in other ways. A common pattern is to tokenize first, and then split the obtained tokens into equally-sized blocks.
```

We will use the `splitter` and `tokenizer` Preprocessors below.

In [9]:
block_size = 512

In [10]:
from transformers import AutoTokenizer
from datasets import Dataset as HFDataset

from ray.data.preprocessors import BatchMapper


def split_text(batch: pd.DataFrame) -> pd.DataFrame:
    text = list(batch["text"])
    flat_text = "".join(text)
    split_text = [
        x.strip()
        for x in flat_text.split("\n")
        if x.strip() and not x.strip()[-1] == ":"
    ]
    return pd.DataFrame(split_text, columns=["text"])


def tokenize(batch: pd.DataFrame) -> dict:
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    tokenizer.pad_token = tokenizer.eos_token
    ret = tokenizer(
        list(batch["text"]),
        truncation=True,
        max_length=block_size,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)


splitter = BatchMapper(split_text, batch_format="pandas")
tokenizer = BatchMapper(tokenize, batch_format="pandas")

### Fine-tuning the model with Ray AIR <a name="train"></a>

We can now configure Ray AIR's {class}`~ray.train.huggingface.huggingface_trainer.HuggingFaceTrainer` to perform distributed fine-tuning of the model. In order to do that, we specify a `trainer_init_per_worker` function, which creates a 🤗 Transformers `Trainer` that will be distributed by Ray using Distributed Data Parallelism (using PyTorch Distributed backend internally). This means that each worker will have its own copy of the model, but operate on different data, At the end of each step, all the workers will sync gradients.

Because GPT-J is a relatively large model, it may not be possible to fit it on smaller GPU types (<=16 GB GRAM). To deal with that issue, we can use [DeepSpeed](https://github.com/microsoft/DeepSpeed), a library to optimize the training process and allow us to (among other things) offload and partition optimizer and parameter states, reducing GRAM usage. Furthermore, DeepSpeed ZeRO Stage 3 allows us to load large models without running out of memory.

🤗 Transformers and Ray AIR's integration (`HuggingFaceTrainer`) allow you to easily configure and use DDP and DeepSpeed. All you need to do is specify the DeepSpeed configuration in the [`TrainingArguments`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) object.

```{tip}
There are many DeepSpeed settings, allowing you to trade-off speed for memory usage. The settings used below are tailored to the cluster setup used (16 g4dn.4xlarge nodes) and batch size of 16. Some things to keep in mind:
- If your GPUs support bfloat16, it is recommended to use that instead of float16 mixed precision as it gives better performance and prevents overflows. Simply replace `fp16=True` with `bf16=True` in `TrainingArguments`
- If you are running out of GRAM: try reducing batch size (defined in the cell below the next one), set `"overlap_comm": False` in DeepSpeed config.
- If you are running out of RAM: try adding more nodes to your cluster, use nodes with more RAM, set `"pin_memory": False` in DeepSpeed config, reduce batch size and remove `"offload_param"` from DeepSpeed config.

For more information on DeepSpeed configuration, refer to [Hugging Face documentation](https://huggingface.co/docs/transformers/main_classes/deepspeed) and [DeepSpeed documentation](https://www.deepspeed.ai/docs/config-json/).
```

In [11]:
import evaluate
from transformers import Trainer, TrainingArguments
from transformers import (
    GPTJForCausalLM,
    AutoTokenizer,
    default_data_collator,
)
from transformers.utils.logging import disable_progress_bar, enable_progress_bar
import torch

from ray.air import session


def trainer_init_per_worker(train_dataset, eval_dataset=None, **config):
    # Use the actual number of CPUs assigned by Ray
    os.environ["OMP_NUM_THREADS"] = str(
        session.get_trial_resources().bundles[-1].get("CPU", 1)
    )
    # Enable tf32 for better performance
    torch.backends.cuda.matmul.allow_tf32 = True

    batch_size = config.get("batch_size", 4)
    epochs = config.get("epochs", 2)
    warmup_steps = config.get("warmup_steps", 0)
    learning_rate = config.get("learning_rate", 0.00002)
    weight_decay = config.get("weight_decay", 0.01)

    deepspeed = {
        "fp16": {
            "enabled": "auto",
            "initial_scale_power": 8,
        },
        "bf16": {"enabled": "auto"},
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": "auto",
                "betas": "auto",
                "eps": "auto",
            },
        },
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True,
            },
            "offload_param": {
                "device": "cpu",
                "pin_memory": True,
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "gather_16bit_weights_on_model_save": True,
            "round_robin_gradients": True,
        },
        "gradient_accumulation_steps": "auto",
        "gradient_clipping": "auto",
        "steps_per_print": 10,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "wall_clock_breakdown": False,
    }

    print("Preparing training arguments")
    training_args = TrainingArguments(
        "output",
        per_device_train_batch_size=batch_size,
        logging_steps=1,
        save_strategy="no",
        per_device_eval_batch_size=batch_size,
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        warmup_steps=warmup_steps,
        label_names=["input_ids", "attention_mask"],
        num_train_epochs=epochs,
        push_to_hub=False,
        disable_tqdm=True,  # declutter the output a little
        fp16=True,
        gradient_checkpointing=True,
        deepspeed=deepspeed,
    )
    disable_progress_bar()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    print("Loading model")

    model = GPTJForCausalLM.from_pretrained(model_name, use_cache=False)
    model.resize_token_embeddings(len(tokenizer))

    print("Model loaded")

    enable_progress_bar()

    metric = evaluate.load("accuracy")

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )
    return trainer

comet_ml is installed but `COMET_API_KEY` is not set.
  from pandas import MultiIndex, Int64Index


With our `trainer_init_per_worker` complete, we can now instantiate the `HuggingFaceTrainer`. Aside from the function, we set the `scaling_config`, controlling the amount of workers and resources used, and the `datasets` we will use for training and evaluation.

We pass the preprocessors we have defined earlier as an argument, wrapped in a {class}`~ray.data.preprocessors.chain.Chain`. The preprocessor will be included with the returned `Checkpoint`, meaning it will also be applied during inference.

```{note}
If you want to upload checkpoints to cloud storage (eg. S3), use {class}`ray.tune.syncer.SyncConfig` - see {ref}`train-config-sync` for an example. Using cloud storage is highly recommended, especially for production.
```

In [12]:
from ray.train.huggingface import HuggingFaceTrainer
from ray.air.config import ScalingConfig
from ray.data.preprocessors import Chain


trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    trainer_init_config={
        "batch_size": 16,  # per device
        "epochs": 1,
    },
    scaling_config=ScalingConfig(
        num_workers=num_workers,
        use_gpu=use_gpu,
        resources_per_worker={"GPU": 1, "CPU": cpus_per_worker},
    ),
    datasets={"train": ray_datasets["train"], "evaluation": ray_datasets["validation"]},
    preprocessor=Chain(splitter, tokenizer),
)

Finally, we call the `fit` method to start training with Ray AIR. We will save the `Result` object to a variable so we can access metrics and checkpoints.

In [13]:
results = trainer.fit()

0,1
Current time:,2023-03-06 17:18:41
Running for:,00:43:11.46
Memory:,31.9/62.0 GiB

Trial name,status,loc,iter,total time (s),loss,learning_rate,epoch
HuggingFaceTrainer_f623d_00000,TERMINATED,10.0.30.196:30861,85,2579.3,0.0715,4.70588e-07,1


(pid=30861)   from pandas import MultiIndex, Int64Index
(pid=30861) comet_ml is installed but `COMET_API_KEY` is not set.
(HuggingFaceTrainer pid=30861) 2023-03-06 16:35:39,040	INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] -> AllToAllOperator[randomize_block_order]
(HuggingFaceTrainer pid=30861) 2023-03-06 16:35:40,878	INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper] -> AllToAllOperator[randomize_block_order]
(RayTrainWorker pid=31281) 2023-03-06 16:35:44,877	INFO config.py:86 -- Setting up process group for: env:// [rank=0, world_size=16]
(HuggingFaceTrainer pid=30861) 2023-03-06 16:35:45,497	INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[BatchMapper]
(RayTrainWorker pid=1942, ip=10.0.57.85)   from pandas import MultiIndex, Int64Index
(RayTrainWorker pid=1942, ip=10.0.35.70)   from pandas import MultiIndex, Int64Index
(RayTrainWorker pid=3

(RayTrainWorker pid=31281) Preparing training arguments
(RayTrainWorker pid=31281) Loading model
(RayTrainWorker pid=2334, ip=10.0.53.213) Preparing training arguments
(RayTrainWorker pid=2334, ip=10.0.53.213) Loading model
(RayTrainWorker pid=1954, ip=10.0.15.115) Preparing training arguments
(RayTrainWorker pid=1942, ip=10.0.51.113) Preparing training arguments
(RayTrainWorker pid=1943, ip=10.0.24.217) Preparing training arguments
(RayTrainWorker pid=1942, ip=10.0.35.70) Preparing training arguments
(RayTrainWorker pid=1956, ip=10.0.47.149) Preparing training arguments
(RayTrainWorker pid=1964, ip=10.0.26.83) Preparing training arguments
(RayTrainWorker pid=1963, ip=10.0.54.163) Preparing training arguments
(RayTrainWorker pid=1955, ip=10.0.58.255) Preparing training arguments
(RayTrainWorker pid=1942, ip=10.0.57.85) Preparing training arguments
(RayTrainWorker pid=1954, ip=10.0.25.154) Preparing training arguments
(RayTrainWorker pid=2623, ip=10.0.4.206) Preparing training arguments



(RayTrainWorker pid=2623, ip=10.0.4.206) Model loaded
(RayTrainWorker pid=1956, ip=10.0.47.149) Model loaded
(RayTrainWorker pid=1943, ip=10.0.37.101) Model loaded
(RayTrainWorker pid=2334, ip=10.0.53.213) Model loaded
(RayTrainWorker pid=31281) Model loaded
(RayTrainWorker pid=1942, ip=10.0.35.70) Model loaded
(RayTrainWorker pid=1942, ip=10.0.57.85) Model loaded
(RayTrainWorker pid=1955, ip=10.0.58.255) Model loaded
(RayTrainWorker pid=1954, ip=10.0.15.115) Model loaded
(RayTrainWorker pid=1942, ip=10.0.51.113) Model loaded
(RayTrainWorker pid=1964, ip=10.0.26.83) Model loaded
(RayTrainWorker pid=1954, ip=10.0.25.154) Model loaded
(RayTrainWorker pid=1963, ip=10.0.29.205) Model loaded
(RayTrainWorker pid=1943, ip=10.0.14.60) Model loaded
(RayTrainWorker pid=1943, ip=10.0.24.217) Model loaded
(RayTrainWorker pid=1963, ip=10.0.54.163) Model loaded


(RayTrainWorker pid=1963, ip=10.0.29.205) Using cuda_amp half precision backend
(RayTrainWorker pid=1943, ip=10.0.14.60) Using cuda_amp half precision backend
(RayTrainWorker pid=1943, ip=10.0.24.217) Using cuda_amp half precision backend
(RayTrainWorker pid=1943, ip=10.0.24.217) 2023-03-06 16:38:03,416	INFO distributed_c10d.py:319 -- Added key: store_based_barrier_key:2 to store for rank: 11
(RayTrainWorker pid=1963, ip=10.0.54.163) Using cuda_amp half precision backend
(RayTrainWorker pid=1963, ip=10.0.54.163) 2023-03-06 16:38:03,434	INFO distributed_c10d.py:319 -- Added key: store_based_barrier_key:2 to store for rank: 8
(RayTrainWorker pid=2623, ip=10.0.4.206) Using cuda_amp half precision backend
(RayTrainWorker pid=2623, ip=10.0.4.206) 2023-03-06 16:38:03,423	INFO distributed_c10d.py:319 -- Added key: store_based_barrier_key:2 to store for rank: 10
(RayTrainWorker pid=1956, ip=10.0.47.149) Using cuda_amp half precision backend
(RayTrainWorker pid=1956, ip=10.0.47.149) 2023-03-06 

(RayTrainWorker pid=31281) [2023-03-06 16:38:03,431] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed info: version=0.8.1, git-hash=unknown, git-branch=unknown
(RayTrainWorker pid=31281) [2023-03-06 16:38:03,450] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False


(RayTrainWorker pid=1942, ip=10.0.35.70) Using cuda_amp half precision backend
(RayTrainWorker pid=1942, ip=10.0.35.70) 2023-03-06 16:38:03,428	INFO distributed_c10d.py:319 -- Added key: store_based_barrier_key:2 to store for rank: 5
(RayTrainWorker pid=1942, ip=10.0.35.70) 2023-03-06 16:38:03,449	INFO distributed_c10d.py:353 -- Rank 5: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes.
(RayTrainWorker pid=1942, ip=10.0.57.85) Using cuda_amp half precision backend
(RayTrainWorker pid=1942, ip=10.0.57.85) 2023-03-06 16:38:03,428	INFO distributed_c10d.py:319 -- Added key: store_based_barrier_key:2 to store for rank: 1
(RayTrainWorker pid=1942, ip=10.0.57.85) 2023-03-06 16:38:03,449	INFO distributed_c10d.py:353 -- Rank 1: Completed store-based barrier for key:store_based_barrier_key:2 with 16 nodes.
(RayTrainWorker pid=1955, ip=10.0.58.255) Using cuda_amp half precision backend
(RayTrainWorker pid=1955, ip=10.0.58.255) 2023-03-06 16:38:03,412	INFO distributed_c

(RayTrainWorker pid=2334, ip=10.0.53.213) ninja: no work to do.


(RayTrainWorker pid=2623, ip=10.0.4.206) Detected CUDA files, patching ldflags
(RayTrainWorker pid=2623, ip=10.0.4.206) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=2623, ip=10.0.4.206) Building extension module cpu_adam...
(RayTrainWorker pid=2623, ip=10.0.4.206) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1942, ip=10.0.35.70) ninja: no work to do.
(RayTrainWorker pid=1942, ip=10.0.35.70) Time to load cpu_adam op: 2.6751821041107178 seconds


(RayTrainWorker pid=1942, ip=10.0.35.70) Loading extension module cpu_adam...
(RayTrainWorker pid=1942, ip=10.0.57.85) Detected CUDA files, patching ldflags
(RayTrainWorker pid=1942, ip=10.0.57.85) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=1942, ip=10.0.57.85) Building extension module cpu_adam...
(RayTrainWorker pid=1942, ip=10.0.57.85) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=1955, ip=10.0.58.255) Detected CUDA files, patching ldflags
(RayTrainWorker pid=1955, ip=10.0.58.255) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=1955, ip=10.0.58.255) Building extension module cpu_adam...
(RayTrainWorker pid=1955, ip=10.0.58.255) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1954, ip=10.0.15.115) ninja: no work to do.


(RayTrainWorker pid=1954, ip=10.0.15.115) Detected CUDA files, patching ldflags
(RayTrainWorker pid=1954, ip=10.0.15.115) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=1954, ip=10.0.15.115) Building extension module cpu_adam...
(RayTrainWorker pid=1954, ip=10.0.15.115) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=1954, ip=10.0.15.115) Loading extension module cpu_adam...
(RayTrainWorker pid=1942, ip=10.0.51.113) Loading extension module cpu_adam...


(RayTrainWorker pid=1942, ip=10.0.51.113) ninja: no work to do.
(RayTrainWorker pid=1942, ip=10.0.51.113) Time to load cpu_adam op: 2.6925859451293945 seconds


(RayTrainWorker pid=1964, ip=10.0.26.83) Detected CUDA files, patching ldflags
(RayTrainWorker pid=1964, ip=10.0.26.83) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=1964, ip=10.0.26.83) Building extension module cpu_adam...
(RayTrainWorker pid=1964, ip=10.0.26.83) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1964, ip=10.0.26.83) ninja: no work to do.
(RayTrainWorker pid=1954, ip=10.0.25.154) ninja: no work to do.


(RayTrainWorker pid=1954, ip=10.0.25.154) Detected CUDA files, patching ldflags
(RayTrainWorker pid=1954, ip=10.0.25.154) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=1954, ip=10.0.25.154) Building extension module cpu_adam...
(RayTrainWorker pid=1954, ip=10.0.25.154) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=1954, ip=10.0.25.154) Loading extension module cpu_adam...
(RayTrainWorker pid=1963, ip=10.0.29.205) Detected CUDA files, patching ldflags
(RayTrainWorker pid=1963, ip=10.0.29.205) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=1963, ip=10.0.29.205) Building extension module cpu_adam...
(RayTrainWorker pid=1963, ip=10.0.29.205) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker

(RayTrainWorker pid=1943, ip=10.0.24.217) ninja: no work to do.
(RayTrainWorker pid=1943, ip=10.0.24.217) Time to load cpu_adam op: 2.7105295658111572 seconds


(RayTrainWorker pid=1963, ip=10.0.54.163) Loading extension module cpu_adam...


(RayTrainWorker pid=1963, ip=10.0.54.163) ninja: no work to do.
(RayTrainWorker pid=1963, ip=10.0.54.163) Time to load cpu_adam op: 2.7104923725128174 seconds
(RayTrainWorker pid=1956, ip=10.0.47.149) ninja: no work to do.
(RayTrainWorker pid=1956, ip=10.0.47.149) Time to load cpu_adam op: 2.7040586471557617 seconds


(RayTrainWorker pid=1956, ip=10.0.47.149) Loading extension module cpu_adam...


(RayTrainWorker pid=1943, ip=10.0.37.101) ninja: no work to do.
(RayTrainWorker pid=1943, ip=10.0.37.101) Time to load cpu_adam op: 2.718742609024048 seconds


(RayTrainWorker pid=1943, ip=10.0.37.101) Loading extension module cpu_adam...


(RayTrainWorker pid=2334, ip=10.0.53.213) Time to load cpu_adam op: 2.683342456817627 seconds


(RayTrainWorker pid=2623, ip=10.0.4.206) Loading extension module cpu_adam...


(RayTrainWorker pid=2623, ip=10.0.4.206) ninja: no work to do.
(RayTrainWorker pid=2623, ip=10.0.4.206) Time to load cpu_adam op: 2.7268447875976562 seconds


(RayTrainWorker pid=31281) Detected CUDA files, patching ldflags
(RayTrainWorker pid=31281) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/cpu_adam/build.ninja...
(RayTrainWorker pid=31281) Building extension module cpu_adam...
(RayTrainWorker pid=31281) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=31281) Loading extension module cpu_adam...


(RayTrainWorker pid=31281) ninja: no work to do.


(RayTrainWorker pid=1942, ip=10.0.57.85) Loading extension module cpu_adam...


(RayTrainWorker pid=1942, ip=10.0.57.85) ninja: no work to do.
(RayTrainWorker pid=1942, ip=10.0.57.85) Time to load cpu_adam op: 2.714007616043091 seconds
(RayTrainWorker pid=1955, ip=10.0.58.255) ninja: no work to do.
(RayTrainWorker pid=1955, ip=10.0.58.255) Time to load cpu_adam op: 2.712510347366333 seconds


(RayTrainWorker pid=1955, ip=10.0.58.255) Loading extension module cpu_adam...


(RayTrainWorker pid=1954, ip=10.0.15.115) Time to load cpu_adam op: 2.7184810638427734 seconds


(RayTrainWorker pid=1964, ip=10.0.26.83) Loading extension module cpu_adam...


(RayTrainWorker pid=1964, ip=10.0.26.83) Time to load cpu_adam op: 2.719329595565796 seconds
(RayTrainWorker pid=1954, ip=10.0.25.154) Time to load cpu_adam op: 2.7163612842559814 seconds


(RayTrainWorker pid=1963, ip=10.0.29.205) Loading extension module cpu_adam...


(RayTrainWorker pid=1963, ip=10.0.29.205) ninja: no work to do.
(RayTrainWorker pid=1943, ip=10.0.14.60) ninja: no work to do.
(RayTrainWorker pid=1943, ip=10.0.14.60) Time to load cpu_adam op: 2.725243091583252 seconds


(RayTrainWorker pid=1943, ip=10.0.14.60) Loading extension module cpu_adam...


(RayTrainWorker pid=31281) Time to load cpu_adam op: 2.75288987159729 seconds
(RayTrainWorker pid=1963, ip=10.0.29.205) Time to load cpu_adam op: 2.7566170692443848 seconds


(RayTrainWorker pid=2623, ip=10.0.4.206) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=2623, ip=10.0.4.206) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=2623, ip=10.0.4.206) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,767] [INFO] [logging.py:75:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,782] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,782] [INFO] [utils.py:53:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,782] [INFO] [logging.py:75:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 3 optimizer
(RayTrainWorker pid=31281) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=31281) Co

(RayTrainWorker pid=2623, ip=10.0.4.206) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=2623, ip=10.0.4.206) Building extension module utils...
(RayTrainWorker pid=2623, ip=10.0.4.206) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=31281) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=31281) [2023-03-06 16:38:09,979] [INFO] [utils.py:825:see_memory_usage] Stage 3 initialize beginning
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,980] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 1.26 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,980] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 11.88 GB, percent = 19.1%
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,982] [INFO] [stage3.py:114:__init__] Reduce bucket size 16777216
(RayTrainWorker pid=31281) [2023-03-06 16:38:09,982] [INFO] [stage3.py:115:__init__] Prefetch bucket size 15099494


(RayTrainWorker pid=2623, ip=10.0.4.206) Loading extension module utils...


(RayTrainWorker pid=2623, ip=10.0.4.206) ninja: no work to do.
(RayTrainWorker pid=2623, ip=10.0.4.206) Time to load utils op: 0.33064842224121094 seconds


(RayTrainWorker pid=31281) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=31281) Building extension module utils...
(RayTrainWorker pid=31281) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=1942, ip=10.0.35.70) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1942, ip=10.0.51.113) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1943, ip=10.0.24.217) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1963, ip=10.0.54.163) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=1956, ip=10.0.47.149) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1956, ip=10.0.47.149) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
(RayTrainWorker pid=1963, ip=10.0.54.163) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1963, ip=10.0.54.163) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1956, ip=10.0.47.149) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=2334, ip=10.0.53.213) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=31281) Loading extension module utils...


(RayTrainWorker pid=31281) ninja: no work to do.
(RayTrainWorker pid=31281) Time to load utils op: 0.34462642669677734 seconds
(RayTrainWorker pid=1942, ip=10.0.35.70) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1942, ip=10.0.35.70) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1942, ip=10.0.57.85) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=1955, ip=10.0.58.255) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1955, ip=10.0.58.255) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1955, ip=10.0.58.255) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=1954, ip=10.0.15.115) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1954, ip=10.0.15.115) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1954, ip=10.0.15.115) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=1942, ip=10.0.51.113) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1942, ip=10.0.51.113) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1964, ip=10.0.26.83) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=1954, ip=10.0.25.154) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1954, ip=10.0.25.154) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1954, ip=10.0.25.154) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1943, ip=10.0.14.60) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=1943, ip=10.0.14.60) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1943, ip=10.0.14.60) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
(RayTrainWorker pid=1943, ip=10.0.24.217) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1943, ip=10.0.24.217) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
(RayTrainWorker pid=1943, ip=10.0.37.101) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1943, ip=10.0.37.101) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1943, ip=10.0.37.101) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...


(RayTrainWorker pid=2334, ip=10.0.53.213) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=2334, ip=10.0.53.213) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1942, ip=10.0.35.70) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1942, ip=10.0.35.70) Building extension module utils...
(RayTrainWorker pid=1942, ip=10.0.35.70) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1942, ip=10.0.57.85) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1942, ip=10.0.57.85) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
(RayTrainWorker pid=1964, ip=10.0.26.83) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1964, ip=10.0.26.83) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1963, ip=10.0.29.205) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1956, ip=10.0.47.149) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1956, ip=10.0.47.149) Building extension module utils...
(RayTrainWorker pid=1956, ip=10.0.47.149) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=2334, ip=10.0.53.213) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=2334, ip=10.0.53.213) Building extension module utils...
(RayTrainWorker pid=2334, ip=10.0.53.213) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=31281) [2023-03-06 16:38:10,526] [INFO] [utils.py:825:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
(RayTrainWorker pid=31281) [2023-03-06 16:38:10,526] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=31281) [2023-03-06 16:38:10,527] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 11.89 GB, percent = 19.2%
(RayTrainWorker pid=31281) Parameter Offload: Total persistent parameters: 811008 in 114 params
(RayTrainWorker pid=1942, ip=10.0.35.70) ninja: no work to do.
(RayTrainWorker pid=1942, ip=10.0.35.70) Time to load utils op: 0.29694080352783203 seconds


(RayTrainWorker pid=1942, ip=10.0.35.70) Loading extension module utils...
(RayTrainWorker pid=1942, ip=10.0.57.85) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1942, ip=10.0.57.85) Building extension module utils...
(RayTrainWorker pid=1942, ip=10.0.57.85) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=1955, ip=10.0.58.255) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1955, ip=10.0.58.255) Building extension module utils...
(RayTrainWorker pid=1955, ip=10.0.58.255) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=1954, ip=10.0.15.115) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1954, ip=10.0.15.115) Building extension

(RayTrainWorker pid=1942, ip=10.0.51.113) ninja: no work to do.
(RayTrainWorker pid=1942, ip=10.0.51.113) Time to load utils op: 0.30550432205200195 seconds


(RayTrainWorker pid=1964, ip=10.0.26.83) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1964, ip=10.0.26.83) Building extension module utils...
(RayTrainWorker pid=1964, ip=10.0.26.83) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=1954, ip=10.0.25.154) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1954, ip=10.0.25.154) Building extension module utils...
(RayTrainWorker pid=1954, ip=10.0.25.154) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1963, ip=10.0.29.205) Adam Optimizer #0 is created with AVX512 arithmetic capability.
(RayTrainWorker pid=1963, ip=10.0.29.205) Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1


(RayTrainWorker pid=1943, ip=10.0.24.217) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1943, ip=10.0.24.217) Building extension module utils...
(RayTrainWorker pid=1943, ip=10.0.24.217) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1943, ip=10.0.24.217) ninja: no work to do.


(RayTrainWorker pid=1943, ip=10.0.24.217) Loading extension module utils...
(RayTrainWorker pid=1963, ip=10.0.54.163) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1963, ip=10.0.54.163) Building extension module utils...
(RayTrainWorker pid=1963, ip=10.0.54.163) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1956, ip=10.0.47.149) ninja: no work to do.
(RayTrainWorker pid=1956, ip=10.0.47.149) Time to load utils op: 0.30418896675109863 seconds


(RayTrainWorker pid=1956, ip=10.0.47.149) Loading extension module utils...
(RayTrainWorker pid=1943, ip=10.0.37.101) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1943, ip=10.0.37.101) Building extension module utils...
(RayTrainWorker pid=1943, ip=10.0.37.101) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
(RayTrainWorker pid=2334, ip=10.0.53.213) Loading extension module utils...


(RayTrainWorker pid=2334, ip=10.0.53.213) ninja: no work to do.
(RayTrainWorker pid=2334, ip=10.0.53.213) Time to load utils op: 0.3006570339202881 seconds
(RayTrainWorker pid=31281) [2023-03-06 16:38:10,702] [INFO] [utils.py:825:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
(RayTrainWorker pid=31281) [2023-03-06 16:38:10,702] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=31281) [2023-03-06 16:38:10,703] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 11.89 GB, percent = 19.2%


(RayTrainWorker pid=1942, ip=10.0.57.85) Loading extension module utils...


(RayTrainWorker pid=1942, ip=10.0.57.85) ninja: no work to do.
(RayTrainWorker pid=1942, ip=10.0.57.85) Time to load utils op: 0.30536675453186035 seconds
(RayTrainWorker pid=1955, ip=10.0.58.255) ninja: no work to do.
(RayTrainWorker pid=1955, ip=10.0.58.255) Time to load utils op: 0.30983710289001465 seconds


(RayTrainWorker pid=1955, ip=10.0.58.255) Loading extension module utils...


(RayTrainWorker pid=1954, ip=10.0.15.115) ninja: no work to do.
(RayTrainWorker pid=1954, ip=10.0.15.115) Time to load utils op: 0.3104853630065918 seconds


(RayTrainWorker pid=1954, ip=10.0.15.115) Loading extension module utils...
(RayTrainWorker pid=1964, ip=10.0.26.83) Loading extension module utils...


(RayTrainWorker pid=1964, ip=10.0.26.83) ninja: no work to do.
(RayTrainWorker pid=1964, ip=10.0.26.83) Time to load utils op: 0.31006431579589844 seconds
(RayTrainWorker pid=1954, ip=10.0.25.154) ninja: no work to do.
(RayTrainWorker pid=1954, ip=10.0.25.154) Time to load utils op: 0.3110191822052002 seconds


(RayTrainWorker pid=1954, ip=10.0.25.154) Loading extension module utils...
(RayTrainWorker pid=1943, ip=10.0.14.60) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1943, ip=10.0.14.60) Building extension module utils...
(RayTrainWorker pid=1943, ip=10.0.14.60) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1943, ip=10.0.24.217) Time to load utils op: 0.30796074867248535 seconds


(RayTrainWorker pid=1963, ip=10.0.54.163) Loading extension module utils...


(RayTrainWorker pid=1963, ip=10.0.54.163) ninja: no work to do.
(RayTrainWorker pid=1963, ip=10.0.54.163) Time to load utils op: 0.3120288848876953 seconds
(RayTrainWorker pid=1943, ip=10.0.37.101) ninja: no work to do.
(RayTrainWorker pid=1943, ip=10.0.37.101) Time to load utils op: 0.3079547882080078 seconds


(RayTrainWorker pid=1943, ip=10.0.37.101) Loading extension module utils...
(RayTrainWorker pid=1963, ip=10.0.29.205) Emitting ninja build file /home/ray/.cache/torch_extensions/py38_cu116/utils/build.ninja...
(RayTrainWorker pid=1963, ip=10.0.29.205) Building extension module utils...
(RayTrainWorker pid=1963, ip=10.0.29.205) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(RayTrainWorker pid=1943, ip=10.0.14.60) ninja: no work to do.
(RayTrainWorker pid=1943, ip=10.0.14.60) Time to load utils op: 0.31665754318237305 seconds


(RayTrainWorker pid=1943, ip=10.0.14.60) Loading extension module utils...


(RayTrainWorker pid=31281) [2023-03-06 16:38:10,862] [INFO] [utils.py:825:see_memory_usage] Before creating fp16 partitions
(RayTrainWorker pid=31281) [2023-03-06 16:38:10,863] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=31281) [2023-03-06 16:38:10,863] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 11.89 GB, percent = 19.2%


(RayTrainWorker pid=1963, ip=10.0.29.205) Loading extension module utils...


(RayTrainWorker pid=1963, ip=10.0.29.205) ninja: no work to do.
(RayTrainWorker pid=1963, ip=10.0.29.205) Time to load utils op: 0.33661627769470215 seconds
(RayTrainWorker pid=31281) [2023-03-06 16:38:11,921] [INFO] [utils.py:825:see_memory_usage] After creating fp16 partitions: 1
(RayTrainWorker pid=31281) [2023-03-06 16:38:11,922] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=31281) [2023-03-06 16:38:11,922] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 12.91 GB, percent = 20.8%
(RayTrainWorker pid=31281) [2023-03-06 16:38:12,072] [INFO] [utils.py:825:see_memory_usage] Before creating fp32 partitions
(RayTrainWorker pid=31281) [2023-03-06 16:38:12,072] [INFO] [utils.py:826:see_memory_usage] MA 0.11 GB         Max_MA 0.11 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=31281) [2023-03-06 16:38:12,072] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  us

(RayTrainWorker pid=1964, ip=10.0.26.83) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1964, ip=10.0.26.83) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1964, ip=10.0.26.83) Loading extension module utils...
(RayTrainWorker pid=1964, ip=10.0.26.83) ***** Running training *****
(RayTrainWorker pid=1964, ip=10.0.26.83)   Num examples = 1348
(RayTrainWorker pid=1964, ip=10.0.26.83)   Num Epochs = 1
(RayTrainWorker pid=1964, ip=10.0.26.83)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1964, ip=10.0.26.83)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1964, ip=10.0.26.83)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1964, ip=10.0.26.83)   Total optimization steps = 85
(RayTrainWorker pid=1964, ip=10.0.26.83)   Number of trainable parameters = 0


(RayTrainWorker pid=1964, ip=10.0.26.83) Time to load utils op: 0.0005335807800292969 seconds
(RayTrainWorker pid=1954, ip=10.0.25.154) Time to load utils op: 0.0005166530609130859 seconds


(RayTrainWorker pid=1954, ip=10.0.25.154) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1954, ip=10.0.25.154) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1954, ip=10.0.25.154) Loading extension module utils...
(RayTrainWorker pid=1954, ip=10.0.25.154) ***** Running training *****
(RayTrainWorker pid=1954, ip=10.0.25.154)   Num examples = 1348
(RayTrainWorker pid=1954, ip=10.0.25.154)   Num Epochs = 1
(RayTrainWorker pid=1954, ip=10.0.25.154)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1954, ip=10.0.25.154)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1954, ip=10.0.25.154)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1954, ip=10.0.25.154)   Total optimization steps = 85
(RayTrainWorker pid=1954, ip=10.0.25.154)   Number of trainable parameters = 0
(RayTrainWorker pid=2623, ip=10.0.4.206) Using /h

(RayTrainWorker pid=2623, ip=10.0.4.206) Time to load utils op: 0.0005464553833007812 seconds
(RayTrainWorker pid=1943, ip=10.0.14.60) Time to load utils op: 0.0005373954772949219 seconds


(RayTrainWorker pid=1943, ip=10.0.14.60) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1943, ip=10.0.14.60) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1943, ip=10.0.14.60) Loading extension module utils...
(RayTrainWorker pid=1943, ip=10.0.14.60) ***** Running training *****
(RayTrainWorker pid=1943, ip=10.0.14.60)   Num examples = 1348
(RayTrainWorker pid=1943, ip=10.0.14.60)   Num Epochs = 1
(RayTrainWorker pid=1943, ip=10.0.14.60)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1943, ip=10.0.14.60)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1943, ip=10.0.14.60)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1943, ip=10.0.14.60)   Total optimization steps = 85
(RayTrainWorker pid=1943, ip=10.0.14.60)   Number of trainable parameters = 0
(RayTrainWorker pid=1963, ip=10.0.29.205) Using /home/ray/.c

(RayTrainWorker pid=1963, ip=10.0.29.205) Time to load utils op: 0.0005829334259033203 seconds


(RayTrainWorker pid=1943, ip=10.0.24.217) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1943, ip=10.0.24.217) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1943, ip=10.0.24.217) Loading extension module utils...
(RayTrainWorker pid=1943, ip=10.0.24.217) ***** Running training *****
(RayTrainWorker pid=1943, ip=10.0.24.217)   Num examples = 1348
(RayTrainWorker pid=1943, ip=10.0.24.217)   Num Epochs = 1
(RayTrainWorker pid=1943, ip=10.0.24.217)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1943, ip=10.0.24.217)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1943, ip=10.0.24.217)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1943, ip=10.0.24.217)   Total optimization steps = 85
(RayTrainWorker pid=1943, ip=10.0.24.217)   Number of trainable parameters = 0


(RayTrainWorker pid=1943, ip=10.0.24.217) Time to load utils op: 0.0005500316619873047 seconds


(RayTrainWorker pid=1963, ip=10.0.54.163) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1963, ip=10.0.54.163) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1963, ip=10.0.54.163) Loading extension module utils...
(RayTrainWorker pid=1963, ip=10.0.54.163) ***** Running training *****
(RayTrainWorker pid=1963, ip=10.0.54.163)   Num examples = 1348
(RayTrainWorker pid=1963, ip=10.0.54.163)   Num Epochs = 1
(RayTrainWorker pid=1963, ip=10.0.54.163)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1963, ip=10.0.54.163)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1963, ip=10.0.54.163)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1963, ip=10.0.54.163)   Total optimization steps = 85
(RayTrainWorker pid=1963, ip=10.0.54.163)   Number of trainable parameters = 0


(RayTrainWorker pid=1963, ip=10.0.54.163) Time to load utils op: 0.000522613525390625 seconds
(RayTrainWorker pid=1956, ip=10.0.47.149) Time to load utils op: 0.0005176067352294922 seconds


(RayTrainWorker pid=1956, ip=10.0.47.149) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1956, ip=10.0.47.149) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1956, ip=10.0.47.149) Loading extension module utils...
(RayTrainWorker pid=1956, ip=10.0.47.149) ***** Running training *****
(RayTrainWorker pid=1956, ip=10.0.47.149)   Num examples = 1348
(RayTrainWorker pid=1956, ip=10.0.47.149)   Num Epochs = 1
(RayTrainWorker pid=1956, ip=10.0.47.149)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1956, ip=10.0.47.149)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1956, ip=10.0.47.149)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1956, ip=10.0.47.149)   Total optimization steps = 85
(RayTrainWorker pid=1956, ip=10.0.47.149)   Number of trainable parameters = 0


(RayTrainWorker pid=1943, ip=10.0.37.101) Time to load utils op: 0.0005319118499755859 seconds


(RayTrainWorker pid=1943, ip=10.0.37.101) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1943, ip=10.0.37.101) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1943, ip=10.0.37.101) Loading extension module utils...
(RayTrainWorker pid=1943, ip=10.0.37.101) ***** Running training *****
(RayTrainWorker pid=1943, ip=10.0.37.101)   Num examples = 1348
(RayTrainWorker pid=1943, ip=10.0.37.101)   Num Epochs = 1
(RayTrainWorker pid=1943, ip=10.0.37.101)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1943, ip=10.0.37.101)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1943, ip=10.0.37.101)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1943, ip=10.0.37.101)   Total optimization steps = 85
(RayTrainWorker pid=1943, ip=10.0.37.101)   Number of trainable parameters = 0
(RayTrainWorker pid=2334, ip=10.0.53.213) Using /

(RayTrainWorker pid=2334, ip=10.0.53.213) Time to load utils op: 0.000518798828125 seconds
(RayTrainWorker pid=1942, ip=10.0.35.70) Time to load utils op: 0.0005497932434082031 seconds


(RayTrainWorker pid=1942, ip=10.0.35.70) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1942, ip=10.0.35.70) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1942, ip=10.0.35.70) Loading extension module utils...
(RayTrainWorker pid=1942, ip=10.0.35.70) ***** Running training *****
(RayTrainWorker pid=1942, ip=10.0.35.70)   Num examples = 1348
(RayTrainWorker pid=1942, ip=10.0.35.70)   Num Epochs = 1
(RayTrainWorker pid=1942, ip=10.0.35.70)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1942, ip=10.0.35.70)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1942, ip=10.0.35.70)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1942, ip=10.0.35.70)   Total optimization steps = 85
(RayTrainWorker pid=1942, ip=10.0.35.70)   Number of trainable parameters = 0


(RayTrainWorker pid=1955, ip=10.0.58.255) Time to load utils op: 0.0005505084991455078 seconds


(RayTrainWorker pid=1955, ip=10.0.58.255) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1955, ip=10.0.58.255) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1955, ip=10.0.58.255) Loading extension module utils...
(RayTrainWorker pid=1955, ip=10.0.58.255) ***** Running training *****
(RayTrainWorker pid=1955, ip=10.0.58.255)   Num examples = 1348
(RayTrainWorker pid=1955, ip=10.0.58.255)   Num Epochs = 1
(RayTrainWorker pid=1955, ip=10.0.58.255)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1955, ip=10.0.58.255)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1955, ip=10.0.58.255)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1955, ip=10.0.58.255)   Total optimization steps = 85
(RayTrainWorker pid=1955, ip=10.0.58.255)   Number of trainable parameters = 0
(RayTrainWorker pid=1942, ip=10.0.57.85) Using /h

(RayTrainWorker pid=1942, ip=10.0.57.85) Time to load utils op: 0.0005173683166503906 seconds
(RayTrainWorker pid=1954, ip=10.0.15.115) Time to load utils op: 0.0005285739898681641 seconds
(RayTrainWorker pid=1942, ip=10.0.51.113) Time to load utils op: 0.0005178451538085938 seconds


(RayTrainWorker pid=1954, ip=10.0.15.115) Using /home/ray/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
(RayTrainWorker pid=1954, ip=10.0.15.115) No modifications detected for re-loaded extension module utils, skipping build step...
(RayTrainWorker pid=1954, ip=10.0.15.115) Loading extension module utils...
(RayTrainWorker pid=1954, ip=10.0.15.115) ***** Running training *****
(RayTrainWorker pid=1954, ip=10.0.15.115)   Num examples = 1348
(RayTrainWorker pid=1954, ip=10.0.15.115)   Num Epochs = 1
(RayTrainWorker pid=1954, ip=10.0.15.115)   Instantaneous batch size per device = 16
(RayTrainWorker pid=1954, ip=10.0.15.115)   Total train batch size (w. parallel, distributed & accumulation) = 256
(RayTrainWorker pid=1954, ip=10.0.15.115)   Gradient Accumulation steps = 1
(RayTrainWorker pid=1954, ip=10.0.15.115)   Total optimization steps = 85
(RayTrainWorker pid=1954, ip=10.0.15.115)   Number of trainable parameters = 0
(RayTrainWorker pid=31281) Using /home/ray/.cache

(RayTrainWorker pid=31281) [2023-03-06 16:38:25,023] [INFO] [utils.py:825:see_memory_usage] After initializing ZeRO optimizer
(RayTrainWorker pid=31281) [2023-03-06 16:38:25,024] [INFO] [utils.py:826:see_memory_usage] MA 0.14 GB         Max_MA 0.91 GB         CA 1.54 GB         Max_CA 2 GB 
(RayTrainWorker pid=31281) [2023-03-06 16:38:25,024] [INFO] [utils.py:834:see_memory_usage] CPU Virtual Memory:  used = 20.25 GB, percent = 32.7%
(RayTrainWorker pid=31281) [2023-03-06 16:38:25,024] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
(RayTrainWorker pid=31281) [2023-03-06 16:38:25,024] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
(RayTrainWorker pid=31281) [2023-03-06 16:38:25,025] [INFO] [logging.py:75:log_dist] [Rank 0] DeepSpeed LR Scheduler = <torch.optim.lr_scheduler.LambdaLR object at 0x7f10a01d7ee0>
(RayTrainWorker pid=31281) [2023-03-06 16:38:25,025] [INFO] [logging.py:75:log_dist] [Rank 0] step=0, ski



(RayTrainWorker pid=1963, ip=10.0.54.163) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1943, ip=10.0.24.217) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1963, ip=10.0.29.205) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=31281) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}


Trial name,date,done,epoch,experiment_tag,hostname,iterations_since_restore,learning_rate,loss,node_ip,pid,should_checkpoint,step,time_since_restore,time_this_iter_s,time_total_s,timestamp,train_loss,train_runtime,train_samples_per_second,train_steps_per_second,training_iteration,trial_id
HuggingFaceTrainer_f623d_00000,2023-03-06_17-18-38,True,1,0,ip-10-0-30-196,85,4.70588e-07,0.0715,10.0.30.196,30861,True,85,2579.3,75.785,2579.3,1678151918,0.324921,2413.12,0.559,0.035,85,f623d_00000


(RayTrainWorker pid=1956, ip=10.0.47.149) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1943, ip=10.0.37.101) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1942, ip=10.0.35.70) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=2334, ip=10.0.53.213) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1955, ip=10.0.58.255) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1954, ip=10.0.15.115) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1942, ip=10.0.51.113) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=1964, ip=10.0.26.83) {'loss': 12.1235, 'learning_rate': 1.9764705882352945e-05, 'epoch': 0.01}
(RayTrainWorker pid=2623, ip=10.0.4.206) {'loss': 12.1235, 'learning_rate'

(RayTrainWorker pid=31281) Saving model checkpoint to output/checkpoint-85
(RayTrainWorker pid=31281) Configuration saved in output/checkpoint-85/config.json
(RayTrainWorker pid=31281) Configuration saved in output/checkpoint-85/generation_config.json


(RayTrainWorker pid=31281) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=1954, ip=10.0.15.115) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=1943, ip=10.0.14.60) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=1963, ip=10.0.54.163) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=1943, ip=10.0.37.101) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=1942, ip=10.0.57.85) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=1956, ip=10.0.47.149) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=2623, ip=10.0.4.206) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch': 1.0}
(RayTrainWorker pid=1954, ip=10.0.25.154) {'loss': 0.0715, 'learning_rate': 4.7058823529411767e-07, 'epoch

(RayTrainWorker pid=31281) Model weights saved in output/checkpoint-85/pytorch_model.bin
(RayTrainWorker pid=31281) tokenizer config file saved in output/checkpoint-85/tokenizer_config.json
(RayTrainWorker pid=31281) Special tokens file saved in output/checkpoint-85/special_tokens_map.json


(RayTrainWorker pid=31281) [2023-03-06 17:18:13,320] [INFO] [engine.py:3516:save_16bit_model] Saving model weights to output/checkpoint-85/pytorch_model.bin
(RayTrainWorker pid=31281) [2023-03-06 17:18:13,320] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/pytorch_model.bin...




(RayTrainWorker pid=1942, ip=10.0.57.85) [2023-03-06 17:18:29,095] [INFO] [logging.py:75:log_dist] [Rank 1] Saving model checkpoint: output/checkpoint-85/global_step85/zero_pp_rank_1_mp_rank_00_model_states.pt
(RayTrainWorker pid=1942, ip=10.0.57.85) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_1_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1943, ip=10.0.24.217) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_11_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1942, ip=10.0.51.113) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_15_mp_rank_00_model_states.pt...
(RayTrainWorker pid=1954, ip=10.0.25.154) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_14_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1964, ip=10.0.26.83) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_13_mp_rank_00_model_states.pt...
(RayTrainWorker pid=1955, ip=10.0.58.255) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_12_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1942, ip=10.0.35.70) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_5_mp_rank_00_model_states.pt...




(RayTrainWorker pid=2334, ip=10.0.53.213) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_3_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1963, ip=10.0.29.205) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_4_mp_rank_00_model_states.pt...




(RayTrainWorker pid=31281) [2023-03-06 17:18:29,075] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/pytorch_model.bin.
(RayTrainWorker pid=31281) [2023-03-06 17:18:29,087] [INFO] [logging.py:75:log_dist] [Rank 0] [Torch] Checkpoint global_step85 is begin to save!
(RayTrainWorker pid=31281) [2023-03-06 17:18:29,109] [INFO] [logging.py:75:log_dist] [Rank 0] Saving model checkpoint: output/checkpoint-85/global_step85/zero_pp_rank_0_mp_rank_00_model_states.pt
(RayTrainWorker pid=31281) [2023-03-06 17:18:29,109] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_0_mp_rank_00_model_states.pt...
(RayTrainWorker pid=1954, ip=10.0.15.115) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_2_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1956, ip=10.0.47.149) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_7_mp_rank_00_model_states.pt...




(RayTrainWorker pid=2623, ip=10.0.4.206) [2023-03-06 17:18:29,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_10_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1963, ip=10.0.54.163) [2023-03-06 17:18:29,094] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_8_mp_rank_00_model_states.pt...
(RayTrainWorker pid=1943, ip=10.0.37.101) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_9_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1943, ip=10.0.14.60) [2023-03-06 17:18:29,095] [INFO] [torch_checkpoint_engine.py:15:save] [Torch] Saving output/checkpoint-85/global_step85/zero_pp_rank_6_mp_rank_00_model_states.pt...




(RayTrainWorker pid=1954, ip=10.0.15.115) [2023-03-06 17:18:29,249] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/global_step85/zero_pp_rank_2_mp_rank_00_model_states.pt.
(RayTrainWorker pid=1956, ip=10.0.47.149) [2023-03-06 17:18:29,251] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/global_step85/zero_pp_rank_7_mp_rank_00_model_states.pt.
(RayTrainWorker pid=1963, ip=10.0.54.163) [2023-03-06 17:18:29,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/global_step85/zero_pp_rank_8_mp_rank_00_model_states.pt.
(RayTrainWorker pid=1943, ip=10.0.37.101) [2023-03-06 17:18:29,256] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/global_step85/zero_pp_rank_9_mp_rank_00_model_states.pt.
(RayTrainWorker pid=1943, ip=10.0.14.60) [2023-03-06 17:18:29,252] [INFO] [torch_checkpoint_engine.py:17:save] [Torch] Saved output/checkpoint-85/global_step85/zero_pp_rank_6_mp_rank_0

(RayTrainWorker pid=1943, ip=10.0.24.217) 
(RayTrainWorker pid=1943, ip=10.0.24.217) 
(RayTrainWorker pid=1943, ip=10.0.24.217) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1943, ip=10.0.24.217) 
(RayTrainWorker pid=1943, ip=10.0.24.217) 


(RayTrainWorker pid=1943, ip=10.0.24.217) [2023-03-06 17:18:38,142] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!


(RayTrainWorker pid=1942, ip=10.0.57.85) 
(RayTrainWorker pid=1942, ip=10.0.57.85) 
(RayTrainWorker pid=1942, ip=10.0.57.85) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1942, ip=10.0.57.85) 
(RayTrainWorker pid=1942, ip=10.0.57.85) 


(RayTrainWorker pid=1942, ip=10.0.57.85) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!


(RayTrainWorker pid=1942, ip=10.0.51.113) 
(RayTrainWorker pid=1942, ip=10.0.51.113) 
(RayTrainWorker pid=1942, ip=10.0.51.113) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1942, ip=10.0.51.113) 
(RayTrainWorker pid=1942, ip=10.0.51.113) 


(RayTrainWorker pid=1942, ip=10.0.51.113) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1942, ip=10.0.51.113) {'train_runtime': 2413.2956, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}
(RayTrainWorker pid=1954, ip=10.0.25.154) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1954, ip=10.0.25.154) {'train_runtime': 2413.2957, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1954, ip=10.0.25.154) 
(RayTrainWorker pid=1954, ip=10.0.25.154) 
(RayTrainWorker pid=1954, ip=10.0.25.154) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1954, ip=10.0.25.154) 
(RayTrainWorker pid=1954, ip=10.0.25.154) 
(RayTrainWorker pid=1964, ip=10.0.26.83) 
(RayTrainWorker pid=1964, ip=10.0.26.83) 
(RayTrainWorker pid=1964, ip=10.0.26.83) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1964, ip=10.0.26.83) 
(RayTrainWorker pid=1964, ip=10.0.26.83) 


(RayTrainWorker pid=1964, ip=10.0.26.83) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1964, ip=10.0.26.83) {'train_runtime': 2413.2955, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}
(RayTrainWorker pid=1955, ip=10.0.58.255) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1955, ip=10.0.58.255) {'train_runtime': 2413.2954, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1955, ip=10.0.58.255) 
(RayTrainWorker pid=1955, ip=10.0.58.255) 
(RayTrainWorker pid=1955, ip=10.0.58.255) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1955, ip=10.0.58.255) 
(RayTrainWorker pid=1955, ip=10.0.58.255) 


(RayTrainWorker pid=1942, ip=10.0.35.70) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1942, ip=10.0.35.70) {'train_runtime': 2413.2964, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1942, ip=10.0.35.70) 
(RayTrainWorker pid=1942, ip=10.0.35.70) 
(RayTrainWorker pid=1942, ip=10.0.35.70) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1942, ip=10.0.35.70) 
(RayTrainWorker pid=1942, ip=10.0.35.70) 
(RayTrainWorker pid=2334, ip=10.0.53.213) 
(RayTrainWorker pid=2334, ip=10.0.53.213) 
(RayTrainWorker pid=2334, ip=10.0.53.213) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=2334, ip=10.0.53.213) 
(RayTrainWorker pid=2334, ip=10.0.53.213) 


(RayTrainWorker pid=2334, ip=10.0.53.213) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=2334, ip=10.0.53.213) {'train_runtime': 2413.2958, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1963, ip=10.0.29.205) 
(RayTrainWorker pid=1963, ip=10.0.29.205) 
(RayTrainWorker pid=1963, ip=10.0.29.205) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1963, ip=10.0.29.205) 
(RayTrainWorker pid=1963, ip=10.0.29.205) 


(RayTrainWorker pid=1963, ip=10.0.29.205) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1963, ip=10.0.29.205) {'train_runtime': 2413.2962, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}
(RayTrainWorker pid=1954, ip=10.0.15.115) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1954, ip=10.0.15.115) {'train_runtime': 2413.2961, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1954, ip=10.0.15.115) 
(RayTrainWorker pid=1954, ip=10.0.15.115) 
(RayTrainWorker pid=1954, ip=10.0.15.115) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1954, ip=10.0.15.115) 
(RayTrainWorker pid=1954, ip=10.0.15.115) 


(RayTrainWorker pid=1956, ip=10.0.47.149) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1956, ip=10.0.47.149) {'train_runtime': 2413.2961, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1956, ip=10.0.47.149) 
(RayTrainWorker pid=1956, ip=10.0.47.149) 
(RayTrainWorker pid=1956, ip=10.0.47.149) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1956, ip=10.0.47.149) 
(RayTrainWorker pid=1956, ip=10.0.47.149) 
(RayTrainWorker pid=1963, ip=10.0.54.163) 
(RayTrainWorker pid=1963, ip=10.0.54.163) 
(RayTrainWorker pid=1963, ip=10.0.54.163) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1963, ip=10.0.54.163) 
(RayTrainWorker pid=1963, ip=10.0.54.163) 


(RayTrainWorker pid=1963, ip=10.0.54.163) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1963, ip=10.0.54.163) {'train_runtime': 2413.2956, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}
(RayTrainWorker pid=1943, ip=10.0.37.101) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=1943, ip=10.0.37.101) {'train_runtime': 2413.2963, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1943, ip=10.0.37.101) 
(RayTrainWorker pid=1943, ip=10.0.37.101) 
(RayTrainWorker pid=1943, ip=10.0.37.101) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1943, ip=10.0.37.101) 
(RayTrainWorker pid=1943, ip=10.0.37.101) 
(RayTrainWorker pid=2623, ip=10.0.4.206) 
(RayTrainWorker pid=2623, ip=10.0.4.206) 
(RayTrainWorker pid=2623, ip=10.0.4.206) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=2623, ip=10.0.4.206) 
(RayTrainWorker pid=2623, ip=10.0.4.206) 
(RayTrainWorker pid=31281) 
(RayTrainWorker pid=31281) 
(RayTrainWorker pid=31281) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=31281) 
(RayTrainWorker pid=31281) 


(RayTrainWorker pid=2623, ip=10.0.4.206) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=2623, ip=10.0.4.206) {'train_runtime': 2413.2958, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}
(RayTrainWorker pid=31281) [2023-03-06 17:18:38,143] [INFO] [torch_checkpoint_engine.py:27:commit] [Torch] Checkpoint global_step85 is ready now!
(RayTrainWorker pid=31281) {'train_runtime': 2413.1243, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}
(RayTrainWorker pid=1943, ip=10.0.14.60) {'train_runtime': 2413.2961, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


(RayTrainWorker pid=1943, ip=10.0.14.60) 
(RayTrainWorker pid=1943, ip=10.0.14.60) 
(RayTrainWorker pid=1943, ip=10.0.14.60) Training completed. Do not forget to share your model on huggingface.co/models =)
(RayTrainWorker pid=1943, ip=10.0.14.60) 
(RayTrainWorker pid=1943, ip=10.0.14.60) 


(RayTrainWorker pid=1943, ip=10.0.24.217) {'train_runtime': 2413.2958, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}
(RayTrainWorker pid=1942, ip=10.0.57.85) {'train_runtime': 2413.2959, 'train_samples_per_second': 0.559, 'train_steps_per_second': 0.035, 'train_loss': 0.32492108064539293, 'epoch': 1.0}


2023-03-06 17:18:41,018	INFO tune.py:825 -- Total run time: 2591.59 seconds (2591.46 seconds for the tuning loop).


You can use the returned `Result` object to access metrics and the Ray AIR `Checkpoint` associated with the last iteration.

In [18]:
checkpoint = results.checkpoint
checkpoint

HuggingFaceCheckpoint(local_path=/home/ray/ray_results/HuggingFaceTrainer_2023-03-06_16-35-29/HuggingFaceTrainer_f623d_00000_0_2023-03-06_16-35-30/checkpoint_000000)

### Generate text from prompt

We can use the {class}`~ray.train.huggingface.huggingface_predictor.HuggingFacePredictor` to generate predictions from our fine-tuned model.

```{tip}
For large scale batch inference, consider configuring cloud checkpointing and then pass the cloud-backed `Checkpoint` to {class}`~ray.train.batch_predictor.BatchPredictor`. More information [here](air-predictors).
```

Because the `HuggingFacePredictor` uses a 🤗 Transformers [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) under the hood, we disable the tokenizer AIR Preprocessor we have used for training and let the `pipeline` to tokenize the data itself.

In [2]:
checkpoint.set_preprocessor(None)

. We also set `device_map="auto"` so that the model is automatically placed on the right device and set the `task` to `"text-generation"`. The `predict` method passes the arguments to a 🤗 Transformers `pipeline` call.

In [4]:
from ray.train.huggingface import HuggingFacePredictor
import pandas as pd

prompts = pd.DataFrame(["Romeo and Juliet", "Romeo", "Juliet"], columns=["text"])

# Predict on the head node.
predictor = HuggingFacePredictor.from_checkpoint(
    checkpoint=checkpoint,
    task="text-generation",
    torch_dtype=torch.float16 if use_gpu else None,
    device_map="auto",
    use_gpu=use_gpu,
)
prediction = predictor.predict(
    prompts,
    do_sample=True,
    temperature=0.9,
    min_length=32,
    max_length=128,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [5]:
prediction

Unnamed: 0,generated_text
0,"Romeo and Juliet, they are married: and it is ..."
1,"Romeo, thou art Romeo and a Montague; for only..."
2,Juliet's name; but I do not sound an ear to na...
