(gptj_deepspeed_finetune)=

# GPT-J-6B Fine-Tuning with Ray AIR and DeepSpeed

In this example, we will showcase how to use the Ray AIR for **GPT-J fine-tuning**. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters. For more information on GPT-J, click [here](https://huggingface.co/docs/transformers/model_doc/gptj).

We will use Ray AIR (with the 🤗 Transformers integration) and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models.

This example focuses more on the performance and distributed computing aspects of Ray AIR. If you are looking for a more beginner-friendly introduction to Ray AIR 🤗 Transformers integration, see {doc}`this example </ray-air/examples/huggingface_text_classification>`.

It is highly recommended to read [Ray Train Key Concepts](train-key-concepts) and [Ray Data Key Concepts](data_key_concepts) before starting this example.

```{note}
To run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The required amount of memory depends on the model. This notebook is tested with 16 g4dn.4xlarge instances (including the head node). If you wish to use a CPU head node, turn on [cloud checkpointing](tune-cloud-checkpointing) to avoid OOM errors that may happen due to the default behavior of syncing the checkpoint files to the head node.
```

In this notebook, we will:
1. [Set up Ray](#setup)
2. [Load the dataset](#load)
3. [Preprocess the dataset with Ray AIR](#preprocess)
4. [Run the training with Ray AIR](#train)
5. [Generate text from prompt with Ray AIR](#predict)

Uncomment and run the following line in order to install all the necessary dependencies (this notebook is being tested with `transformers==4.26.0`):

In [1]:
#! pip install "datasets" "evaluate" "accelerate==0.18.0" "transformers>=4.26.0" "torch>=1.12.0" "deepspeed==0.8.3"

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
os.environ["RAY_AIR_NEW_PERSISTENCE_MODE"] = "1"

## Set up Ray <a name="setup"></a>

First, let's set some global variables. We will use 16 workers, each being assigned 1 GPU and 8 CPUs.

In [3]:
model_name = "EleutherAI/gpt-j-6B"
use_gpu = True
num_workers = 16
cpus_per_worker = 8

We will use `ray.init()` to initialize a local cluster. By default, this cluster will be comprised of only the machine you are running this notebook on. You can also run this notebook on an Anyscale cluster.

We define a {ref}`runtime environment <runtime-environments>` to ensure that the Ray workers have access to all the necessary packages. You can omit the `runtime_env` argument if you have all of the packages already installed on each node in your cluster.

In [4]:
import ray

ray.init(
    runtime_env={
        "pip": [
            "datasets",
            "evaluate",
            # Latest combination of accelerate==0.19.0 and transformers==4.29.0
            # seems to have issues with DeepSpeed process group initialization,
            # and will result in a batch_size validation problem.
            # TODO(jungong) : get rid of the pins once the issue is fixed.
            "accelerate==0.16.0",
            "transformers==4.26.0",
            "torch>=1.12.0",
            "deepspeed==0.9.2",
        ],
        "env_vars": {"RAY_AIR_NEW_PERSISTENCE_MODE": "1"}
    },
)

2023-08-17 17:59:12,836	INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.0.14.237:6379...
2023-08-17 17:59:12,890	INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-fqkrgwwj25xw22p19t55bi21tn.i.anyscaleuserdata-staging.com [39m[22m
2023-08-17 17:59:12,904	INFO packaging.py:346 -- Pushing file package 'gcs://_ray_pkg_99bcdb90137840ddbb1bd97c3f1a88f7.zip' (4.01MiB) to Ray cluster...
2023-08-17 17:59:12,916	INFO packaging.py:359 -- Successfully pushed file package 'gcs://_ray_pkg_99bcdb90137840ddbb1bd97c3f1a88f7.zip'.


0,1
Python version:,3.9.15
Ray version:,3.0.0.dev0
Dashboard:,http://session-fqkrgwwj25xw22p19t55bi21tn.i.anyscaleuserdata-staging.com


In [5]:
# THIS SHOULD BE HIDDEN IN DOCS AND ONLY RAN IN CI
# Download the model from our S3 mirror as it's faster

import ray
import subprocess
import ray.util.scheduling_strategies


def force_on_node(node_id: str, remote_func_or_actor_class):
    scheduling_strategy = ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
        node_id=node_id, soft=False
    )
    options = {"scheduling_strategy": scheduling_strategy}
    return remote_func_or_actor_class.options(**options)


def run_on_every_node(remote_func_or_actor_class, **remote_kwargs):
    refs = []
    for node in ray.nodes():
        if node["Alive"] and node["Resources"].get("GPU", None):
            refs.append(
                force_on_node(node["NodeID"], remote_func_or_actor_class).remote(
                    **remote_kwargs
                )
            )
    return ray.get(refs)


@ray.remote(num_gpus=1)
def download_model():
    from transformers.utils.hub import TRANSFORMERS_CACHE

    path = os.path.expanduser(
        os.path.join(TRANSFORMERS_CACHE, "models--EleutherAI--gpt-j-6B")
    )
    subprocess.run(["mkdir", "-p", os.path.join(path, "snapshots", "main")])
    subprocess.run(["mkdir", "-p", os.path.join(path, "refs")])
    if os.path.exists(os.path.join(path, "refs", "main")):
        return
    subprocess.run(
        [
            "aws",
            "s3",
            "sync",
            "--no-sign-request",
            "s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/",
            os.path.join(path, "snapshots", "main"),
        ]
    )
    with open(os.path.join(path, "snapshots", "main", "hash"), "r") as f:
        f_hash = f.read().strip()
    with open(os.path.join(path, "refs", "main"), "w") as f:
        f.write(f_hash)
    os.rename(
        os.path.join(path, "snapshots", "main"), os.path.join(path, "snapshots", f_hash)
    )


_ = run_on_every_node(download_model)

[2m[36m(download_model pid=4509, ip=10.0.4.91)[0m 2023-08-17 15:27:14.116714: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[2m[36m(download_model pid=4509, ip=10.0.4.91)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(download_model pid=4509, ip=10.0.4.91)[0m 2023-08-17 15:27:14.335144: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2m[36m(download_model pid=4509, ip=10.0.4.91)[0m 2023-08-17 15:27:15.656059: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic libr

download: s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/added_tokens.json to ../../../../../../home/ray/.cache/huggingface/hub/models--EleutherAI--gpt-j-6B/snapshots/main/added_tokens.json
[2m[36m(download_model pid=31622, ip=10.0.51.3)[0m Completed 3.9 KiB/22.5 GiB (71.3 KiB/s) with 8 file(s) remaining
download: s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/special_tokens_map.json to ../../../../../../home/ray/.cache/huggingface/hub/models--EleutherAI--gpt-j-6B/snapshots/main/special_tokens_map.json
download: s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/config.json to ../../../../../../home/ray/.cache/huggingface/hub/models--EleutherAI--gpt-j-6B/snapshots/main/config.json
download: s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/main/tokenizer.json to ../../../../../../home/ray/.cache/huggingface/hub/models--EleutherAI--gpt-j-6B/snapshots/main/tokenizer.json
download: s3://large-dl-models-mirror/models--EleutherAI--gpt-j-6B/

## Loading the dataset <a name="load"></a>

We will be fine-tuning the model on the [`tiny_shakespeare` dataset](https://huggingface.co/datasets/tiny_shakespeare), comprised of 40,000 lines of Shakespeare from a variety of Shakespeare's plays. The aim will be to make the GPT-J model better at generating text in the style of Shakespeare.

In [11]:
from datasets import load_dataset

print("Loading tiny_shakespeare dataset")
current_dataset = load_dataset("tiny_shakespeare")
current_dataset

Loading tiny_shakespeare dataset


Using custom data configuration default
Reusing dataset tiny_shakespeare (/home/ray/.cache/huggingface/datasets/tiny_shakespeare/default/1.0.0/b5b13969f09fe8707337f6cb296314fbe06960bd9a868dca39e713e163d27b5e)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1
    })
})

We will use [Ray Data](data) for distributed preprocessing and data ingestion. We can easily convert the dataset obtained from Hugging Face Hub to Ray Data by using {meth}`ray.data.from_huggingface`.

In [12]:
import ray.data

ray_datasets = {
    "train": ray.data.from_huggingface(current_dataset["train"]),
    "validation": ray.data.from_huggingface(current_dataset["validation"])
}

ray_datasets

{'train': MaterializedDataset(num_blocks=1, num_rows=1, schema={text: string}),
 'validation': MaterializedDataset(num_blocks=1, num_rows=1, schema={text: string})}

Because the dataset is represented by a single large string, we will need to do some preprocessing. For that, we will define two [Ray AIR Preprocessors](air-preprocessors) using the {class}`~ray.data.preprocessors.BatchMapper` API, allowing us to define functions that will be applied on batches of data.

The `split_text` function will take the single string and split it into separate lines, removing empty lines and character names ending with ':' (eg. 'ROMEO:'). The `tokenize` function will take the lines and tokenize them using the 🤗 Tokenizer associated with the model, ensuring each entry has the same length (`block_size`) by padding and truncating. This is necessary for training.

```{note}
This preprocessing can be done in other ways. A common pattern is to tokenize first, and then split the obtained tokens into equally-sized blocks.
```

We will use the `splitter` and `tokenizer` Preprocessors below.

In [13]:
block_size = 512

In [14]:
from transformers import AutoTokenizer

from ray.data.preprocessors import BatchMapper


def split_text(batch: pd.DataFrame) -> pd.DataFrame:
    text = list(batch["text"])
    flat_text = "".join(text)
    split_text = [
        x.strip()
        for x in flat_text.split("\n")
        if x.strip() and not x.strip()[-1] == ":"
    ]
    return pd.DataFrame(split_text, columns=["text"])


def tokenize(batch: pd.DataFrame) -> dict:
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    tokenizer.pad_token = tokenizer.eos_token
    ret = tokenizer(
        list(batch["text"]),
        truncation=True,
        max_length=block_size,
        padding="max_length",
        return_tensors="np",
    )
    ret["labels"] = ret["input_ids"].copy()
    return dict(ret)

processed_datasets = {
    key: ds.map_batches(split_text, batch_format="pandas").map_batches(tokenize, batch_format="pandas")
    for key, ds in ray_datasets.items()
}
processed_datasets

{'train': MapBatches(tokenize)
 +- MapBatches(split_text)
    +- Dataset(num_blocks=1, num_rows=1, schema={text: string}),
 'validation': MapBatches(tokenize)
 +- MapBatches(split_text)
    +- Dataset(num_blocks=1, num_rows=1, schema={text: string})}

### Fine-tuning the model with Ray AIR <a name="train"></a>

We can now configure Ray AIR's {class}`~ray.train.huggingface.TransformersTrainer` to perform distributed fine-tuning of the model. In order to do that, we specify a `trainer_init_per_worker` function, which creates a 🤗 Transformers `Trainer` that will be distributed by Ray using Distributed Data Parallelism (using PyTorch Distributed backend internally). This means that each worker will have its own copy of the model, but operate on different data, At the end of each step, all the workers will sync gradients.

Because GPT-J is a relatively large model, it may not be possible to fit it on smaller GPU types (<=16 GB GRAM). To deal with that issue, we can use [DeepSpeed](https://github.com/microsoft/DeepSpeed), a library to optimize the training process and allow us to (among other things) offload and partition optimizer and parameter states, reducing GRAM usage. Furthermore, DeepSpeed ZeRO Stage 3 allows us to load large models without running out of memory.

🤗 Transformers and Ray AIR's integration ({class}`~ray.train.huggingface.TransformersTrainer`) allow you to easily configure and use DDP and DeepSpeed. All you need to do is specify the DeepSpeed configuration in the [`TrainingArguments`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) object.

```{tip}
There are many DeepSpeed settings that allow you to trade-off speed for memory usage. The settings used below are tailored to the cluster setup used (16 g4dn.4xlarge nodes) and per device batch size of 16. Some things to keep in mind:
- If your GPUs support bfloat16, use that instead of float16 mixed precision to get better performance and prevent overflows. Replace `fp16=True` with `bf16=True` in `TrainingArguments`.
- If you are running out of GRAM: try reducing batch size (defined in the cell below the next one), set `"overlap_comm": False` in DeepSpeed config.
- If you are running out of RAM, add more nodes to your cluster, use nodes with more RAM, set `"pin_memory": False` in the DeepSpeed config, reduce the batch size, and remove `"offload_param"` from the DeepSpeed config.

For more information on DeepSpeed configuration, refer to [Hugging Face documentation](https://huggingface.co/docs/transformers/main_classes/deepspeed) and [DeepSpeed documentation](https://www.deepspeed.ai/docs/config-json/).

Additionally, if you prefer a lower-level API, the logic below can be expressed as an [Accelerate training loop](https://github.com/huggingface/accelerate/blob/main/examples/by_feature/deepspeed_with_config_support.py) distributed by a Ray AIR {class}`~ray.train.torch.torch_trainer.TorchTrainer`.
```

#### Training speed

As we are using data parallelism, each worker operates on its own shard of the data. The batch size set in `TrainingArguments` is the **per device batch size** (per worker batch size). By changing the number of workers, we can change the **effective batch size** and thus the time needed for training to complete. The effective batch size is then calculated as `per device batch size * number of workers * number of gradient accumulation steps`. As we add more workers, the effective batch size rises and thus we need less time to complete a full epoch. While the speedup is not exactly linear due to extra communication overheads, in many cases it can be close to linear.

The preprocessed dataset has 1348 examples. We have set per device batch size to 16.

* With 16 g4dn.4xlarge nodes, the effective batch size was 256, which equals to 85 steps per epoch. One epoch took **~2440 seconds** (including initialization time).

* With 32 g4dn.4xlarge nodes, the effective batch size was 512, which equals to 43 steps per epoch. One epoch took **~1280 seconds** (including initialization time).

In [15]:
import evaluate
from transformers import Trainer, TrainingArguments
from transformers import (
    GPTJForCausalLM,
    AutoTokenizer,
    default_data_collator,
)
from transformers.utils.logging import disable_progress_bar, enable_progress_bar
import torch

from ray import train
from ray.train.huggingface.transformers import (
    prepare_trainer,
    RayTrainReportCallback
)


def train_func(config):
    # Use the actual number of CPUs assigned by Ray
    os.environ["OMP_NUM_THREADS"] = str(
        train.get_context().get_trial_resources().bundles[-1].get("CPU", 1)
    )
    # Enable tf32 for better performance
    torch.backends.cuda.matmul.allow_tf32 = True

    batch_size = config.get("batch_size", 4)
    epochs = config.get("epochs", 2)
    warmup_steps = config.get("warmup_steps", 0)
    learning_rate = config.get("learning_rate", 0.00002)
    weight_decay = config.get("weight_decay", 0.01)
    steps_per_epoch = config.get("steps_per_epoch")

    deepspeed = {
        "fp16": {
            "enabled": "auto",
            "initial_scale_power": 8,
        },
        "bf16": {"enabled": "auto"},
        "optimizer": {
            "type": "AdamW",
            "params": {
                "lr": "auto",
                "betas": "auto",
                "eps": "auto",
            },
        },
        "zero_optimization": {
            "stage": 3,
            "offload_optimizer": {
                "device": "cpu",
                "pin_memory": True,
            },
            "offload_param": {
                "device": "cpu",
                "pin_memory": True,
            },
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_bucket_size": "auto",
            "stage3_prefetch_bucket_size": "auto",
            "stage3_param_persistence_threshold": "auto",
            "gather_16bit_weights_on_model_save": True,
            "round_robin_gradients": True,
        },
        "gradient_accumulation_steps": "auto",
        "gradient_clipping": "auto",
        "steps_per_print": 10,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "wall_clock_breakdown": False,
    }

    print("Preparing training arguments")
    training_args = TrainingArguments(
        "output",
        logging_steps=1,
        max_steps=steps_per_epoch, # DEBUG
        save_strategy="steps",
        save_steps=steps_per_epoch, # Checkpointing for every epoch
        learning_rate=learning_rate,
        weight_decay=weight_decay,
        warmup_steps=warmup_steps,
        label_names=["input_ids", "attention_mask"],
        push_to_hub=False,
        report_to="none",
        disable_tqdm=True,  # declutter the output a little
        fp16=True,
        gradient_checkpointing=True,
        deepspeed=deepspeed,
    )
    disable_progress_bar()

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    print("Loading model")

    model = GPTJForCausalLM.from_pretrained(model_name, use_cache=False)
    model.resize_token_embeddings(len(tokenizer))

    print("Model loaded")

    enable_progress_bar()

    metric = evaluate.load("accuracy")

    train_ds = train.get_dataset_shard("train")
    eval_ds = train.get_dataset_shard("validation")

    train_ds_iterable = train_ds.iter_torch_batches(batch_size=batch_size)
    eval_ds_iterable = eval_ds.iter_torch_batches(batch_size=batch_size)

    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds_iterable,
        eval_dataset=eval_ds_iterable,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
        data_collator=default_data_collator,
    )

    # Add callback to report checkpoints to Ray Train
    trainer.add_callback(RayTrainReportCallback())
    trainer = prepare_trainer(trainer)
    trainer.train()

With our `trainer_init_per_worker` complete, we can now instantiate the {class}`~ray.train.huggingface.TransformersTrainer`. Aside from the function, we set the `scaling_config`, controlling the amount of workers and resources used, and the `datasets` we will use for training and evaluation.

We pass the preprocessors we have defined earlier as an argument, wrapped in a {class}`~ray.data.preprocessors.chain.Chain`. The preprocessor will be included with the returned {class}`~ray.train.Checkpoint`, meaning it will also be applied during inference.

```{note}
Since this example runs with multiple nodes, we need to persist checkpoints
and other outputs to some external storage for access after training has completed.
**You should set up cloud storage or NFS, then replace `storage_path` with your own cloud bucket URI or NFS path.**

See the [storage guide](tune-storage-options) for more details.
```

In [None]:
storage_path="s3://your-bucket-here"  # TODO: Set up cloud storage
# storage_path="/mnt/path/to/nfs"     # TODO: Alternatively, set up NFS

In [16]:
import os, re
artifact_storage = os.environ.get("ANYSCALE_ARTIFACT_STORAGE", "artifact_storage")
user_name = re.sub(r"\s+", "__", os.environ.get("ANYSCALE_USERNAME", "user"))
storage_path = (f"{artifact_storage}/{user_name}/gptj-deepspeed-finetune")

In [17]:
import s3fs
import pyarrow.fs

s3_additional_kwargs = {
    'MaxKeys': 32,  # equivalent to s3.max_concurrent_requests
    'TransferClient': 'crt',  # equivalent to default.s3.preferred_transfer_client
    'TargetBandwidth': 100 * 10**9,  # 100Gb/s, equivalent to default.s3.target_bandwidth
    'MultipartChunksize': 8 * 10**6  # 8MB, equivalent to default.s3.multipart_chunksize
}
s3_fs = s3fs.S3FileSystem(s3_additional_kwargs=s3_additional_kwargs)
fs = pyarrow.fs.PyFileSystem(pyarrow.fs.FSSpecHandler(s3_fs))

In [21]:
from ray.train.torch import TorchTrainer
from ray.train import RunConfig, ScalingConfig
from ray.data.preprocessors import Chain

# total_steps = processed_datasets["train"].count() // 

trainer = TorchTrainer(
    train_loop_per_worker=train_func,
    train_loop_config={
        "batch_size": 16,  # per device
        "epochs": 1,
        "steps_per_epoch": 5
    },
    scaling_config=ScalingConfig(
        num_workers=num_workers,
        use_gpu=use_gpu,
        resources_per_worker={"GPU": 1, "CPU": cpus_per_worker},
    ),
    datasets=processed_datasets,
    run_config=RunConfig(storage_path=storage_path.replace("s3://", ""), storage_filesystem=fs),
)

Finally, we call the {meth}`~ray.train.huggingface.TransformersTrainer.fit` method to start training with Ray AIR. We will save the {class}`~ray.train.Result` object to a variable so we can access metrics and checkpoints.

In [23]:
results = trainer.fit()

0,1
Current time:,2023-08-17 18:15:00
Running for:,00:08:14.56
Memory:,11.0/62.0 GiB

Trial name,# failures,error file
TorchTrainer_8052c_00000,1,/home/ray/ray_results/TorchTrainer_2023-08-17_18-06-45/TorchTrainer_8052c_00000_0_2023-08-17_18-06-45/error.txt

Trial name,status,loc
TorchTrainer_8052c_00000,ERROR,10.0.57.155:67601


[2m[36m(TrainTrainable pid=67601, ip=10.0.57.155)[0m 2023-08-17 18:06:49.780412: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[2m[36m(TrainTrainable pid=67601, ip=10.0.57.155)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(TrainTrainable pid=67601, ip=10.0.57.155)[0m 2023-08-17 18:06:49.930195: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2m[36m(TrainTrainable pid=67601, ip=10.0.57.155)[0m 2023-08-17 18:06:50.709495: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load 

[2m[36m(RayTrainWorker pid=141374)[0m Preparing training arguments
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Loading model
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:11,581] [INFO] [partition_parameters.py:454:__exit__] finished initializing model with 6.05B parameters
[2m[36m(RayTrainWorker pid=67537, ip=10.0.46.63)[0m Preparing training arguments[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m Loading model[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Model loaded


[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m max_steps is given, it will override any value given in num_train_epochs
[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m Using cuda_amp half precision backend
[2m[36m(SplitCoordinator pid=67735, ip=10.0.57.155)[0m 2023-08-17 18:07:00.268483: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64[32m [repeated 32x across cluster][0m
[2m[36m(RayTrainWorker pid=67903, ip=10.0.18.21)[0m comet_ml is installed but `COMET_API_KEY` is not set.[32m [repeated 15x across cluster][0m


[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:47,774] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.9.2, git-hash=unknown, git-branch=unknown
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:47,787] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False


[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch extensions root...
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m max_steps is given, it will override any value given in num_train_epochs[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m Using cuda_amp half precision backend[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Detected CUDA files, patching ldflags
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu118/cpu_adam/build.ninja...
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Building extension module cpu_adam...
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[

[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m ninja: no work to do.
[2m[36m(RayTrainWorker pid=67903, ip=10.0.18.21)[0m Model loaded[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Time to load cpu_adam op: 2.673201560974121 seconds
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Adam Optimizer #0 is created with AVX512 arithmetic capability.
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Config: alpha=0.000020, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:53,638] [INFO] [logging.py:96:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:53,655] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:53,655] [INFO] [util

[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Building extension module utils...
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Loading extension module utils...


[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m Time to load utils op: 0.25148725509643555 seconds
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:54,010] [INFO] [utils.py:785:see_memory_usage] Stage 3 initialize beginning
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:54,011] [INFO] [utils.py:786:see_memory_usage] MA 0.11 GB         Max_MA 1.26 GB         CA 1.54 GB         Max_CA 2 GB 
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:54,011] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 9.2 GB, percent = 14.8%
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:54,013] [INFO] [stage3.py:113:__init__] Reduce bucket size 16777216
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:54,013] [INFO] [stage3.py:114:__init__] Prefetch bucket size 15099494
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:08:54,642] [INFO] 

[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m No modifications detected for re-loaded extension module utils, skipping build step...
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m ***** Running training *****
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m   Num examples = 640
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m   Num Epochs = 9223372036854775807
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m   Instantaneous batch size per device = 8
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m   Total train batch size (w. parallel, distributed & accumulation) = 128
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m   Gradient Accumulation steps = 1
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m   Total optimization steps = 5
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m   Number of trainable parameters = 0
[2m[36m(RayTrainWorker pid=67407, ip=10.0.10.55)[0m Using /home/ray/.cache/torch_extensions/py39_cu118 as PyTorch

[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:09:05,421] [INFO] [utils.py:785:see_memory_usage] After initializing ZeRO optimizer
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:09:05,422] [INFO] [utils.py:786:see_memory_usage] MA 0.14 GB         Max_MA 0.91 GB         CA 1.54 GB         Max_CA 2 GB 
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:09:05,422] [INFO] [utils.py:793:see_memory_usage] CPU Virtual Memory:  used = 17.52 GB, percent = 28.3%
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:09:05,422] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = adamw
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:09:05,423] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client callable to create LR scheduler
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:09:05,423] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Sche

[2m[36m(SplitCoordinator pid=67735, ip=10.0.57.155)[0m Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(split_text)->MapBatches(tokenize)] -> OutputSplitter[split(16, equal=True)]
[2m[36m(SplitCoordinator pid=67735, ip=10.0.57.155)[0m Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=2000000000.0), locality_with_output=['deec88dafdebef24aa3eb967a799ccbb7d5d88e4c87cbd6ead7a81b7', '8db6bea78fd4443c6ebc218aa3df9737bdcb6a3b6391a65b285d0bef', 'dba6653fc4def3239b91b0beb6367a5030440ca552c7d8093749e008', '4d41c0302e37bc6dee160197448bb471f9fd538c9a2de31274aa245d', 'cd8574f82b87900a11b3a8138ea6d36ae60465919beacd69017d6125', 'cac1c399f6701b72858114b4043709cb8f4c4ea3a1b4e8aadc5d2cb0', 'ad42a7376f0a219a683685c8dbb049b072957f8d68b06b174015b6bb', 'c2c06cc26415ee5e6deebb8e1abe08bdcb6ab227923c642a5c143486', '0b525a31769291a3979840630e49225ebe51c9337ac0618cefb318f7', '5ae6fe5e162e460fdb74a62a0cad31da52dfc7ae28c13aead

(pid=67735, ip=10.0.57.155) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]

[2m[36m(MapBatches(split_text)->MapBatches(tokenize) pid=68345, ip=10.0.57.155)[0m 2023-08-17 18:09:07.029287: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
[2m[36m(MapBatches(split_text)->MapBatches(tokenize) pid=68345, ip=10.0.57.155)[0m To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[2m[36m(MapBatches(split_text)->MapBatches(tokenize) pid=68345, ip=10.0.57.155)[0m 2023-08-17 18:09:07.172331: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2m[36m(MapBatches(split_text)->MapBatches(tokenize) pid=68345, ip=10.0.57.155)[0m 

[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m Time to load utils op: 0.0004165172576904297 seconds[32m [repeated 14x across cluster][0m
[2m[36m(RayTrainWorker pid=68172, ip=10.0.51.206)[0m {'loss': 12.1235, 'learning_rate': 1.6000000000000003e-05, 'epoch': 0.2}


[2m[33m(raylet)[0m [2023-08-17 18:10:14,162 E 2847 2864] (raylet) file_system_monitor.cc:111: /tmp/ray is over 95% full, available space: 7776915456; capacity: 155897610240. Object creation will fail if spilling is required.


[2m[36m(RayTrainWorker pid=67725, ip=10.0.4.91)[0m {'loss': 6.7834, 'learning_rate': 1.2e-05, 'epoch': 0.4}[32m [repeated 16x across cluster][0m


[2m[33m(raylet)[0m [2023-08-17 18:10:24,171 E 2847 2864] (raylet) file_system_monitor.cc:111: /tmp/ray is over 95% full, available space: 7535624192; capacity: 155897610240. Object creation will fail if spilling is required.
[2m[33m(raylet)[0m [2023-08-17 18:10:34,179 E 2847 2864] (raylet) file_system_monitor.cc:111: /tmp/ray is over 95% full, available space: 7535603712; capacity: 155897610240. Object creation will fail if spilling is required.


[2m[36m(RayTrainWorker pid=141374)[0m {'loss': 2.6553, 'learning_rate': 8.000000000000001e-06, 'epoch': 0.6}[32m [repeated 16x across cluster][0m
[2m[36m(RayTrainWorker pid=67782, ip=10.0.56.166)[0m {'loss': 0.3044, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.8}[32m [repeated 16x across cluster][0m
[2m[36m(RayTrainWorker pid=67820, ip=10.0.51.3)[0m {'loss': 0.1634, 'learning_rate': 0.0, 'epoch': 1.0}[32m [repeated 16x across cluster][0m


[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m Saving model checkpoint to output/checkpoint-5
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m Configuration saved in output/checkpoint-5/config.json
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m Configuration saved in output/checkpoint-5/generation_config.json
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m No modifications detected for re-loaded extension module utils, skipping build step...[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m ***** Running training *****[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m   Num examples = 640[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m   Num Epochs = 9223372036854775807[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m   Instantaneous batch size per device = 8[32m [repeated

[2m[36m(RayTrainWorker pid=67820, ip=10.0.51.3)[0m [2023-08-17 18:12:07,720] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step5 is ready now!
[2m[36m(RayTrainWorker pid=141374)[0m {'loss': 0.1634, 'learning_rate': 0.0, 'epoch': 1.0}[32m [repeated 15x across cluster][0m
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:12:07,720] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step5 is about to be saved!
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:12:07,721] [INFO] [engine.py:3337:save_16bit_model] Saving model weights to output/checkpoint-5/pytorch_model.bin, tag: global_step5
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:12:07,721] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-5/pytorch_model.bin...




[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m [2023-08-17 18:12:07,720] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step5 is ready now![32m [repeated 14x across cluster][0m
[2m[36m(RayTrainWorker pid=67782, ip=10.0.56.166)[0m [2023-08-17 18:12:22,170] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-5/global_step5/zero_pp_rank_3_mp_rank_00_model_states.pt...
[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m [2023-08-17 18:12:22,170] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-5/global_step5/zero_pp_rank_7_mp_rank_00_model_states.pt...
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:12:22,146] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved output/checkpoint-5/pytorch_model.bin.
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:12:22,160] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint global_step5 is about to be 

[2m[33m(raylet)[0m [2023-08-17 18:12:24,335 E 2847 2864] (raylet) file_system_monitor.cc:111: /tmp/ray is over 95% full, available space: 7680335872; capacity: 155897610240. Object creation will fail if spilling is required.


[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m [2023-08-17 18:12:27,291] [INFO] [engine.py:3228:_save_zero_checkpoint] zero checkpoint saved output/checkpoint-5/global_step5/zero_pp_rank_7_mp_rank_00_optim_states.pt
[2m[36m(RayTrainWorker pid=67671, ip=10.0.57.155)[0m [2023-08-17 18:12:22,146] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint global_step5 is ready now!
[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m [2023-08-17 18:12:22,806] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving output/checkpoint-5/global_step5/zero_pp_rank_7_mp_rank_00_optim_states.pt...[32m [repeated 30x across cluster][0m
[2m[36m(RayTrainWorker pid=68145, ip=10.0.52.197)[0m [2023-08-17 18:12:27,291] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved output/checkpoint-5/global_step5/zero_pp_rank_7_mp_rank_00_optim_states.pt.[32m [repeated 17x across cluster][0m
[2m[36m(RayTrainWorker pid=67248, ip=10.0.34.94)[0m [2023-08-17 18:12:22,170] [INFO] 

2023-08-17 18:12:34,341	ERROR tune_controller.py:1506 -- Trial task failed for trial TorchTrainer_8052c_00000
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/_private/worker.py", line 2526, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(Error): [36mray::_Inner.train()[39m (pid=67601, ip=10.0.57.155, actor_id=85b245b6ee648068ada8eada07000000, repr=TorchTrainer)
  File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 392, in trai

You can use the returned {class}`~ray.train.Result` object to access metrics and the Ray AIR {class}`~ray.train.Checkpoint` associated with the last iteration.

In [13]:
checkpoint = results.checkpoint
checkpoint

Checkpoint(filesystem=<pyarrow._s3fs.S3FileSystem object at 0x7f9dbd7241b0>, path=anyscale-staging-data-cld-kvedzwag2qa8i5bjxuevf5i7/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/yunxuan__xiao/gptj-deepspeed-finetune/TorchTrainer_2023-08-17_16-09-41/TorchTrainer_25f7a_00000_0_2023-08-17_16-09-42/checkpoint_000000)

### Generate text from prompt

We can use the {class}`~ray.train.huggingface.huggingface_predictor.TransformersPredictor` to generate predictions from our fine-tuned model.

```{tip}
For large scale batch inference, see {ref}`End-to-end: Offline Batch Inference <batch_inference_home>`.
```

Because the {class}`~ray.train.huggingface.huggingface_predictor.TransformersPredictor` uses a 🤗 Transformers [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) under the hood, we disable the tokenizer AIR Preprocessor we have used for training and let the `pipeline` to tokenize the data itself.

We also set `device_map="auto"` so that the model is automatically placed on the right device and set the `task` to `"text-generation"`. The `predict` method passes the arguments to a 🤗 Transformers `pipeline` call.

In [4]:
import pandas as pd

prompts = pd.DataFrame(["Romeo and Juliet", "Romeo", "Juliet"], columns=["text"])

# Predict on the head node.
predictor = TransformersPredictor.from_checkpoint(
    checkpoint=checkpoint,
    task="text-generation",
    torch_dtype=torch.float16 if use_gpu else None,
    device_map="auto",
    use_gpu=use_gpu,
)
prediction = predictor.predict(
    prompts,
    do_sample=True,
    temperature=0.9,
    min_length=32,
    max_length=128,
)

In [5]:
prediction

Unnamed: 0,generated_text
0,"Romeo and Juliet, they are married: and it is ..."
1,"Romeo, thou art Romeo and a Montague; for only..."
2,Juliet's name; but I do not sound an ear to na...
