# Batch Inference with OPT 30B and Ray Data

This notebook was tested on a single p3.16xlarge instance with 8 V100 GPUs.

## Set Up
Initialize Ray and a runtime environment to ensure that all dependent packages are available.

In [2]:
import ray

ray.init(
    runtime_env={
        "pip": [
            "numpy==1.23",
            "protobuf==3.20.0",
            "transformers==4.27.2",
            "accelerate==0.17.1",
            "deepspeed==0.8.3",
        ],
        "env_vars": {
            "HF_HUB_DISABLE_PROGRESS_BARS": "1",
        }
    }
)

2023-04-22 11:12:15,071	INFO worker.py:1314 -- Using address localhost:9031 set in the environment variable RAY_ADDRESS
fatal: not a git repository (or any parent up to mount point /home/ray)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
2023-04-22 11:12:15,676	INFO worker.py:1432 -- Connecting to existing Ray cluster at address: 172.31.244.129:9031...
2023-04-22 11:12:15,724	INFO worker.py:1607 -- Connected to Ray cluster. View the dashboard at https://console.anyscale.com/api/v2/sessions/ses_jgkdnu2723aleytwqqhebr12vs/services?redirect_to=dashboard 
2023-04-22 11:12:15,732	INFO packaging.py:347 -- Pushing file package 'gcs://_ray_pkg_7ad665e3661cefc8f8037daeb0b5ba6e.zip' (0.03MiB) to Ray cluster...
2023-04-22 11:12:15,733	INFO packaging.py:360 -- Successfully pushed file package 'gcs://_ray_pkg_7ad665e3661cefc8f8037daeb0b5ba6e.zip'.


0,1
Python version:,3.9.15
Ray version:,3.0.0.dev0
Dashboard:,http://console.anyscale.com/api/v2/sessions/ses_jgkdnu2723aleytwqqhebr12vs/services?redirect_to=dashboard


## Define Hyperparameters

Define a list of hyperparameters as a global dataclass.

Refer to https://deepspeed.readthedocs.io/en/stable/inference-init.html#deepspeed.inference.config.DeepSpeedInferenceConfig for more details about the configurations of a DeepSpeed inference job.

In [2]:
from dataclasses import dataclass
from typing import Optional


@dataclass
class Config:
    model_name: str = "facebook/opt-30b"
    # Path to HuggingFace cache directory. Default is ~/.cache/huggingface/.
    cache_dir: Optional[str] = None
    # Path to the directory that actually holds model files.
    # e.g., ~/.cache/huggingface/models--facebook--opt-30b/snapshots/xxx/
    # If this path is not None, we skip download models from HuggingFace.
    repo_root: Optional[str] = None
    # This is how many DeepSpeed-inference replicas to run for
    # this batch inference job.
    num_worker_groups: int = 1
    # Number of DeepSpeed workers per group.
    num_workers_per_group: int = 8

    batch_size: int = 1
    dtype: str = "float16"
    # Maximum number of tokens DeepSpeed inference-engine can work with,
    # including the input and output tokens.
    max_tokens: int = 1024
    # Use meta tensors to initialize model.
    use_meta_tensor: bool = True
    # Use cache for generation.
    use_cache: bool = True
    # The path for which we want to save the loaded model with a checkpoint.
    save_mp_checkpoint_path: Optional[str] = None


config = Config()

## Download and Cache Model

Next, we will download and cache model files on all instances of the cluster before we run the job.

Notice that when we download model snapshots from HuggingFace, we skip files that end with safetensors, msgpack, and h5 extensions. These are Tensorflow and JAX weight files. We only need PyTorch weights for this example.

We execute the ``download_model()`` function on every node of the cluster by using a ``NodeAffinitySchedulingStrategy`` from Ray Core.

In [3]:

from huggingface_hub import snapshot_download
import ray
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy


@ray.remote
def download_model(config: Config):
    # This function downloads the specified HF model into a local directory.
    # This can also download models from cloud storages like S3.
    return snapshot_download(
        repo_id=config.model_name,
        cache_dir=config.cache_dir,
        allow_patterns=["*"],
        # Skip downloading TF and FLAX weight files.
        ignore_patterns=["*.safetensors", "*.msgpack", "*.h5"],
        revision=None,
    )

if config.repo_root is None:
    # Download model files to all GPU nodes, and set correct repo_root.
    refs = []
    for node in ray.nodes():
        if node["Alive"] and node["Resources"].get("GPU", None):
            node_id = node["NodeID"]
            scheduling_strategy = NodeAffinitySchedulingStrategy(
                node_id=node_id, soft=False
            )
            options = {"scheduling_strategy": scheduling_strategy}
            refs.append(
                download_model.options(scheduling_strategy=scheduling_strategy).remote(config)
            )

    print("Caching model locally ...")

    # Wait for models to finish downloading.
    config.repo_root = ray.get(refs)[0]

    print(f"Done. Model saved in {config.repo_root}")
else:
    print(f"Using existing model saved in {config.repo_root}")

Caching model locally ...
Done. Model saved in /home/ray/.cache/huggingface/hub/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546


## Define DeepSpeed Utility Classes

Next, we define a few utility classes and functions that are useful for setting up and running the DeepSpeed inference job.

Note that the Pipeline is modeled after https://github.com/microsoft/DeepSpeedExamples/tree/efacebb3ddbea86bb20c3af30fd060be0fa41ac8/inference/huggingface/text-generation.

In [4]:
import gc
import io
import json
import math
import os
from pathlib import Path
from typing import List

import deepspeed
import torch
from deepspeed.runtime.utils import see_memory_usage
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer


class DSPipeline:
    """
    Example helper class for comprehending DeepSpeed Meta Tensors, meant to mimic HF pipelines.
    The DSPipeline can run with and without meta tensors.
    """

    def __init__(
        self,
        model_name,
        dtype=torch.float16,
        is_meta=True,
        device=-1,
        repo_root=None,
    ):
        self.model_name = model_name
        self.dtype = dtype

        if isinstance(device, torch.device):
            self.device = device
        elif isinstance(device, str):
            self.device = torch.device(device)
        elif device < 0:
            self.device = torch.device("cpu")
        else:
            self.device = torch.device(f"cuda:{device}")

        self.tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="right")
        self.tokenizer.pad_token = self.tokenizer.eos_token

        if is_meta:
            # When meta tensors enabled, use checkpoints
            self.config = AutoConfig.from_pretrained(self.model_name)
            self.checkpoints_json = self._generate_json(repo_root)

            with deepspeed.OnDevice(dtype=dtype, device="meta"):
                self.model = AutoModelForCausalLM.from_config(self.config)
        else:
            self.model = AutoModelForCausalLM.from_pretrained(self.model_name)

        self.model.eval()

    def __call__(self, inputs, **kwargs):
        input_list = [inputs] if isinstance(inputs, str) else inputs
        outputs = self.generate_outputs(input_list, **kwargs)
        return outputs

    def _generate_json(self, repo_root):
        if os.path.exists(os.path.join(repo_root, "ds_inference_config.json")):
            # Simply use the available inference config.
            return os.path.join(repo_root, "ds_inference_config.json")

        # Write a checkpoints config file in local directory.
        checkpoints_json = "checkpoints.json"

        with io.open(checkpoints_json, "w", encoding="utf-8") as f:
            file_list = [
                str(entry).split("/")[-1]
                for entry in Path(repo_root).rglob("*.[bp][it][n]")
                if entry.is_file()
            ]
            data = {
                # Hardcode bloom for now.
                # Possible choices are "bloom", "ds_model", "Megatron".
                "type": "bloom",
                "checkpoints": file_list,
                "version": 1.0
            }
            json.dump(data, f)

        return checkpoints_json

    def generate_outputs(self, inputs, **generate_kwargs):
        input_tokens = self.tokenizer.batch_encode_plus(
            inputs, return_tensors="pt", padding=True
        )
        for t in input_tokens:
            if torch.is_tensor(input_tokens[t]):
                input_tokens[t] = input_tokens[t].to(self.device)

        self.model.cuda().to(self.device)

        outputs = self.model.generate(**input_tokens, **generate_kwargs)
        outputs = self.tokenizer.batch_decode(outputs, skip_special_tokens=True)

        return outputs


def _memory_usage(gpu_id: int, msg: str):
    """Print memory usage."""
    if gpu_id != 0:
        return
    see_memory_usage(msg, True)


def init_model(config: Config, world_size: int, gpu_id: int) -> DSPipeline:
    """Initialize the deepspeed model."""
    data_type = getattr(torch, config.dtype)

    _memory_usage(gpu_id, "before init")
    pipe = DSPipeline(
        model_name=config.model_name,
        dtype=data_type,
        is_meta=config.use_meta_tensor,
        device=gpu_id,
        repo_root=config.repo_root,
    )
    _memory_usage(gpu_id, "after init")

    if config.use_meta_tensor:
        ds_kwargs = dict(
            base_dir=config.repo_root, checkpoint=pipe.checkpoints_json
        )
    else:
        ds_kwargs = dict()

    gc.collect()

    pipe.model = deepspeed.init_inference(
        pipe.model,
        dtype=data_type,
        mp_size=world_size,
        replace_with_kernel_inject=True,
        replace_method=True,
        max_tokens=config.max_tokens,
        save_mp_checkpoint_path=config.save_mp_checkpoint_path,
        **ds_kwargs,
    )
    _memory_usage(gpu_id, "after init_inference")

    return pipe


def generate(
    input_sentences: List[str], pipe: DSPipeline, batch_size: int, **generate_kwargs
) -> List[str]:
    """Generate predictions using a DSPipeline."""
    if batch_size > len(input_sentences):
        # Dynamically extend to support larger bs by repetition.
        input_sentences *= math.ceil(batch_size / len(input_sentences))

    inputs = input_sentences[:batch_size]
    outputs = pipe(inputs, **generate_kwargs)
    return outputs

  from pandas import MultiIndex, Int64Index


## Define a DeepSpeed Predictor

Define an AIR Predictor to be instantiated by the Dataset pipeline below.

Each DeepSpeedPredictor is a stateful Ray actor that understands how to process the input prompt using a group of DeepSpeed inference workers.

More specifically, each DeepSpeedPredictor sets up a proper PyTorch DDP process group before spinning up multiple PredictionWorkers. Since the model is loaded using the DeepSpeed inference framework, each PredictionWorker handles a shard of the entire DeepSpeed inference model.


In [5]:
from typing import List

import pandas as pd
import ray
import ray.util
from ray.air import Checkpoint, ScalingConfig
from ray.air.util.torch_dist import (
    TorchDistributedWorker,
    init_torch_dist_process_group,
    shutdown_torch_dist_process_group,
)
from ray.train.predictor import Predictor
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy


@ray.remote
class PredictionWorker(TorchDistributedWorker):
    """A PredictionWorker is a Ray remote actor that runs a single shard of a DeepSpeed job.
    
    Multiple PredictionWorkers of the same WorkerGroup form a PyTorch DDP process
    group and work together under the orchestration of DeepSpeed.
    """
    def __init__(self, config: Config, world_size: int):
        self.config = config
        self.world_size = world_size

    def init_model(self, local_rank: int):
        """Initialize model for inference."""
        # Note: We have to provide the local_rank that was used to initiate
        # the DDP process group here. e.g., a PredictionWorker may be the
        # rank 0 worker of a group, but occupies gpu 7.
        self.generator = init_model(self.config, self.world_size, local_rank)

    def generate(self, data: pd.DataFrame, column: str, **kwargs) -> List[str]:
        return generate(
            list(data[column]), self.generator, self.config.batch_size, **kwargs
        )


# TODO: This Predictor should be part of Ray AIR.
class DeepSpeedPredictor(Predictor):
    def __init__(self, checkpoint: Checkpoint, scaling_config: ScalingConfig) -> None:
        self.checkpoint = checkpoint
        self.scaling_config = scaling_config
        self.init_worker_group(scaling_config)

    def __del__(self):
        shutdown_torch_dist_process_group(self.prediction_workers)

    def init_worker_group(self, scaling_config: ScalingConfig):
        """Create the worker group.

        Each worker in the group communicates with other workers through the
        torch distributed backend. The worker group is inelastic (a failure of
        one worker destroys the entire group). Each worker in the group
        recieves the same input data and outputs the same generated text.
        """
        config = self.checkpoint.to_dict()["config"]

        # Start a placement group for the workers.
        self.pg = scaling_config.as_placement_group_factory().to_placement_group()
        prediction_worker_cls = PredictionWorker.options(
            num_cpus=scaling_config.num_cpus_per_worker,
            num_gpus=scaling_config.num_gpus_per_worker,
            resources=scaling_config.additional_resources_per_worker,
            scheduling_strategy=PlacementGroupSchedulingStrategy(
                placement_group=self.pg, placement_group_capture_child_tasks=True
            ),
        )
        # Create the prediction workers.
        self.prediction_workers = [
            prediction_worker_cls.remote(config, scaling_config.num_workers)
            for i in range(scaling_config.num_workers)
        ]

        # Initialize torch distributed process group for the workers.
        local_ranks = init_torch_dist_process_group(self.prediction_workers, backend="nccl")

        # Initialize the model on each worker.
        ray.get([
            worker.init_model.remote(local_rank)
            for worker, local_rank in zip(self.prediction_workers, local_ranks)
        ])

    def _predict_pandas(
        self,
        data: pd.DataFrame,
        input_column: str = "prompt",
        output_column: str = "output",
        **kwargs
    ) -> pd.DataFrame:
        data_ref = ray.put(data)
        prediction = ray.get(
            [
                worker.generate.remote(data_ref, column=input_column, **kwargs)
                for worker in self.prediction_workers
            ]
        )[0]

        return pd.DataFrame(prediction, columns=[output_column])

    @classmethod
    def from_checkpoint(cls, checkpoint: Checkpoint, **kwargs) -> "Predictor":
        return cls(checkpoint=checkpoint, **kwargs)


## Create a Dataset Pipeline

Finally, we connect all these pieces together, and use a BatchPredictor to run multiple copies of the DeepSpeedPredictor actors.

This step helps parallelize our batch inference job and utilize all available resources in the cluster.

In [6]:
import pandas as pd
import ray
from ray.air import Checkpoint, ScalingConfig
from ray.train.batch_predictor import BatchPredictor

# Disable terminal progress bar for notebook environments.
ray.data.set_progress_bars(False)

# Prompts.
# For testing purpose, we create 64 prompts in total.
df = pd.DataFrame(
    [
        "DeepSpeed is",
        "Test",
        "Please complete",
        "How can you"
    ] * 16,
    columns=["prompt"]
)
ds = (
    ray.data.from_pandas(df)
    # Make sure there are enough blocks for parallelized execution.
    .repartition(config.num_workers_per_group * 2)
    .random_shuffle()
    .fully_executed()
)

# Scaling config for one worker group.
group_scaling_config = ScalingConfig(
    use_gpu=True,
    num_workers=config.num_workers_per_group,
    # Should not be necessary after we switch to the new API.
    trainer_resources={"CPU": 0},
)
batch_predictor = BatchPredictor.from_checkpoint(
    # TODO: Use HugginFaceDeepSpeedCheckpoint when it's available.
    Checkpoint.from_dict({"config": config}),
    DeepSpeedPredictor,
    scaling_config=group_scaling_config,
)

# Batch prediction.
pred = batch_predictor.predict(
    ds,
    batch_size=1,
    num_cpus_per_worker=0,
    min_scoring_workers=config.num_worker_groups,
    max_scoring_workers=config.num_worker_groups,
    # Kwargs passed to model.generate()
    do_sample=True,
    temperature=0.9,
    max_length=100,
)

# Let's see the genreated texts.
print(pred.to_pandas())

2023-04-22 11:14:12,079	INFO streaming_executor.py:87 -- Executing DAG InputDataBuffer[Input] -> AllToAllOperator[Repartition] -> AllToAllOperator[RandomShuffle]
2023-04-22 11:14:12,081	INFO streaming_executor.py:88 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-04-22 11:14:12,082	INFO streaming_executor.py:90 -- Tip: To enable per-operator progress reporting, set RAY_DATA_VERBOSE_PROGRESS=1.


- Repartition 1:   0%|          | 0/16 [00:00<?, ?it/s]

- RandomShuffle 3:   0%|          | 0/16 [00:00<?, ?it/s]

2023-04-22 11:14:12,680	INFO streaming_executor.py:87 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(ScoringWrapper)]
2023-04-22 11:14:12,682	INFO streaming_executor.py:88 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
2023-04-22 11:14:12,683	INFO streaming_executor.py:90 -- Tip: To enable per-operator progress reporting, set RAY_DATA_VERBOSE_PROGRESS=1.
2023-04-22 11:14:12,785	INFO actor_pool_map_operator.py:114 -- MapBatches(ScoringWrapper): Waiting for 1 pool actors to start...
(_MapWorker pid=7005) The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]05) 


(PredictionWorker pid=10038) [2023-04-22 11:14:30,762] [INFO] [utils.py:829:see_memory_usage] before init
(PredictionWorker pid=10038) [2023-04-22 11:14:30,762] [INFO] [utils.py:830:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
(PredictionWorker pid=10038) [2023-04-22 11:14:30,762] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 11.63 GB, percent = 2.4%


(PredictionWorker pid=10040) --------------------------------------------------------------------------
(PredictionWorker pid=10040)                  Aim collects anonymous usage analytics.                 
(PredictionWorker pid=10040)                         Read how to opt-out here:                         
(PredictionWorker pid=10040)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html    
(PredictionWorker pid=10040) --------------------------------------------------------------------------


(PredictionWorker pid=10045) [2023-04-22 11:14:33,061] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
(PredictionWorker pid=10045) [2023-04-22 11:14:33,062] [INFO] [logging.py:93:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
(PredictionWorker pid=10038) [2023-04-22 11:14:33,074] [INFO] [utils.py:829:see_memory_usage] after init
(PredictionWorker pid=10038) [2023-04-22 11:14:33,075] [INFO] [utils.py:830:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
(PredictionWorker pid=10038) [2023-04-22 11:14:33,075] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 12.25 GB, percent = 2.6%


(PredictionWorker pid=10040) Using /home/ray/.cache/torch_extensions/py39_cu116 as PyTorch extensions root...
(PredictionWorker pid=10038) Creating extension directory /home/ray/.cache/torch_extensions/py39_cu116/transformer_inference...
(PredictionWorker pid=10038) Detected CUDA files, patching ldflags
(PredictionWorker pid=10038) Emitting ninja build file /home/ray/.cache/torch_extensions/py39_cu116/transformer_inference/build.ninja...
(PredictionWorker pid=10038) Building extension module transformer_inference...
(PredictionWorker pid=10038) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


(PredictionWorker pid=10038) [1/9] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=transformer_inference -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/transformer/inference/includes -I/home/ray/anaconda3/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/TH -isystem /home/ray/anaconda3/lib/python3.9/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /home/ray/anaconda3/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=

(PredictionWorker pid=10038) Loading extension module transformer_inference...
(PredictionWorker pid=10041) -------------------------------------------------------------------------- [repeated 14x across cluster]
(PredictionWorker pid=10041)                  Aim collects anonymous usage analytics.                  [repeated 7x across cluster]
(PredictionWorker pid=10041)                         Read how to opt-out here:                          [repeated 7x across cluster]
(PredictionWorker pid=10041)     https://aimstack.readthedocs.io/en/latest/community/telemetry.html     [repeated 7x across cluster]
(PredictionWorker pid=10041) Using /home/ray/.cache/torch_extensions/py39_cu116 as PyTorch extensions root... [repeated 7x across cluster]


(PredictionWorker pid=10038) [9/9] c++ pt_binding.o gelu.cuda.o relu.cuda.o layer_norm.cuda.o softmax.cuda.o dequantize.cuda.o apply_rotary_pos_emb.cuda.o transform.cuda.o -shared -lcurand -L/home/ray/anaconda3/lib/python3.9/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o transformer_inference.so
(PredictionWorker pid=10038) Time to load transformer_inference op: 46.834928035736084 seconds
(PredictionWorker pid=10038) [2023-04-22 11:15:21,799] [INFO] [logging.py:93:log_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 7168, 'intermediate_size': 28672, 'heads': 56, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_t

(PredictionWorker pid=10040) No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading 7 checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]
Loading 7 checkpoint shards:  14%|█▍        | 1/7 [00:39<03:57, 39.57s/it]
(PredictionWorker pid=10041) Loading extension module transformer_inference... [repeated 15x across cluster]
(PredictionWorker pid=10041) Using /home/ray/.cache/torch_extensions/py39_cu116 as PyTorch extensions root... [repeated 8x across cluster]
(PredictionWorker pid=10041) No modifications detected for re-loaded extension module transformer_inference, skipping build step... [repeated 7x across cluster]
Loading 7 checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s] [repeated 7x across cluster]
Loading 7 checkpoint shards:  29%|██▊       | 2/7 [01:15<03:06, 37.25s/it] [repeated 8x across cluster]
Loading 7 checkpoint shards:  29%|██▊       | 2/7 [01:28<03:42, 44.58s/it] [repeated 7x across cluster]
Loading 7 che

(PredictionWorker pid=10044) checkpoint loading time at rank 6: 216.07904958724976 sec
(PredictionWorker pid=10040) Time to load transformer_inference op: 0.03857231140136719 seconds [repeated 15x across cluster]


Loading 7 checkpoint shards: 100%|██████████| 7/7 [03:36<00:00, 30.87s/it]
Loading 7 checkpoint shards: 100%|██████████| 7/7 [03:36<00:00, 30.87s/it]
Loading 7 checkpoint shards: 100%|██████████| 7/7 [03:43<00:00, 31.88s/it] [repeated 6x across cluster]


(PredictionWorker pid=10040) checkpoint loading time at rank 1: 223.18208837509155 sec [repeated 6x across cluster]
(PredictionWorker pid=10038) [2023-04-22 11:19:13,839] [INFO] [utils.py:829:see_memory_usage] after init_inference
(PredictionWorker pid=10038) [2023-04-22 11:19:13,840] [INFO] [utils.py:830:see_memory_usage] MA 7.69 GB         Max_MA 7.69 GB         CA 7.83 GB         Max_CA 8 GB 
(PredictionWorker pid=10038) [2023-04-22 11:19:13,840] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 22.22 GB, percent = 4.6%
(PredictionWorker pid=10039) [2023-04-22 11:19:13,840] [INFO] [utils.py:838:see_memory_usage] CPU Virtual Memory:  used = 22.22 GB, percent = 4.6%
(PredictionWorker pid=10038) ------------------------------------------------------
(PredictionWorker pid=10038) Free memory : 6.587830 (GigaBytes)  
(PredictionWorker pid=10038) Total memory: 15.781921 (GigaBytes)  
(PredictionWorker pid=10038) Requested memory: 0.601562 (GigaBytes) 
(PredictionWorker pid

(PredictionWorker pid=10040) 2023-04-22 11:19:26.855845: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(PredictionWorker pid=10040) 2023-04-22 11:19:26.856002: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64


                                               output
0   DeepSpeed is the one to go with. No need for a...
1   Testimonials:\n\nG. SACCHIOULAS (TX)\n\n"We bo...
2   Testimonials\n\nI received my order today, I'm...
3   Testimonials\n\nWhat do our clients say about ...
4   How can you make them that high?\nI edited the...
..                                                ...
59  Please complete the form below to request more...
60  DeepSpeed is the most popular way of dealing t...
61  How can you not tell that's not a real tweet?\...
62  Testimonials\n\n"The staff and community of H....
63  DeepSpeed is an independent, privately held co...

[64 rows x 1 columns]
(autoscaler +12m27s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(autoscaler +12m27s) Resized to 64 CPUs, 8 GPUs.
