# GPT-J6-B Batch Prediction with Ray AIR

In this example, we will showcase how to use the Ray AIR for **GPT-J batch inference**. GPT-J is a GPT-2-like causal language model trained on the Pile dataset. This particular model has 6 billion parameters.

We will use Ray Data to carry out this task and a pretrained model from Hugging Face hub. Note that you can easily adapt this example to use other similar models.

```{note}
In order to run this example, make sure your Ray cluster has access to at least one GPU with 16 or more GBs of memory. The amount of memory needed will depend on the model.
```

In [1]:
model_id = "EleutherAI/gpt-j-6B"
revision = "float16"  # use float16 weights to fit in 16GB GPUs
prompt = (
    "In a shocking finding, scientists discovered a herd of unicorns living in a remote, "
    "previously unexplored valley, in the Andes Mountains. Even more surprising to the "
    "researchers was the fact that the unicorns spoke perfect English."
)

In [2]:
import ray

We define a runtime environment to ensure that the Ray workers have access to all the necessary packages.

In [3]:
ray.init(
    runtime_env={
        "pip": [
            "accelerate>=0.16.0",
            "transformers>=4.26.0",
        ]
    }
)

2023-02-28 10:40:41,426	INFO worker.py:1360 -- Connecting to existing Ray cluster at address: 10.0.8.73:6379...
2023-02-28 10:40:41,436	INFO worker.py:1548 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://console.anyscale-staging.com/api/v2/sessions/ses_sedlspnpy16naa5lm9kf2cmi2y/services?redirect_to=dashboard [39m[22m
2023-02-28 10:40:41,464	INFO packaging.py:330 -- Pushing file package 'gcs://_ray_pkg_2bd3385b273c689360603318a30b38a4.zip' (8.55MiB) to Ray cluster...
2023-02-28 10:40:41,570	INFO packaging.py:343 -- Successfully pushed file package 'gcs://_ray_pkg_2bd3385b273c689360603318a30b38a4.zip'.


0,1
Python version:,3.8.16
Ray version:,3.0.0.dev0
Dashboard:,http://console.anyscale-staging.com/api/v2/sessions/ses_sedlspnpy16naa5lm9kf2cmi2y/services?redirect_to=dashboard


For the purposes of this example, we will use a very small toy dataset composed of multiple copies of our prompt. Ray Data can handle much bigger datasets with ease.

In [4]:
import ray.data
import pandas as pd

ds = ray.data.from_pandas(pd.DataFrame([prompt] * 10, columns=["prompt"]))

You may notice that we are not using an AIR {class}`Predictor <ray.train.predictor.Predictor>` here. Predictors are mainly intended to be used with AIR {class}`Checkpoints <ray.air.Checkpoint>`. Since we will be using a pretrained model from Hugging Face hub, it is simpler to instead use {meth}`map_batches <ray.data.Dataset.map_batches>` with a [callable class UDF](transform_datasets_callable_classes). This will allow us to save time by initializing a model just once and then feed it multiple batches of data.

In [5]:
class PredictCallable:
    def __init__(self, model_id: str, revision: str = None):
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch

        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            revision=revision,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            device_map="auto",
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

    def __call__(self, batch: pd.DataFrame) -> pd.DataFrame:
        input_ids = self.tokenizer(
            list(batch["prompt"]), return_tensors="pt"
        ).input_ids.to(self.model.device)

        gen_tokens = self.model.generate(
            input_ids,
            do_sample=True,
            temperature=0.9,
            max_length=100,
        )
        return pd.DataFrame(
            self.tokenizer.batch_decode(gen_tokens), columns=["responses"]
        )

All that is left is to run the `map_batches` method on the dataset. We specify that we want to use one GPU for each Ray Actor that will be running our callable class.

```{tip}
If you have access to large GPUs, you may want to increase the batch size to better saturate them.
```

In [6]:
preds = ds.map_batches(
    PredictCallable,
    batch_size=4,
    fn_constructor_kwargs=dict(model_id=model_id, revision=revision),
    compute="actors",
    num_gpus=1,
)

After `map_batches` is done, we can view our generated text.

In [7]:
preds.take_all()

2023-02-28 10:40:50,530	INFO bulk_executor.py:41 -- Executing DAG InputDataBuffer[Input] -> ActorPoolMapOperator[MapBatches(PredictCallable)]
MapBatches(PredictCallable):   0%|          | 0/1 [00:00<?, ?it/s](_MapWorker pid=739, ip=10.0.4.6) 2023-02-28 10:40:52.454038: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
(_MapWorker pid=739, ip=10.0.4.6) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(_MapWorker pid=739, ip=10.0.4.6) 2023-02-28 10:40:52.722496: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(_MapWorker pid=739, ip=10.0.4.6) 

[{'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nThe finding comes from the team of researchers, which includes Dr. Michael Goldberg, a professor and chair of the Zoology Department at the University of Maryland. Dr. Goldberg spent a year collecting and conducting research in the Ecuadorian Andes, including the Pinchahu'},
 {'responses': 'In a shocking finding, scientists discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.\n\nThe team of British, Argentine and Chilean scientists found that the elusive unicorns had been living in the valley for at least 50 years, and had even interacted with humans.\n\nThe team’s findings published in the journa