# Batch Inference with LoRA Adapters

In this example, we show how to perform batch inference using Ray Data LLM with LLM and a LoRA adapter. 

To run this example, we need to install the following dependencies:

```bash
pip install -qU "ray[data,llm]"
```

In [1]:
import ray
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig

# First construct a vLLM processor config.
processor_config = vLLMEngineProcessorConfig(
    # The base model.
    model="unsloth/Llama-3.2-1B-Instruct",
    # vLLM engine config.
    engine_kwargs=dict(
        # Enable LoRA in the vLLM engine; otherwise you won't be able to
        # process requests with LoRA adapters.
        enable_lora=True,
        # You need to set the LoRA rank for the adapter.
        # The LoRA rank is the value of "r" in the LoRA config.
        # If you want to use multiple LoRA adapters in this pipeline,
        # please specify the maximum LoRA rank amount all of them.
        max_lora_rank=32,
        # The maximum number of LoRA adapters vLLM cached. "1" means
        # vLLM only caches one LoRA adapter at a time, so if your dataset
        # needs more than one LoRA adapters, then there would be context
        # switching. On the other hand, while increasing max_loras reduces
        # the context switching, it increases the memory footprint.
        max_loras=1,
    ),
    # The batch size used in Ray Data.
    batch_size=16,
    # Use one GPU in this example.
    concurrency=1,
)

# Then construct a processor using the processor config.
processor = build_llm_processor(
    processor_config,
    # Convert the input data to the OpenAI chat form.
    preprocess=lambda row: dict(
        # If you specify "model" in a request, and the model is different
        # from the model you specify in the processor config, then this
        # is the LoRA adapter. The "model" here can be a LoRA adapter
        # available in the HuggingFace Hub or a local path.
        model="EdBergJr/Llama32_Baha_3",
        messages=[
            {"role": "system",
             "content": "You are a calculator. Please only output the answer "
                "of the given equation."},
            {"role": "user", "content": f"{row['id']} ** 3 = ?"},
        ],
        sampling_params=dict(
            temperature=0.3,
            max_tokens=20,
            detokenize=False,
        ),
    ),
    # Only keep the generated text in the output dataset.
    postprocess=lambda row: {
        "resp": row["generated_text"],
    },
)

# Synthesize a dataset with 30 rows.
ds = ray.data.range(30)
ds = ds.map(lambda x: {"id": x["id"]})

# Apply the processor to the dataset. Note that this line won't kick off
# anything because processor is execution lazily.
ds = processor(ds)
# Materialization kicks off the pipeline execution.
ds = ds.materialize()

# Print all outputs.
for out in ds.take_all():
    print(out)
    print("==========")


INFO 02-21 15:58:15 __init__.py:190] Automatically detected platform cuda.


2025-02-21 15:58:17,900	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.0.126.121:6379...
2025-02-21 15:58:17,910	INFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-6aga4yn8zhbb2tsdr587kwe46n.i.anyscaleuserdata-staging.com [39m[22m
2025-02-21 15:58:17,911	INFO packaging.py:367 -- Pushing file package 'gcs://_ray_pkg_cb9f39504b588559e56f00112d1f2e92ee33dbce.zip' (0.00MiB) to Ray cluster...
2025-02-21 15:58:17,912	INFO packaging.py:380 -- Successfully pushed file package 'gcs://_ray_pkg_cb9f39504b588559e56f00112d1f2e92ee33dbce.zip'.
2025-02-21 15:58:17,942	INFO streaming_executor.py:108 -- Starting execution of Dataset. Full logs are in /tmp/ray/session_2025-02-21_09-38-35_001765_2789/logs/ray-data
2025-02-21 15:58:17,942	INFO streaming_executor.py:109 -- Execution plan of Dataset: InputDataBuffer[Input] -> ActorPoolMapOperator[ReadRange->Map(<lambda>)->Map(_preprocess)->MapBatches(ChatTemplateUDF)] -> ActorPoolMapOp

Running 0: 0.00 row [00:00, ? row/s]

[36m(_MapWorker pid=154950)[0m INFO 02-21 15:58:25 __init__.py:190] Automatically detected platform cuda.
[36m(_MapWorker pid=155088)[0m INFO 02-21 15:58:35 __init__.py:190] Automatically detected platform cuda.
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:58:44 __init__.py:190] Automatically detected platform cuda.


[36m(_MapWorker pid=155187)[0m Max pending requests is set to 141


[36m(_MapWorker pid=155187)[0m INFO 02-21 15:58:55 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='unsloth/Llama-3.2-1B-Instruct', speculative_config=None, tokenizer='unsloth/Llama-3.2-1B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=unsloth/Llama-3.2-1B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=Fal

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


[36m(_MapWorker pid=155187)[0m INFO 02-21 15:58:58 weight_utils.py:297] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.39it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.39it/s]
[36m(_MapWorker pid=155187)[0m 


[36m(_MapWorker pid=155187)[0m INFO 02-21 15:58:58 model_runner.py:1115] Loading model weights took 2.3185 GB
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:58:58 punica_selector.py:18] Using PunicaWrapperGPU.
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:02 worker.py:267] Memory profiling takes 3.16 seconds
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:02 worker.py:267] the current vLLM instance can use total_gpu_memory (44.53GiB) x gpu_memory_utilization (0.90) = 40.07GiB
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:02 worker.py:267] model weights take 2.32GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 7.51GiB; the rest of the memory reserved for KV Cache is 30.16GiB.
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:02 executor_base.py:110] # CUDA blocks: 61774, # CPU blocks: 8192
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:02 executor_base.py:115] Maximum concurrency for 131072 tokens per request: 7.54x


Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]


[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:05 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.


Capturing CUDA graph shapes:   3%|▎         | 1/35 [00:00<00:28,  1.21it/s]
Capturing CUDA graph shapes:   6%|▌         | 2/35 [00:01<00:26,  1.27it/s]
Capturing CUDA graph shapes:   9%|▊         | 3/35 [00:02<00:22,  1.43it/s]
Capturing CUDA graph shapes:  11%|█▏        | 4/35 [00:02<00:20,  1.55it/s]
Capturing CUDA graph shapes:  14%|█▍        | 5/35 [00:03<00:17,  1.70it/s]
Capturing CUDA graph shapes:  17%|█▋        | 6/35 [00:03<00:16,  1.80it/s]
Capturing CUDA graph shapes:  20%|██        | 7/35 [00:04<00:15,  1.83it/s]
Capturing CUDA graph shapes:  23%|██▎       | 8/35 [00:04<00:14,  1.89it/s]
Capturing CUDA graph shapes:  26%|██▌       | 9/35 [00:05<00:13,  1.95it/s]
Capturing CUDA graph shapes:  29%|██▊       | 10/35 [00:05<00:12,  1.96it/s]
Capturing CUDA graph shapes:  31%|███▏      | 11/35 [00:06<00:12,  1.99it/s]
Capturing CUDA graph shapes:  34%|███▍      | 12/35 [00:06<00:11,  2.01it/s]
Capturing CUDA graph shapes:  37%|███▋      | 13/35 [00:07<00:10,  2.03it/s]
Capturin

[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:23 model_runner.py:1562] Graph capturing finished in 18 secs, took 1.12 GiB
[36m(_MapWorker pid=155187)[0m INFO 02-21 15:59:23 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 24.21 seconds
[36m(_MapWorker pid=155539)[0m INFO 02-21 15:59:32 __init__.py:190] Automatically detected platform cuda.


- ReadRange->Map(<lambda>)->Map(_preprocess)->MapBatches(ChatTemplateUDF) 1: 0.00 row [00:00, ? row/s]

- MapBatches(TokenizeUDF) 2: 0.00 row [00:00, ? row/s]

- MapBatches(vLLMEngineStageUDF) 3: 0.00 row [00:00, ? row/s]

- MapBatches(DetokenizeUDF) 4: 0.00 row [00:00, ? row/s]

- Map(_postprocess) 5: 0.00 row [00:00, ? row/s]

Fetching 15 files: 100%|██████████| 15/15 [00:00<00:00, 19070.80it/s]


[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=155187)[0m INFO 02-21 15:59:35 metrics.py:455] Avg prompt throughput: 150.2 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 30 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.


[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=155187)[0m [vLLM] Elapsed time for batch 2eba76b6bd0d4125a75466be157196e6 with size 14: 1.1405171180012985
[36m(MapWorker(MapBatches(vLLMEngineStageUDF)) pid=155187)[0m [vLLM] Elapsed time for batch 5171ea7cc1d247d089c86ff5455f4bfe with size 16: 1.1826866699993843


{'resp': '21.'}
{'resp': 'The answer is 27.'}
{'resp': 'The answer is: 30.'}
{'resp': 'The answer is: 36.'}
{'resp': 'The answer is  39.'}
{'resp': 'The answer is  24.'}
{'resp': '0 ** 3 = 0.'}
{'resp': '11 × 3 = 33.'}
{'resp': '15 × 3 = 45.'}
{'resp': '5 × 3 = 15.'}
{'resp': '1  * 3 = 3.'}
{'resp': 'The answer to 3 × 3 is 9.'}
{'resp': 'The answer is  6 **  3 = 18.'}
{'resp': 'The answer is: 42. For in this number there is hidden a mystery that none, except'}
{'resp': 'God bless Thee, O Thou the Most Merciful! The answer is 6. For if'}
{'resp': 'I am unable to provide the answer to the question 4 × 3 = 12. If'}
{'resp': '961.'}
{'resp': 'The answer is 6.'}
{'resp': 'The answer is 60.'}
{'resp': 'The answer is  51.'}
{'resp': 'The answer is  69.'}
{'resp': 'The answer is  66.'}
{'resp': 'The answer is  69.'}
{'resp': 'The answer is:  8'}
{'resp': 'The answer is  75.'}
{'resp': 'The answer is  9.'}
{'resp': '16 × 3 = 48.'}
{'resp': 'The answer to 2 × 8 is 16.'}
{'resp': '  26  multiplie

[36m(autoscaler +30m24s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
