# Llama 3.1 Batch Inference with Ray Data LLM

This notebook demonstrates how to use Ray Data LLM for batch inference with the Meta Llama 3.1 model. This implementation requires significant GPU resources - please ensure your system meets the minimum requirements listed in the documentation.

## 1. Initialize Ray

In [None]:
import ray
import os
import numpy as np

# Initialize Ray - this will connect to the running Ray cluster
ray.init(address="auto")
print("Ray initialized successfully!")

## 2. Configure the vLLM Processor

For Llama models, we need to configure the vLLM processor with appropriate parameters to manage memory efficiently. The reference system used for this project has an RTX 2080 with 8GB VRAM, which is at the minimum threshold for running the 8B model. The settings below are optimized for hardware with similar specifications.

In [None]:
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

# These parameters are optimized for an RTX 2080 with 8GB VRAM
# IMPORTANT: These reduced parameters are essential to fit the model in 8GB VRAM
config = vLLMEngineProcessorConfig(
    model="meta-llama/Llama-3.1-8B-Instruct",  # Using the 8B model as it's most likely to fit
    engine_kwargs={
        "enable_chunked_prefill": True,  # Helps with memory efficiency
        "max_num_batched_tokens": 2048,  # Reduced from 4096 to save memory
        "max_model_len": 8192,           # Reduced from 16384 to save memory
        "gpu_memory_utilization": 0.85,  # Control memory utilization
        "tensor_parallel_size": 1        # No model parallelism for single GPU
    },
    concurrency=1,  # Single concurrent worker to avoid memory issues
    batch_size=16   # Reduced batch size to avoid OOM errors (64 is default)
)

## 3. Hugging Face Authentication

To download the Llama model, you'll need to authenticate with Hugging Face. Make sure you have access to the Llama model on Hugging Face.

Important: You need to have requested and been granted access to Meta's Llama models on Hugging Face before proceeding. The model won't download if you don't have proper access.

In [None]:
# You need a Hugging Face token with access to Meta Llama models
# Get your token at https://huggingface.co/settings/tokens
# Set this before running the notebook
import os
from huggingface_hub import login

# Option 1: Set in notebook (not recommended for sharing)
# os.environ["HUGGING_FACE_HUB_TOKEN"] = "your_token_here"

# Option 2: Use existing token (preferred)
if "HUGGING_FACE_HUB_TOKEN" in os.environ:
    login(token=os.environ["HUGGING_FACE_HUB_TOKEN"])
    print("Logged in to Hugging Face Hub using environment token")
else:
    print("WARNING: No Hugging Face token found in environment")
    print("Please add your token to environment with:")
    print("export HUGGING_FACE_HUB_TOKEN='your_token_here'")

## 4. Build the LLM Processor

Now we'll build the processor with pre and post-processing functions. This configures how the inputs are formatted for the model and how the outputs are processed.

In [None]:
processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[
            {"role": "system", "content": "You are a helpful AI assistant."},
            {"role": "user", "content": row["prompt"] if "prompt" in row else row["question"]}
        ],
        sampling_params=dict(
            temperature=0.7,
            max_tokens=256,  # Limited for memory efficiency
            top_p=0.9
        )
    ),
    postprocess=lambda row: dict(
        response=row["generated_text"],
        **row  # This will return all the original columns in the dataset
    ),
)

## 5. Prepare Test Data

Let's create a dataset with diverse prompts to test the model. We're using a small number of prompts to avoid memory issues during batch processing.

In [None]:
# Create a dataset with a variety of prompts
prompts = [
    "Explain the concept of batch inference with LLMs in simple terms.",
    "Write a haiku about artificial intelligence.",
    "What are the main advantages of using Ray for distributed computing?",
    "Summarize the key features of vLLM in three bullet points.",
    "How does tensor parallelism improve LLM inference?"
]

# Create a Ray dataset
ds = ray.data.from_items([{"prompt": p} for p in prompts])
print(f"Created dataset with {ds.count()} prompts")
ds.show()

## 6. Run Batch Inference

Now we'll process the dataset through the Llama model. This step may take some time and requires significant GPU resources. We've included error handling in case of memory issues.  

Warning: The first run will download the model weights which may take significant time depending on your internet connection. The model is approximately 5-6GB in size.

In [None]:
try:
    print("Starting batch inference with Llama 3.1-8B...")
    print("This may take a while depending on your hardware.")
    print("First run will download model weights (~5-6GB)")
    
    # Process the dataset through the Llama model
    result_ds = processor(ds)
    
    print("Batch inference completed successfully!")
    result_ds.show()
    
except Exception as e:
    print(f"Error during batch inference: {e}")
    print("\nThis could be due to insufficient GPU memory or other resource constraints.")
    print("Consider these troubleshooting steps:")
    print("1. Further reduce batch_size (try 8 or 4)")
    print("2. Lower max_num_batched_tokens to 1024")
    print("3. Reduce gpu_memory_utilization to 0.7")
    print("4. Close other applications using GPU memory")
    print("5. Check if your WSL GPU passthrough is properly configured")
    print("6. Consider trying a smaller model like TinyLlama/TinyLlama-1.1B-Chat-v1.0")

## 7. Examine the Results

Let's fetch and display the results in a more readable format.

In [None]:
try:
    # Take all results and display them
    results = result_ds.take_all()
    
    for i, item in enumerate(results):
        print(f"Prompt {i+1}: {item['prompt']}")
        print(f"\nResponse:\n{item['response']}")
        print("-" * 80)
        print()
except NameError:
    print("No results to display. Batch inference may have failed.")

## 8. Memory and Performance Analysis

Let's check GPU memory usage and other resources. Monitoring memory usage is critical when working with large models on systems with limited VRAM.

In [None]:
try:
    # Only works if you have nvidia-smi available
    !nvidia-smi
except:
    print("nvidia-smi not available or not accessible from the notebook")
    
    # Alternative memory check using Python
    import torch
    if torch.cuda.is_available():
        print(f"CUDA available: {torch.cuda.is_available()}")
        print(f"Current GPU: {torch.cuda.current_device()}")
        print(f"GPU Name: {torch.cuda.get_device_name()}")
        print(f"Total GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
        print(f"Allocated memory: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
        print(f"Cached memory: {torch.cuda.memory_reserved() / 1e9:.2f} GB")
    else:
        print("CUDA not available")

## 9. Cleanup

Always shut down Ray when you're done to free up resources. This is especially important when working with limited GPU memory.

In [None]:
# Shutdown Ray to free resources
ray.shutdown()
print("Ray has been shut down.")

## 10. Additional Recommendations for Limited Hardware

The reference hardware specification (Intel Core i7-10875H, 32GB RAM, RTX 2080 with 8GB VRAM) presents some specific challenges:

1. The RTX 2080 with 8GB VRAM is at the minimum threshold for running Llama-3.1-8B
2. WSL adds overhead to GPU performance and memory management
3. System processes may compete for GPU resources

If encountering persistent memory issues:

1. Try using WSL memory configuration to optimize memory allocation
2. Create a `.wslconfig` file in the Windows user directory with:
   ```
   [wsl2]
   memory=24GB
   swap=8GB
   ```
3. Consider using a smaller model first to test the workflow
4. Run the Ray Dashboard (http://localhost:8265) to monitor cluster resources
5. Add `swap_ratio=0.5` to `config` to enable swap memory (slower but helps with OOM)

Remember that batch inference is memory-intensive, but the optimized configuration provided in this notebook should work on systems with 8GB VRAM with careful management.