# Model Fine-Tuning and Batch Inference

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Generic/ray_logo.png" width="20%" loading="lazy">

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_pipeline_full.png" width="100%" loading="lazy">

Welcome to this tutorial notebook, where you'll explore how to leverage [Ray AI Runtime (AIR)](https://docs.ray.io/en/latest/ray-air/getting-started.html) to perform distributed data preprocessing, fine-tuning, hyperparameter tuning, and batch inference using the [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model applied to the [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) dataset.

[FLAN-T5](https://arxiv.org/pdf/2210.11416.pdf) is transformer-based language model based on [Google's T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) architecture and fine-tuned on instruction data. You will be further training this model on [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html), a set of 52k instructions and demonstrations. Through Ray AIR's integration with the Hugging Face hub, these components are easily accessible, and this example can be adapted for use with other similar models.

By the end of this tutorial, you'll have a comprehensive understanding of how to harness Ray AIR to efficiently distribute complex machine learning tasks, allowing you to scale your projects easily.

## Getting started

### Set up imports and utilities

In [None]:
import random
import torch
import transformers
import warnings

import numpy as np
import pandas as pd

from IPython.display import display, HTML
from typing import Any, Dict, List, Optional

transformers.set_seed(42)
warnings.simplefilter("ignore")

### Initialize Ray runtime

In [None]:
import ray

In [None]:
ray.init()

By calling `ray.init()`, you will initialize a Ray cluster. Follow the link outputted above to open the Ray Dashboard——a vital observability tool for understanding your infrastructure and application.

## Data ingest

### Load the dataset

You will be fine-tuning the model on [Alpaca](https://huggingface.co/datasets/tatsu-lab/alpaca) to hopefully further refine the question answering and text generation ability of the original model.

In [None]:
from datasets import load_dataset
from utils import get_random_elements

In [None]:
hf_dataset = load_dataset("tatsu-lab/alpaca", split="train").train_test_split(
    test_size=0.2, seed=57
)
hf_dataset

### Display sample data

In [None]:
df = get_random_elements(dataset=hf_dataset["train"], num_examples=3)
display(HTML(df.to_html()))

Notice that there are four feature columns in the dataset:

* `instruction` - The original prompt or query such as "How do we reduce air pollution?"
* `input` - Any additional context that wasn't provided by the instruction.
* `output` - A sample generated response as generated by [Open AI's](https://platform.openai.com/docs/models/gpt-3-5) `text-davinci-003`.
* `text` - The instruction, input, output, along with an [instructional prefix](https://github.com/tatsu-lab/stanford_alpaca#data-release).

### Convert to Ray Dataset

In [None]:
ray_dataset = ray.data.from_huggingface(hf_dataset)
ray_dataset

[Ray Datasets](https://docs.ray.io/en/master/data/dataset.html#datasets) are the standard method for loading and exchanging data in Ray AIR libraries. They are specifically designed for easy distributed batch preprocessing, and you can easily convert from a Hugging Face dataset to Ray by using [`ray.data.from_huggingface()`](https://docs.ray.io/en/master/data/api/doc/ray.data.from_huggingface.html#ray.data.from_huggingface).

### Set up train and validation Ray datasets

In [None]:
SMALL_DATA = True

if SMALL_DATA:
    train_dataset = ray_dataset["train"].limit(100)
    validation_dataset = ray_dataset["test"].limit(100)
else:
    train_dataset = ray_dataset["train"]
    validation_dataset = ray_dataset["test"]

Note the `SMALL_DATA` flag which, when `True`, limits the number of samples used for downstream steps. This is to reduce training time for demonstration purposes. However, if you have more time, it is advised to set this flag to `False` to utilize the full dataset.

## Distributed preprocessing

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_pipeline_data.png" width="100%" loading="lazy">

### Implement preprocessing function

In [None]:
from ray.data.preprocessors import BatchMapper
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [None]:
def preprocess_function(batch: Dict[str, Any]) -> Dict[str, Any]:
    """
    Tokenizes the input and instruction pairs in a batch using the T5 tokenizer
    from the Google/flan-t5-base model, and returns a dictionary containing the
    encoded inputs and labels.

    Args:
        batch: A dictionary containing at least two keys, "instruction" and
        "input", whose values are lists of strings.

    Returns:
        A dictionary containing the encoded inputs and labels, as returned by
        the T5 tokenizer.
    """
    model_name = "google/flan-t5-base"
    tokenizer = T5Tokenizer.from_pretrained(model_name)

    encoded_inputs = tokenizer(
        list(batch["instruction"]),
        list(batch["input"]),
        padding="max_length",
        truncation=True,
        return_tensors="np",
    )

    encoded_inputs["labels"] = encoded_inputs["input_ids"].copy()

    return dict(encoded_inputs)

In [None]:
batch_preprocessor = BatchMapper(preprocess_function, batch_format="pandas", batch_size=4096)

You need to define a preprocessing function to convert a batch of data from Alpaca to a format that the FLAN-T5 model can accept. [Ray AIR's `BatchMapper`](https://docs.ray.io/en/latest/ray-air/api/doc/ray.data.preprocessors.BatchMapper.html#ray-data-preprocessors-batchmapper) will then map this function onto each incoming batch during the fine-tuning step.

Unpacking this function a bit, the most important component is the [tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer), which is a Hugging Face component associated with the FLAN-T5 model that turns natural language into formatted tokens with the right padding and truncation necessary for training.

## Distributed finetuning

Now you have the dataset prepared, and a batch preprocessor initialized, it is time to configure [Ray AIR's `HuggingFaceTrainer`](https://docs.ray.io/en/master/train/api/doc/ray.train.huggingface.HuggingFaceTrainer.html#ray.train.huggingface.HuggingFaceTrainer) to distribute FLAN-T5 fine-tuning on Alpaca.

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_pipeline_finetune.png" width="100%" loading="lazy">

### Ray AIR Distributed Fine-Tuning Flow

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_train.png" width="100%" loading="lazy">|
|:--|
|Each worker node houses a preprocessor copy to process partitioned batches of the Ray Dataset, and then individual model copies train on these batches. PyTorch DDP synchronizes their weights, resulting in an integrated, fine-tuned model.|

### Initialize training logic for each worker

In [None]:
from transformers import TrainingArguments, Trainer

In [None]:
batch_size = 2
use_gpu = True

Before getting started, set the batch size (use a small number here since training requires a large amount of compute) and specify use of GPUs.

In [None]:
def trainer_init_per_worker(
    train_dataset: ray.data.Dataset,
    eval_dataset: Optional[ray.data.Dataset] = None,
    **config,
) -> Trainer:
    """
    Initializes a Hugging Face Trainer for training a T5 text generation model.

    Args:
        train_dataset (ray.data.Dataset): The dataset for training the model.
        eval_dataset (ray.data.Dataset, optional): The dataset for evaluating
        the model.
            Defaults to None.
        config: Additional arguments to configure the Trainer.

    Returns:
        Trainer: A Hugging Face Trainer for training the T5 model.
    """
    device = torch.device("cuda" if use_gpu and torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    model_name = "google/flan-t5-base"

    tokenizer = T5Tokenizer.from_pretrained(model_name)
    model = T5ForConditionalGeneration.from_pretrained(model_name)

    training_args = TrainingArguments(
        "flan-t5-base-finetuned-alpaca",
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        learning_rate=config.get("learning_rate", 2e-5),
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=config.get("epochs", 4),
        weight_decay=config.get("weight_decay", 0.01),
        push_to_hub=False,
        disable_tqdm=True,
    )

    hf_trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
    )

    print("Starting training...")
    return hf_trainer

The `trainer_init_per_worker` function creates a Hugging Face Transformers Trainer that will be distributed by Ray using Distributed Data Parallelism (using PyTorch Distributed backend internally). This means that each worker will have its own copy of the model, but operate on different data. At the end of each step, all the workers will sync gradients.

Note: The Hugging Face hub offers different versions of [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) with increasing size. Here, the model and associated tokenizer are ["flan_t5_base"](https://huggingface.co/google/flan-t5-base), the smallest variant, in order to expedite fine-tuning for demonstration purposes. You can try this notebook with larger models, and you might find [this related tutorial](https://docs.ray.io/en/master/ray-air/examples/gptj_deepspeed_fine_tuning.html#train) helpful if the model does not fit on a single GPU.

### Define Trainer

In [None]:
from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig
from ray.train.huggingface import HuggingFaceTrainer

In [None]:
num_workers = 2

Since you have access to two GPUs, set the number of workers to match in order to utilize the full cluster for fine-tuning.

In [None]:
trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": train_dataset,
        "evaluation": validation_dataset,
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
    preprocessor=batch_preprocessor,
)

[Ray AIR's HuggingFaceTrainer](https://docs.ray.io/en/latest/train/api/doc/ray.train.huggingface.HuggingFaceTrainer.html?highlight=ray%20air%20hugging%20face%20trainer) integrates with the Hugging Face Transformers library to scale training and fine-tuning across multiple workers, each with its own copy of the Hugging Face `transformers.Trainer` set up in the previous step.

Here, you specify the following:
* `trainer_init_per_worker` - Training logic copied onto each worker node.
* `scaling_config` - Specify how to scale and the hardware to run on.
* `datasets` - Which datasets to run training and evaluation on.
* `run_config` - Specify checkpointing behavior (how many times to save the model and how to compare between saved models).
* `preprocessor` - The same [Ray AIR preprocessor](https://docs.ray.io/en/latest/ray-air/preprocessors.html) defined above used to transform raw data into tokenized batches.

### Run finetuning

In [None]:
result = trainer.fit()

### Try the finetuned model

Now that you have a fine-tuned model stored in a Checkpoint, you can retrieve it and test out your own instructions. In a later section, you will implement inference at scale.

In [None]:
model_name = "google/flan-t5-base"

In [None]:
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

In [None]:
checkpoint = result.checkpoint
finetuned_model = checkpoint.get_model(model)

Note: You are fetching the fine-tuned FLAN-T5 from the saved [checkpoint object](https://docs.ray.io/en/latest/ray-air/api/doc/ray.air.checkpoint.Checkpoint.html#ray.air.checkpoint.Checkpoint), which requires passing in what kind of model you expect to receive.

In [None]:
instruction = "How many bees do I have?"  # Enter your own instruction here.
input_query = (
    "I don't have enough bees."  # Write additional context for the model here.
)

inputs = tokenizer(instruction, input_query, return_tensors="pt")
outputs = finetuned_model.generate(**inputs)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

## [Optional] Distributed hyperparameter tuning

If you would like to tune hyperparameters in pursuit of a better performing model, you can pass the previous `HuggingFaceTrainer` into a [Ray AIR `Tuner`](https://docs.ray.io/en/latest/ray-air/tuner.html) and define the parameter search space to conduct experiments.

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_pipeline_tune.png" width="100%" loading="lazy">

### Ray AIR Distributed Hyperparameter Tuning Flow

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_tune.png" width="100%" loading="lazy">|
|:--|
|To achieve the best configuration for the fine-tuned model, define a Tuner object with a customized search space and behavioral settings for scheduling, scaling, and checkpointing. Running multiple trial experiments using this approach can help converge on the optimal configuration.|

In [None]:
from ray import tune
from ray.tune import Tuner
from ray.tune.schedulers.async_hyperband import ASHAScheduler

In [None]:
total_num_trials = 4
max_tune_epochs = 16

In [None]:
num_workers = 1
use_gpu = True

Set the number of workers to 1 for each `Trainer` so that hyperparameter tuning can run in parallel rather than sequentially with each trial utilizing all resources per experiment.

In [None]:
trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": train_dataset,
        "evaluation": validation_dataset,
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
    preprocessor=batch_preprocessor,
)

This is the same `HuggingFaceTrainer` created previously, just with a different number of workers for fine-tuning.

In [None]:
tuner = Tuner(
    trainer,
    param_space={
        "trainer_init_config": {
            "learning_rate": tune.choice([2e-5, 2e-4, 2e-3, 2e-2]),
            "epochs": tune.choice([2, 4, 8, max_tune_epochs]),
            "weight_decay": tune.choice([0.01, 0.1, 1.0, 10.0]),
        }
    },
    tune_config=tune.TuneConfig(
        metric="eval_loss",
        mode="min",
        num_samples=total_num_trials,
        scheduler=ASHAScheduler(
            max_t=max_tune_epochs,
        ),
    ),
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        )
    ),
)

There are four major components passed into the Tuner:
1. `trainer` - The `HuggingFaceTrainer` with scaling, preprocessing, and fine-tuning logic from earlier.
2. `param_space` - The [possibilities of hyperparameters](https://docs.ray.io/en/latest/ray-air/tuner.html#how-to-configure-a-search-space) to tune and search for any given trial.
3. `tune_config` - Specify how to compare different experiments, the number of trials, as well as any advanced [search algorithms](https://docs.ray.io/en/latest/tune/key-concepts.html#search-alg-ref) and [schedulers](https://docs.ray.io/en/latest/tune/key-concepts.html#schedulers-ref) like [ASHA](https://openreview.net/forum?id=S1Y7OOlRZ).
4. `run_config` - Used to specify checkpointing behavior, custom callbacks, failure/retry configurations, [and more.](https://docs.ray.io/en/latest/ray-air/api/doc/ray.air.RunConfig.html#ray.air.RunConfig)

In [None]:
result_grid = tuner.fit()

## Distributed batch inference

Once you have a fine-tuned model, you can apply it to batches of inputs to generate predictions at scale, which is exactly what [Ray AIR's `BatchPredictor`](https://docs.ray.io/en/latest/ray-air/predictors.html#batch-prediction) is designed to facilitate.

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_pipeline_inference.png" width="100%" loading="lazy">

### Ray AIR Distributed Batch Inference Flow

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/NLP_workloads/Text_generation/nlp_batchpredict.png" width="100%" loading="lazy">|
|:--|
|Using the best fine-tuned model stored in a Checkpoint object, apply BatchPredictor to new batches of data to generate predictions.|

In [None]:
from ray.train.predictor import Predictor
from ray.train.batch_predictor import BatchPredictor
from transformers import AutoTokenizer

In [None]:
class HuggingFaceModelPredictor(Predictor):
    """
    A Ray Predictor for Hugging Face models that generates text given input data.

    Args:
        model (transformers.PreTrainedModel): A trained Hugging Face model.
        tokenizer (Optional[transformers.PreTrainedTokenizerBase]): A tokenizer
        that can tokenize input text.
        preprocessor (Optional[Callable]): A function that takes raw input data
        and returns tokenized input data.
        use_gpu (bool): Whether to use a GPU or CPU for prediction.
    """

    def __init__(
        self,
        model: Any,
        tokenizer: Optional[Any] = None,
        preprocessor: Optional[Any] = None,
        use_gpu: bool = False,
    ) -> None:
        super().__init__(preprocessor)
        self.model = model
        self.use_gpu = use_gpu
        self.tokenizer = tokenizer

    @classmethod
    def from_checkpoint(
        cls,
        checkpoint: Any,
        model_cls: Any,
        *,
        tokenizer: Optional[Any] = None,
        use_gpu: bool = False,
        **get_model_kwargs: Any,
    ) -> "HuggingFaceModelPredictor":
        """
        Create a HuggingFaceModelPredictor from a checkpoint.

        Args:
            checkpoint (Any): A checkpoint containing a trained Hugging Face model.
            model_cls (Any): The type of Hugging Face model to load from the checkpoint.
            tokenizer (Optional[Any]): A tokenizer that can tokenize input text.
            use_gpu (bool): Whether to use a GPU or CPU for prediction.
            **get_model_kwargs (Any): Additional keyword arguments for loading
            the Hugging Face model.

        Returns:
            HuggingFaceModelPredictor: A Ray Predictor for the Hugging Face model.
        """
        if not tokenizer:
            tokenizer = AutoTokenizer
        if isinstance(tokenizer, type):
            tokenizer = checkpoint.get_tokenizer(tokenizer)
        return cls(
            checkpoint.get_model(model_cls, **get_model_kwargs),
            tokenizer=tokenizer,
            preprocessor=checkpoint.get_preprocessor(),
            use_gpu=use_gpu,
        )

    def _predict_numpy(
        self,
        data: Dict[str, Any],
        feature_columns: Optional[List[str]] = None,
        **generate_kwargs: Any,
    ) -> pd.DataFrame:
        """
        Generates text given input data.

        Args:
            data (Dict[str, Any]): A dictionary of input data.
            feature_columns (Optional[List[str]]): A list of feature column names
            to use for prediction.
            **generate_kwargs (Any): Additional keyword arguments for generating text.

        Returns:
            pd.DataFrame: A Pandas DataFrame with a single column "generated_output"
            containing the generated text.
        """
        # we get already tokenized text here because we have the tokenizer as an AIR preprocessor
        if feature_columns:
            data = {k: v for k, v in data.items() if k in feature_columns}

        data = {
            k: torch.from_numpy(v).to(device=self.model.device) for k, v in data.items()
        }
        generate_kwargs = {**data, **generate_kwargs}

        outputs = self.model.generate(**generate_kwargs)
        return pd.DataFrame(
            self.tokenizer.batch_decode(outputs, skip_special_tokens=True),
            columns=["generated_output"],
        )

Establish a custom class for prediction, `HugginFaceModelPredictor`, which extends the base Ray AIR [`Predictor`](https://docs.ray.io/en/latest/ray-air/api/doc/ray.train.predictor.Predictor.html?highlight=ray%20air%20predictor) to generate text responses to input instructions:

* The predictor takes a trained Hugging Face model, a tokenizer, and a preprocessor (which can be a function that takes raw input data and returns tokenized input data). 

* `from_checkpoint` creates a `HuggingFaceModelPredictor` from a checkpoint containing a trained Hugging Face model. 

* `_predict_numpy` generates text given input data in the form of a dictionary, and returns a Pandas DataFrame with a single column "generated_output" containing the generated text. 

In [None]:
predictor = BatchPredictor.from_checkpoint(
    checkpoint=result.checkpoint,
    predictor_cls=HuggingFaceModelPredictor,
    model_cls=T5ForConditionalGeneration,
    tokenizer=T5Tokenizer,
    use_gpu=use_gpu,
    device_map="auto",
    torch_dtype=torch.float16,
)

Create a Ray AIR `BatchPredictor` from a [Checkpoint](https://docs.ray.io/en/latest/ray-air/api/checkpoint.html?highlight=checkpoint) and specify the custom predictor, model class, tokenizer, as well as any additional arguments.

### Run batch inference

In [None]:
prediction = predictor.predict(
    validation_dataset,
    num_gpus_per_worker=int(use_gpu),
    batch_size=256,
    max_new_tokens=128,
)

### Inspect predictions

In [None]:
# Display inputs and generated outputs side by side.
input_data_pd = validation_dataset.to_pandas()
prediction_pd = prediction.to_pandas()

input_data_pd.join(prediction_pd, how='inner')

# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [**Ray documentation**](https://docs.ray.io/en/latest)

* [**Official Ray site**](https://www.ray.io/)  
Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.

* [**Join the community on Slack**](https://forms.gle/9TSdDYUgxYs8SA9e8)  
Find friends to discuss your new learnings in our Slack space.

* [**Use the discussion board**](https://discuss.ray.io/)  
Ask questions, follow topics, and view announcements on this community forum.

* [**Join a meetup group**](https://www.meetup.com/Bay-Area-Ray-Meetup/)  
Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.

* [**Open an issue**](https://github.com/ray-project/ray/issues/new/choose)  
Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

* [**Become a Ray contributor**](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html)  
We welcome community contributions to improve our documentation and Ray framework.

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Generic/ray_logo.png" width="20%" loading="lazy">