# Scaling model training

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Generic/ray_logo.png" width="20%" loading="lazy">

## About this notebook

### Is this module right for you?

This module guides you through distributed model training with Ray. Through fine-tuning a transformer for a computer vision task, ML practitioners will learn how to scale training workloads using deep learning models on large datasets.

### Prerequisites

For this notebook, you should satisfy the following minimum requirements:
-   Practical Python knowledge.
-   Familiarity with training deep learning models.
-   Experience with Ray equivalent to completing the following training modules:
    -   [Overview of Ray](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb)
    -   [Introduction to Ray AIR](https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Introduction_to_Ray_AIR.ipynb)
    
### Learning objectives

-   Understand the challenges associated with distributing model training across multiple GPUs.
-   Implement the data parallelism design pattern using Ray Datasets.
-   Fine-tune a transformer model on an image dataset using Ray Train.
-   Evaluate the trained model by performing inference on the test set.

### What will you do?

-   Distributed model training overview
    -   Learn about why training large machine learning models requires a distributed solution.
    -   Refresh your knowledge of the data parallelism design pattern.
-   Example: Fine-tuning a model for image segmentation.
    -   Background
        -   Data - MITADE20K benchmark dataset of scene images.
        -   Model - Segformer transformer for semantic segmentation.
    -   Getting started
        -   Start Ray cluster and set-up environment.
    -   Data ingest
        -   Batch and transform raw data into training inputs using Ray Data.
    -   Distributed training
        -   Fine-tune transformer model on benchmark dataset using Ray Train.
    -   Evaluation
        -   Perform inference on the test set to assess performance using Ray AIR's BatchPredictor.
-   Conclusion
    -   Summarize the distributed training approach as well as the Ray components at each stage of the pipeline.

## Distributed model training

As the development of machine learning models advances, their [size continues to balloon](https://epochai.org/blog/machine-learning-model-sizes-and-the-parameter-gap). Training these large neural networks can take a prohibitively long time and requires an increasingly [massive amount of compute](https://www.hyro.ai/glossary/gpt-3#:~:text=To%20be%20exact%2C%20GPT%2D3,amount%20of%20time%20is%20unimaginable.).

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Overview_of_Ray/ai_compute_annotated.png" width="70%" loading="lazy">|
|:--|
|OpenAI's blog["AI and compute"](https://openai.com/research/ai-and-compute) reports that the amount of compute needed to train the largest models has roughly doubled every 3.5 months since 2012, with no signs of this trend slowing down. Annotated original chart with trend lines overlaid.|

Distributing this workload presents unique challenges involved with orchestrating multiple machines to produce one computationally synchronized result. This problem only compounds when working with heterogeneous resources, multiple tuning experiments, or a model that can't fit on a single GPU. To address these issues, machine learning practitioners have developed a variety of techniques to parallelize training across nodes, one of which is data parallelism.

### Data parallelism

<div class="alert alert-info">
  <strong>Data parallelism:</a></strong> a design pattern that trains replicas of the model on different subsets of a large dataset, periodically synchronizing weights to produce a fully trained result. This method requires that a model's parameters, or weights, are able to fit on a single GPU's memory.
</div>

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Scaling_model_training/data_parallelism.png" width="70%" loading="lazy">|
|:--|
|A large dataset is sharded across multiple worker nodes each containing a model copy. Gradients calculated on independent nodes are continuously synchronized with others to produce a final trained model.|

Ray Train provides distributed data parallel training capabilities. Its integration with Ray AIR also allows for convenient parallelization of data ingestion and pre-processing, hyperparameter tuning, batch inference, and serving. This provides a unified compute layer for the machine learning pipeline, eliminating the need to stitch together independent scaling solutions at each stage. In the next section, you will implement this design pattern using a transformer model and scene images to accomplish a computer vision task.

Note: There are other techniques for distributed training such as model parallelism which divides the model itself across multiple GPUs. However, this module will focus on implementing data parallelism.

## Background on semantic segmentation

<div class="alert alert-info">
  <strong>Semantic segmentation:</a></strong> a computer vision task that assigns labels to object regions in a scene, pixel-by-pixel. Similar to object detection, this approach involves dividing an image into multiple semantic categories such as couch, person, car, or sky.
</div>

In this hands-on example, you will implement the data parallelism design pattern by fine-tuning a pretrained transformer model on scene image data.

### Data

#### MIT ADE20K - scene parsing benchmark

The [MIT ADE20K Dataset](http://sceneparsing.csail.mit.edu/) (also known as "SceneParse150") provides the largest open source dataset for scene parsing. It is often used as a standard for assessing semantic segmentation model performance due to its high-quality annotations.

You will use the training set for fine-tuning and the unlabeled test set for evaluation.

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Scaling_inference/scene.png" width="70%" loading="lazy">|
|:--|
|Test image on the left vs. predicted result. ([Source](https://github.com/CSAILVision/semantic-segmentation-pytorch))|

Dataset highlights

-   20k annotated, scene-centric training images
-   3.3k unlabeled test images
-   150 [semantic categories](https://docs.google.com/spreadsheets/d/1se8YEtb2detS7OuPE86fXGyD269pMycAWe2mtKUj2W8/edit?usp=sharing) (such as person, car, bed, sky, etc.)

### Model

#### SegFormer - transformer-based framework for semantic segmentation

[SegFormer](https://arxiv.org/pdf/2105.15203.pdf) is an effective semantic segmentation method based on a transformer architecture. [Transformers](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)) are a type of deep learning architecture that process sequential data via a series of self-attention layers and then transform them via a feedforward neural network.

What sets SegFormer apart from previous transformer-based approaches are two key features:

1.  A hierarchically structured transformer encoder which does not depend on positional encoding that avoids interpolation when training and testing resolutions differ.
2.  A lightweight MLP layer that avoids complex decoders.

|<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Scaling_inference/segformer_architecture.png" width="70%" loading="lazy">|
|:--|
|Segformer architecture illustrated in the [original paper](https://arxiv.org/pdf/2105.15203.pdf).|

You will use a general, pre-trained SegFormer model to fine-tune on [MITADE20K](http://sceneparsing.csail.mit.edu/) image data.

## Getting started

### Set up necessary imports and utilities

In [None]:
import torch
import numpy as np
import pandas as pd
from PIL import Image
from PIL.JpegImagePlugin import JpegImageFile

# Set the seed to a fixed value for reproducibility.
torch.manual_seed(201)

### Initialize Ray runtime

In [None]:
import ray

In [None]:
ray.init()

### Load the model components from the HuggingFace Hub

From the [Hugging Face Hub](https://huggingface.co/docs/hub/index), retrieve the pretrained SegFormer model by specifying the model name and [label files](https://huggingface.co/datasets/huggingface/label-files/blob/main/ade20k-id2label.json) which map indices to semantic categories.

#### Load label mappings

In [None]:
from utils import get_labels

In [None]:
id2label, label2id = get_labels()
num_labels = len(id2label)

print(f"Total number of labels: {len(id2label)}")
print(f"Example labels: {list(id2label.values())[:5]}")

The utility function `get_labels` fetches two dictionary mappings from [Hugging Face](https://huggingface.co/datasets/huggingface/label-files/blob/main/ade20k-id2label.json), `id2label` and `label2id`, which are used to convert between numerical and string labels for the 150 available [semantic categories](https://docs.google.com/spreadsheets/d/1se8YEtb2detS7OuPE86fXGyD269pMycAWe2mtKUj2W8/edit#gid=0) of objects.

#### Load SegFormer

In [None]:
from transformers import SegformerForSemanticSegmentation

In [None]:
# "nvidia/mit-b0"                              https://huggingface.co/nvidia/mit-b0
# "nvidia/segformer-b0-finetuned-ade-512-512"  https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512

MODEL_NAME = "nvidia/mit-b0"

segformer = SegformerForSemanticSegmentation.from_pretrained(
    MODEL_NAME, id2label=id2label, label2id=label2id
)

print(f"Number of model parameters: {segformer.num_parameters()/(10**6):.2f} M")

The [Hugging Face Hub](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) makes available many variations on SegFormer. Here, you specify a version finetuned on the MITADE20K (SceneParse150) dataset on images with a 512 x 512 resolution.

Note: This "b0" model is the smallest, with [other options](https://huggingface.co/nvidia/segformer-b5-finetuned-ade-640-640) ranging up to and including "b5". Keep this in mind as something to experiment with when comparing different batch inference architectures later on.

## Data ingest

### Load dataset from HuggingFace Hub

In [None]:
from datasets import load_dataset
from utils import convert_image_to_rgb

In [None]:
SMALL_DATA = True

<div class="alert alert-warning">
  <strong>SMALL_DATA</strong>: a flag to download a subset (160 images) of the available test data. Defaults to True. Set to False (recommended) to work with the full test data (3352 images).
</div>

If you set `SMALL_DATA` to `False`, expect it to take some time (depending on your connection download speed) because you are downloading all test images to your local machine or cluster.

#### Load SceneParse150

In [None]:
DATASET_NAME = "scene_parse_150"

# Load data from the Hugging Face datasets repository.
if SMALL_DATA:
    train_dataset = load_dataset(DATASET_NAME, split="train[:160]")
else:
    train_dataset = load_dataset(DATASET_NAME, split="train")

In [None]:
train_dataset = train_dataset.map(convert_image_to_rgb)
train_dataset

Each sample contains three components:
* **`image`** 
    * The PIL image.
* **`annotation`**  
    * Human annotations of image regions (annotation mask is `None` in testing set).
* **`category`**  
    * Category of the scene generally (e.g. driveway, voting booth, dairy_outdoor).

#### Display example images

In [None]:
from utils import display_example_images

In [None]:
# Try running this multiple times!
display_example_images(train_dataset)

### Create train and test Ray Datasets with 160 images each

In [None]:
BATCH_SIZE = 6
N_BATCHES = 1

#### Create train dataset

In [None]:
from utils import get_image_indices

In [None]:
# Get BATCH_SIZE * N_BATCHES randomly shuffled image IDs from the train dataset.
image_indices = get_image_indices(dataset=train_dataset, n=BATCH_SIZE * N_BATCHES)

# Create a list of tuples (image, label) for the indices sampled from the train dataset.
data = [
    (train_dataset[i]["image"], train_dataset[i]["annotation"]) for i in image_indices
]

# Create a Ray Dataset from the list of images to use in Ray AIR.
train_ds = ray.data.from_items(data)
train_ds = train_ds.map_batches(
    lambda x: pd.DataFrame(x, columns=["image", "annotation"])
)

In [None]:
train_ds

In [None]:
# Display example image
train_ds.take(1)[0]["image"]

In [None]:
# Display example image
train_ds.take(1)[0]["annotation"]

#### Create eval dataset

In [None]:
# Get BATCH_SIZE * N_BATCHES randomly shuffled image IDs from the train dataset.
image_indices = get_image_indices(dataset=train_dataset, n=BATCH_SIZE * N_BATCHES)

# Create a list of tuples (image, label) for the indices sampled from the train dataset.
data = [
    (train_dataset[i]["image"], train_dataset[i]["annotation"]) for i in image_indices
]

# Create a Ray Dataset from the list of images to use in Ray AIR.
eval_ds = ray.data.from_items(data)
eval_ds = train_ds.map_batches(
    lambda x: pd.DataFrame(x, columns=["image", "annotation"])
)

In [None]:
eval_ds

### Create preprocessor for distributed data loading

In [None]:
from ray.data.preprocessors import BatchMapper
from transformers import SegformerFeatureExtractor

In [None]:
def images_preprocessor(batch):
    segformer_feature_extractor = SegformerFeatureExtractor.from_pretrained(
        MODEL_NAME, reduce_labels=True
    )

    # inputs are `transformers.image_processing_utils.BatchFeature`
    inputs = segformer_feature_extractor(
        images=list(batch["image"]),
        segmentation_maps=list(batch["annotation"]),
        return_tensors="np",
    )

    return dict(inputs)  # {"pixel_values": array, "labels": array}

[Feature extractors](https://huggingface.co/docs/transformers/main_classes/feature_extractor) preprocess input features (e.g. image data) by normalizing, resizing, padding, and converting raw images into the shape expected by SegFormer.

The [`reduce_labels`](https://huggingface.co/docs/transformers/model_doc/segformer#segformer) flag ensures that the background of an image (anything that is not explicitly an object) isn't included when computing loss. 

In [None]:
batch_preprocessor = BatchMapper(fn=images_preprocessor, batch_format="pandas")

## Distributed Training

To run distributed training with SegFormer from HuggingFace you need:

* setup batches preprocessor
* setup HuggingFace Trainer configuration for all workers
* create HuggingFaceTrainer - Ray Train object that handles distributed training

### Setup HuggingFace Trainer per worker

In [None]:
from transformers import Trainer, TrainingArguments
from transformers.utils.logging import set_verbosity_error

In [None]:
def trainer_init_per_worker(train_dataset, eval_dataset=None, **config):
    set_verbosity_error()
    name = "segformer-finetuned"

    segformer = SegformerForSemanticSegmentation.from_pretrained(
        MODEL_NAME, id2label=id2label, label2id=label2id
    )

    optimizer = torch.optim.AdamW(params=segformer.parameters(), lr=1e-4)
    lr_scheduler = torch.optim.lr_scheduler.LambdaLR(
        optimizer=optimizer, lr_lambda=lambda x: x
    )

    training_args = TrainingArguments(
        name,
        num_train_epochs=5,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        save_total_limit=3,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_strategy="epoch",
        eval_accumulation_steps=16,
        remove_unused_columns=False,
        push_to_hub=False,
        disable_tqdm=True,  # declutter the output a little
        no_cuda=not torch.cuda.is_available(),
    )

    trainer = Trainer(
        model=segformer,
        optimizers=(optimizer, lr_scheduler),
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=compute_metrics,
    )

    print("Starting training...")
    return trainer

### Setup evaluation

In [None]:
import evaluate
from torch.nn.functional import interpolate

In [None]:
mean_iou_metric = evaluate.load("mean_iou")


def compute_metrics(eval_pred):
    with torch.no_grad():
        logits, labels = eval_pred
        logits_tensor = torch.from_numpy(logits)
        logits_tensor = interpolate(
            logits_tensor,
            size=labels.shape[-2:],
            mode="bilinear",
            align_corners=False,
        ).argmax(dim=1)

        pred_labels = logits_tensor.detach().cpu().numpy()

        metrics = mean_iou_metric.compute(
            predictions=pred_labels,
            references=labels,
            num_labels=num_labels,
            ignore_index=255,
            reduce_labels=False,
        )

        for key, value in metrics.items():
            if type(value) is np.ndarray:
                metrics[key] = value.tolist()

        return metrics

### Create HugingFace Trainer

In [None]:
from ray.train.huggingface import HuggingFaceTrainer
from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig

In [None]:
# config
num_workers = 1

In [None]:
# create HuggingFcaceTrainer
trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=ScalingConfig(
        num_workers=num_workers, use_gpu=torch.cuda.is_available()
    ),
    datasets={
        "train": train_ds,
        "evaluation": eval_ds,
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="eval_loss",
            checkpoint_score_order="min",
        ),
    ),
    preprocessor=batch_preprocessor,
)

In [None]:
# run model training
result = trainer.fit()

## Conclusion

Congratulations! You have successfully gained experience in using Ray Train to fine-tune a vision transformer model for semantic segmentation.In the upcoming module, you will be able to build on this example by conducting a series of hyperparameter tuning experiments using Ray Tune.

### Summary

-   Distributed model training
    -   Training and fine-tuning large neural networks requires a massive amount of compute, so the only solution is to distribute this workload.
    -   Data parallelism offers a pattern for sharding a large dataset across multiple machines for training and gradient synchronization.
    -   This orchestration and maintenance is challenging, and Ray AIR offers a unified compute solution to scale this workload that integrates well with other stages in the pipeline.
-   Fine-tuning Segformer on MITADE20K
    -   Data ingest
        -   Ray Data can be used to ingest and preprocess training images. These same transformations can be applied during tuning, inference, and serving.
    -   Distributed training
        -   Ray Train can fine-tune a transformer model, in this case implementing the data parallel design pattern by running PyTorch's [Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) as the backend.
    -   Evaluation
        -   You used Ray AIR's BatchPredictor to assess performance of the fine-tuned model by running inference.


# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [**Ray documentation**](https://docs.ray.io/en/latest)

* [**Official Ray site**](https://www.ray.io/)  
Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.

* [**Join the community on Slack**](https://forms.gle/9TSdDYUgxYs8SA9e8)  
Find friends to discuss your new learnings in our Slack space.

* [**Use the discussion board**](https://discuss.ray.io/)  
Ask questions, follow topics, and view announcements on this community forum.

* [**Join a meetup group**](https://www.meetup.com/Bay-Area-Ray-Meetup/)  
Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.

* [**Open an issue**](https://github.com/ray-project/ray/issues/new/choose)  
Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

* [**Become a Ray contributor**](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html)  
We welcome community contributions to improve our documentation and Ray framework.

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Generic/ray_logo.png" width="20%" loading="lazy">