# Scaling model training

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Generic/ray_logo.png" width="20%" loading="lazy">

## Learning objectives

-   Understand the challenges associated with distributing model training across multiple GPUs.
-   Implement the data parallelism design pattern using Ray Datasets.
-   Fine-tune a transformer model on an image dataset using Ray Train.
-   Evaluate the trained model by performing inference on the test set.

## Getting started

### Set up necessary imports and utilities

In [None]:
import warnings

import torch
import numpy as np
import pandas as pd
from PIL import Image
from PIL.JpegImagePlugin import JpegImageFile

# Set the seed to a fixed value for reproducibility.
torch.manual_seed(201)

warnings.simplefilter("ignore")

### Initialize Ray runtime

In [None]:
import ray

In [None]:
ray.init()

### Load the model components from the HuggingFace Hub

From the [Hugging Face Hub](https://huggingface.co/docs/hub/index), retrieve the pretrained SegFormer model by specifying the model name and [label files](https://huggingface.co/datasets/huggingface/label-files/blob/main/ade20k-id2label.json) which map indices to semantic categories.

#### Load label mappings

In [None]:
from utils import get_labels

In [None]:
id2label, label2id = get_labels()
num_labels = len(id2label)

print(f"Total number of labels: {len(id2label)}")
print(f"Example labels: {list(id2label.values())[:5]}")

The utility function `get_labels` fetches two dictionary mappings from [Hugging Face](https://huggingface.co/datasets/huggingface/label-files/blob/main/ade20k-id2label.json), `id2label` and `label2id`, which are used to convert between numerical and string labels for the 150 available [semantic categories](https://docs.google.com/spreadsheets/d/1se8YEtb2detS7OuPE86fXGyD269pMycAWe2mtKUj2W8/edit#gid=0) of objects.

#### Load SegFormer

In [None]:
from transformers import SegformerForSemanticSegmentation

In [None]:
# "nvidia/mit-b0"                              https://huggingface.co/nvidia/mit-b0
# "nvidia/segformer-b0-finetuned-ade-512-512"  https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512

MODEL_NAME = "nvidia/mit-b0"

segformer = SegformerForSemanticSegmentation.from_pretrained(
    MODEL_NAME, id2label=id2label, label2id=label2id
)

print(f"Number of model parameters: {segformer.num_parameters()/(10**6):.2f} M")

The [Hugging Face Hub](https://huggingface.co/nvidia/segformer-b0-finetuned-ade-512-512) makes available many variations on SegFormer. Here, you specify a version finetuned on the MITADE20K (SceneParse150) dataset on images with a 512 x 512 resolution.

Note: This "b0" model is the smallest, with [other options](https://huggingface.co/nvidia/segformer-b5-finetuned-ade-640-640) ranging up to and including "b5". Keep this in mind as something to experiment with when comparing different batch inference architectures later on.

## Data ingest

### Load dataset from HuggingFace Hub

In [None]:
from datasets import load_dataset
from utils import convert_image_to_rgb

In [None]:
SMALL_DATA = True

<div class="alert alert-warning">
  <strong>SMALL_DATA</strong>: a flag to download a subset (160 images) of the available data. Defaults to True. Set to False (recommended) to work with the full train data (20k images).
</div>

If you set `SMALL_DATA` to `False`, expect it to take some time (depending on your connection download speed) because you are downloading all test images to your local machine or cluster.

#### Load SceneParse150

In [None]:
DATASET_NAME = "scene_parse_150"

# Load data from the Hugging Face datasets repository.
if SMALL_DATA:
    train_dataset = load_dataset(DATASET_NAME, split="train[:160]")
else:
    train_dataset = load_dataset(DATASET_NAME, split="train")

In [None]:
train_dataset = train_dataset.map(convert_image_to_rgb)
train_dataset

Each sample contains three components:
* **`image`** 
    * The PIL image.
* **`annotation`**  
    * Human annotations of image regions (annotation mask is `None` in testing set).
* **`category`**  
    * Category of the scene generally (e.g. driveway, voting booth, dairy_outdoor).

#### Display example images

In [None]:
from utils import display_example_images

In [None]:
# Try running this multiple times!
display_example_images(train_dataset)

---
## Pause: Switch to Slido to answer a quiz question.
---

### Create Ray Dataset for training

In [None]:
BATCH_SIZE = 8
N_BATCHES = 1

In [None]:
from utils import get_image_indices

In [None]:
# Get BATCH_SIZE * N_BATCHES randomly shuffled image IDs from the train dataset.
image_indices = get_image_indices(dataset=train_dataset, n=BATCH_SIZE * N_BATCHES)

# Create a list of tuples (image, label) for the indices sampled from the train dataset.
data = [
    (train_dataset[i]["image"], train_dataset[i]["annotation"]) for i in image_indices
]

# Create a Ray Dataset from the list of images to use in Ray AIR.
train_ds = ray.data.from_items(data)
train_ds = train_ds.map_batches(
    lambda x: pd.DataFrame(x, columns=["image", "annotation"])
)

In [None]:
train_ds.schema()

In [None]:
# Display example image
train_ds.take(1)[0]["image"]

In [None]:
# Display example image
train_ds.take(1)[0]["annotation"]

### Create preprocessor for distributed data loading

In [None]:
from transformers import SegformerImageProcessor

In [None]:
def images_preprocessor(batch):
    warnings.simplefilter("ignore")
    segformer_image_processor = SegformerImageProcessor.from_pretrained(
        MODEL_NAME, do_reduce_labels=True
    )

    # inputs are `transformers.image_processing_utils.BatchFeature`
    inputs = segformer_image_processor(
        images=list(batch["image"]),
        segmentation_maps=list(batch["annotation"]),
        return_tensors="np",
    )

    return dict(inputs)  # {"pixel_values": array, "labels": array}

[Feature extractors](https://huggingface.co/docs/transformers/main_classes/feature_extractor) preprocess input features (e.g. image data) by normalizing, resizing, padding, and converting raw images into the shape expected by SegFormer.

The [`reduce_labels`](https://huggingface.co/docs/transformers/model_doc/segformer#segformer) flag ensures that the background of an image (anything that is not explicitly an object) isn't included when computing loss. 

In [None]:
from ray.data.preprocessors import BatchMapper

In [None]:
batch_preprocessor = BatchMapper(
    fn=images_preprocessor, batch_format="pandas", batch_size=2
)

---
## Pause: Switch to Slido to answer a quiz question.
---

## Distributed Training

To run distributed training with SegFormer from HuggingFace you need:

* setup batches preprocessor
* setup HuggingFace Trainer configuration for all workers
* create HuggingFaceTrainer - Ray Train object that handles distributed training

### Setup HuggingFace Trainer per worker

In [None]:
from transformers import Trainer, TrainingArguments

In [None]:
def trainer_init_per_worker(train_dataset, eval_dataset=None, **config):
    warnings.simplefilter("ignore")

    name = "segformer-finetuned"

    # Setup model
    segformer = SegformerForSemanticSegmentation.from_pretrained(
        MODEL_NAME, id2label=id2label, label2id=label2id
    )

    # Setup optimizer and LR scheduler
    optimizer = torch.optim.AdamW(params=segformer.parameters(), lr=1e-4)
    lr_scheduler = torch.optim.lr_scheduler.LambdaLR(
        optimizer=optimizer, lr_lambda=lambda x: x
    )

    # Setup HF Training Arguments
    training_args = TrainingArguments(
        name,
        num_train_epochs=5,
        per_device_train_batch_size=BATCH_SIZE,
        save_total_limit=3,
        save_strategy="epoch",
        logging_strategy="epoch",
        eval_accumulation_steps=2,
        log_level="error",
        log_level_replica="error",
        log_on_each_node=False,
        remove_unused_columns=False,
        push_to_hub=False,
        disable_tqdm=True,  # declutter the output a little
        no_cuda=True,
    )

    # Setup HF Trainer
    trainer = Trainer(
        model=segformer,
        optimizers=(optimizer, lr_scheduler),
        args=training_args,
        train_dataset=train_dataset,
    )

    print("Starting training...")
    return trainer

### Create HuggingFace Trainer

In [None]:
from ray.train.huggingface import HuggingFaceTrainer
from ray.air.config import RunConfig, ScalingConfig, CheckpointConfig

In [None]:
# setup parameters for the ScalingConfig
num_workers = 2
use_gpu = False

In [None]:
# Setup Ray's HF Trainer
trainer = HuggingFaceTrainer(
    trainer_init_per_worker=trainer_init_per_worker,
    scaling_config=ScalingConfig(num_workers=num_workers, use_gpu=use_gpu),
    datasets={
        "train": train_ds,
    },
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
            checkpoint_score_attribute="loss",
            checkpoint_score_order="min",
        ),
    ),
    preprocessor=batch_preprocessor,
)

In [None]:
# run model training
result = trainer.fit()

---
## Pause: Switch to Slido to answer a quiz question.
---

## Conclusion

Congratulations! You have successfully gained experience in using Ray Train to fine-tune a vision transformer model for semantic segmentation.In the upcoming module, you will be able to build on this example by conducting a series of hyperparameter tuning experiments using Ray Tune.

### Summary

-   Distributed model training
    -   Training and fine-tuning large neural networks requires a massive amount of compute, so the only solution is to distribute this workload.
    -   Data parallelism offers a pattern for sharding a large dataset across multiple machines for training and gradient synchronization.
    -   This orchestration and maintenance is challenging, and Ray AIR offers a unified compute solution to scale this workload that integrates well with other stages in the pipeline.
-   Fine-tuning Segformer on MITADE20K
    -   Data ingest
        -   Ray Data can be used to ingest and preprocess training images. These same transformations can be applied during tuning, inference, and serving.
    -   Distributed training
        -   Ray Train can fine-tune a transformer model, in this case implementing the data parallel design pattern by running PyTorch's [Distributed Data Parallel](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) as the backend.
    -   Evaluation
        -   You used Ray AIR's BatchPredictor to assess performance of the fine-tuned model by running inference.


# Connect with the Ray community

You can learn and get more involved with the Ray community of developers and researchers:

* [**Ray documentation**](https://docs.ray.io/en/latest)

* [**Official Ray site**](https://www.ray.io/)  
Browse the ecosystem and use this site as a hub to get the information that you need to get going and building with Ray.

* [**Join the community on Slack**](https://forms.gle/9TSdDYUgxYs8SA9e8)  
Find friends to discuss your new learnings in our Slack space.

* [**Use the discussion board**](https://discuss.ray.io/)  
Ask questions, follow topics, and view announcements on this community forum.

* [**Join a meetup group**](https://www.meetup.com/Bay-Area-Ray-Meetup/)  
Tune in on meet-ups to listen to compelling talks, get to know other users, and meet the team behind Ray.

* [**Open an issue**](https://github.com/ray-project/ray/issues/new/choose)  
Ray is constantly evolving to improve developer experience. Submit feature requests, bug-reports, and get help via GitHub issues.

* [**Become a Ray contributor**](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html)  
We welcome community contributions to improve our documentation and Ray framework.

<img src="https://technical-training-assets.s3.us-west-2.amazonaws.com/Generic/ray_logo.png" width="20%" loading="lazy">