(hpu_resnet_training)=
# ResNet Model Training with Intel Gaudi

<a id="try-anyscale-quickstart-intel_gaudi-resnet" href="https://console.anyscale.com/register/ha?render_flow=ray&utm_source=ray_docs&utm_medium=docs&utm_campaign=intel_gaudi-resnet">
    <img src="../../../_static/img/run-on-anyscale.svg" alt="try-anyscale-quickstart">
</a>
<br></br>

In this Jupyter notebook, we will train a ResNet-50 model to classify images of ants and bees using HPU. We will use PyTorch for model training and Ray for distributed training. The dataset will be downloaded and processed using torchvision's datasets and transforms.

[Intel Gaudi AI Processors (HPUs)](https://habana.ai) are AI hardware accelerators designed by Intel Habana Labs. For more information, see [Gaudi Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/index.html) and [Gaudi Developer Docs](https://developer.habana.ai/).

## Configuration

A node with Gaudi/Gaudi2 installed is required to run this example. Both Gaudi and Gaudi2 have 8 HPUs. We will use 2 workers to train the model, each using 1 HPU.

We recommend using a prebuilt container to run these examples. To run a container, you need Docker. See [Install Docker Engine](https://docs.docker.com/engine/install/) for installation instructions.

Next, follow [Run Using Containers](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html?highlight=installer#run-using-containers) to install the Gaudi drivers and container runtime.

Next, start the Gaudi container:
```bash
docker pull vault.habana.ai/gaudi-docker/1.22.1/ubuntu24.04/habanalabs/pytorch-installer-2.7.1:latest
docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.22.1/ubuntu24.04/habanalabs/pytorch-installer-2.7.1:latest
```

Inside the container, install Ray and Jupyter to run this notebook.
```bash
pip install ray[train] notebook
```

In [None]:
import os
from typing import Dict
from tempfile import TemporaryDirectory

import torch
from filelock import FileLock
from torch import nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, models
from tqdm import tqdm

import ray
import ray.train as train
from ray.train import ScalingConfig, Checkpoint
from ray.train.torch import TorchTrainer
from ray.train.torch import TorchConfig
from ray.runtime_env import RuntimeEnv

import habana_frameworks.torch.core as htcore

## Define Data Transforms

We will set up the data transforms for preprocessing images for training and validation. This includes random cropping, flipping, and normalization for the training set, and resizing and normalization for the validation set.

In [None]:
# Data augmentation and normalization for training
# Just normalization for validation
data_transforms = {
    "train": transforms.Compose([
        transforms.RandomResizedCrop(224),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]),
    "val": transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]),
}

## Dataset Download Function

We will define a function to download the Hymenoptera dataset. This dataset contains images of ants and bees for a binary classification problem.

In [None]:
def download_datasets():
    os.system("wget https://download.pytorch.org/tutorial/hymenoptera_data.zip >/dev/null 2>&1")
    os.system("unzip hymenoptera_data.zip >/dev/null 2>&1")

## Dataset Preparation Function

After downloading the dataset, we need to build PyTorch datasets for training and validation. The `build_datasets` function will apply the previously defined transforms and create the datasets.

In [None]:
def build_datasets():
    torch_datasets = {}
    for split in ["train", "val"]:
        torch_datasets[split] = datasets.ImageFolder(
            os.path.join("./hymenoptera_data", split), data_transforms[split]
        )
    return torch_datasets

## Model Initialization Functions

We will define two functions to initialize our model. The `initialize_model` function will load a pre-trained ResNet-50 model and replace the final classification layer for our binary classification task. The `initialize_model_from_checkpoint` function will load a model from a saved checkpoint if available.

In [None]:
def initialize_model():
    # Load pretrained model params
    model = models.resnet50(pretrained=True)

    # Replace the original classifier with a new Linear layer
    num_features = model.fc.in_features
    model.fc = nn.Linear(num_features, 2)

    # Ensure all params get updated during finetuning
    for param in model.parameters():
        param.requires_grad = True
    return model

## Evaluation Function

To assess the performance of our model during training, we define an `evaluate` function. This function computes the number of correct predictions by comparing the predicted labels with the true labels.

In [None]:
def evaluate(logits, labels):
    _, preds = torch.max(logits, 1)
    corrects = torch.sum(preds == labels).item()
    return corrects

## Training Loop Function

This function defines the training loop that will be executed by each worker. It includes downloading the dataset, preparing data loaders, initializing the model, and running the training and validation phases. Compared to a training function for GPU, no changes are needed to port to HPU. Internally, Ray Train does these things:

* Detect HPU and set the device.

* Initializes the habana PyTorch backend.

* Initializes the habana distributed backend.

In [None]:
def train_loop_per_worker(configs):
    import warnings

    warnings.filterwarnings("ignore")

    # Calculate the batch size for a single worker
    worker_batch_size = configs["batch_size"] // train.get_context().get_world_size()

    # Download dataset once on local rank 0 worker
    if train.get_context().get_local_rank() == 0:
        download_datasets()
    torch.distributed.barrier()

    # Build datasets on each worker
    torch_datasets = build_datasets()

    # Prepare dataloader for each worker
    dataloaders = dict()
    dataloaders["train"] = DataLoader(
        torch_datasets["train"], batch_size=worker_batch_size, shuffle=True
    )
    dataloaders["val"] = DataLoader(
        torch_datasets["val"], batch_size=worker_batch_size, shuffle=False
    )

    # Distribute
    dataloaders["train"] = train.torch.prepare_data_loader(dataloaders["train"])
    dataloaders["val"] = train.torch.prepare_data_loader(dataloaders["val"])

    # Obtain HPU device automatically
    device = train.torch.get_device()

    # Prepare DDP Model, optimizer, and loss function
    model = initialize_model()
    model = model.to(device)

    optimizer = optim.SGD(
        model.parameters(), lr=configs["lr"], momentum=configs["momentum"]
    )
    criterion = nn.CrossEntropyLoss()

    # Start training loops
    for epoch in range(configs["num_epochs"]):
        # Each epoch has a training and validation phase
        for phase in ["train", "val"]:
            if phase == "train":
                model.train()  # Set model to training mode
            else:
                model.eval()  # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                with torch.set_grad_enabled(phase == "train"):
                    # Get model outputs and calculate loss
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == "train":
                        loss.backward()
                        optimizer.step()

                # calculate statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += evaluate(outputs, labels)

            size = len(torch_datasets[phase]) // train.get_context().get_world_size()
            epoch_loss = running_loss / size
            epoch_acc = running_corrects / size

            if train.get_context().get_world_rank() == 0:
                print(
                    "Epoch {}-{} Loss: {:.4f} Acc: {:.4f}".format(
                        epoch, phase, epoch_loss, epoch_acc
                    )
                )

            # Report metrics and checkpoint every epoch
            if phase == "val":
                train.report(
                    metrics={"loss": epoch_loss, "acc": epoch_acc},
                )

## Main Training Function

The `train_resnet` function sets up the distributed training environment using Ray and starts the training process. It specifies the batch size, number of epochs, learning rate, and momentum for the SGD optimizer. To enable training using HPU, we only need to make the following changes:
* Require an HPU for each worker in ScalingConfig
* Set backend to "hccl" in TorchConfig

In [None]:
def train_resnet(num_workers=2):
    global_batch_size = 16

    train_loop_config = {
        "input_size": 224,  # Input image size (224 x 224)
        "batch_size": 32,  # Batch size for training
        "num_epochs": 10,  # Number of epochs to train for
        "lr": 0.001,  # Learning Rate
        "momentum": 0.9,  # SGD optimizer momentum
    }
    # Configure computation resources
    # In ScalingConfig, require an HPU for each worker
    scaling_config = ScalingConfig(num_workers=num_workers, resources_per_worker={"CPU": 1, "HPU": 1})
    # Set backend to hccl in TorchConfig
    torch_config = TorchConfig(backend = "hccl")
    
    # Workaround https://github.com/ray-project/ray/issues/45302 by explictly setting HPU resource
    ray.init(resources={"HPU": 8})
    
    # Initialize a Ray TorchTrainer
    trainer = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        train_loop_config=train_loop_config,
        torch_config=torch_config,
        scaling_config=scaling_config,
    )

    result = trainer.fit()
    print(f"Training result: {result}")

## Start Training

Finally, we call the `train_resnet` function to start the training process. You can adjust the number of workers to use. Before running this cell, ensure that Ray is properly set up in your environment to handle distributed training.

In [None]:
%env PT_HPU_LAZY_MODE=1
train_resnet(num_workers=2) 

## Possible outputs

``` text
env: PT_HPU_LAZY_MODE=1
2025-11-19 23:26:40,364	INFO worker.py:2012 -- Started a local Ray instance.
/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
(TrainController pid=10012) Attempting to start training worker group of size 2 with the following resources: [{'CPU': 1, 'HPU': 1}] * 2
(RayTrainWorker pid=10466) Setting up process group for: env:// [rank=0, world_size=2]
(TrainController pid=10012) Started training worker group of size 2: 
(TrainController pid=10012) - (ip=100.83.67.100, pid=10466) world_rank=0, local_rank=0, node_rank=0
(TrainController pid=10012) - (ip=100.83.67.100, pid=10465) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=10466) ============================= HPU PT BRIDGE CONFIGURATION ON RANK = 0 ============= 
(RayTrainWorker pid=10466)  PT_HPU_LAZY_MODE = 1
(RayTrainWorker pid=10466)  PT_HPU_RECIPE_CACHE_CONFIG = ,false,1024,false
(RayTrainWorker pid=10466)  PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
(RayTrainWorker pid=10466)  PT_HPU_LAZY_ACC_PAR_MODE = 1
(RayTrainWorker pid=10466)  PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
(RayTrainWorker pid=10466)  PT_HPU_EAGER_PIPELINE_ENABLE = 1
(RayTrainWorker pid=10466)  PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
(RayTrainWorker pid=10466)  PT_HPU_ENABLE_LAZY_COLLECTIVES = 0
(RayTrainWorker pid=10466) ---------------------------: System Configuration :---------------------------
(RayTrainWorker pid=10466) Num CPU Cores : 160
(RayTrainWorker pid=10466) CPU RAM       : 1007 GB
(RayTrainWorker pid=10466) ------------------------------------------------------------------------------
  0%|          | 0.00/97.8M [00:00<?, ?B/s]
 13%|█▎        | 12.5M/97.8M [00:00<00:00, 126MB/s]
(RayTrainWorker pid=10465) Downloading: "https://download.pytorch.org/models/resnet50-0676ba61.pth" to /root/.cache/torch/hub/checkpoints/resnet50-0676ba61.pth
100%|██████████| 97.8M/97.8M [00:00<00:00, 251MB/s]
100%|██████████| 97.8M/97.8M [00:00<00:00, 247MB/s]
(pid=gcs_server) [2025-11-19 23:27:08,208 E 135 135] (gcs_server) gcs_server.cc:302: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(raylet) [2025-11-19 23:27:10,279 E 436 436] (raylet) main.cc:975: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(pid=571) [2025-11-19 23:27:14,208 E 571 1215] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
  0%|          | 0.00/97.8M [00:00<?, ?B/s]
 55%|█████▌    | 53.9M/97.8M [00:00<00:00, 222MB/s]  [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
[2025-11-19 23:27:15,493 E 52 563] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(RayTrainWorker pid=10466) Epoch 0-train Loss: 0.6561 Acc: 0.6230
(RayTrainWorker pid=10466) Epoch 0-val Loss: 0.5295 Acc: 0.6447
(RayTrainWorker pid=10465) Reporting training result 1: TrainingReport(checkpoint=None, metrics={'loss': 0.6530329294894871, 'acc': 0.5394736842105263}, validation_spec=None)
(RayTrainWorker pid=10466) Epoch 1-train Loss: 0.5024 Acc: 0.7459
(RayTrainWorker pid=10466) Epoch 1-val Loss: 0.3141 Acc: 0.9474
(RayTrainWorker pid=10466) Epoch 2-train Loss: 0.3223 Acc: 0.9344
(TrainController pid=10012) [2025-11-19 23:27:18,259 E 10012 10052] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 157x across cluster]
(RayTrainWorker pid=10466) Epoch 2-val Loss: 0.2315 Acc: 0.9474
(RayTrainWorker pid=10466) Epoch 3-train Loss: 0.2607 Acc: 0.9426
(RayTrainWorker pid=10466) Epoch 3-val Loss: 0.1912 Acc: 0.9605
(RayTrainWorker pid=10466) Epoch 4-train Loss: 0.1582 Acc: 0.9672
(RayTrainWorker pid=10466) Reporting training result 4: TrainingReport(checkpoint=None, metrics={'loss': 0.19123821666366175, 'acc': 0.9605263157894737}, validation_spec=None) [repeated 7x across cluster]
(RayTrainWorker pid=10466) Epoch 4-val Loss: 0.1681 Acc: 0.9474
(bundle_reservation_check_func pid=10281) [2025-11-19 23:27:24,784 E 10281 10400] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(RayTrainWorker pid=10466) Epoch 5-train Loss: 0.1454 Acc: 0.9590
(RayTrainWorker pid=10466) Epoch 5-val Loss: 0.1510 Acc: 0.9737
(RayTrainWorker pid=10466) Epoch 6-train Loss: 0.1103 Acc: 0.9836
(RayTrainWorker pid=10466) Epoch 6-val Loss: 0.1436 Acc: 0.9605
(RayTrainWorker pid=10465) [2025-11-19 23:27:27,666 E 10465 10664] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(RayTrainWorker pid=10466) Epoch 7-train Loss: 0.1021 Acc: 0.9836
(RayTrainWorker pid=10466) Reporting training result 7: TrainingReport(checkpoint=None, metrics={'loss': 0.14356592101486107, 'acc': 0.9605263157894737}, validation_spec=None) [repeated 6x across cluster]
(RayTrainWorker pid=10466) Epoch 7-val Loss: 0.1382 Acc: 0.9605
(RayTrainWorker pid=10466) Epoch 8-train Loss: 0.0715 Acc: 0.9918
(RayTrainWorker pid=10466) Epoch 8-val Loss: 0.1396 Acc: 0.9737
(RayTrainWorker pid=10466) Epoch 9-train Loss: 0.0874 Acc: 0.9672
(RayTrainWorker pid=10466) [2025-11-19 23:27:27,646 E 10466 10580] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 2x across cluster]
(RayTrainWorker pid=10466) Epoch 9-val Loss: 0.1374 Acc: 0.9737
Training result: Result(metrics=None, checkpoint=None, error=None, path='/root/ray_results/ray_train_run-2025-11-19_23-26-45', metrics_dataframe=None, best_checkpoints=[], _storage_filesystem=<pyarrow._fs.LocalFileSystem object at 0x7f6e6e587c70>)
```