# Distributed training with Ray Train, PyTorch and Hugging Face
© 2025, Anyscale. All Rights Reserved

💻 **Launch Locally**: You can run this notebook locally.

🚀 **Launch on Cloud**: Think about running this notebook on a Ray Cluster (Click [here](http://console.anyscale.com/register) to easily start a Ray cluster on Anyscale)


This notebook demonstrates how to perform distributed training of a BERT model for sequence classification using Ray Train, PyTorch, and Hugging Face libraries. The goal is to classify Yelp reviews into categories by leveraging the power of distributed computing, which allows you to train large models efficiently across multiple CPUs or GPUs.

The notebook starts by importing all the necessary libraries, including PyTorch for deep learning, Hugging Face Transformers for model and tokenizer utilities, and Ray Train for distributed training. It then sets up the evaluation metric (accuracy) and defines a function to compute this metric during model evaluation.

A key part of the notebook is the training function, which is executed by each worker in the distributed setup. This function handles loading the Yelp review dataset, tokenizing the text data, preparing data loaders for batching, and setting up the BERT model for training. The function is designed to automatically use the best available hardware, whether that's a CPU, GPU, or Apple Silicon's MPS.

The main training function, `train_bert`, configures the distributed environment using Ray, sets up the training parameters, and launches the training process across multiple workers. This approach allows you to scale up your training easily, making it suitable for both local machines and cloud platforms. After training, Ray is properly shut down to free up resources.

Overall, this notebook provides a practical introduction to distributed deep learning with modern Python tools, making it easier for machine learning engineers to train large models on big datasets efficiently.

### Outline
<div class="alert alert-block alert-info">
<ol>
    <li>Architecture Diagram
    <li>Library Imports
        <ul>
            <li>Importing PyTorch, Hugging Face Transformers, Ray Train, and other dependencies
        </ul>
    <li>Metrics Setup
        <ul>
            <li>Defining accuracy as the evaluation metric
            <li>Function to compute metrics during evaluation
        </ul>
    <li>Training Function Per Worker
        <ul>
            <li>Data loading and preprocessing (tokenization)
            <li>Preparing data loaders for batching
            <li>Model initialization (BERT for sequence classification)
            <li>Device selection (CPU, GPU, or MPS)
            <li>Training and evaluation loop
        </ul>
    <li>Main Training Function
        <ul>
            <li>Setting up distributed training configuration with Ray
            <li>Scaling configuration for CPUs/GPUs
            <li>Initializing and running the Ray TorchTrainer
        </ul>
    <li>Running the Training
        <ul>
            <li>Executing the main training function with a specified number of workers
        </ul>
    <li>Shutdown Ray Cluster
    <li>Summary
</ol>
</div>
