# Ray Train - A Library for Distributed Deep Learning

[Ray Train](https://docs.ray.io/en/latest/train/train.html) is a lightweight library for distributed deep learning. It provides thin wrappers around [PyTorch](https://pytorch.org) and [TensorFlow](https://tensorflow.org) native modules for data parallel training.

> **NOTE**: Ray SGD is renamed to Ray Train

## About Ray Train

The main features of Ray Train are:
 * **Ease of use:** You can scale PyTorch’s native `DistributedDataParallel` and TensorFlow’s `tf.distribute.MirroredStrategy` without the requirement to monitor individual nodes yourself.
 * **Composability:** Ray Train is built on top of the [Ray Actor](https://docs.ray.io/en/latest/actors.html) API, enabling seamless integration with existing Ray applications such as RLlib, Tune, and Serve.
 * **Scale up and down:** You can start on a single CPU, then scale up to multi-node, multi-CPU, or multi-GPU clusters when needed. All it takes is changing two lines of code.

This [Ray blog post](https://medium.com/distributed-computing-with-ray/faster-and-cheaper-pytorch-with-raysgd-a5a44d4fd220) provides more information on the motivations for Ray Train (SGD), such as the many steps you have to do yourself without it, and how it removes those steps.

## Example - Distributed Training for PyTorch 

This example is adapted and modified from the [Ray Train documentation](https://docs.ray.io/en/latest/train/examples/train_linear_example.html). 

First, do the necessary imports, as before.

In [1]:
import numpy as np
import tqdm
import torch
import torch.nn as nn
import ray.train as train
from ray.train import Trainer
from ray.train.torch import TorchConfig
from ray.train.callbacks import JsonLoggerCallback, TBXLoggerCallback

from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster
import ray 

setup_ray_cluster(
  num_worker_nodes=2,
  num_cpus_per_node=4,
  collect_log_to_path="/dbfs/path/to/ray_collected_logs"
)
ray.init()

### Step 1: Define PyTorch Datasets loaders 

Now define classes and several functions we'll need.
This particular tutorial uses linear regression to solve: `y = ax + b`.

We implement our `LinearDataset` class as a PyTorch Dataset by subclassing `torch.utils.data.Dataset` 

In [2]:
class LinearDataset(torch.utils.data.Dataset):
    """y = a * x + b"""

    def __init__(self, a, b, size=1000):
        x = np.arange(0, 10, 10 / size, dtype=np.float32)
        self.X = torch.from_numpy(x)
        self.y = torch.from_numpy(a * x + b)

    def __getitem__(self, index):
        return self.X[index, None], self.y[index, None]

    def __len__(self):
        return len(self.X)

### Step 2:  Define training and validation function per epoch 

Define our training function per epoch. 

In [3]:
def train_epoch(dataloader, model, loss_fn, optimizer):
    for X, y in dataloader:
        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Define our validate function per epoch

In [4]:
def validate_epoch(dataloader, model, loss_fn):
    num_batches = len(dataloader)
    model.eval()
    loss = 0
    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X)
            loss += loss_fn(pred, y).item()
    loss /= num_batches
    import copy
    model_copy = copy.deepcopy(model)
    result = {"model": model_copy.cpu().state_dict(), "loss": loss}
    return result

### Step 3: Define our training function to pass to the Trainer

In [5]:
def train_func(config):
    # Fetch all the configs
    data_size = config.get("data_size", 1000)
    val_size = config.get("val_size", 400)
    batch_size = config.get("batch_size", 32)
    hidden_size = config.get("hidden_size", 1)
    lr = config.get("lr", 1e-2)
    epochs = config.get("epochs", [20, 40, 60])

    # Get the Training and validation dataset 
    train_dataset = LinearDataset(2, 5, size=data_size)
    val_dataset = LinearDataset(2, 5, size=val_size)
    
    # Convert them to PyTorch equivalent dataloaders
    # Prepare them to use for Ray Training distributed training
    # by using the train.torch.prepare_data_loaders.
    # These are wrappers around PyTorch Distributed Dataloaders
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size)
    validation_loader = torch.utils.data.DataLoader(
        val_dataset, batch_size=batch_size)

    train_loader = train.torch.prepare_data_loader(train_loader)
    validation_loader = train.torch.prepare_data_loader(validation_loader)

    # Create our simple PyTorch linear model and prepare it for PyTorch DDP
    # 
    model = nn.Linear(1, hidden_size)
    model = train.torch.prepare_model(model)

    # Use MSE for our loss function
    loss_fn = nn.MSELoss()

    # Use SGD for optimzer
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    results = []

    for epoch in epochs:
         for e in range(epoch):
            train_epoch(train_loader, model, loss_fn, optimizer)
            result = validate_epoch(validation_loader, model, loss_fn)

            # Ray Train, as in Ray Tune, allows us to report results back
            # to main Trainer
            train.report(**result)
            results.append(result)
            if e % epoch == 0:
                od = result.get('model')        #is an ordered dictionary
                loss = result.get('loss')
                m_weight = od.get('module.weight').item()
                m_bias = od.get('module.bias').item()
                
                print(f"epoch {epoch}, loss: {loss:.3f}, model.weight: {m_weight:.3f}, model.bias: {m_bias:.3f}")

    return results

### Step 4: Wrap our Trainer around a main driver function

Also, note that we are using a PyTorch backend and providing a [TorchConfig](https://docs.ray.io/en/latest/train/api.html?highlight=TorchConfig#torchconfighttps://docs.ray.io/en/latest/train/api.html?highlight=TorchConfig#torchconfig) with [gloo](https://pytorch.org/docs/stable/distributed.htmlhttps://pytorch.org/docs/stable/distributed.html).

In [6]:
def train_linear(num_workers=2, use_gpu=False, epochs=[20, 40]):
    trainer = Trainer(
        backend=TorchConfig(backend="gloo"),
        num_workers=num_workers,
        use_gpu=use_gpu)
    config = {"lr": 1e-2, "hidden_size": 1, "batch_size": 4, "epochs": epochs}
    trainer.start()
    results = trainer.run(
        train_func,
        config,
        callbacks=[JsonLoggerCallback(),
                   TBXLoggerCallback()])
    trainer.shutdown()
    return results

In [7]:
results = train_linear(
            num_workers=4,
            epochs=[20, 40, 60])

2022-03-16 16:14:37,498	INFO services.py:1412 -- View the Ray dashboard at [1m[32mhttp://127.0.0.1:8267[39m[22m
2022-03-16 16:14:39,998	INFO trainer.py:199 -- Trainer logs will be logged in: /Users/jules/ray_results/train_2022-03-16_16-14-39
2022-03-16 16:14:41,731	INFO trainer.py:205 -- Run results will be logged in: /Users/jules/ray_results/train_2022-03-16_16-14-39/run_001
[2m[36m(BaseWorkerMixin pid=60974)[0m 2022-03-16 16:14:41,698	INFO torch.py:66 -- Setting up process group for: env:// [rank=0, world_size=4]
[2m[36m(BaseWorkerMixin pid=60975)[0m 2022-03-16 16:14:41,698	INFO torch.py:66 -- Setting up process group for: env:// [rank=3, world_size=4]
[2m[36m(BaseWorkerMixin pid=60980)[0m 2022-03-16 16:14:41,699	INFO torch.py:66 -- Setting up process group for: env:// [rank=2, world_size=4]
[2m[36m(BaseWorkerMixin pid=60981)[0m 2022-03-16 16:14:41,698	INFO torch.py:66 -- Setting up process group for: env:// [rank=1, world_size=4]
[2m[36m(BaseWorkerMixin pid=60974)

[2m[36m(BaseWorkerMixin pid=60974)[0m epoch 20, loss: 3.587, model.weight: 2.326, model.bias: 1.743
[2m[36m(BaseWorkerMixin pid=60975)[0m epoch 20, loss: 3.507, model.weight: 2.326, model.bias: 1.743
[2m[36m(BaseWorkerMixin pid=60981)[0m epoch 20, loss: 3.560, model.weight: 2.326, model.bias: 1.743
[2m[36m(BaseWorkerMixin pid=60980)[0m epoch 20, loss: 3.534, model.weight: 2.326, model.bias: 1.743
[2m[36m(BaseWorkerMixin pid=60974)[0m epoch 40, loss: 0.004, model.weight: 2.011, model.bias: 4.886
[2m[36m(BaseWorkerMixin pid=60975)[0m epoch 40, loss: 0.004, model.weight: 2.011, model.bias: 4.886
[2m[36m(BaseWorkerMixin pid=60981)[0m epoch 40, loss: 0.004, model.weight: 2.011, model.bias: 4.886
[2m[36m(BaseWorkerMixin pid=60980)[0m epoch 40, loss: 0.004, model.weight: 2.011, model.bias: 4.886
[2m[36m(BaseWorkerMixin pid=60974)[0m epoch 60, loss: 0.000, model.weight: 2.000, model.bias: 5.000
[2m[36m(BaseWorkerMixin pid=60975)[0m epoch 60, loss: 0.000, model.wei

### Logging callbacks

They store the results in `~/ray_results/train_*`. 

In [None]:
!ls -l ~/ray_results/train_*

Launch Tensorboard to view the training results

In [9]:
!tensorboard --logdir ~/ray_results/train_2021-12-12_16-20-46/


NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.8.0 at http://localhost:6006/ (Press CTRL+C to quit)
^C


<img src="images/ray_train_tensorboard.png" width="80%" height="80%">

### Excercises

Have a go at this in your spare time and observe the results

 1. Change the learning rate and batch size in `config`
 2. Try chaning the number of workers to 1/2 number of cores on your localhost or laptop
 3. Change the `data_size` and `val_size`
 4. Modify the linear equation: `y = 2x + 5 --> y = 4x + 10`. This will require you to modify `LinearDataset(2, 5, size=data_size)`

In [None]:
shutdown_ray_cluster()