# Ray Train - A Library for Distributed Deep Learning

[Ray Train](https://docs.ray.io/en/latest/train/train.html) is a lightweight library for distributed deep learning. It provides thin wrappers around [PyTorch](https://pytorch.org), [TensorFlow](https://tensorflow.org), and [Horvod](https://horovod.ai/) native modules for data parallel training.

> **NOTE**: Ray SGD is renamed to Ray Train

### Quick Start: single machine, single worker with PyTorch

Let's work through a typical non-distributed PyTorch trainining example, where we only use a single machine, single workper process.

In [1]:
import os

import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm

### Step 1. Define constants, input and output variables

In [2]:
NUM_SAMPLES = 20             # our dataset for training
INPUT_SIZE = 20              # inputs or neurons into the first layer
LAYER_SIZE = 15              # inputs or neurons to the hidden layer
OUTPUT_SIZE = 5              # outputs to the last layer

# In this example we use a randomly generated dataset.
input = torch.randn(NUM_SAMPLES, INPUT_SIZE)         # In normal ML parlance, X
labels = torch.randn(NUM_SAMPLES, OUTPUT_SIZE)       # In nmormal ML parlance, y

### Step 2: Define a simple PyTorch neural network

In [3]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(in_features=INPUT_SIZE, out_features=LAYER_SIZE)
        # Our activation function
        self.relu = nn.ReLU()           
        self.layer2 = nn.Linear(in_features=LAYER_SIZE, out_features=OUTPUT_SIZE)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

### Step 3: Define your training function
Simple function that iterates over epochs and does standard PyTorch steps:
 * Invoke the callable model with input
 * Calculate the loss
 * Zero out the gradients
 * Do backward propogation
 * Optimize the step

In [4]:
def train_func(configs):
    model = NeuralNetwork()
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    # Iterate over the loop
    epochs = configs.get('NUM_EPOCHS',[20, 40, 60] )
    for epoch in epochs: 
        for e in tqdm(range(epoch)):
            output = model(input)
            loss = loss_fn(output, labels)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
            if e % epoch == 0:
                print(f'epoch {epoch}, loss: {loss.item():.3f}')
    # Return anything you want; here we just report back the pid of the Ray worker process on which this function runs
    return os.getpid()

### Step 4: Train the model

In [5]:
result = train_func({'NUM_EPOCHS': [20, 40, 60]})
print(f'pid: {result}')

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1870.54it/s]


epoch 20, loss: 0.923


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 2809.22it/s]


epoch 40, loss: 0.599


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:00<00:00, 3168.26it/s]

epoch 60, loss: 0.277
pid: 60743





### Excercises

Have a a go at this in your spare time and observe the results

 1. Change the NUM_EPOCHS list to **[200, 400, 600]**
 2. Do you see the loss approaching zero?
 3. Try changing sample sizes. Do you need more epochs to train and minimize loss?

In [None]:
shutdown_ray_cluster()