References 
- [Ray Train](https://docs.ray.io/en/latest/train/train.html#)
- [Tensorboard & Pytorch](https://pytorch.org/docs/stable/tensorboard.html)

First, set up your dataset and model.

In [1]:
import torch
import torch.nn as nn

num_samples = 20
input_size = 10
layer_size = 15
output_size = 5

class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.layer1 = nn.Linear(input_size, layer_size)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(layer_size, output_size)

    def forward(self, input):
        return self.layer2(self.relu(self.layer1(input)))

# In this example we use a randomly generated dataset.
input = torch.randn(num_samples, input_size)
labels = torch.randn(num_samples, output_size)


Now define your single-worker PyTorch training function.

In [2]:
import os
import torch.optim as optim

from torch.utils.tensorboard import SummaryWriter
# Writer will output to ./runs/ directory by default
writer = SummaryWriter()


def train_func():
    num_epochs = 30
    
    ckpt_dir = "ckpts"
    os.makedirs(ckpt_dir, exist_ok=True)

    model = NeuralNetwork()
    writer.add_graph(model, input) # Add graph to tensorboard, default goto ./runs

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        output = model(input) # x1 = A x0
        loss = loss_fn(output, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        writer.add_scalar('Loss/train', loss.item(), epoch) # Add loss to tensorboard
        
        # Checkpoint model
        ckpt_path = os.path.join(ckpt_dir, "model-{}.pt".format(epoch))
        torch.save(model.state_dict(), ckpt_path)
        
        print(f"epoch: {epoch}, loss: {loss.item()}")

This training function can be executed with:

In [3]:
train_func()

epoch: 0, loss: 1.2137645483016968
epoch: 1, loss: 1.1585373878479004
epoch: 2, loss: 1.1137664318084717
epoch: 3, loss: 1.0762763023376465
epoch: 4, loss: 1.0443921089172363
epoch: 5, loss: 1.0168265104293823
epoch: 6, loss: 0.9926542639732361
epoch: 7, loss: 0.9713320136070251
epoch: 8, loss: 0.952486515045166
epoch: 9, loss: 0.9356498122215271
epoch: 10, loss: 0.9199298024177551
epoch: 11, loss: 0.9052792191505432
epoch: 12, loss: 0.8915045857429504
epoch: 13, loss: 0.878430187702179
epoch: 14, loss: 0.8659675121307373
epoch: 15, loss: 0.8541174530982971
epoch: 16, loss: 0.8426482677459717
epoch: 17, loss: 0.8315061330795288
epoch: 18, loss: 0.8206733465194702
epoch: 19, loss: 0.810304582118988
epoch: 20, loss: 0.8003292679786682
epoch: 21, loss: 0.7905774712562561
epoch: 22, loss: 0.7809650301933289
epoch: 23, loss: 0.7715209126472473
epoch: 24, loss: 0.7621709704399109
epoch: 25, loss: 0.7529958486557007
epoch: 26, loss: 0.7437600493431091
epoch: 27, loss: 0.7343636155128479
epoch

Open tensorboard to check loss and network.

In [4]:
! tensorboard --logdir=runs

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

I1123 01:27:46.953557 140317629609728 plugin.py:346] Monitor runs begin
Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.7.0 at http://localhost:6007/ (Press CTRL+C to quit)
