# Ray Train User Testing

In this notebook, you will learn how to use Ray Train for distributed model training. Eventually, you will scale up your existing training workload from single node to multiple GPU nodes.

- Task 1: Run training with 1 CPU worker
- Task 2: Run training with 1 GPU worker
- Task 3: Run distributed training with 4 GPU workers

You can refer to the Ray Documentation for user guides and APIs: https://docs.ray.io/en/master/index.html


## Task 0: Inspect your data

In this section, you will calculate some basic statistics of your data. We have already provided the data loading code below. Please fill in the blanks in the table below the code block.


In [None]:
import os
import wandb
import mlflow
import torch
import torch.nn.functional as F
from datasets import load_dataset
from functools import partial
from torch.utils.data import DataLoader
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm


# Data Preprocessing
def preprocess(batch, tokenizer):
    sentences = [case["text"] for case in batch]
    labels = torch.LongTensor([case["label"] for case in batch])

    encoded_sent = tokenizer(
        sentences,
        max_length=256,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    )

    out = {}
    out["input_ids"] = encoded_sent["input_ids"]
    out["attention_mask"] = encoded_sent["attention_mask"]
    out["label"] = labels
    return out


BATCH_SIZE = 32

hf_ds = load_dataset("tweet_eval", "irony", keep_in_memory=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
collate_fn = partial(preprocess, tokenizer=tokenizer)

train_ds = hf_ds["train"]
val_ds = hf_ds["test"]
train_dataloader = DataLoader(train_ds, batch_size=BATCH_SIZE, collate_fn=collate_fn)
val_dataloader = DataLoader(val_ds, batch_size=BATCH_SIZE, collate_fn=collate_fn)

In [None]:
# Your Code here:



Please fill in the blanks below:

|                     |                      |
|---------------------|----------------------|
| Total train samples | ___x___ (samples)    |
| Total train batches | ___x___ (batches)    |
| Total eval samples  | ___x___ (samples)    |
| Total eval batches  | ___x___ (batches)    |

## Task 1: Train with one CPU worker

In this section, we'll first convert your code to Ray Train, and test the correctness of your code with 1 CPU worker.

Success Criteria for this section:
- Specify correct configurations for Ray Train
- Successfully run one training epoch and one evaluation epoch
- Successfully save a checkpoint and report it to Ray Train


In [None]:
# Estimated Difficulty Level (1-7) [1=very easy 7=very difficult]:

In [None]:
# Your code here






Please fill in the blanks below:

> Hint: You can find relevant information from the progress bar.

|                         |                       |
|-------------------------|-----------------------|
| Num iterations per epoch| ___x___ (iters/epoch) |
| Training speed          | ___x___ (s/iter)     |

In [None]:
# Actaul Difficulty Level (1-7) [1=very easy 7=very difficult]:

## Task 2: Train with one GPU worker

Next let's modify your code to train on 1 GPU.


In [None]:
# Estimated Difficulty Level (1-7) [1=very easy 7=very difficult]:

In [None]:
# Your code here







Please fill in the blanks below:

|                         |                      |
|-------------------------|----------------------|
| Num iterations per epoch| ___x___ (iters/epoch) |
| Training speed          | ___x___ (s/iter)     |

In [None]:
# Actaul Difficulty Level (1-7) [1=very easy 7=very difficult]:

## Task 3: Distributed Training with Multi-node and Multi-GPUs

Let's next scale up your training to 4 GPUs. You may use Ray Dashboard to check GPU utilization, training status, and other system metrics.

In [None]:
# Estimated Difficulty Level (1-7) [1=very easy 7=very difficult]:

In [None]:
# Your code here







Please fill in the blanks below:

|                         |                      |
|-------------------------|----------------------|
| Num iterations per epoch| ___x___ (iters/epoch) |
| Training speed          | ___x___ (iters/s)     |
| Checkpoint path         | ___x___               |

In [None]:
# Actaul Difficulty Level (1-7) [1=very easy 7=very difficult]: