# 4. Hyperparameter Tuning for GNNs using Ray Tune
  This notebook performs hyperparameter tuning for GNNs using Ray Tune. It defines a GNN model, a training function, and a search space for hyperparameters. It utilizes Ray Tune to search for the best hyperparameters based on a given metric. The results are stored in the storage path defined in the notebook and can be visualized using TensorBoard.
### 4.1 Setup instructions
Before starting the hyperparameter tuning process, ensure that the dataset exists at the specified path mentioned in the notebook.

In [1]:
import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv

import os
import ray
from ray import tune
from ray.tune.tune_config import TuneConfig
from ray.air.config import RunConfig
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler

from ray.air import session
from ray.air.checkpoint import Checkpoint

Let's start by defining the dataset location and the storage path for Ray Tune results. If the storage path already exists, the experiment name will be incremented by 1 (e.g. tune_analyzing_results_1). The experiment name and storage path will be printed out. We are trying to keep the experiment name unique so that we can easily find the results in the storage path and visualize them using TensorBoard.

In [2]:
# Define the storage path for Ray Tune results
storage_path = "./ray_results"
exp_name = "tune_analyzing_results"
dataset_path = "datasets/data.pt"

# if experiment path exists, increment number on experiment name (e.g. tune_analyzing_results_1)
i = 0
new_exp_name = f"{exp_name}_{i}"
while os.path.exists(f"{storage_path}/{new_exp_name}"):
    i += 1
    new_exp_name = f"{exp_name}_{i}"

exp_name = new_exp_name
experiment_path = f"{storage_path}/{exp_name}"

print("CUDA:", torch.cuda.is_available())
print("Experiment name:", exp_name)
print("Storage path:", storage_path)

CUDA: True
Experiment name: tune_analyzing_results_2
Storage path: ./ray_results


### 4.2 Define the GCN model
See section 5.2 for more details on the GCN model. Should be imported from there, but can't due to bug with raytune and importing from another notebook.
To see bug, use the following code instead of the GCN class defined below:
```python
import import_ipynb
from _5_Surface_Feature_GNN import GCN
```

In [3]:
class GCN(torch.nn.Module):
    def __init__(self, num_features, num_classes, num_neurons, network_size):
        super(GCN, self).__init__()

        self.conv_layers = torch.nn.ModuleList()
        self.conv_layers.append(GCNConv(num_features, num_neurons))
        for _ in range(network_size - 1):
            self.conv_layers.append(GCNConv(num_neurons, num_neurons))
        self.conv_layers.append(GCNConv(num_neurons, num_classes))

    def forward(self, data):
        x, edge_index, edge_weight = data.x, data.edge_index, data.edge_attr

        for i, conv_layer in enumerate(self.conv_layers):
            if i != len(self.conv_layers) - 1:
                x = F.relu(conv_layer(x, edge_index, edge_weight))
            else:
                x = conv_layer(x, edge_index, edge_weight)

        return F.log_softmax(x, dim=1)

### 4.3 Define the training function
The training function is responsible for training the GCN model with the given hyperparameters and the provided dataset. The function takes the configuration and data as input and performs the training process. The function is defined as follows:

In [4]:
def train_gcn(config, data):
    num_neurons = int(config["num_neurons"].sample())
    network_size = int(config["network_size"].sample())
    lr = float(config["lr"].sample())
    weight_decay = float(config["weight_decay"].sample())
    # epoch_num = int(config["epoch_num"].sample())

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = GCN(data.num_node_features, int(data.y.max() + 2), num_neurons, network_size).to(device)
    data = data.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)

    epochs = 500
    start = 0

    # To restore a checkpoint, use `session.get_checkpoint()`.
    loaded_checkpoint = session.get_checkpoint()
    if loaded_checkpoint:
        with loaded_checkpoint.as_directory() as loaded_checkpoint_dir:
           model_state, optimizer_state = torch.load(os.path.join(loaded_checkpoint_dir, "checkpoint.pt"))
        model.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)


    checkpoint_freq = 50
    model.train()
    for epoch in range(start, epochs):
        optimizer.zero_grad()
        out = model(data)
        loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
        loss.backward()
        optimizer.step()

        model.eval()
        pred = model(data).argmax(dim=1)

        train_correct = (pred[data.train_mask] == data.y[data.train_mask]).sum()
        train_acc = int(train_correct) / int(data.train_mask.sum())

        test_correct = (pred[data.test_mask] == data.y[data.test_mask]).sum()
        test_acc = int(test_correct) / int(data.test_mask.sum())

        metrics = {"accuracy": test_acc, "loss": loss.item(), "train_accuracy": train_acc}
        # if epoch % checkpoint_freq == 0:
        #     checkpoint = Checkpoint.from_dict({"epoch": epoch})
        #     session.report(metrics, checkpoint=checkpoint)
        # else:
        #     session.report(metrics)

        os.makedirs("./models/tune", exist_ok=True)
        torch.save(
            (model.state_dict(), optimizer.state_dict()), "./models/tune/checkpoint.pt")
        checkpoint = Checkpoint.from_directory("./models/tune")
        session.report(metrics, checkpoint=checkpoint)

In [5]:
# Load dataset
data = torch.load(dataset_path)

# Define the search space for hyperparameters
config = {
    "num_neurons": tune.choice([32, 64, 128]),
    "network_size": tune.choice([2, 3, 4]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "weight_decay": tune.loguniform(1e-6, 1e-3),
    # "epoch_num": tune.choice([100, 250, 500])
}

# Define the trainable function for Ray Tune
trainable_with_resources = tune.with_resources(lambda trainable: train_gcn(config, data), {"cpu": 8, "gpu": 1})

print(data)
print("Number of Nodes:", data.num_nodes)
print("Percent Train (Split):", str(round(float(sum(data.train_mask) / len(data.train_mask)) * 10000) / 100) + "%")

DataBatch(edge_index=[2, 55086], rot=[3975, 3], size=[3975], x=[3975, 3], edge_attr=[55086, 1], y=[3975], train_mask=[3975], test_mask=[3975], batch=[3975], ptr=[17])
Number of Nodes: 3975
Percent Train (Split): 79.87%


### 4.4 Perform hyperparameter tuning
The hyperparameter tuning process using Ray Tune consists of the following steps:

Define the dataset location and storage path for Ray Tune results.
Check if the storage path exists and increment the experiment name if necessary.
Define the search space for hyperparameters.
Load the dataset.
Create a trainable function with resources to train the GCN model with the provided hyperparameters and dataset.
Initialize Ray Tune and specify the metric and mode for the ASHAScheduler.
Define the reporter and scheduler for Ray Tune.
Create a tuner object and specify the trainable function, search space, tune configuration, and run configuration.
Perform the hyperparameter tuning using the tuner object.
Retrieve the best hyperparameters and corresponding accuracy.

In [6]:
if __name__ == "__main__":
    ray.shutdown()
    ray.init()

    # Define the metric and mode for the ASHAScheduler
    metric = "accuracy"
    mode = "max"

    # Perform hyperparameter tuning
    # Perform hyperparameter tuning using Ray Tune
    reporter = CLIReporter(metric_columns=["accuracy"])
    scheduler = ASHAScheduler(metric=metric, mode=mode, max_t=500, grace_period=20)

    # store results using tune.Tuner
    tuner = tune.Tuner(
        trainable_with_resources,
        param_space= {
            "params": config
        },
        tune_config=TuneConfig(
            num_samples=50,
            # time_budget_s=600.0,
            scheduler=scheduler,
        ),
        run_config=RunConfig(
            name=exp_name,
            storage_path=storage_path,  # Specify a directory to store results
            progress_reporter=reporter,
        ),
    )
    result = tuner.fit()
    best_trial = result.get_best_result("accuracy", mode="max", scope="last")
    best_hyperparameters = best_trial.config
    best_accuracy = best_trial.metrics

    print("Best hyperparameters found:")
    print(best_hyperparameters)
    print("Best accuracy found:", best_accuracy)

    ray.shutdown()

2023-07-10 18:26:55,633	INFO worker.py:1636 -- Started a local Ray instance.


== Status ==
Current time: 2023-07-10 18:26:59 (running for 00:00:00.64)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 320.000: None | Iter 80.000: None | Iter 20.000: None
Logical resource usage: 8.0/8 CPUs, 1.0/1 GPUs
Result logdir: C:\Users\Shamit\ray_results\tune_analyzing_results_2
Number of trials: 16/50 (16 PENDING)
+--------------------+----------+-------+-------------+-----------------------+----------------------+-----------------------+
| Trial name         | status   | loc   |   params/lr |   params/network_size |   params/num_neurons |   params/weight_decay |
|--------------------+----------+-------+-------------+-----------------------+----------------------+-----------------------|
| lambda_4404e_00000 | PENDING  |       | 0.0194737   |                     4 |                  128 |           3.51474e-06 |
| lambda_4404e_00001 | PENDING  |       | 0.000259929 |                     4 |                   64 |           9.48672e-06 |
| lambda_4404e_00002 | PENDING  |   

Trial name,accuracy,date,done,hostname,iterations_since_restore,loss,node_ip,pid,should_checkpoint,time_since_restore,time_this_iter_s,time_total_s,timestamp,train_accuracy,training_iteration,trial_id
lambda_4404e_00000,0.44875,2023-07-10_18-27-07,False,BasementPC,1,0.816747,127.0.0.1,12988,True,2.31425,2.31425,2.31425,1689031627,0.434646,1,4404e_00000


== Status ==
Current time: 2023-07-10 18:27:09 (running for 00:00:10.66)
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 320.000: None | Iter 80.000: None | Iter 20.000: 0.55125
Logical resource usage: 8.0/8 CPUs, 1.0/1 GPUs
Result logdir: C:\Users\Shamit\ray_results\tune_analyzing_results_2
Number of trials: 17/50 (16 PENDING, 1 RUNNING)
+--------------------+----------+-----------------+-------------+-----------------------+----------------------+-----------------------+------------+
| Trial name         | status   | loc             |   params/lr |   params/network_size |   params/num_neurons |   params/weight_decay |   accuracy |
|--------------------+----------+-----------------+-------------+-----------------------+----------------------+-----------------------+------------|
| lambda_4404e_00000 | RUNNING  | 127.0.0.1:12988 | 0.0194737   |                     4 |                  128 |           3.51474e-06 |      0.605 |
| lambda_4404e_00001 | PENDING  |                 | 0.000

2023-07-10 18:33:52,584	INFO tune.py:1111 -- Total run time: 413.82 seconds (412.72 seconds for the tuning loop).


== Status ==
Current time: 2023-07-10 18:33:52 (running for 00:06:53.75)
Using AsyncHyperBand: num_stopped=50
Bracket: Iter 320.000: 0.81125 | Iter 80.000: 0.6171875 | Iter 20.000: 0.55125
Logical resource usage: 8.0/8 CPUs, 1.0/1 GPUs
Result logdir: C:\Users\Shamit\ray_results\tune_analyzing_results_2
Number of trials: 50/50 (50 TERMINATED)
+--------------------+------------+-----------------+-------------+-----------------------+----------------------+-----------------------+------------+
| Trial name         | status     | loc             |   params/lr |   params/network_size |   params/num_neurons |   params/weight_decay |   accuracy |
|--------------------+------------+-----------------+-------------+-----------------------+----------------------+-----------------------+------------|
| lambda_4404e_00000 | TERMINATED | 127.0.0.1:12988 | 0.0194737   |                     4 |                  128 |           3.51474e-06 |    0.845   |
| lambda_4404e_00001 | TERMINATED | 127.0.0.1:12

In [7]:
# restored_tuner = tune.Tuner.restore(experiment_path, trainable=trainable_with_resources)
# restored_tuner.get_results()

tuner.get_results()

ResultGrid<[
  Result(
    metrics={'accuracy': 0.845, 'loss': 0.40148797631263733, 'train_accuracy': 0.8324409448818898, 'should_checkpoint': True, 'done': True, 'trial_id': '4404e_00000', 'experiment_tag': '0_lr=0.0195,network_size=4,num_neurons=128,weight_decay=0.0000'},
    path='e://\\OneDrive\\UWM\\William_Musinski__Surana\\Research_Code\\Research\\notebooks\\ray_results\\tune_analyzing_results_2\\lambda_4404e_00000_0_lr=0.0195,network_size=4,num_neurons=128,weight_decay=0.0000_2023-07-10_18-26-59',
    checkpoint=Checkpoint(uri=e://\OneDrive\UWM\William_Musinski__Surana\Research_Code\Research\notebooks\ray_results\tune_analyzing_results_2\lambda_4404e_00000_0_lr=0.0195,network_size=4,num_neurons=128,weight_decay=0.0000_2023-07-10_18-26-59\checkpoint_000499)
  ),
  Result(
    metrics={'accuracy': 0.50875, 'loss': 1.1151809692382812, 'train_accuracy': 0.5250393700787401, 'should_checkpoint': True, 'done': True, 'trial_id': '4404e_00001', 'experiment_tag': '1_lr=0.0003,network_siz

### 4.5 Visualize the results
The results of the hyperparameter tuning process are stored in the specified storage path. The results can be loaded and visualized using TensorBoard. The code snippet below demonstrates the loading and visualization of the results

In [11]:
print(f"Loading results from {experiment_path}...")
%reload_ext tensorboard

%tensorboard --logdir $experiment_path

Loading results from ./ray_results/tune_analyzing_results_2...


Launching TensorBoard...

### 4.6 Analyze the results
In tensorboard, you can see the accuracy increase and loss decrease over time. The test accuracy (just labeled `accuracy` in tensorboard), should correlate to the training accuracy (labeled `train_accuracy`). If not, the model may be overfitting.

To view specific hyperparameters, click on the hyperparameter tab. The hyperparameter tab shows the distribution of the hyperparameters and the corresponding accuracy. The hyperparameter tab also shows the best hyperparameters and the corresponding accuracy.
