In [1]:
# Update imports when files change
%load_ext autoreload
%autoreload 2

This notebook demonstrates a (relatively) minimal example of using Ray Tune and Pytorch Lightning to train a fully connected network with optimal hyperparameters on an iris dataset.

The `hpo` method is provided to kick off an example training process, and repeated to help test wandb logging within a jupyter notebook.

To demonstrate the different methods of logging in this environment, the `logger` argument is provided.

- `logger="lightning` will use the `lightning.pytorch.loggers.WandbLogger` module to log the results within each HPO trial.
- `logger="ray"` will use the `ray.air.integrations.wandb.WandbLoggerCallback` module to log the results as part of the HPO Tuner.


In [2]:
from hpo import hpo

## First run with `logger="lightning"`

In [None]:
hpo(num_samples=10, num_epochs=500, logger="lightning")

0,1
Current time:,2024-12-11 19:18:29
Running for:,00:00:52.18
Memory:,13.7/18.0 GiB

Trial name,status,loc,batch_size,hidden_dim,learning_rate,iter,total time (s),train_loss,train_accuracy,val_loss
train_tune_9546a_00000,TERMINATED,127.0.0.1:14930,16,128,0.0680161,500,44.2425,0.0470041,1.0,0.114881
train_tune_9546a_00001,TERMINATED,127.0.0.1:14927,64,128,0.00268882,500,27.4551,6.05167e-05,1.0,0.0143013
train_tune_9546a_00002,TERMINATED,127.0.0.1:14924,64,128,0.000519428,500,27.6522,0.00806381,1.0,0.00319099
train_tune_9546a_00003,TERMINATED,127.0.0.1:14929,32,64,0.024383,500,34.0813,6.03477e-06,1.0,8.73751e-05
train_tune_9546a_00004,TERMINATED,127.0.0.1:14933,16,16,0.00226016,500,44.3061,0.000136771,1.0,0.0766294
train_tune_9546a_00005,TERMINATED,127.0.0.1:14925,32,32,0.0913462,500,33.7082,4.96705e-09,1.0,0.0141029
train_tune_9546a_00006,TERMINATED,127.0.0.1:14928,64,16,0.00195429,500,27.1909,0.0354798,0.982143,0.00949431
train_tune_9546a_00007,TERMINATED,127.0.0.1:14926,32,16,0.0227947,500,33.9517,7.47587e-05,1.0,0.142688
train_tune_9546a_00008,TERMINATED,127.0.0.1:14932,16,16,0.0731784,500,44.3466,0.0684356,1.0,0.0295348
train_tune_9546a_00009,TERMINATED,127.0.0.1:14931,16,64,0.0181125,500,44.1321,0.0,1.0,0.0250322


[36m(train_tune pid=14927)[0m GPU available: True (mps), used: True
[36m(train_tune pid=14927)[0m TPU available: False, using: 0 TPU cores
[36m(train_tune pid=14927)[0m HPU available: False, using: 0 HPUs
[36m(train_tune pid=14927)[0m wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[36m(train_tune pid=14927)[0m wandb: Currently logged in as: robbie-leap (leap-labs). Use `wandb login --relogin` to force relogin
[36m(train_tune pid=14924)[0m wandb: Tracking run with wandb version 0.18.7
[36m(train_tune pid=14924)[0m wandb: Run data is saved locally in ./wandb/run-20241211_191743-ayb53whj
[36m(train_tune pid=14924)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(train_tune pid=14924)[0m wandb: Syncing run stellar-jazz-196
[36m(train_tune pid=14924)[0m wandb: ⭐️ View project at https://wandb.ai/leap-labs/hanging-runs-test
[36m(train_tune pid=14924)[0m wandb: 🚀 View run at https://wandb.ai/leap-labs

[36m(train_tune pid=14927)[0m [1;34mwandb[0m: 🚀 View run [33mpeachy-spaceship-201[0m at: [34mhttps://wandb.ai/leap-labs/hanging-runs-test/runs/6efrjjcz[0m


[36m(train_tune pid=14930)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-17-36/train_tune_9546a_00000_0_batch_size=16,hidden_dim=128,learning_rate=0.0680_2024-12-11_19-17-37/checkpoint_000279)[32m [repeated 662x across cluster][0m
[36m(train_tune pid=14924)[0m `Trainer.fit` stopped: `max_epochs=500` reached.[32m [repeated 2x across cluster][0m
[36m(train_tune pid=14930)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-17-36/train_tune_9546a_00000_0_batch_size=16,hidden_dim=128,learning_rate=0.0680_2024-12-11_19-17-37/checkpoint_000350)[32m [repeated 510x across cluster][0m


[36m(train_tune pid=14926)[0m [1;34mwandb[0m: 🚀 View run [33mlilac-totem-201[0m at: [34mhttps://wandb.ai/leap-labs/hanging-runs-test/runs/vtams9ya[0m[32m [repeated 2x across cluster][0m


[36m(train_tune pid=14929)[0m `Trainer.fit` stopped: `max_epochs=500` reached.[32m [repeated 3x across cluster][0m
[36m(train_tune pid=14930)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-17-36/train_tune_9546a_00000_0_batch_size=16,hidden_dim=128,learning_rate=0.0680_2024-12-11_19-17-37/checkpoint_000443)[32m [repeated 373x across cluster][0m
[36m(train_tune pid=14931)[0m `Trainer.fit` stopped: `max_epochs=500` reached.
2024-12-11 19:18:29,803	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-17-36' in 0.0415s.
2024-12-11 19:18:29,806	INFO tune.py:1041 -- Total run time: 52.20 seconds (52.14 seconds for the tuning loop).


Best hyperparameters found were:  {'hidden_dim': 64, 'learning_rate': 0.024382960114596945, 'batch_size': 32}


## Second run with `logger="lightning"`

In [None]:
hpo(num_samples=10, num_epochs=500, logger="lightning")

0,1
Current time:,2024-12-11 19:19:19
Running for:,00:00:49.84
Memory:,13.1/18.0 GiB

Trial name,status,loc,batch_size,hidden_dim,learning_rate,iter,total time (s),train_loss,train_accuracy,val_loss
train_tune_b477e_00000,TERMINATED,127.0.0.1:15643,16,32,0.000853058,500,42.0809,0.00274658,1,0.0304653
train_tune_b477e_00001,TERMINATED,127.0.0.1:15647,32,128,0.0388364,500,33.0497,2.90565e-06,1,1.06325e-05
train_tune_b477e_00002,TERMINATED,127.0.0.1:15646,64,32,0.0119395,500,26.3473,0.000199908,1,0.151583
train_tune_b477e_00003,TERMINATED,127.0.0.1:15640,16,16,0.00185389,500,41.9073,0.00798562,1,0.0153228
train_tune_b477e_00004,TERMINATED,127.0.0.1:15642,64,128,0.0620237,500,26.3267,1.23071e-05,1,1.15235e-06
train_tune_b477e_00005,TERMINATED,127.0.0.1:15644,64,64,0.037667,500,26.0372,1.2986e-05,1,0.248536
train_tune_b477e_00006,TERMINATED,127.0.0.1:15645,32,64,0.0777279,500,32.5205,3.12923e-07,1,4.06976e-05
train_tune_b477e_00007,TERMINATED,127.0.0.1:15637,32,128,0.00310119,500,32.4846,4.00185e-05,1,0.0367832
train_tune_b477e_00008,TERMINATED,127.0.0.1:15639,16,16,0.000449586,500,42.1623,0.0161911,1,0.0170677
train_tune_b477e_00009,TERMINATED,127.0.0.1:15638,32,16,0.0280822,500,32.6407,5.04631e-06,1,0.745489


[36m(train_tune pid=15642)[0m GPU available: True (mps), used: True
[36m(train_tune pid=15642)[0m TPU available: False, using: 0 TPU cores
[36m(train_tune pid=15642)[0m HPU available: False, using: 0 HPUs
[36m(train_tune pid=15637)[0m wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[36m(train_tune pid=15642)[0m wandb: Currently logged in as: robbie-leap (leap-labs). Use `wandb login --relogin` to force relogin
[36m(train_tune pid=15644)[0m wandb: Tracking run with wandb version 0.18.7
[36m(train_tune pid=15644)[0m wandb: Run data is saved locally in ./wandb/run-20241211_191834-lnoh1djj
[36m(train_tune pid=15644)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(train_tune pid=15644)[0m wandb: Syncing run gallant-sun-205
[36m(train_tune pid=15644)[0m wandb: ⭐️ View project at https://wandb.ai/leap-labs/hanging-runs-test
[36m(train_tune pid=15644)[0m wandb: 🚀 View run at https://wandb.ai/leap-labs/

[36m(train_tune pid=15642)[0m [1;34mwandb[0m: 🚀 View run [33msilvery-sun-211[0m at: [34mhttps://wandb.ai/leap-labs/hanging-runs-test/runs/n4mfrbwc[0m


[36m(train_tune pid=15637)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-18-29/train_tune_b477e_00007_7_batch_size=32,hidden_dim=128,learning_rate=0.0031_2024-12-11_19-18-29/checkpoint_000428)[32m [repeated 653x across cluster][0m
[36m(train_tune pid=15642)[0m `Trainer.fit` stopped: `max_epochs=500` reached.[32m [repeated 2x across cluster][0m


[36m(train_tune pid=15637)[0m [1;34mwandb[0m: 🚀 View run [33mdark-darkness-207[0m at: [34mhttps://wandb.ai/leap-labs/hanging-runs-test/runs/g2kv9old[0m


[36m(train_tune pid=15640)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-18-29/train_tune_b477e_00003_3_batch_size=16,hidden_dim=16,learning_rate=0.0019_2024-12-11_19-18-29/checkpoint_000357)[32m [repeated 543x across cluster][0m
[36m(train_tune pid=15647)[0m `Trainer.fit` stopped: `max_epochs=500` reached.[32m [repeated 4x across cluster][0m
[36m(train_tune pid=15640)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-18-29/train_tune_b477e_00003_3_batch_size=16,hidden_dim=16,learning_rate=0.0019_2024-12-11_19-18-29/checkpoint_000455)[32m [repeated 294x across cluster][0m
[36m(train_tune pid=15640)[0m `Trainer.fit` stopped: `max_epochs=500` reached.
[36m(train_tune pid=16606)[0m GPU available: True (mps), used: True
[36m(train_tune pid=16606)[0m TPU available: False, using: 0 TPU cores
[36m(train_tun

2024-12-11 19:19:19,789	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-18-29' in 0.0722s.
2024-12-11 19:19:19,793	INFO tune.py:1041 -- Total run time: 49.85 seconds (49.77 seconds for the tuning loop).


Best hyperparameters found were:  {'hidden_dim': 128, 'learning_rate': 0.06202366177593322, 'batch_size': 64}


## Run with `logger="ray"`

In [5]:
hpo(num_samples=10, num_epochs=500, logger="ray")

0,1
Current time:,2024-12-11 19:20:52
Running for:,00:01:00.42
Memory:,12.3/18.0 GiB

Trial name,status,loc,batch_size,hidden_dim,learning_rate,iter,total time (s),train_loss,train_accuracy,val_loss
train_tune_e5244_00000,TERMINATED,127.0.0.1:16604,32,128,0.0240247,500,34.18,9.4374e-08,1.0,0.0188939
train_tune_e5244_00001,TERMINATED,127.0.0.1:16605,16,16,0.000218536,500,43.79,0.0625065,1.0,0.0544614
train_tune_e5244_00002,TERMINATED,127.0.0.1:16597,32,32,0.00390696,500,33.0788,0.000197692,1.0,0.152746
train_tune_e5244_00003,TERMINATED,127.0.0.1:16600,64,32,0.00707671,500,25.8919,0.000588747,1.0,0.0413282
train_tune_e5244_00004,TERMINATED,127.0.0.1:16601,32,128,0.000107147,500,34.1203,0.0284889,1.0,0.019909
train_tune_e5244_00005,TERMINATED,127.0.0.1:16599,32,16,0.0592108,500,33.503,0.000104843,1.0,0.0950233
train_tune_e5244_00006,TERMINATED,127.0.0.1:16603,32,64,0.00364605,500,33.5074,1.40698e-05,1.0,3.85253e-05
train_tune_e5244_00007,TERMINATED,127.0.0.1:16602,16,16,0.000215408,500,44.5109,0.0248592,1.0,0.0623124
train_tune_e5244_00008,TERMINATED,127.0.0.1:16606,32,32,0.00854498,500,34.0636,1.81295e-06,1.0,0.182212
train_tune_e5244_00009,TERMINATED,127.0.0.1:16598,64,32,0.000192548,500,25.6031,0.120659,0.964286,0.110292


2024-12-11 19:19:51,613	INFO wandb.py:319 -- Already logged into W&B.
2024-12-11 19:20:52,032	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/robbiemccorkell/ray_results/train_tune_2024-12-11_19-19-51' in 0.0937s.
2024-12-11 19:20:53,685	INFO tune.py:1041 -- Total run time: 62.08 seconds (60.32 seconds for the tuning loop).


Best hyperparameters found were:  {'hidden_dim': 64, 'learning_rate': 0.003646051886383178, 'batch_size': 32}
