# Scaling Multi-Objective Bayesian Optimisation using RayTune

This notebook will explore various Multi-Objective Bayesian Optimisation algorithms and evaluate their distributed schedulers for fast, efficient scaling. The task at hand is Neural Architecture Search (NAS), and the following auxiliary frameworks will be used:

1. **Ax**: Abstraction on top of the popular BoTorch library
2. **Optuna**: Uses tree-structured Parzen Estimator (TPE) for Bayesian optimisation

**RayTune** is the scalable optimisation framework that acts as a distributed backbone for these auxiliary algorithms, and helps scale them in distributed environments. The notebook compares their performance on NAS tasks of varying performances, and in particular, probes their scheduling algorithms that control execution and search speed.

In [1]:
import time
import ray
from ray import train, tune
from ray.tune.schedulers import AsyncHyperBandScheduler
from ray.tune.search.optuna import OptunaSearch

## Neural Architecture Search (NAS)

NAS is a high-level optimisation problem that aims to find the most optimal neural network architecture for specific tasks. Naturally, there are several considerations when designing a neural network, such as the tradeoff between model size and performance and the computational complexity of layers. A large search space requires *multi-objective* optimisation, with Bayesian methods being particularly useful because it is infeasible to train and evaluate every possible architecture. 

### With Ax

In [2]:
from dataclasses import dataclass
import tempfile
from filelock import FileLock
import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from ray.train.lightning import (
    RayDDPStrategy,
    RayLightningEnvironment,
    RayTrainReportCallback,
    prepare_trainer,
)
from torchmetrics import Accuracy


@dataclass
class ModelConfig:
    learning_rate: float
    batch_size: int
    layer_1_size: int
    layer_2_size: int
    layer_3_size: int
    dropout: float


class MNISTClassifier(pl.LightningModule):
    def __init__(self, config):
        super().__init__()
        self.config = config

        self.accuracy = Accuracy(task="multiclass", num_classes=10, top_k=1)

        # model parameters
        self.layer1 = nn.Linear(28 * 28, config.layer_1_size)
        self.layer2 = nn.Linear(config.layer_1_size, config.layer_2_size)
        self.layer3 = nn.Linear(config.layer_2_size, config.layer_3_size)
        self.layer4 = nn.Linear(config.layer_3_size, 10)
        self.dropout = nn.Dropout(config.dropout)

        # training parameters
        self.learning_rate = config.learning_rate

        self.eval_loss = []
        self.eval_accuracy = []

    def forward(self, x):
        batch_size, channels, width, height = x.size()

        x = x.view(batch_size, -1)

        x = F.relu(self.layer1(x))
        x = self.dropout(x)

        x = F.relu(self.layer2(x))
        x = self.dropout(x)

        x = F.relu(self.layer3(x))
        x = self.dropout(x)

        x = self.layer4(x)
        x = torch.log_softmax(x, dim=1)
        return x

    def cross_entropy_loss(self, logits, labels):
        """
        Apply NLL loss because softmax is applied in the forward function
        """
        return F.nll_loss(logits, labels)

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)

        self.log("ptl/train_loss", loss)
        self.log("ptl/train_accuracy", accuracy)

        return loss

    def validation_step(self, val_batch, batch_idx):
        x, y = val_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        accuracy = self.accuracy(logits, y)

        self.eval_loss.append(loss)
        self.eval_accuracy.append(accuracy)

        return {"val_loss": loss, "val_accuracy": accuracy}

    def on_validation_epoch_end(self):
        avg_loss = torch.stack(self.eval_loss).mean()
        avg_acc = torch.stack(self.eval_accuracy).mean()
        self.log("ptl/val_loss", avg_loss, sync_dist=True)
        self.log("ptl/val_accuracy", avg_acc, sync_dist=True)
        self.eval_loss.clear()
        self.eval_accuracy.clear()

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer


class MNISTDataModule(pl.LightningDataModule):
    def __init__(self, batch_size=128):
        super().__init__()
        self.data_dir = tempfile.mkdtemp()
        self.batch_size = batch_size
        self.transform = transforms.Compose(
            [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
        )

    def setup(self, stage=None):
        with FileLock(f"{self.data_dir}.lock"):
            mnist = MNIST(
                self.data_dir, train=True, download=True, transform=self.transform
            )
            self.mnist_train, self.mnist_val = random_split(mnist, [55000, 5000])
            self.mnist_test = MNIST(
                self.data_dir, train=False, download=True, transform=self.transform
            )

    def train_dataloader(self):
        return DataLoader(self.mnist_train, 
                          self.batch_size, num_workers=4)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=self.batch_size, num_workers=4)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=self.batch_size, num_workers=4)


def train_func(config):
    print("printing config", config)
    with open("config.txt", "w") as f:
        f.write(str(config))
    data_module = MNISTDataModule(batch_size=config["batch_size"])

    # instantiate config object
    config = ModelConfig(
        learning_rate=config["learning_rate"],
        batch_size=config["batch_size"],
        layer_1_size=config["layer_1_size"],
        layer_2_size=config["layer_2_size"],
        layer_3_size=config["layer_3_size"],
        dropout=config["dropout"],
    )

    model = MNISTClassifier(config=config)

    trainer = pl.Trainer(
        devices=1,
        accelerator="cpu",
        strategy=RayDDPStrategy(),
        callbacks=[RayTrainReportCallback()],
        plugins=[RayLightningEnvironment()],
        enable_progress_bar=False,
    )

    trainer = prepare_trainer(trainer)
    trainer.fit(model, datamodule=data_module)

In [3]:
search_space = {
    "layer_1_size": tune.choice([32, 64, 128]),
    "layer_2_size": tune.choice([64, 128, 256]),
    "layer_3_size": tune.choice([128, 256, 512]),
    "dropout": tune.uniform(0.1, 0.3),
    "batch_size": tune.choice([32, 64, 128]),
    "learning_rate": tune.loguniform(1e-4, 1e-1),
}

In [4]:
from ray.train import RunConfig, ScalingConfig, CheckpointConfig

# scaling_config = ScalingConfig(
#     num_workers=3,
#     use_gpu=True,
#     resources_per_worker={"CPU": 1, "GPU": 1}
# )

run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        checkpoint_score_attribute="ptl/val_accuracy",
        checkpoint_score_order="max"
    )
)

In [5]:
from ray.train.torch import TorchTrainer

ray_trainer = TorchTrainer(
    train_func,
    # scaling_config=scaling_config,
    run_config=run_config,
)

In [6]:
from ray.tune.schedulers import ASHAScheduler

num_epochs = 5
num_samples = 100

scheduler = ASHAScheduler(
    max_t=num_epochs,
    grace_period=1,
    reduction_factor=2
)

In [7]:
from ray.tune.search.ax import AxSearch

algorithm = AxSearch()

# restrict to 4 concurrent trials
algorithm = tune.search.ConcurrencyLimiter(algorithm, max_concurrent=4)



In [8]:
tuner = tune.Tuner(
    ray_trainer,
    param_space={"train_loop_config": search_space},
    tune_config=tune.TuneConfig(
        metric="ptl/val_accuracy",
        mode="max",
        num_samples=num_samples,
        scheduler=scheduler,
    ),
)

tuner.fit()

0,1
Current time:,2024-12-24 17:40:21
Running for:,00:06:00.62
Memory:,25.4/32.0 GiB

Trial name,status,loc,train_loop_config/ba tch_size,train_loop_config/dr opout,train_loop_config/la yer_1_size,train_loop_config/la yer_2_size,train_loop_config/la yer_3_size,train_loop_config/le arning_rate,iter,total time (s),ptl/train_loss,ptl/train_accuracy,ptl/val_loss
TorchTrainer_4f4b8_00034,RUNNING,127.0.0.1:56482,128,0.23891,128,64,128,0.00245667,3.0,46.6424,0.154768,0.965909,0.103128
TorchTrainer_4f4b8_00035,RUNNING,127.0.0.1:56577,64,0.151442,64,64,256,0.00297387,1.0,31.2849,0.300558,0.875,0.213082
TorchTrainer_4f4b8_00037,RUNNING,127.0.0.1:56933,64,0.238714,128,256,256,0.000252594,,,,,
TorchTrainer_4f4b8_00038,RUNNING,127.0.0.1:57080,64,0.213271,64,128,512,0.00654806,,,,,
TorchTrainer_4f4b8_00039,PENDING,,32,0.100271,64,128,128,0.000578316,,,,,
TorchTrainer_4f4b8_00040,PENDING,,128,0.208088,32,64,128,0.00178802,,,,,
TorchTrainer_4f4b8_00041,PENDING,,32,0.174195,32,128,128,0.000138052,,,,,
TorchTrainer_4f4b8_00042,PENDING,,32,0.210235,128,128,512,0.00460865,,,,,
TorchTrainer_4f4b8_00043,PENDING,,128,0.177099,32,256,512,0.000119454,,,,,
TorchTrainer_4f4b8_00044,PENDING,,128,0.152518,64,64,512,0.082316,,,,,


[36m(TorchTrainer pid=51189)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=51189)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=51246) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=51248)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=51247)[0m printing config {'layer_1_size': 64, 'layer_2_size': 256, 'layer_3_size': 256, 'dropout': 0.16465667111048166, 'batch_size': 128, 'learning_rate': 0.0012649991136170205}
[36m(RayTrainWorker pid=51247)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz


[36m(RayTrainWorker pid=51248)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=51248)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=51248)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=51248)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=51248)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=51247)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=51247)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=51247)[0m 
[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m 
[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51247)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpoka2j6og/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 599kB/s]
 27%|██▋       | 2.72M/9.91M [00:00<00:01, 5.91MB/s]
 97%|█████████▋| 9.60M/9.91M [00:01<00:00, 10.5MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 8.45MB/s]
 92%|█████████▏| 9.14M/9.91M [00:01<00:00, 10.8MB/s]


[36m(RayTrainWorker pid=51247)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpoka2j6og/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpoka2j6og/MNIST/raw
[36m(RayTrainWorker pid=51247)[0m 


 93%|█████████▎| 9.21M/9.91M [00:01<00:00, 8.75MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 7.76MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 7.15MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 6.96MB/s]


[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 


100%|██████████| 9.91M/9.91M [00:01<00:00, 6.80MB/s]


[36m(RayTrainWorker pid=51249)[0m 
[36m(RayTrainWorker pid=51247)[0m 
[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m 
[36m(RayTrainWorker pid=51247)[0m 
[36m(RayTrainWorker pid=51245)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 346kB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 334kB/s]


[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 343kB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 341kB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 346kB/s]


[36m(RayTrainWorker pid=51247)[0m 
[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m 


[36m(TorchTrainer pid=51188)[0m Started distributed worker processes: [32m [repeated 4x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)[0m
[36m(TorchTrainer pid=51188)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=51245) world_rank=0, local_rank=0, node_rank=0[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51245)[0m Setting up process group for: env:// [rank=0, world_size=1][32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51245)[0m GPU available: True (mps), used: False[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51245)[0m TPU available: False, using: 0 TPU cores[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51245)[0m HPU available: False, using: 0 HPUs[32m [repeated 4x across cluster][0

[36m(RayTrainWorker pid=51247)[0m 
[36m(RayTrainWorker pid=51245)[0m printing config {'layer_1_size': 64, 'layer_2_size': 64, 'layer_3_size': 512, 'dropout': 0.24417686558658022, 'batch_size': 32, 'learning_rate': 0.0012850365068080222}[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51247)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 30x across cluster][0m
[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m Failed to download (trying next):[32m [repeated 14x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m HTTP Error 403: Forbidden[32m [repeated 14x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.28MB/s]


[36m(RayTrainWorker pid=51247)[0m 
[36m(RayTrainWorker pid=51249)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpccl95zix/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 14x across cluster][0m
[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m 
[36m(RayTrainWorker pid=51247)[0m 


100%|██████████| 4.54k/4.54k [00:00<00:00, 3.33MB/s][32m [repeated 15x across cluster][0m
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.25MB/s][32m [repeated 55x across cluster][0m
[36m(RayTrainWorker pid=51247)[0m 
[36m(RayTrainWorker pid=51247)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=51247)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=51247)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=51247)[0m 1 | layer1   | Linear             | 50.2 K | train
[36m(RayTrainWorker pid=51247)[0m 2 | layer2   | Linear             | 16.6 K | train
[36m(RayTrainWorker pid=51247)[0m 3 | layer3   | Linear             | 65.8 K | train
[36m(RayTrainWorker pid=51247)[0m 4 | layer4   | Linear             | 2.6 K  | train
[36m(RayTrainWorker pid=51247)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=51247)[0m -----------------------------------------

[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m 


[36m(RayTrainWorker pid=51245)[0m 
[36m(RayTrainWorker pid=51248)[0m 
[36m(RayTrainWorker pid=51246)[0m 
[36m(RayTrainWorker pid=51249)[0m 
[36m(RayTrainWorker pid=51247)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
 63%|██████▎   | 6.26M/9.91M [00:01<00:00, 8.76MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 4.35MB/s][32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m   | Name     | Type               | Params | Mode [32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m --------------------------------------------------------[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m 0 | accuracy | MulticlassAccuracy | 0      | train[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=51249)

[36m(RayTrainWorker pid=51249)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpccl95zix/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpccl95zix/MNIST/raw[32m [repeated 19x across cluster][0m
[36m(RayTrainWorker pid=51907)[0m printing config {'layer_1_size': 32, 'layer_2_size': 128, 'layer_3_size': 256, 'dropout': 0.21873339805254258, 'batch_size': 64, 'learning_rate': 0.0017592276854866433}
[36m(RayTrainWorker pid=51249)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 9x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m Failed to download (trying next):[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m HTTP Error 403: Forbidden[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=51249)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/

[36m(RayTrainWorker pid=51907)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=51907)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=51907)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=51907)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=51907)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=51907)[0m 


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 592kB/s]
  4%|▍         | 426k/9.91M [00:00<00:06, 1.39MB/s]
 15%|█▍        | 1.47M/9.91M [00:00<00:01, 4.42MB/s]
 23%|██▎       | 2.29M/9.91M [00:00<00:01, 4.42MB/s]
 68%|██████▊   | 6.72M/9.91M [00:00<00:00, 15.0MB/s]
 87%|████████▋ | 8.65M/9.91M [00:00<00:00, 13.9MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.2MB/s]


[36m(RayTrainWorker pid=51907)[0m 
[36m(RayTrainWorker pid=51907)[0m 


  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 334kB/s]


[36m(RayTrainWorker pid=51907)[0m 
[36m(RayTrainWorker pid=51907)[0m 


  0%|          | 0.00/1.65M [00:00<?, ?B/s]
[36m(RayTrainWorker pid=51246)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00002_2_batch_size=64,dropout=0.1395,layer_1_size=64,layer_2_size=64,layer_3_size=256,learning_rate=0.0017_2024-12-24_17-34-21/checkpoint_000001)
[36m(RayTrainWorker pid=52006)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(TorchTrainer pid=51894)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=51894)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=52006) world_rank=0, local_rank=0, node_rank=0


[36m(RayTrainWorker pid=52006)[0m printing config {'layer_1_size': 64, 'layer_2_size': 256, 'layer_3_size': 128, 'dropout': 0.21735158263460483, 'batch_size': 64, 'learning_rate': 0.008840953012473073}


  6%|▌         | 98.3k/1.65M [00:00<00:02, 587kB/s]
 24%|██▍       | 393k/1.65M [00:00<00:00, 1.58MB/s]
[36m(RayTrainWorker pid=51248)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00003_3_batch_size=128,dropout=0.2106,layer_1_size=128,layer_2_size=64,layer_3_size=128,learning_rate=0.0230_2024-12-24_17-34-21/checkpoint_000002)
 50%|████▉     | 819k/1.65M [00:00<00:00, 2.25MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.60MB/s]
[36m(RayTrainWorker pid=52006)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=52006)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=52006)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=52006)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(acc

[36m(RayTrainWorker pid=51907)[0m 
[36m(RayTrainWorker pid=51907)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0c_zbc36/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0c_zbc36/MNIST/raw[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=51907)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=51907)[0m Failed to download (trying next):[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=51907)[0m HTTP Error 403: Forbidden[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=51907)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0c_zbc36/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52006)[0m 
[36m(RayTrainWorker pid=51907)[0m 
[36m(RayTrainWorker p

[36m(RayTrainWorker pid=51907)[0m 
[36m(RayTrainWorker pid=51907)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=51907)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=51907)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=51907)[0m 1 | layer1   | Linear             | 25.1 K | train
[36m(RayTrainWorker pid=51907)[0m 2 | layer2   | Linear             | 4.2 K  | train
[36m(RayTrainWorker pid=51907)[0m 3 | layer3   | Linear             | 33.0 K | train
[36m(RayTrainWorker pid=51907)[0m 4 | layer4   | Linear             | 2.6 K  | train
[36m(RayTrainWorker pid=51907)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=51907)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=51907)[0m 64.9 K    Trainable params
[36m(RayTrainWorker pid=51907)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=51907)[0m 64

[36m(RayTrainWorker pid=52006)[0m 
[36m(RayTrainWorker pid=52006)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 344kB/s]


[36m(RayTrainWorker pid=52006)[0m 
[36m(RayTrainWorker pid=52006)[0m 


[36m(RayTrainWorker pid=51907)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.13MB/s]


[36m(RayTrainWorker pid=52006)[0m 
[36m(RayTrainWorker pid=52006)[0m 
[36m(RayTrainWorker pid=52006)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmplt8lwpzy/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmplt8lwpzy/MNIST/raw[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=52006)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=52006)[0m Failed to download (trying next):[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=52006)[0m HTTP Error 403: Forbidden[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=52006)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmplt8lwpzy/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 4x across cluster][0m


[36m(RayTrainWorker pid=52006)[0m 
[36m(RayTrainWorker pid=51247)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00000_0_batch_size=128,dropout=0.1647,layer_1_size=64,layer_2_size=256,layer_3_size=256,learning_rate=0.0013_2024-12-24_17-34-21/checkpoint_000002)


[36m(RayTrainWorker pid=52006)[0m 


[36m(RayTrainWorker pid=51246)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00002_2_batch_size=64,dropout=0.1395,layer_1_size=64,layer_2_size=64,layer_3_size=256,learning_rate=0.0017_2024-12-24_17-34-21/checkpoint_000002)
[36m(RayTrainWorker pid=52006)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=52006)[0m --------------------------------------------------------[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=52006)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=52006)[0m 4 | layer4   | Linear             | 1.3 K  | train[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=52006)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=52006)[0m 101 K     Trainable params
[36m(RayTrainWorker pid=52006)[0m 0         Non-trainable params
[36m(RayTra

[36m(RayTrainWorker pid=52659)[0m printing config {'layer_1_size': 128, 'layer_2_size': 256, 'layer_3_size': 512, 'dropout': 0.19069275345499645, 'batch_size': 64, 'learning_rate': 0.011887118405969704}
[36m(RayTrainWorker pid=52006)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmplt8lwpzy/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmplt8lwpzy/MNIST/raw
[36m(RayTrainWorker pid=52006)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmplt8lwpzy/MNIST/raw/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=52659)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz


[36m(RayTrainWorker pid=52659)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=52659)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=52659)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=52659)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=52659)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=52659)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=52659)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=52659)[0m 
[36m(RayTrainWorker pid=52659)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
[36m(RayTrainWorker pid=52659)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsdtbszzl/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 602kB/s]
  4%|▍         | 426k/9.91M [00:00<00:06, 1.40MB/s]
 17%|█▋        | 1.67M/9.91M [00:00<00:01, 5.09MB/s]
 30%|███       | 3.01M/9.91M [00:00<00:00, 7.75MB/s]
 57%|█████▋    | 5.64M/9.91M [00:00<00:00, 13.5MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 12.1MB/s]


[36m(RayTrainWorker pid=52659)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsdtbszzl/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsdtbszzl/MNIST/raw
[36m(RayTrainWorker pid=52659)[0m 
[36m(RayTrainWorker pid=52659)[0m Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=52659)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=52659)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=52659)[0m 
[36m(RayTrainWorker pid=52659)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsdtbszzl/MNIST/raw/train-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=52695)[0m 
[36m(RayTrainWorker pid=52694)[0m 
[36m(RayTrainWorker pid=52659)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsdtbszzl/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p

100%|██████████| 28.9k/28.9k [00:00<00:00, 348kB/s]
[36m(RayTrainWorker pid=51246)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00002_2_batch_size=64,dropout=0.1395,layer_1_size=64,layer_2_size=64,layer_3_size=256,learning_rate=0.0017_2024-12-24_17-34-21/checkpoint_000004)
[36m(RayTrainWorker pid=51246)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=51246)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=51246)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=51246)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=51246)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=51246)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in 

[36m(RayTrainWorker pid=52720)[0m 
[36m(RayTrainWorker pid=52659)[0m 


[36m(TorchTrainer pid=52656)[0m Started distributed worker processes: [32m [repeated 3x across cluster][0m
[36m(TorchTrainer pid=52656)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=52720) world_rank=0, local_rank=0, node_rank=0[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52720)[0m Setting up process group for: env:// [rank=0, world_size=1][32m [repeated 3x across cluster][0m
100%|██████████| 9.91M/9.91M [00:00<00:00, 11.1MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.6MB/s]


[36m(RayTrainWorker pid=52694)[0m 
[36m(RayTrainWorker pid=52695)[0m 
[36m(RayTrainWorker pid=52659)[0m 
[36m(RayTrainWorker pid=52694)[0m 
[36m(RayTrainWorker pid=52720)[0m printing config {'layer_1_size': 128, 'layer_2_size': 128, 'layer_3_size': 256, 'dropout': 0.24886063751267073, 'batch_size': 64, 'learning_rate': 0.0008074760893481581}[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52694)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz[32m [repeated 13x across cluster][0m


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.19MB/s]


[36m(RayTrainWorker pid=52695)[0m 


[36m(RayTrainWorker pid=52720)[0m GPU available: True (mps), used: False[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52720)[0m TPU available: False, using: 0 TPU cores[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52720)[0m HPU available: False, using: 0 HPUs[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52720)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52720)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.[32m [repeated 3x across cluster][0m
 96%|█████████▌| 9.50M/9.91M [00:00<00:00, 16.3MB/s]
100%|██████████| 9.91M/

[36m(RayTrainWorker pid=52659)[0m 
[36m(RayTrainWorker pid=52720)[0m 
[36m(RayTrainWorker pid=52659)[0m Failed to download (trying next):[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=52659)[0m HTTP Error 403: Forbidden[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=52695)[0m 
[36m(RayTrainWorker pid=52694)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 340kB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 342kB/s]


[36m(RayTrainWorker pid=52720)[0m 
[36m(RayTrainWorker pid=52659)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsdtbszzl/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=52659)[0m 


100%|██████████| 4.54k/4.54k [00:00<00:00, 3.72MB/s]
[36m(RayTrainWorker pid=52659)[0m 
[36m(RayTrainWorker pid=52659)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=52659)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=52659)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=52659)[0m 1 | layer1   | Linear             | 100 K  | train
[36m(RayTrainWorker pid=52659)[0m 2 | layer2   | Linear             | 33.0 K | train
[36m(RayTrainWorker pid=52659)[0m 3 | layer3   | Linear             | 131 K  | train
[36m(RayTrainWorker pid=52659)[0m 4 | layer4   | Linear             | 5.1 K  | train
[36m(RayTrainWorker pid=52659)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=52659)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=52659)[0m 270 K     Trainable params
[36m(RayTrainWorker pid=52659)[0m 0         Non-tr

[36m(RayTrainWorker pid=52694)[0m 
[36m(RayTrainWorker pid=52695)[0m 


  0%|          | 0.00/28.9k [00:00<?, ?B/s][32m [repeated 8x across cluster][0m
 52%|█████▏    | 5.14M/9.91M [00:00<00:00, 8.05MB/s][32m [repeated 20x across cluster][0m
100%|██████████| 28.9k/28.9k [00:00<00:00, 344kB/s]


[36m(RayTrainWorker pid=52720)[0m 
[36m(RayTrainWorker pid=52720)[0m 
[36m(RayTrainWorker pid=52720)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpmu6vn7s8/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpmu6vn7s8/MNIST/raw[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=52694)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.39MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.26MB/s]


[36m(RayTrainWorker pid=52695)[0m 
[36m(RayTrainWorker pid=52694)[0m 
[36m(RayTrainWorker pid=52695)[0m 


[36m(RayTrainWorker pid=52659)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=52720)[0m 


[36m(RayTrainWorker pid=52694)[0m 
[36m(RayTrainWorker pid=52695)[0m 


[36m(RayTrainWorker pid=52694)[0m 
[36m(RayTrainWorker pid=52695)[0m 
[36m(RayTrainWorker pid=52720)[0m 
[36m(RayTrainWorker pid=52720)[0m 


[36m(RayTrainWorker pid=52720)[0m 
[36m(TorchTrainer pid=52788)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=52788)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=52814) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=52814)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=52814)[0m printing config {'layer_1_size': 128, 'layer_2_size': 64, 'layer_3_size': 128, 'dropout': 0.1835654440926341, 'batch_size': 32, 'learning_rate': 0.004610715013900017}
[36m(RayTrainWorker pid=52814)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 17x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m 
[36m(RayTrainWorker pid=52814)[0m Failed to download (trying next):[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m HTTP Error 403: Forbidden[32m [repeated 8x across cluster][0m


[36m(RayTrainWorker pid=52814)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=52814)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=52814)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=52814)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=52814)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
[36m(RayTrainWorker pid=52720)[0m   | Name     | Type               | Params | Mode [32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52720)[0m --------------------------------------------------------[32m [repeated 6x across cluster][0m
[36m(RayT

[36m(RayTrainWorker pid=52814)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_9qus576/MNIST/raw/train-images-idx3-ubyte.gz[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_9qus576/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_9qus576/MNIST/raw[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m 
[36m(RayTrainWorker pid=52814)[0m 


[36m(RayTrainWorker pid=52720)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.[32m [repeated 3x across cluster][0m
100%|██████████| 28.9k/28.9k [00:00<00:00, 348kB/s]


[36m(RayTrainWorker pid=52814)[0m 
[36m(RayTrainWorker pid=52814)[0m 
[36m(RayTrainWorker pid=52814)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.74MB/s]


[36m(RayTrainWorker pid=52814)[0m 
[36m(RayTrainWorker pid=52814)[0m Failed to download (trying next):[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m HTTP Error 403: Forbidden[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m 


[36m(RayTrainWorker pid=52814)[0m 
[36m(RayTrainWorker pid=52814)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=52814)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=52814)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=52814)[0m 1 | layer1   | Linear             | 100 K  | train
[36m(RayTrainWorker pid=52814)[0m 2 | layer2   | Linear             | 8.3 K  | train
[36m(RayTrainWorker pid=52814)[0m 3 | layer3   | Linear             | 8.3 K  | train
[36m(RayTrainWorker pid=52814)[0m 4 | layer4   | Linear             | 1.3 K  | train
[36m(RayTrainWorker pid=52814)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=52814)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=52814)[0m 118 K     Trainable params
[36m(RayTrainWorker pid=52814)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=52814)[0m 11

[36m(RayTrainWorker pid=53239)[0m printing config {'layer_1_size': 128, 'layer_2_size': 128, 'layer_3_size': 512, 'dropout': 0.14513634236469966, 'batch_size': 64, 'learning_rate': 0.08377024358898488}
[36m(RayTrainWorker pid=52814)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_9qus576/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=52814)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_9qus576/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_9qus576/MNIST/raw[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=53239)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=53239)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=53239)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=53239)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=53239)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=53239)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=53239)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=53239)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=53239)[0m 
[36m(RayTrainWorker pid=53244)[0m 


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 597kB/s]
  4%|▍         | 426k/9.91M [00:00<00:06, 1.39MB/s]
 17%|█▋        | 1.64M/9.91M [00:00<00:01, 4.98MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 11.9MB/s]


[36m(RayTrainWorker pid=53239)[0m 
[36m(RayTrainWorker pid=53239)[0m 
[36m(RayTrainWorker pid=53239)[0m 
[36m(RayTrainWorker pid=53239)[0m 


100%|██████████| 9.91M/9.91M [00:02<00:00, 3.50MB/s]
[36m(TorchTrainer pid=53139)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=53139)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=53244) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=53244)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=53244)[0m 


[36m(RayTrainWorker pid=52720)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00010_10_batch_size=64,dropout=0.2489,layer_1_size=128,layer_2_size=128,layer_3_size=256,learning_rate=0.0008_2024-12-24_17-34-21/checkpoint_000001)


[36m(RayTrainWorker pid=53244)[0m 
[36m(RayTrainWorker pid=53244)[0m printing config {'layer_1_size': 128, 'layer_2_size': 128, 'layer_3_size': 256, 'dropout': 0.19768331624522156, 'batch_size': 64, 'learning_rate': 0.027985409353988275}
[36m(RayTrainWorker pid=53239)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_5bdtq8h/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=53239)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_5bdtq8h/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp_5bdtq8h/MNIST/raw[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=53244)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz[32m [repeated 9x across cluster][0m
[36m(RayTrainWorker pid=53239)[0m 


[36m(RayTrainWorker pid=53244)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=53244)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=53244)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=53244)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=53244)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
100%|██████████| 28.9k/28.9k [00:00<00:00, 337kB/s]


[36m(RayTrainWorker pid=53244)[0m 
[36m(RayTrainWorker pid=53239)[0m 
[36m(RayTrainWorker pid=53239)[0m Failed to download (trying next):[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=53239)[0m HTTP Error 403: Forbidden[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=53244)[0m 
[36m(RayTrainWorker pid=53239)[0m 


100%|██████████| 4.54k/4.54k [00:00<00:00, 2.07MB/s][32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=53239)[0m 
[36m(RayTrainWorker pid=53239)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=53239)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=53239)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=53239)[0m 1 | layer1   | Linear             | 100 K  | train
[36m(RayTrainWorker pid=53239)[0m 2 | layer2   | Linear             | 16.5 K | train
[36m(RayTrainWorker pid=53239)[0m 3 | layer3   | Linear             | 66.0 K | train
[36m(RayTrainWorker pid=53239)[0m 4 | layer4   | Linear             | 5.1 K  | train
[36m(RayTrainWorker pid=53239)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=53239)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=53239)[0m 188 K     Trainable params
[36m(RayTrain

[36m(RayTrainWorker pid=53244)[0m 
[36m(RayTrainWorker pid=53244)[0m 


[36m(RayTrainWorker pid=53244)[0m 


[36m(RayTrainWorker pid=53244)[0m 


[36m(RayTrainWorker pid=52695)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00009_9_batch_size=128,dropout=0.2985,layer_1_size=128,layer_2_size=128,layer_3_size=256,learning_rate=0.0009_2024-12-24_17-34-21/checkpoint_000002)[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=53239)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
100%|██████████| 4.54k/4.54k [00:00<00:00, 3.76MB/s][32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=53244)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=53244)[0m --------------------------------------------------------[32m [repeated 2x across cluster][0m
[36m(RayTrain

[36m(RayTrainWorker pid=53951)[0m printing config {'layer_1_size': 128, 'layer_2_size': 128, 'layer_3_size': 256, 'dropout': 0.15949214766748768, 'batch_size': 64, 'learning_rate': 0.06906314226038607}
[36m(RayTrainWorker pid=53244)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmps9k_n8cb/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=53244)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmps9k_n8cb/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmps9k_n8cb/MNIST/raw[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=53244)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=53244)[0m Failed to download (trying next):[32m [repeated 2x across cluster][0m
[36m(RayTrainWo

[36m(RayTrainWorker pid=53951)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=53951)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=53951)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=53951)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=53951)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
[36m(RayTrainWorker pid=52720)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00010_10_batch_size=64,dropout=0.2489,layer_1_size=128,layer_2_size=128

[36m(RayTrainWorker pid=53951)[0m 


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 588kB/s]
  4%|▍         | 426k/9.91M [00:00<00:06, 1.39MB/s]
 13%|█▎        | 1.28M/9.91M [00:00<00:02, 3.13MB/s]
 35%|███▍      | 3.44M/9.91M [00:00<00:00, 8.37MB/s]


[36m(RayTrainWorker pid=53988)[0m 
[36m(RayTrainWorker pid=53989)[0m 


 66%|██████▌   | 6.55M/9.91M [00:00<00:00, 14.9MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 11.3MB/s]


[36m(RayTrainWorker pid=53951)[0m 
[36m(RayTrainWorker pid=53951)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 237kB/s]


[36m(RayTrainWorker pid=53951)[0m 


100%|██████████| 9.91M/9.91M [00:01<00:00, 8.56MB/s]
[36m(TorchTrainer pid=53946)[0m Started distributed worker processes: [32m [repeated 3x across cluster][0m
[36m(TorchTrainer pid=53946)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=54019) world_rank=0, local_rank=0, node_rank=0[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m Setting up process group for: env:// [rank=0, world_size=1][32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=53988)[0m 
[36m(RayTrainWorker pid=54019)[0m 
[36m(RayTrainWorker pid=53951)[0m 
[36m(RayTrainWorker pid=53988)[0m 


[36m(RayTrainWorker pid=52720)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00010_10_batch_size=64,dropout=0.2489,layer_1_size=128,layer_2_size=128,layer_3_size=256,learning_rate=0.0008_2024-12-24_17-34-21/checkpoint_000004)


[36m(RayTrainWorker pid=53988)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 331kB/s]
 94%|█████████▍| 9.34M/9.91M [00:02<00:00, 4.68MB/s]


[36m(RayTrainWorker pid=53951)[0m 
[36m(RayTrainWorker pid=54019)[0m printing config {'layer_1_size': 128, 'layer_2_size': 128, 'layer_3_size': 128, 'dropout': 0.20081277124647182, 'batch_size': 128, 'learning_rate': 0.0002177538522592566}[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=53988)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp7z21mrqp/MNIST/raw/train-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=53951)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp71w5gs3r/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp71w5gs3r/MNIST/raw[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=53951)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 16x across cluster][0m
[36m(RayTrainWorker pid=53988)[0m Failed to download (

[36m(RayTrainWorker pid=52720)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=52720)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=52720)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=52720)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=52720)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=52720)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
[36m(RayTrainWorker pid=52720)[0m     self = reduction.pickle.load(from_parent)
[36m(RayTrainWorker pid=52720)[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=52720)[0m _pickle.UnpicklingError: pickle data was truncated
[36m(RayTrainWorker pid=54019)[0m GPU available: True (mps), used: False[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pi

[36m(RayTrainWorker pid=53989)[0m 


100%|██████████| 9.91M/9.91M [00:00<00:00, 10.2MB/s]


[36m(RayTrainWorker pid=53988)[0m 
[36m(RayTrainWorker pid=54019)[0m 
[36m(RayTrainWorker pid=53951)[0m 
[36m(RayTrainWorker pid=53989)[0m 
[36m(RayTrainWorker pid=54019)[0m 
[36m(RayTrainWorker pid=53951)[0m 


[36m(RayTrainWorker pid=53951)[0m 
[36m(RayTrainWorker pid=53951)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=53951)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=53951)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=53951)[0m 1 | layer1   | Linear             | 100 K  | train
[36m(RayTrainWorker pid=53951)[0m 2 | layer2   | Linear             | 16.5 K | train
[36m(RayTrainWorker pid=53951)[0m 3 | layer3   | Linear             | 33.0 K | train
[36m(RayTrainWorker pid=53951)[0m 4 | layer4   | Linear             | 2.6 K  | train
[36m(RayTrainWorker pid=53951)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=53951)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=53951)[0m 152 K     Trainable params
[36m(RayTrainWorker pid=53951)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=53951)[0m 15

[36m(RayTrainWorker pid=53989)[0m 
[36m(RayTrainWorker pid=53988)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.72MB/s][32m [repeated 37x across cluster][0m


[36m(RayTrainWorker pid=54019)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 348kB/s]


[36m(RayTrainWorker pid=53989)[0m 
[36m(RayTrainWorker pid=53988)[0m 
[36m(RayTrainWorker pid=54019)[0m 
[36m(RayTrainWorker pid=53988)[0m 


[36m(RayTrainWorker pid=53988)[0m 


[36m(RayTrainWorker pid=53989)[0m 


[36m(RayTrainWorker pid=53951)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=54019)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.63MB/s]


[36m(RayTrainWorker pid=53989)[0m 
[36m(RayTrainWorker pid=54019)[0m 
[36m(RayTrainWorker pid=53989)[0m 


[36m(RayTrainWorker pid=53989)[0m 


[36m(RayTrainWorker pid=54019)[0m 


[36m(RayTrainWorker pid=54019)[0m 
[36m(RayTrainWorker pid=54019)[0m   | Name     | Type               | Params | Mode [32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m --------------------------------------------------------[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m 0 | accuracy | MulticlassAccuracy | 0      | train[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m 4 | layer4   | Linear             | 1.3 K  | train[32m [repeated 12x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m 5 | dropout  | Dropout            | 0      | train[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m 134 K     Trainable params[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m 0         Non-trainable params[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m 134 K     Total params[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=540

[36m(RayTrainWorker pid=54161)[0m printing config {'layer_1_size': 32, 'layer_2_size': 64, 'layer_3_size': 256, 'dropout': 0.16431648687476777, 'batch_size': 128, 'learning_rate': 0.0029890994732554837}
[36m(RayTrainWorker pid=54019)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxbfgduw4/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 9x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxbfgduw4/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxbfgduw4/MNIST/raw[32m [repeated 11x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 16x across cluster][0m
[36m(RayTrainWorker pid=54019)[0m Failed to download (trying next):[32m [repeated 9x across cluster][0m
[36m(RayTrai

 95%|█████████▍| 9.40M/9.91M [00:00<00:00, 15.1MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.2MB/s]


[36m(RayTrainWorker pid=54161)[0m 
[36m(RayTrainWorker pid=54161)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 336kB/s][32m [repeated 2x across cluster][0m
 77%|███████▋  | 7.67M/9.91M [00:00<00:00, 15.0MB/s][32m [repeated 6x across cluster][0m


[36m(RayTrainWorker pid=54161)[0m 
[36m(RayTrainWorker pid=54161)[0m 
[36m(RayTrainWorker pid=54161)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxmbqtipr/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54161)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxmbqtipr/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxmbqtipr/MNIST/raw[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54161)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=54161)[0m Failed to download (trying next):[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54161)[0m HTTP Error 403: Forbidden[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54161)[0m 
[36m(Ray

[36m(RayTrainWorker pid=53988)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00015_15_batch_size=128,dropout=0.1031,layer_1_size=64,layer_2_size=128,layer_3_size=128,learning_rate=0.0036_2024-12-24_17-34-21/checkpoint_000000)
[36m(RayTrainWorker pid=54161)[0m 
[36m(RayTrainWorker pid=54161)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=54161)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=54161)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=54161)[0m 1 | layer1   | Linear             | 25.1 K | train
[36m(RayTrainWorker pid=54161)[0m 2 | layer2   | Linear             | 2.1 K  | train
[36m(RayTrainWorker pid=54161)[0m 3 | layer3   | Linear             | 16.6 K | train
[36m(RayTrainWorker pid=54161)[0m 4 | layer4   | Linear             | 2.6 K  | train
[36

[36m(RayTrainWorker pid=54161)[0m 


[36m(RayTrainWorker pid=53951)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=53951)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=53951)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=53951)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=53951)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=53951)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
[36m(RayTrainWorker pid=53951)[0m     self = reduction.pickle.load(from_parent)
[36m(RayTrainWorker pid=53951)[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=53951)[0m _pickle.UnpicklingError: pickle data was truncated
100%|██████████| 4.54k/4.54k [00:00<00:00, 2.83MB/s][32m [repeated 2x across cluster][0m
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.

[36m(RayTrainWorker pid=54566)[0m printing config {'layer_1_size': 32, 'layer_2_size': 256, 'layer_3_size': 256, 'dropout': 0.1423789011822376, 'batch_size': 32, 'learning_rate': 0.05657467324573046}
[36m(RayTrainWorker pid=54161)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxmbqtipr/MNIST/raw/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=54161)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxmbqtipr/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpxmbqtipr/MNIST/raw
[36m(RayTrainWorker pid=54566)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54161)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=54161)[0m HTTP Error 403: Forbidden


[36m(RayTrainWorker pid=54566)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=54566)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=54566)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=54566)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=54566)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=54566)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=54566)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=54566)[0m 
[36m(RayTrainWorker pid=54571)[0m 
[36m(RayTrainWorker pid=54566)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpf8716_rj/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 589kB/s]
  4%|▍         | 426k/9.91M [00:00<00:06, 1.39MB/s]
 18%|█▊        | 1.80M/9.91M [00:00<00:01, 4.51MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 11.3MB/s]


[36m(RayTrainWorker pid=54566)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpf8716_rj/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpf8716_rj/MNIST/raw
[36m(RayTrainWorker pid=54566)[0m 
[36m(RayTrainWorker pid=54617)[0m 
[36m(RayTrainWorker pid=54566)[0m 


 92%|█████████▏| 9.14M/9.91M [00:01<00:00, 9.38MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 6.95MB/s]


[36m(RayTrainWorker pid=54571)[0m 


[36m(RayTrainWorker pid=54161)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00018_18_batch_size=128,dropout=0.1643,layer_1_size=32,layer_2_size=64,layer_3_size=256,learning_rate=0.0030_2024-12-24_17-34-21/checkpoint_000000)
100%|██████████| 28.9k/28.9k [00:00<00:00, 345kB/s]


[36m(RayTrainWorker pid=54566)[0m 
[36m(RayTrainWorker pid=54571)[0m 




[36m(RayTrainWorker pid=54566)[0m 
[36m(RayTrainWorker pid=54617)[0m 
[36m(RayTrainWorker pid=54571)[0m 


[36m(TorchTrainer pid=54525)[0m Started distributed worker processes: [32m [repeated 2x across cluster][0m
[36m(TorchTrainer pid=54525)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=54617) world_rank=0, local_rank=0, node_rank=0[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m Setting up process group for: env:// [rank=0, world_size=1][32m [repeated 2x across cluster][0m
100%|██████████| 28.9k/28.9k [00:00<00:00, 338kB/s]


[36m(RayTrainWorker pid=54617)[0m 
[36m(RayTrainWorker pid=54571)[0m 
[36m(RayTrainWorker pid=54617)[0m printing config {'layer_1_size': 32, 'layer_2_size': 64, 'layer_3_size': 256, 'dropout': 0.20123555802351, 'batch_size': 32, 'learning_rate': 0.00037028326952746643}[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54571)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 15x across cluster][0m
[36m(RayTrainWorker pid=54566)[0m 


[36m(RayTrainWorker pid=54617)[0m GPU available: True (mps), used: False[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m TPU available: False, using: 0 TPU cores[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m HPU available: False, using: 0 HPUs[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.[32m [repeated 2x across cluster][0m
100%|██████████| 28.9k/28.9k [00:00<00:00, 343kB/s]


[36m(RayTrainWorker pid=54617)[0m 
[36m(RayTrainWorker pid=54566)[0m 
[36m(RayTrainWorker pid=54566)[0m Failed to download (trying next):[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=54566)[0m HTTP Error 403: Forbidden[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=54571)[0m 
[36m(RayTrainWorker pid=54617)[0m 
[36m(RayTrainWorker pid=54566)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpf8716_rj/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=54566)[0m 
[36m(RayTrainWorker pid=54571)[0m 


[36m(RayTrainWorker pid=54566)[0m 
[36m(RayTrainWorker pid=54566)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=54566)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=54566)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=54566)[0m 1 | layer1   | Linear             | 25.1 K | train
[36m(RayTrainWorker pid=54566)[0m 2 | layer2   | Linear             | 8.4 K  | train
[36m(RayTrainWorker pid=54566)[0m 3 | layer3   | Linear             | 65.8 K | train
[36m(RayTrainWorker pid=54566)[0m 4 | layer4   | Linear             | 2.6 K  | train
[36m(RayTrainWorker pid=54566)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=54566)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=54566)[0m 101 K     Trainable params
[36m(RayTrainWorker pid=54566)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=54566)[0m 10

[36m(RayTrainWorker pid=54711)[0m 


[36m(RayTrainWorker pid=54161)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=54161)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=54161)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=54161)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=54161)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=54161)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
[36m(RayTrainWorker pid=54161)[0m     self = reduction.pickle.load(from_parent)
[36m(RayTrainWorker pid=54161)[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=54161)[0m _pickle.UnpicklingError: pickle data was truncated
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.76MB/s]


[36m(RayTrainWorker pid=54571)[0m 
[36m(RayTrainWorker pid=54617)[0m 


[36m(RayTrainWorker pid=54571)[0m 


[36m(RayTrainWorker pid=54617)[0m 
[36m(RayTrainWorker pid=54617)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpe8qn2ir6/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpe8qn2ir6/MNIST/raw[32m [repeated 10x across cluster][0m


 14%|█▍        | 1.41M/9.91M [00:00<00:02, 4.18MB/s]
[36m(RayTrainWorker pid=54617)[0m 


[36m(RayTrainWorker pid=54617)[0m 


100%|██████████| 9.91M/9.91M [00:01<00:00, 9.31MB/s]
[36m(RayTrainWorker pid=54566)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=54711)[0m 
[36m(RayTrainWorker pid=54711)[0m 


[36m(TorchTrainer pid=54647)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=54647)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=54711) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=54711)[0m Setting up process group for: env:// [rank=0, world_size=1]
100%|██████████| 28.9k/28.9k [00:00<00:00, 298kB/s]


[36m(RayTrainWorker pid=54711)[0m 
[36m(RayTrainWorker pid=54711)[0m 
[36m(RayTrainWorker pid=54711)[0m printing config {'layer_1_size': 32, 'layer_2_size': 64, 'layer_3_size': 256, 'dropout': 0.10494902861257592, 'batch_size': 32, 'learning_rate': 0.017061156114948565}
[36m(RayTrainWorker pid=54711)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 14x across cluster][0m


[36m(RayTrainWorker pid=54711)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=54711)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=54711)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=54711)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=54711)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=54711)[0m 
[36m(RayTrainWorker pid=54711)[0m Failed to download (trying next):[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=54711)[0m HTTP Error 403: Forbidden[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=54711)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpahmuu0ch/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 6x across cluster][0m


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.77MB/s]


[36m(RayTrainWorker pid=54711)[0m 


[36m(RayTrainWorker pid=54617)[0m   | Name     | Type               | Params | Mode [32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m --------------------------------------------------------[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m 0 | accuracy | MulticlassAccuracy | 0      | train[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m 4 | layer4   | Linear             | 2.6 K  | train[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m 5 | dropout  | Dropout            | 0      | train[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m 46.4 K    Trainable params[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m 0         Non-trainable params[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m 46.4 K    Total params[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54617)[0m 0.186     Total estimated mode

[36m(RayTrainWorker pid=54711)[0m 


[36m(RayTrainWorker pid=54711)[0m 
[36m(RayTrainWorker pid=54711)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.[32m [repeated 3x across cluster][0m
[36m(TorchTrainer pid=54781)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=54781)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=54864) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=54864)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=54864)[0m printing config {'layer_1_size': 32, 'layer_2_size': 128, 'layer_3_size': 512, 'dropout': 0.23261068720157155, 'batch_size': 128, 'learning_rate': 0.01129419972728505}
[36m(RayTrainWorker pid=54711)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpahmuu0ch/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpahmuu0ch/MNIST/raw[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=54864)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54711)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=54711)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=54711)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpahmuu0ch/MNIST/raw/t10k-labels-idx1-ubyte.gz


[36m(RayTrainWorker pid=54864)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=54864)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=54864)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=54864)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=54864)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=54864)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=54864)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=54864)[0m 
[36m(RayTrainWorker pid=54864)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw/train-images-idx3-ubyte.gz


[36m(RayTrainWorker pid=54711)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=54711)[0m --------------------------------------------------------[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=54711)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=54711)[0m 4 | layer4   | Linear             | 2.6 K  | train[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=54711)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=54711)[0m 46.4 K    Trainable params
[36m(RayTrainWorker pid=54711)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=54711)[0m 46.4 K    Total params
[36m(RayTrainWorker pid=54711)[0m 0.186     Total estimated model params size (MB)
[36m(RayTrainWorker pid=54711)[0m 6         Modules in train mode
[36m(RayTrainWorker pid=54711)[0m 0         Modules in eval mode
[36m(RayTrainWorker pid=54711)[0m /Users/sidharrthnagappan/.virtualenvs/

[36m(RayTrainWorker pid=54864)[0m 
[36m(RayTrainWorker pid=54864)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=54864)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=54864)[0m 
[36m(RayTrainWorker pid=54864)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw/train-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=54864)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 349kB/s]
[36m(RayTrainWorker pid=54571)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00020_20_batch_size=64,dropout=0.1971,layer_1_size=128,layer_2_size=64,layer_3_size=512,learning_rate=0.0003_2024-12-24_17-34-21/checkpoint_000000)


[36m(RayTrainWorker pid=54864)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=54864)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=54864)[0m 
[36m(RayTrainWorker pid=54864)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw/t10k-images-idx3-ubyte.gz


  6%|▌         | 98.3k/1.65M [00:00<00:02, 580kB/s]
 26%|██▌       | 426k/1.65M [00:00<00:00, 1.37MB/s]
 91%|█████████▏| 1.51M/1.65M [00:00<00:00, 4.50MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.21MB/s]


[36m(RayTrainWorker pid=54864)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=54864)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=54864)[0m 
[36m(RayTrainWorker pid=54864)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=54864)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=54864)[0m 
[36m(RayTrainWorker pid=54864)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=54864)[0m 


[36m(RayTrainWorker pid=54864)[0m 
[36m(RayTrainWorker pid=54864)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=54864)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=54864)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=54864)[0m 100 K     Trainable params
[36m(RayTrainWorker pid=54864)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=54864)[0m 100 K     Total params
[36m(RayTrainWorker pid=54864)[0m 0.402     Total estimated model params size (MB)
[36m(RayTrainWorker pid=54864)[0m 6         Modules in train mode
[36m(RayTrainWorker pid=54864)[0m 0         Modules in eval mode
[36m(RayTrainWorker pid=54864)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=55197)[0m printing config {'layer_1_size': 128, 'layer_2_size': 256, 'layer_3_size': 512, 'dropout': 0.17949978964635208, 'batch_size': 64, 'learning_rate': 0.03878742860492825}
[36m(RayTrainWorker pid=54864)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpj31g1h3j/MNIST/raw
[36m(RayTrainWorker pid=54864)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 2x across cluster][0m


[36m(RayTrainWorker pid=55197)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=55197)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=55197)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=55197)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=55197)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
[36m(RayTrainWorker pid=54864)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=54864)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=54864)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessi

[36m(RayTrainWorker pid=55197)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=55197)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=55197)[0m 
[36m(RayTrainWorker pid=55197)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsrgbtdsm/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 65.5k/9.91M [00:00<00:24, 395kB/s]
  4%|▎         | 360k/9.91M [00:00<00:07, 1.20MB/s]
 14%|█▎        | 1.34M/9.91M [00:00<00:02, 4.09MB/s]


[36m(RayTrainWorker pid=55226)[0m 


 30%|███       | 2.98M/9.91M [00:00<00:00, 8.04MB/s]
 40%|███▉      | 3.93M/9.91M [00:00<00:00, 7.21MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 11.3MB/s]


[36m(RayTrainWorker pid=55197)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsrgbtdsm/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsrgbtdsm/MNIST/raw
[36m(RayTrainWorker pid=55197)[0m 


[36m(RayTrainWorker pid=54864)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00023_23_batch_size=128,dropout=0.2326,layer_1_size=32,layer_2_size=128,layer_3_size=512,learning_rate=0.0113_2024-12-24_17-34-21/checkpoint_000000)


[36m(RayTrainWorker pid=55197)[0m 


100%|██████████| 9.91M/9.91M [00:00<00:00, 10.8MB/s]
[36m(TorchTrainer pid=55205)[0m Started distributed worker processes: [32m [repeated 2x across cluster][0m
[36m(TorchTrainer pid=55205)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=55302) world_rank=0, local_rank=0, node_rank=0[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m Setting up process group for: env:// [rank=0, world_size=1][32m [repeated 2x across cluster][0m


[36m(RayTrainWorker pid=55302)[0m printing config 
[36m(RayTrainWorker pid=55226)[0m 
[36m(RayTrainWorker pid=55302)[0m {'layer_1_size': 128, 'layer_2_size': 128, 'layer_3_size': 128, 'dropout': 0.25725225591832507, 'batch_size': 32, 'learning_rate': 0.0009903775404096573}


100%|██████████| 28.9k/28.9k [00:00<00:00, 349kB/s]


[36m(RayTrainWorker pid=55197)[0m 
[36m(RayTrainWorker pid=55226)[0m 
[36m(RayTrainWorker pid=55302)[0m 
[36m(RayTrainWorker pid=55197)[0m 
[36m(RayTrainWorker pid=55226)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 343kB/s]
 91%|█████████▏| 1.51M/1.65M [00:00<00:00, 4.03MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.21MB/s]
[36m(RayTrainWorker pid=55302)[0m GPU available: True (mps), used: False[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m TPU available: False, using: 0 TPU cores[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m HPU available: False, using: 0 HPUs[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train w

[36m(RayTrainWorker pid=55226)[0m 
[36m(RayTrainWorker pid=55226)[0m printing config {'layer_1_size': 32, 'layer_2_size': 128, 'layer_3_size': 128, 'dropout': 0.14285853280267355, 'batch_size': 128, 'learning_rate': 0.015160436906754329}
[36m(RayTrainWorker pid=55226)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 14x across cluster][0m
[36m(RayTrainWorker pid=55197)[0m 


 82%|████████▏ | 8.13M/9.91M [00:00<00:00, 13.1MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.4MB/s]


[36m(RayTrainWorker pid=55302)[0m 


[36m(RayTrainWorker pid=54571)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00020_20_batch_size=64,dropout=0.1971,layer_1_size=128,layer_2_size=64,layer_3_size=512,learning_rate=0.0003_2024-12-24_17-34-21/checkpoint_000002)


[36m(RayTrainWorker pid=55197)[0m 
[36m(RayTrainWorker pid=55197)[0m Failed to download (trying next):[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=55197)[0m HTTP Error 403: Forbidden[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=55226)[0m 
[36m(RayTrainWorker pid=55197)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpsrgbtdsm/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m 


[36m(RayTrainWorker pid=55197)[0m 
[36m(RayTrainWorker pid=55197)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=55197)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=55197)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=55197)[0m 1 | layer1   | Linear             | 100 K  | train
[36m(RayTrainWorker pid=55197)[0m 2 | layer2   | Linear             | 33.0 K | train
[36m(RayTrainWorker pid=55197)[0m 3 | layer3   | Linear             | 131 K  | train
[36m(RayTrainWorker pid=55197)[0m 4 | layer4   | Linear             | 5.1 K  | train
[36m(RayTrainWorker pid=55197)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=55197)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=55197)[0m 270 K     Trainable params
[36m(RayTrainWorker pid=55197)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=55197)[0m 27

[36m(RayTrainWorker pid=55197)[0m 
[36m(RayTrainWorker pid=55226)[0m 
[36m(RayTrainWorker pid=55302)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 349kB/s][32m [repeated 8x across cluster][0m
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.67MB/s][32m [repeated 16x across cluster][0m


[36m(RayTrainWorker pid=55226)[0m 
[36m(RayTrainWorker pid=55226)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpuoy2tt6h/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpuoy2tt6h/MNIST/raw[32m [repeated 9x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m 


[36m(RayTrainWorker pid=55226)[0m 
[36m(TorchTrainer pid=55341)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=55341)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=55402) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=55402)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=55402)[0m printing config {'layer_1_size': 32, 'layer_2_size': 64, 'layer_3_size': 512, 'dropout': 0.28714111837162914, 'batch_size': 32, 'learning_rate': 0.03835455856042801}


[36m(RayTrainWorker pid=55197)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=55402)[0m 


[36m(RayTrainWorker pid=55402)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=55402)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=55402)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=55402)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=55402)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=55402)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz[32m [repeated 10x across cluster][0m
[36m(RayTrainWorker pid=55402)[0m Failed to download (trying next):[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55402)[0m HTTP Error 403: Forbidden[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55402)[0m 
[36m(RayTrainWorker pid=55402)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmppf8qrx2a/MNIST/raw/train-images-idx3-ubyte.gz[32m [repeated 4x across cluster][0m


100%|██████████| 9.91M/9.91M [00:00<00:00, 10.6MB/s]
[36m(RayTrainWorker pid=55226)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=55226)[0m --------------------------------------------------------[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55226)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=55226)[0m 4 | layer4   | Linear             | 1.3 K  | train[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55226)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=55226)[0m 47.1 K    Trainable params
[36m(RayTrainWorker pid=55226)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=55226)[0m 47.1 K    Total params
[36m(RayTrainWorker pid=55226)[0m 0.189     Total estimated model params size (MB)
[36m(RayTrainWorker pid=55226)[0m 6         Modules in train mode
[36m(RayTrainWorker pid=55226)[0m 0         Modules in eval mode
[36m(RayTrainWorker 

[36m(RayTrainWorker pid=55402)[0m 
[36m(RayTrainWorker pid=55402)[0m 
[36m(RayTrainWorker pid=55402)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmppf8qrx2a/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmppf8qrx2a/MNIST/raw[32m [repeated 2x across cluster][0m


100%|██████████| 28.9k/28.9k [00:00<00:00, 347kB/s]
 91%|█████████▏| 1.51M/1.65M [00:04<00:00, 474kB/s]


[36m(RayTrainWorker pid=55302)[0m 
[36m(RayTrainWorker pid=55402)[0m 


100%|██████████| 1.65M/1.65M [00:04<00:00, 330kB/s]


[36m(RayTrainWorker pid=55302)[0m 
[36m(RayTrainWorker pid=55402)[0m 
[36m(RayTrainWorker pid=55302)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.22MB/s]
[36m(RayTrainWorker pid=55302)[0m 
[36m(RayTrainWorker pid=55302)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=55302)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=55302)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=55302)[0m 134 K     Trainable params
[36m(RayTrainWorker pid=55302)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=55302)[0m 134 K     Total params
[36m(RayTrainWorker pid=55302)[0m 0.539     Total estimated model params size (MB)
[36m(RayTrainWorker pid=55302)[0m 6         Modules in train mode
[36m(RayTrainWorker pid=55302)[0m 0         Modules in eval mode
[36m(RayTrainWorker pid=55302)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'val_dataloade

[36m(RayTrainWorker pid=55402)[0m 
[36m(RayTrainWorker pid=55402)[0m 


[36m(RayTrainWorker pid=55226)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
[36m(RayTrainWorker pid=55402)[0m 
[36m(RayTrainWorker pid=54571)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00020_20_batch_size=64,dropout=0.1971,layer_1_size=128,layer_2_size=64,layer_3_size=512,learning_rate=0.0003_2024-12-24_17-34-21/checkpoint_000003)
[36m(RayTrainWorker pid=54571)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=54571)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=54571)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker 

[36m(RayTrainWorker pid=55748)[0m printing config {'layer_1_size': 64, 'layer_2_size': 256, 'layer_3_size': 512, 'dropout': 0.11186246065003927, 'batch_size': 32, 'learning_rate': 0.013959371950979267}
[36m(RayTrainWorker pid=55748)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 9x across cluster][0m
[36m(RayTrainWorker pid=55402)[0m Failed to download (trying next):[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55402)[0m HTTP Error 403: Forbidden[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55402)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmppf8qrx2a/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55402)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmppf8qrx2a/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9j



[36m(RayTrainWorker pid=55748)[0m 
[36m(RayTrainWorker pid=55749)[0m 


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 590kB/s]
  4%|▍         | 393k/9.91M [00:00<00:08, 1.15MB/s]
 14%|█▍        | 1.38M/9.91M [00:00<00:02, 3.83MB/s]
 21%|██        | 2.10M/9.91M [00:00<00:01, 4.87MB/s]
 96%|█████████▌| 9.54M/9.91M [00:01<00:00, 7.69MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 6.59MB/s]


[36m(RayTrainWorker pid=55748)[0m 


100%|██████████| 9.91M/9.91M [00:01<00:00, 8.05MB/s]


[36m(RayTrainWorker pid=55749)[0m 
[36m(RayTrainWorker pid=55748)[0m 


[36m(RayTrainWorker pid=55302)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00026_26_batch_size=32,dropout=0.2573,layer_1_size=128,layer_2_size=128,layer_3_size=128,learning_rate=0.0010_2024-12-24_17-34-21/checkpoint_000000)


[36m(RayTrainWorker pid=55749)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 348kB/s]
[36m(RayTrainWorker pid=55402)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=55402)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=55402)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=55402)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=55402)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=55402)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
[36m(RayTrainWorker pid=55402)[0m     self = reduction.pickle.load(from_parent)
[36m(RayTrainWorker pid=55402)[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=55402)[0m _pickle.UnpicklingError: pickle data was truncated


[36m(RayTrainWorker pid=55748)[0m 
[36m(RayTrainWorker pid=55774)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 345kB/s]


[36m(RayTrainWorker pid=55749)[0m 
[36m(RayTrainWorker pid=55748)[0m 
[36m(RayTrainWorker pid=55749)[0m 


[36m(TorchTrainer pid=55740)[0m Started distributed worker processes: [32m [repeated 2x across cluster][0m
[36m(TorchTrainer pid=55740)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=55774) world_rank=0, local_rank=0, node_rank=0[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m Setting up process group for: env:// [rank=0, world_size=1][32m [repeated 2x across cluster][0m


[36m(RayTrainWorker pid=55774)[0m printing config {'layer_1_size': 64, 'layer_2_size': 256, 'layer_3_size': 128, 'dropout': 0.12668409320735985, 'batch_size': 128, 'learning_rate': 0.007232709332948}[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55749)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 13x across cluster][0m
[36m(RayTrainWorker pid=55749)[0m Failed to download (trying next):[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=55749)[0m HTTP Error 403: Forbidden[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=55749)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmphkymwgko/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=55749)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmphkymwgko/MNIST/raw/train-labels-idx1

[36m(RayTrainWorker pid=55774)[0m GPU available: True (mps), used: False[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m TPU available: False, using: 0 TPU cores[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m HPU available: False, using: 0 HPUs[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.[32m [repeated 2x across cluster][0m
 50%|████▉     | 819k/1.65M [00:00<00:00, 2.22MB/s]
100%|██████████| 1.65M/1

[36m(RayTrainWorker pid=55748)[0m 
[36m(RayTrainWorker pid=55774)[0m 
[36m(RayTrainWorker pid=55748)[0m 
[36m(RayTrainWorker pid=55774)[0m 


  0%|          | 0.00/1.65M [00:00<?, ?B/s][32m [repeated 6x across cluster][0m
 72%|███████▏  | 1.18M/1.65M [00:01<00:00, 1.14MB/s][32m [repeated 28x across cluster][0m
[36m(RayTrainWorker pid=55748)[0m 
[36m(RayTrainWorker pid=55748)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=55748)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=55748)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=55748)[0m 1 | layer1   | Linear             | 50.2 K | train
[36m(RayTrainWorker pid=55748)[0m 2 | layer2   | Linear             | 16.6 K | train
[36m(RayTrainWorker pid=55748)[0m 3 | layer3   | Linear             | 131 K  | train
[36m(RayTrainWorker pid=55748)[0m 4 | layer4   | Linear             | 5.1 K  | train
[36m(RayTrainWorker pid=55748)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=55748)[0m ---------------------------------------------------

[36m(RayTrainWorker pid=55748)[0m 
[36m(RayTrainWorker pid=55749)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 335kB/s]


[36m(RayTrainWorker pid=55774)[0m 
[36m(RayTrainWorker pid=55749)[0m 
[36m(RayTrainWorker pid=55774)[0m 


[36m(RayTrainWorker pid=55749)[0m 


[36m(RayTrainWorker pid=55749)[0m 
[36m(RayTrainWorker pid=55774)[0m 
[36m(RayTrainWorker pid=55774)[0m 


[36m(RayTrainWorker pid=55748)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
[36m(RayTrainWorker pid=55402)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00027_27_batch_size=32,dropout=0.2871,layer_1_size=32,layer_2_size=64,layer_3_size=512,learning_rate=0.0384_2024-12-24_17-34-21/checkpoint_000000)


[36m(RayTrainWorker pid=55774)[0m 


[36m(RayTrainWorker pid=55774)[0m 
[36m(TorchTrainer pid=55850)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=55850)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=55909) world_rank=0, local_rank=0, node_rank=0
100%|██████████| 4.54k/4.54k [00:00<00:00, 1.88MB/s][32m [repeated 5x across cluster][0m
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.67MB/s][32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m   | Name     | Type               | Params | Mode [32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m --------------------------------------------------------[32m [repeated 4x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m 0 | accuracy | MulticlassAccuracy | 0      | train[32m [repeated 2x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m 4 | layer4   | Linear             | 1.3 K  | train[32m [repeated 8x across cluster][0m
[36m(RayTrainWorker pid=55774)[0

[36m(RayTrainWorker pid=55909)[0m printing config {'layer_1_size': 32, 'layer_2_size': 256, 'layer_3_size': 512, 'dropout': 0.11993337335730632, 'batch_size': 64, 'learning_rate': 0.042333702941545745}
[36m(RayTrainWorker pid=55774)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 10x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m Failed to download (trying next):[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m HTTP Error 403: Forbidden[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp3op2msxx/MNIST/raw/t10k-labels-idx1-ubyte.gz[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=55774)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp3op2msxx/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p1

[36m(RayTrainWorker pid=55909)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=55909)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=55909)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=55909)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=55909)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=55909)[0m 


[36m(RayTrainWorker pid=55774)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.[32m [repeated 2x across cluster][0m


[36m(RayTrainWorker pid=55909)[0m 
[36m(RayTrainWorker pid=55909)[0m 
[36m(RayTrainWorker pid=55909)[0m 


100%|██████████| 28.9k/28.9k [00:00<00:00, 334kB/s][32m [repeated 2x across cluster][0m
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.7MB/s][32m [repeated 7x across cluster][0m


[36m(RayTrainWorker pid=55909)[0m 


[36m(RayTrainWorker pid=55302)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00026_26_batch_size=32,dropout=0.2573,layer_1_size=128,layer_2_size=128,layer_3_size=128,learning_rate=0.0010_2024-12-24_17-34-21/checkpoint_000001)


[36m(RayTrainWorker pid=55909)[0m 
[36m(RayTrainWorker pid=55909)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=55909)[0m Failed to download (trying next):[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=55909)[0m HTTP Error 403: Forbidden[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=55909)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0xrl8j88/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=55909)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0xrl8j88/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0xrl8j88/MNIST/raw[32m [repeated 3x across cluster][0m


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.60MB/s]


[36m(RayTrainWorker pid=55909)[0m 
[36m(RayTrainWorker pid=55909)[0m 


[36m(RayTrainWorker pid=55909)[0m 
[36m(RayTrainWorker pid=55909)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=55909)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=55909)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=55909)[0m 1 | layer1   | Linear             | 25.1 K | train
[36m(RayTrainWorker pid=55909)[0m 2 | layer2   | Linear             | 8.4 K  | train
[36m(RayTrainWorker pid=55909)[0m 3 | layer3   | Linear             | 131 K  | train
[36m(RayTrainWorker pid=55909)[0m 4 | layer4   | Linear             | 5.1 K  | train
[36m(RayTrainWorker pid=55909)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=55909)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=55909)[0m 170 K     Trainable params
[36m(RayTrainWorker pid=55909)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=55909)[0m 17

[36m(RayTrainWorker pid=56410)[0m printing config {'layer_1_size': 128, 'layer_2_size': 64, 'layer_3_size': 512, 'dropout': 0.11368157681297185, 'batch_size': 64, 'learning_rate': 0.029065376897065908}
[36m(RayTrainWorker pid=55909)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=55909)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=55909)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=55909)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0xrl8j88/MNIST/raw/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=55909)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0xrl8j88/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp0xrl8j88/MNIST/raw
[36m(RayTrainWorker pid=56410)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.g

[36m(RayTrainWorker pid=56410)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=56410)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=56410)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=56410)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=56410)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=56410)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56410)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56410)[0m 
[36m(RayTrainWorker pid=56410)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
[36m(RayTrainWorker pid=56410)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpurwzkfl9/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 65.5k/9.91M [00:00<00:24, 395kB/s]
  3%|▎         | 262k/9.91M [00:00<00:11, 853kB/s] 
 11%|█         | 1.05M/9.91M [00:00<00:03, 2.58MB/s]
 15%|█▍        | 1.44M/9.91M [00:00<00:02, 2.93MB/s]
 42%|████▏     | 4.13M/9.91M [00:00<00:00, 8.30MB/s]
 54%|█████▍    | 5.34M/9.91M [00:00<00:00, 9.28MB/s]
 70%|██████▉   | 6.91M/9.91M [00:00<00:00, 11.0MB/s]
 84%|████████▎ | 8.29M/9.91M [00:01<00:00, 10.6MB/s]
 96%|█████████▌| 9.50M/9.91M [00:01<00:00, 10.9MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 8.05MB/s]


[36m(RayTrainWorker pid=56410)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpurwzkfl9/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpurwzkfl9/MNIST/raw
[36m(RayTrainWorker pid=56410)[0m 
[36m(RayTrainWorker pid=56410)[0m Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=56410)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56410)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56410)[0m 
[36m(RayTrainWorker pid=56410)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpurwzkfl9/MNIST/raw/train-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=56481)[0m 


  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 341kB/s]


[36m(RayTrainWorker pid=56410)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpurwzkfl9/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpurwzkfl9/MNIST/raw
[36m(RayTrainWorker pid=56410)[0m 


[36m(TorchTrainer pid=56369)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=56369)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=56481) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=56481)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=56410)[0m 


[36m(RayTrainWorker pid=55749)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00029_29_batch_size=32,dropout=0.2170,layer_1_size=64,layer_2_size=64,layer_3_size=256,learning_rate=0.0003_2024-12-24_17-34-21/checkpoint_000001)
[36m(RayTrainWorker pid=55749)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=55749)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=55749)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=55749)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=55749)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=55749)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
[36m(RayTrainWorker pid=55749)[0m     self 

[36m(RayTrainWorker pid=56481)[0m 
[36m(RayTrainWorker pid=56410)[0m 
[36m(RayTrainWorker pid=56481)[0m printing config {'layer_1_size': 32, 'layer_2_size': 64, 'layer_3_size': 512, 'dropout': 0.21635566512546026, 'batch_size': 128, 'learning_rate': 0.0009346762863871827}
[36m(RayTrainWorker pid=56410)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=56481)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.23MB/s]
[36m(TorchTrainer pid=56482)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=56482)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=56542) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=56542)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=56481)[0m Failed to download (trying next):[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=56481)[0m HTTP Error 403: Forbidden[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=56410)[0m 
[36m(RayTrainWorker pid=56481)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp6xuf7bsb/MNIST/raw/train-labels-idx1-ubyte.gz[32m [repeated 3x across cluster][0m


100%|██████████| 28.9k/28.9k [00:00<00:00, 332kB/s]


[36m(RayTrainWorker pid=56481)[0m 


[36m(RayTrainWorker pid=56542)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=56542)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=56542)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=56542)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=56542)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.
  0%|          | 0.00/28.9k [00:00<?, ?B/s][32m [repeated 3x across cluster][0m
 22%|██▏       | 360k/1.65M [00:00<00:01, 1.19MB/s][32m [repeated 9x across cluster][0m
[36m(RayTrainWorker pid=56410)[0m 
[36m(RayTrainWorker pid=56410)[0m   | Name     | Type  

[36m(RayTrainWorker pid=56542)[0m printing config {'layer_1_size': 128, 'layer_2_size': 64, 'layer_3_size': 128, 'dropout': 0.2389097471313951, 'batch_size': 128, 'learning_rate': 0.00245666823916289}
[36m(RayTrainWorker pid=56410)[0m 
[36m(RayTrainWorker pid=56481)[0m 
[36m(RayTrainWorker pid=56542)[0m 


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.18MB/s]


[36m(RayTrainWorker pid=56481)[0m 
[36m(RayTrainWorker pid=56481)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp6xuf7bsb/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmp6xuf7bsb/MNIST/raw[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=56481)[0m 


100%|██████████| 9.91M/9.91M [00:01<00:00, 8.53MB/s]


[36m(RayTrainWorker pid=56542)[0m 
[36m(RayTrainWorker pid=56481)[0m 


[36m(RayTrainWorker pid=56481)[0m 


[36m(RayTrainWorker pid=56542)[0m 


[36m(RayTrainWorker pid=56410)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=56542)[0m 
[36m(RayTrainWorker pid=56542)[0m 
[36m(RayTrainWorker pid=56542)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 12x across cluster][0m




[36m(RayTrainWorker pid=56542)[0m Failed to download (trying next):[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=56542)[0m HTTP Error 403: Forbidden[32m [repeated 6x across cluster][0m


[36m(RayTrainWorker pid=55302)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00026_26_batch_size=32,dropout=0.2573,layer_1_size=128,layer_2_size=128,layer_3_size=128,learning_rate=0.0010_2024-12-24_17-34-21/checkpoint_000003)
  0%|          | 0.00/1.65M [00:00<?, ?B/s][32m [repeated 6x across cluster][0m
 46%|████▌     | 754k/1.65M [00:00<00:00, 2.23MB/s][32m [repeated 13x across cluster][0m
[36m(RayTrainWorker pid=55302)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=55302)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=55302)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=55302)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=55302)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pi

[36m(RayTrainWorker pid=56542)[0m 
[36m(RayTrainWorker pid=56542)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpr16z8p18/MNIST/raw/t10k-images-idx3-ubyte.gz[32m [repeated 6x across cluster][0m
[36m(RayTrainWorker pid=56542)[0m 


[36m(RayTrainWorker pid=56542)[0m 
[36m(RayTrainWorker pid=56542)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=56542)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=56542)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=56542)[0m 118 K     Trainable params
[36m(RayTrainWorker pid=56542)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=56542)[0m 118 K     Total params
[36m(RayTrainWorker pid=56542)[0m 0.473     Total estimated model params size (MB)
[36m(RayTrainWorker pid=56542)[0m 6         Modules in train mode
[36m(RayTrainWorker pid=56542)[0m 0         Modules in eval mode
[36m(RayTrainWorker pid=56542)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'val_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=56542)[0m 
[36m(RayTrainWorker pid=56542)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpr16z8p18/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpr16z8p18/MNIST/raw[32m [repeated 5x across cluster][0m


[36m(RayTrainWorker pid=56664)[0m Setting up process group for: env:// [rank=0, world_size=1]
[36m(RayTrainWorker pid=56481)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
[36m(TorchTrainer pid=56577)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=56577)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=56664) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=56542)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
100%|██████████| 4.54k/4.54k [00:00<00:00, 4.18MB/s]
[36m(RayTrainWorker

[36m(RayTrainWorker pid=56664)[0m printing config {'layer_1_size': 64, 'layer_2_size': 64, 'layer_3_size': 256, 'dropout': 0.15144247022590526, 'batch_size': 64, 'learning_rate': 0.002973866475711854}
[36m(RayTrainWorker pid=56664)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m
[36m(RayTrainWorker pid=56542)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56542)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56542)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpr16z8p18/MNIST/raw/t10k-labels-idx1-ubyte.gz


[36m(RayTrainWorker pid=56664)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=56664)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=56664)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=56664)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=56664)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=56664)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56664)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56664)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 65.5k/9.91M [00:00<00:16, 592kB/s]
  2%|▏         | 197k/9.91M [00:00<00:11, 831kB/s] 
  8%|▊         | 819k/9.91M [00:00<00:03, 2.84MB/s]
 17%|█▋        | 1.70M/9.91M [00:00<00:01, 4.31MB/s]
 23%|██▎       | 2.33M/9.91M [00:00<00:01, 4.78MB/s]
 60%|██████    | 6.00M/9.91M [00:00<00:00, 11.5MB/s]
 71%|███████   | 7.01M/9.91M [00:00<00:00, 11.0MB/s]
 90%|█████████ | 8.95M/9.91M [00:01<00:00, 11.7MB/s]
100%|██████████| 9.91M/9.91M [00:01<00:00, 8.73MB/s]


[36m(RayTrainWorker pid=56664)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56664)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56664)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56664)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 340kB/s]


[36m(RayTrainWorker pid=56664)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56664)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56664)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56664)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0.00/1.65M [00:00<?, ?B/s]
  6%|▌         | 98.3k/1.65M [00:00<00:02, 588kB/s]
 26%|██▌       | 426k/1.65M [00:00<00:00, 1.40MB/s]


[36m(RayTrainWorker pid=56664)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56664)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 6x across cluster][0m


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.74MB/s]
[36m(TorchTrainer pid=56678)[0m Started distributed worker processes: 
[36m(TorchTrainer pid=56678)[0m - (node_id=7cc2334a334327f410131b140fb9ad2ae54820f537299598f92261b4, ip=127.0.0.1, pid=56794) world_rank=0, local_rank=0, node_rank=0
[36m(RayTrainWorker pid=56794)[0m Setting up process group for: env:// [rank=0, world_size=1]


[36m(RayTrainWorker pid=56664)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56664)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56664)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/t10k-labels-idx1-ubyte.gz


[36m(RayTrainWorker pid=56481)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00033_33_batch_size=128,dropout=0.2164,layer_1_size=32,layer_2_size=64,layer_3_size=512,learning_rate=0.0009_2024-12-24_17-34-21/checkpoint_000000)
[36m(RayTrainWorker pid=56794)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=56794)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=56794)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=56794)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=56794)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Sett

[36m(RayTrainWorker pid=56794)[0m printing config {'layer_1_size': 32, 'layer_2_size': 128, 'layer_3_size': 512, 'dropout': 0.14029510397823686, 'batch_size': 64, 'learning_rate': 0.039443335175837424}
[36m(RayTrainWorker pid=56664)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmph5dp9y5w/MNIST/raw
[36m(RayTrainWorker pid=56664)[0m 
[36m(RayTrainWorker pid=56794)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56794)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56794)[0m 
[36m(RayTrainWorker pid=56794)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/train-images-idx3-ubyte.gz


  1%|          | 65.5k/9.91M [00:00<00:24, 399kB/s]
  4%|▎         | 360k/9.91M [00:00<00:07, 1.20MB/s]
[36m(RayTrainWorker pid=56410)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=56410)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=56410)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=56410)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=56410)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=56410)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
[36m(RayTrainWorker pid=56410)[0m     self = reduction.pickle.load(from_parent)
[36m(RayTrainWorker pid=56410)[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=56410)[0m _pickle.UnpicklingError: pickle data was truncated
 15%|█▍        | 1.44M/9.91M [00

[36m(RayTrainWorker pid=56794)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw
[36m(RayTrainWorker pid=56794)[0m 
[36m(RayTrainWorker pid=56794)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56794)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56794)[0m 
[36m(RayTrainWorker pid=56794)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28.9k/28.9k [00:00<00:00, 339kB/s]


[36m(RayTrainWorker pid=56794)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw
[36m(RayTrainWorker pid=56794)[0m 


[36m(RayTrainWorker pid=56664)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.


[36m(RayTrainWorker pid=56794)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56794)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56794)[0m 
[36m(RayTrainWorker pid=56794)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=56794)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/t10k-images-idx3-ubyte.gz


  4%|▍         | 65.5k/1.65M [00:00<00:04, 394kB/s]
 22%|██▏       | 360k/1.65M [00:00<00:01, 1.28MB/s]
[36m(RayTrainWorker pid=56542)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00034_34_batch_size=128,dropout=0.2389,layer_1_size=128,layer_2_size=64,layer_3_size=128,learning_rate=0.0025_2024-12-24_17-34-21/checkpoint_000000)[32m [repeated 2x across cluster][0m
 32%|███▏      | 524k/1.65M [00:00<00:00, 1.38MB/s]
 79%|███████▉  | 1.31M/1.65M [00:00<00:00, 3.04MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 2.44MB/s]
  0%|          | 0.00/1.65M [00:00<?, ?B/s][32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=56794)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw
[36m(RayTrainWorker pid=56794)[0m 
[36m(RayTrainWorker pid=56794)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=56794)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=56794)[0m 
[36m(RayTrainWorker pid=56794)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=56794)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpzyr0ze4a/MNIST/raw
[36m(RayTrainWorker pid=56794)[0m 


[36m(RayTrainWorker pid=56794)[0m 
[36m(RayTrainWorker pid=56794)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=56794)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=56794)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=56794)[0m 1 | layer1   | Linear             | 25.1 K | train
[36m(RayTrainWorker pid=56794)[0m 2 | layer2   | Linear             | 4.2 K  | train
[36m(RayTrainWorker pid=56794)[0m 3 | layer3   | Linear             | 66.0 K | train
[36m(RayTrainWorker pid=56794)[0m 4 | layer4   | Linear             | 5.1 K  | train
[36m(RayTrainWorker pid=56794)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=56794)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=56794)[0m 100 K     Trainable params
[36m(RayTrainWorker pid=56794)[0m 0         Non-trainable params
[36m(RayTrainWorker pid=56794)[0m 10

[36m(RayTrainWorker pid=57010)[0m printing config {'layer_1_size': 128, 'layer_2_size': 256, 'layer_3_size': 256, 'dropout': 0.23871449076395343, 'batch_size': 64, 'learning_rate': 0.00025259435896183397}
[36m(RayTrainWorker pid=57010)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=57010)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=57010)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=57010)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=57010)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=57010)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=57010)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57010)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57010)[0m 
[36m(RayTrainWorker pid=57010)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0.00/9.91M [00:00<?, ?B/s]
[36m(RayTrainWorker pid=56542)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00034_34_batch_size=128,dropout=0.2389,layer_1_size=128,layer_2_size=64,layer_3_size=128,learning_rate=0.0025_2024-12-24_17-34-21/checkpoint_000001)
  1%|          | 65.5k/9.91M [00:00<00:19, 506kB/s]
  2%|▏         | 197k/9.91M [00:00<00:11, 837kB/s] 
  8%|▊         | 754k/9.91M [00:00<00:04, 2.15MB/s]
 22%|██▏       | 2.16M/9.91M [00:00<00:01, 5.84MB/s]
 29%|██▉       | 2.85M/9.91M [00:00<00:01, 5.83MB/s]
 69%|██████▉   | 6.88M/9.91M [00:00<00:00, 15.0MB/s]
 86%|████████▌ | 8.49M/9.91M [00:00<00:00, 15.2MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 10.2MB/s]


[36m(RayTrainWorker pid=57010)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw
[36m(RayTrainWorker pid=57010)[0m 
[36m(RayTrainWorker pid=57010)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57010)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57010)[0m 




[36m(RayTrainWorker pid=57010)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 339kB/s]


[36m(RayTrainWorker pid=57010)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw
[36m(RayTrainWorker pid=57010)[0m 
[36m(RayTrainWorker pid=57010)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57010)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57010)[0m 
[36m(RayTrainWorker pid=57010)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0.00/1.65M [00:00<?, ?B/s]
  6%|▌         | 98.3k/1.65M [00:00<00:02, 581kB/s]
 26%|██▌       | 426k/1.65M [00:00<00:00, 1.39MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 3.26MB/s]


[36m(RayTrainWorker pid=57010)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw
[36m(RayTrainWorker pid=57010)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz[32m [repeated 5x across cluster][0m
[36m(RayTrainWorker pid=57010)[0m 
[36m(RayTrainWorker pid=57010)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57010)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57010)[0m 
[36m(RayTrainWorker pid=57010)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=57010)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpl9d_ugit/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tm

100%|██████████| 4.54k/4.54k [00:00<00:00, 2.15MB/s]
[36m(RayTrainWorker pid=57010)[0m 
[36m(RayTrainWorker pid=57010)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=57010)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=57010)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=57010)[0m 1 | layer1   | Linear             | 100 K  | train
[36m(RayTrainWorker pid=57010)[0m 2 | layer2   | Linear             | 33.0 K | train
[36m(RayTrainWorker pid=57010)[0m 3 | layer3   | Linear             | 65.8 K | train
[36m(RayTrainWorker pid=57010)[0m 4 | layer4   | Linear             | 2.6 K  | train
[36m(RayTrainWorker pid=57010)[0m 5 | dropout  | Dropout            | 0      | train
[36m(RayTrainWorker pid=57010)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=57010)[0m 201 K     Trainable params
[36m(RayTrainWorker pid=57010)[0m 0         Non-tr

[36m(RayTrainWorker pid=57177)[0m printing config {'layer_1_size': 64, 'layer_2_size': 128, 'layer_3_size': 512, 'dropout': 0.21327135716305037, 'batch_size': 64, 'learning_rate': 0.00654805944708387}
[36m(RayTrainWorker pid=57177)[0m Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz[32m [repeated 3x across cluster][0m


[36m(RayTrainWorker pid=57177)[0m GPU available: True (mps), used: False
[36m(RayTrainWorker pid=57177)[0m TPU available: False, using: 0 TPU cores
[36m(RayTrainWorker pid=57177)[0m HPU available: False, using: 0 HPUs
[36m(RayTrainWorker pid=57177)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
[36m(RayTrainWorker pid=57177)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/loops/utilities.py:73: `max_epochs` was not set. Setting it to 1000 epochs. To train without an epoch limit, set `max_epochs=-1`.


[36m(RayTrainWorker pid=57177)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57177)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57177)[0m 


[36m(RayTrainWorker pid=57010)[0m /Users/sidharrthnagappan/.virtualenvs/lsdp_miniproject/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:420: Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.
[36m(RayTrainWorker pid=56794)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00036_36_batch_size=64,dropout=0.1403,layer_1_size=32,layer_2_size=128,layer_3_size=512,learning_rate=0.0394_2024-12-24_17-34-21/checkpoint_000000)


[36m(RayTrainWorker pid=57177)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/train-images-idx3-ubyte.gz


[36m(RayTrainWorker pid=56794)[0m Traceback (most recent call last):
[36m(RayTrainWorker pid=56794)[0m   File "<string>", line 1, in <module>
[36m(RayTrainWorker pid=56794)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
[36m(RayTrainWorker pid=56794)[0m     exitcode = _main(fd, parent_sentinel)
[36m(RayTrainWorker pid=56794)[0m                ^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=56794)[0m   File "/Users/sidharrthnagappan/.pyenv/versions/3.11.0/lib/python3.11/multiprocessing/spawn.py", line 130, in _main
[36m(RayTrainWorker pid=56794)[0m     self = reduction.pickle.load(from_parent)
[36m(RayTrainWorker pid=56794)[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[36m(RayTrainWorker pid=56794)[0m _pickle.UnpicklingError: pickle data was truncated
  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:16, 601kB/s]
  4%|▍         | 393k/9.91M [00:00<00:05

[36m(RayTrainWorker pid=57177)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/train-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw
[36m(RayTrainWorker pid=57177)[0m 


2024-12-24 17:40:21,832	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19' in 0.0255s.


[36m(RayTrainWorker pid=57177)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57177)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57177)[0m 
[36m(RayTrainWorker pid=57177)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 349kB/s]


[36m(RayTrainWorker pid=57177)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/train-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw
[36m(RayTrainWorker pid=57177)[0m 
[36m(RayTrainWorker pid=57177)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57177)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57177)[0m 
[36m(RayTrainWorker pid=57177)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0.00/1.65M [00:00<?, ?B/s]
  6%|▌         | 98.3k/1.65M [00:00<00:02, 580kB/s]
 26%|██▌       | 426k/1.65M [00:00<00:00, 1.37MB/s]


[36m(RayTrainWorker pid=57177)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/t10k-images-idx3-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw
[36m(RayTrainWorker pid=57177)[0m 
[36m(RayTrainWorker pid=57177)[0m Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz[32m [repeated 6x across cluster][0m


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.35MB/s]


[36m(RayTrainWorker pid=57177)[0m Failed to download (trying next):
[36m(RayTrainWorker pid=57177)[0m HTTP Error 403: Forbidden
[36m(RayTrainWorker pid=57177)[0m 
[36m(RayTrainWorker pid=57177)[0m Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/t10k-labels-idx1-ubyte.gz
[36m(RayTrainWorker pid=57177)[0m Extracting /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw/t10k-labels-idx1-ubyte.gz to /var/folders/mf/cfzv46p15017stkscsq9jw9h0000gn/T/tmpynugre8n/MNIST/raw
[36m(RayTrainWorker pid=57177)[0m 


100%|██████████| 4.54k/4.54k [00:00<00:00, 1.45MB/s]
[36m(RayTrainWorker pid=56542)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00034_34_batch_size=128,dropout=0.2389,layer_1_size=128,layer_2_size=64,layer_3_size=128,learning_rate=0.0025_2024-12-24_17-34-21/checkpoint_000002)
[36m(RayTrainWorker pid=57177)[0m 
[36m(RayTrainWorker pid=57177)[0m   | Name     | Type               | Params | Mode 
[36m(RayTrainWorker pid=57177)[0m --------------------------------------------------------
[36m(RayTrainWorker pid=57177)[0m 0 | accuracy | MulticlassAccuracy | 0      | train
[36m(RayTrainWorker pid=57177)[0m 1 | layer1   | Linear             | 50.2 K | train
[36m(RayTrainWorker pid=57177)[0m 2 | layer2   | Linear             | 8.3 K  | train
[36m(RayTrainWorker pid=57177)[0m 3 | layer3   | Linear             | 66.0 K | train
[36m(RayTrainWorker pid=57177)[0m 4 

ResultGrid<[
  Result(
    metrics={'ptl/train_loss': 0.14144454896450043, 'ptl/train_accuracy': 0.9659090638160706, 'ptl/val_loss': 0.09280373156070709, 'ptl/val_accuracy': 0.970507800579071, 'epoch': 4, 'step': 2150},
    path='/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00000_0_batch_size=128,dropout=0.1647,layer_1_size=64,layer_2_size=256,layer_3_size=256,learning_rate=0.0013_2024-12-24_17-34-21',
    filesystem='local',
    checkpoint=Checkpoint(filesystem=local, path=/Users/sidharrthnagappan/ray_results/TorchTrainer_2024-12-24_17-34-19/TorchTrainer_4f4b8_00000_0_batch_size=128,dropout=0.1647,layer_1_size=64,layer_2_size=256,layer_3_size=256,learning_rate=0.0013_2024-12-24_17-34-21/checkpoint_000004)
  ),
  Result(
    metrics={'ptl/train_loss': 0.24482308328151703, 'ptl/train_accuracy': 0.9583333134651184, 'ptl/val_loss': 0.1930762678384781, 'ptl/val_accuracy': 0.9432357549667358, 'epoch': 0, 'step': 860},
    path='/Users/sidharrthnag



In [None]:
tuner.results_df

In [9]:
from axsearch_multiobjective import AxSearchMultiObjective

In [1]:
from ax.service.ax_client import AxClient
from ax.service.utils.instantiation import ObjectiveProperties
from ray import tune, air
from ray.air import session
from ray.tune.search import ConcurrencyLimiter
from ray.tune.search.ax import AxSearch
from ray.train import RunConfig, ScalingConfig, CheckpointConfig

# from ray.tune.search.ax import AxSearch
from axsearch_multiobjective import AxSearchMultiObjective


def evaluate(parameter: dict, checkpoint_dir=None):
    session.report(
        {
            "a": parameter["a"],
            "b": parameter["b"],
        }
    )


ax_client = AxClient(
    verbose_logging=False,
    # enforce_sequential_optimization=False,
)
ax_client.create_experiment(
    name="test",
    parameters=[
        {
            "name": "a",
            "type": "range",
            "value_type": "float",
            "bounds": [0, 1.0],
        },
        {
            "name": "b",
            "type": "range",
            "value_type": "float",
            "bounds": [0, 1.0],
        },
    ],
    objectives={
        "a": ObjectiveProperties(minimize=True, threshold=0.5),
        "b": ObjectiveProperties(minimize=True, threshold=0.5),
    },
    overwrite_existing_experiment=True,
    is_test=False,
)

algo = AxSearchMultiObjective(ax_client=ax_client)
algo = ConcurrencyLimiter(algo, max_concurrent=1)
tuner = tune.Tuner(
    tune.with_resources(evaluate, resources={"cpu": 1}),
    tune_config=tune.TuneConfig(search_alg=algo, num_samples=60),
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=2,
            checkpoint_score_attribute="ptl/val_accuracy",
            checkpoint_score_order="max",
        )
    ),
)
tuner.fit()

0,1
Current time:,2024-12-24 23:24:27
Running for:,00:00:11.78
Memory:,22.9/32.0 GiB

Trial name,status,loc,a,b,iter,total time (s),a.1,b.1
evaluate_be1f910c,PENDING,,0.0,0.0,,,,
evaluate_3851fd8b,TERMINATED,127.0.0.1:89948,0.205193,0.946294,1.0,0.000143051,0.205193,0.946294
evaluate_9396dabf,TERMINATED,127.0.0.1:89950,0.965835,0.362051,1.0,0.000108004,0.965835,0.362051
evaluate_cda636d3,TERMINATED,127.0.0.1:89976,0.599389,0.695756,1.0,0.000109911,0.599389,0.695756
evaluate_e747b542,TERMINATED,127.0.0.1:89977,0.35434,0.113071,1.0,0.000101805,0.35434,0.113071
evaluate_ce00982e,TERMINATED,127.0.0.1:89983,0.380549,0.585214,1.0,0.000183105,0.380549,0.585214
evaluate_b6258c52,TERMINATED,127.0.0.1:90005,0.0,0.0,1.0,0.000118971,0.0,0.0
evaluate_34172aa9,TERMINATED,127.0.0.1:90027,0.0,0.22987,1.0,0.000115156,0.0,0.22987
evaluate_05b4a7d3,TERMINATED,127.0.0.1:90028,0.0,0.0,1.0,0.000101089,0.0,0.0
evaluate_b64ec550,TERMINATED,127.0.0.1:90052,0.0,0.0,1.0,0.000108004,0.0,0.0


observations []
observation_features []
Suggested config: {'a': 0.20519250631332397, 'b': 0.9462937712669373}


  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'a': 0.20519250631332397, 'b': 0.9462937712669373, 'timestamp': 1735082657, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': '3851fd8b', 'date': '2024-12-24_23-24-17', 'time_this_iter_s': 0.0001430511474609375, 'time_total_s': 0.0001430511474609375, 'pid': 89948, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 0.0001430511474609375, 'iterations_since_restore': 1, 'experiment_tag': '1_a=0.2052,b=0.9463', 'config/a': 0.20519250631332397, 'config/b': 0.9462937712669373}
Metrics to include: ['a', 'b']
Metric dict after trial: {'a': (0.20519250631332397, None), 'b': (0.9462937712669373, None)}
observations [<ax.core.observation.Observation object at 0x165f1d050>]
observation_features [ObservationFeatures(parameters={'a': 0.20519250631332397, 'b': 0.9462937712669373}, trial_index=0)]
Suggested config: {'a': 0.9658348821103573, 'b': 0.3620511395856738}


  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'a': 0.9658348821103573, 'b': 0.3620511395856738, 'timestamp': 1735082658, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': '9396dabf', 'date': '2024-12-24_23-24-18', 'time_this_iter_s': 0.00010800361633300781, 'time_total_s': 0.00010800361633300781, 'pid': 89950, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 0.00010800361633300781, 'iterations_since_restore': 1, 'experiment_tag': '2_a=0.9658,b=0.3621', 'config/a': 0.9658348821103573, 'config/b': 0.3620511395856738}
Metrics to include: ['a', 'b']
Metric dict after trial: {'a': (0.9658348821103573, None), 'b': (0.3620511395856738, None)}
observations [<ax.core.observation.Observation object at 0x17f31b790>, <ax.core.observation.Observation object at 0x17f34f350>]
observation_features [ObservationFeatures(parameters={'a': 0.20519250631332397, 'b': 0.9462937712669373}, trial_index=0), ObservationFeatures(parameters={'a': 0.9658348821103573, 'b': 0.3620

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'a': 0.5993889393284917, 'b': 0.6957563571631908, 'timestamp': 1735082658, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': 'cda636d3', 'date': '2024-12-24_23-24-18', 'time_this_iter_s': 0.00010991096496582031, 'time_total_s': 0.00010991096496582031, 'pid': 89976, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 0.00010991096496582031, 'iterations_since_restore': 1, 'experiment_tag': '3_a=0.5994,b=0.6958', 'config/a': 0.5993889393284917, 'config/b': 0.6957563571631908}
Metrics to include: ['a', 'b']
Metric dict after trial: {'a': (0.5993889393284917, None), 'b': (0.6957563571631908, None)}
observations [<ax.core.observation.Observation object at 0x16a873e10>, <ax.core.observation.Observation object at 0x17f34d410>, <ax.core.observation.Observation object at 0x3867b1590>]
observation_features [ObservationFeatures(parameters={'a': 0.20519250631332397, 'b': 0.9462937712669373}, trial_index=0), Observation

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'a': 0.3543400028720498, 'b': 0.11307110730558634, 'timestamp': 1735082659, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': 'e747b542', 'date': '2024-12-24_23-24-19', 'time_this_iter_s': 0.00010180473327636719, 'time_total_s': 0.00010180473327636719, 'pid': 89977, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 0.00010180473327636719, 'iterations_since_restore': 1, 'experiment_tag': '4_a=0.3543,b=0.1131', 'config/a': 0.3543400028720498, 'config/b': 0.11307110730558634}
Metrics to include: ['a', 'b']
Metric dict after trial: {'a': (0.3543400028720498, None), 'b': (0.11307110730558634, None)}
observations [<ax.core.observation.Observation object at 0x386795d90>, <ax.core.observation.Observation object at 0x3867df650>, <ax.core.observation.Observation object at 0x17f304f10>, <ax.core.observation.Observation object at 0x3867b1f90>]
observation_features [ObservationFeatures(parameters={'a': 0.205192506313

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'a': 0.38054878637194633, 'b': 0.5852137189358473, 'timestamp': 1735082660, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': 'ce00982e', 'date': '2024-12-24_23-24-20', 'time_this_iter_s': 0.00018310546875, 'time_total_s': 0.00018310546875, 'pid': 89983, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 0.00018310546875, 'iterations_since_restore': 1, 'experiment_tag': '5_a=0.3805,b=0.5852', 'config/a': 0.38054878637194633, 'config/b': 0.5852137189358473}
Metrics to include: ['a', 'b']
Metric dict after trial: {'a': (0.38054878637194633, None), 'b': (0.5852137189358473, None)}
observations [<ax.core.observation.Observation object at 0x3867f7b90>, <ax.core.observation.Observation object at 0x17f318250>, <ax.core.observation.Observation object at 0x3867d0ed0>, <ax.core.observation.Observation object at 0x3867df310>, <ax.core.observation.Observation object at 0x3867df2d0>]
observation_features [ObservationF



observations [<ax.core.observation.Observation object at 0x17f33c210>, <ax.core.observation.Observation object at 0x38681b690>, <ax.core.observation.Observation object at 0x3867a0750>, <ax.core.observation.Observation object at 0x386832890>, <ax.core.observation.Observation object at 0x386842990>, <ax.core.observation.Observation object at 0x386843d90>]
observation_features [ObservationFeatures(parameters={'a': 0.20519250631332397, 'b': 0.9462937712669373}, trial_index=0), ObservationFeatures(parameters={'a': 0.9658348821103573, 'b': 0.3620511395856738}, trial_index=1), ObservationFeatures(parameters={'a': 0.5993889393284917, 'b': 0.6957563571631908}, trial_index=2), ObservationFeatures(parameters={'a': 0.3543400028720498, 'b': 0.11307110730558634}, trial_index=3), ObservationFeatures(parameters={'a': 0.38054878637194633, 'b': 0.5852137189358473}, trial_index=4), ObservationFeatures(parameters={'a': 0.0, 'b': 0.0}, trial_index=5)]
Suggested config: {'a': 0.0, 'b': 0.22987031718489753}




observations [<ax.core.observation.Observation object at 0x17f31a450>, <ax.core.observation.Observation object at 0x3867d06d0>, <ax.core.observation.Observation object at 0x3868706d0>, <ax.core.observation.Observation object at 0x3867ddcd0>, <ax.core.observation.Observation object at 0x386829b50>, <ax.core.observation.Observation object at 0x38687fd10>, <ax.core.observation.Observation object at 0x38688e490>]
observation_features [ObservationFeatures(parameters={'a': 0.20519250631332397, 'b': 0.9462937712669373}, trial_index=0), ObservationFeatures(parameters={'a': 0.9658348821103573, 'b': 0.3620511395856738}, trial_index=1), ObservationFeatures(parameters={'a': 0.5993889393284917, 'b': 0.6957563571631908}, trial_index=2), ObservationFeatures(parameters={'a': 0.3543400028720498, 'b': 0.11307110730558634}, trial_index=3), ObservationFeatures(parameters={'a': 0.38054878637194633, 'b': 0.5852137189358473}, trial_index=4), ObservationFeatures(parameters={'a': 0.0, 'b': 0.0}, trial_index=5)



observations [<ax.core.observation.Observation object at 0x38679aa10>, <ax.core.observation.Observation object at 0x3867f5350>, <ax.core.observation.Observation object at 0x38681a650>, <ax.core.observation.Observation object at 0x386842350>, <ax.core.observation.Observation object at 0x386872690>, <ax.core.observation.Observation object at 0x387c1e6d0>, <ax.core.observation.Observation object at 0x387c2bf90>, <ax.core.observation.Observation object at 0x3867a2210>]
observation_features [ObservationFeatures(parameters={'a': 0.20519250631332397, 'b': 0.9462937712669373}, trial_index=0), ObservationFeatures(parameters={'a': 0.9658348821103573, 'b': 0.3620511395856738}, trial_index=1), ObservationFeatures(parameters={'a': 0.5993889393284917, 'b': 0.6957563571631908}, trial_index=2), ObservationFeatures(parameters={'a': 0.3543400028720498, 'b': 0.11307110730558634}, trial_index=3), ObservationFeatures(parameters={'a': 0.38054878637194633, 'b': 0.5852137189358473}, trial_index=4), Observatio

2024-12-24 23:24:27,972	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-24-14' in 0.0093s.
2024-12-24 23:24:28,349	INFO tune.py:1041 -- Total run time: 12.18 seconds (11.77 seconds for the tuning loop).
Resume experiment with: Tuner.restore(path="/Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-24-14", trainable=...)
- evaluate_be1f910c: FileNotFoundError('Could not fetch metrics for evaluate_be1f910c: both result.json and progress.csv were not found at /Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-24-14/evaluate_be1f910c_11_a=0.0000,b=0.0000_2024-12-24_23-24-27')


ResultGrid<[
  Result(
    metrics={'a': 0.20519250631332397, 'b': 0.9462937712669373},
    path='/Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-24-14/evaluate_3851fd8b_1_a=0.2052,b=0.9463_2024-12-24_23-24-16',
    filesystem='local',
    checkpoint=None
  ),
  Result(
    metrics={'a': 0.9658348821103573, 'b': 0.3620511395856738},
    path='/Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-24-14/evaluate_9396dabf_2_a=0.9658,b=0.3621_2024-12-24_23-24-17',
    filesystem='local',
    checkpoint=None
  ),
  Result(
    metrics={'a': 0.5993889393284917, 'b': 0.6957563571631908},
    path='/Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-24-14/evaluate_cda636d3_3_a=0.5994,b=0.6958_2024-12-24_23-24-18',
    filesystem='local',
    checkpoint=None
  ),
  Result(
    metrics={'a': 0.3543400028720498, 'b': 0.11307110730558634},
    path='/Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-24-14/evaluate_e747b542_4_a=0.3543,b=0.1131_2024-12-24_23-24-18'

In [1]:
from ax.service.ax_client import AxClient
from ax.service.utils.instantiation import ObjectiveProperties
from ray import tune
from ray.air import session
from ray.tune.search import ConcurrencyLimiter
from axsearch_multiobjective import AxSearchMultiObjective
from ray.train import RunConfig, CheckpointConfig
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

# Load MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST("./data", train=True, download=True, transform=transform),
    batch_size=64,
    shuffle=True,
)
val_loader = torch.utils.data.DataLoader(
    datasets.MNIST("./data", train=False, transform=transform), batch_size=64
)

# Define the evaluation function
def evaluate(parameter: dict, checkpoint_dir=None):
    """
    Evaluate a neural architecture based on given parameters.
    This function trains and validates the model, returning accuracy and latency.
    """
    # Extract parameters
    num_layers = int(parameter["num_layers"])
    hidden_size = int(parameter["hidden_size"])
    kernel_size = int(parameter["kernel_size"])

    # Define a simple CNN architecture
    class SimpleCNN(nn.Module):
        def __init__(self, num_layers, hidden_size, kernel_size):
            super(SimpleCNN, self).__init__()
            layers = []
            in_channels = 1  # Input channels for MNIST (grayscale images)
            for _ in range(num_layers):
                layers.append(
                    nn.Conv2d(in_channels, hidden_size, kernel_size, stride=1, padding=1)
                )
                layers.append(nn.ReLU())
                in_channels = hidden_size

            self.conv = nn.Sequential(*layers)

            # Calculate the output size after convolutions
            conv_output_size = 28  # Initial image size
            for _ in range(num_layers):
                conv_output_size = (conv_output_size - kernel_size + 2 * 1) + 1

            flattened_size = hidden_size * (conv_output_size ** 2)

            self.fc = nn.Linear(flattened_size, 10)  # Fully connected layer for 10 classes

        def forward(self, x):
            x = self.conv(x)
            x = x.view(x.size(0), -1)  # Flatten for the Linear layer
            x = self.fc(x)
            return x

    # Initialize model, criterion, and optimizer
    model = SimpleCNN(num_layers, hidden_size, kernel_size)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Train the model for 1 epoch (you can extend this for multiple epochs)
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        if batch_idx > 50:  # Limit batches for quick tuning
            break
        optimizer.zero_grad()
        output = model(data)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()

    # Evaluate on validation data
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for data, target in val_loader:
            output = model(data)
            _, predicted = output.max(1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    accuracy = correct / total
    latency = num_layers * hidden_size * kernel_size  # Mock latency (use real metric in production)

    # Report results to Ray Tune
    session.report({"accuracy": accuracy, "latency": latency})


# AxClient for NAS with MOBO
ax_client = AxClient(verbose_logging=False)
ax_client.create_experiment(
    name="nas_mobo",
    parameters=[
        {"name": "num_layers", "type": "range", "value_type": "int", "bounds": [1, 5]},
        {"name": "hidden_size", "type": "range", "value_type": "int", "bounds": [16, 128]},
        {"name": "kernel_size", "type": "choice", "values": [3, 5, 7]},
    ],
    objectives={
        "accuracy": ObjectiveProperties(minimize=False, threshold=0.8),
        "latency": ObjectiveProperties(minimize=True, threshold=1000),
    },
    overwrite_existing_experiment=True,
    is_test=False,
)

# Use AxSearchMultiObjective for search
algo = AxSearchMultiObjective(ax_client=ax_client)
algo = ConcurrencyLimiter(algo, max_concurrent=1)  # Limit to sequential trials

# Tuner for NAS
num_samples = 20  # Number of trials
tuner = tune.Tuner(
    tune.with_resources(evaluate, resources={"cpu": 2, "gpu": 0}),
    tune_config=tune.TuneConfig(search_alg=algo, num_samples=num_samples),
    run_config=RunConfig(
        checkpoint_config=CheckpointConfig(
            num_to_keep=2,
            checkpoint_score_attribute="accuracy",
            checkpoint_score_order="max",
        )
    ),
)

# Run NAS tuning
result_grid = tuner.fit()

0,1
Current time:,2024-12-24 23:45:16
Running for:,00:04:58.20
Memory:,23.9/32.0 GiB

Trial name,status,loc,hidden_size,kernel_size,num_layers,iter,total time (s),accuracy,latency
evaluate_eec15173,TERMINATED,127.0.0.1:1499,87,3,1,1,3.46956,0.8761,261
evaluate_29ae04fb,TERMINATED,127.0.0.1:1599,21,5,5,1,10.4258,0.898,525
evaluate_58fa2603,TERMINATED,127.0.0.1:1742,127,5,4,1,75.9225,0.9125,2540
evaluate_dc80ba9d,TERMINATED,127.0.0.1:2620,54,7,2,1,12.9419,0.9276,756
evaluate_305570eb,TERMINATED,127.0.0.1:2809,103,7,3,1,43.5916,0.8984,2163
evaluate_c3ba019f,TERMINATED,127.0.0.1:3378,65,3,4,1,25.0104,0.9171,780
evaluate_7e725cc6,TERMINATED,127.0.0.1:3721,51,3,1,1,2.03818,0.8709,153
evaluate_14c4df33,TERMINATED,127.0.0.1:3793,16,3,5,1,7.11735,0.8688,240
evaluate_81dd0df6,TERMINATED,127.0.0.1:3917,16,7,1,1,1.35454,0.9152,112
evaluate_5a76f217,TERMINATED,127.0.0.1:3977,16,3,3,1,4.16296,0.887,144


observations []
observation_features []
Suggested config: {'num_layers': 1, 'hidden_size': 87, 'kernel_size': 3}


  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'accuracy': 0.8761, 'latency': 261, 'timestamp': 1735083625, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': 'eec15173', 'date': '2024-12-24_23-40-25', 'time_this_iter_s': 3.469562292098999, 'time_total_s': 3.469562292098999, 'pid': 1499, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 3.469562292098999, 'iterations_since_restore': 1, 'experiment_tag': '1_hidden_size=87,kernel_size=3,num_layers=1', 'config/num_layers': 1, 'config/hidden_size': 87, 'config/kernel_size': 3}
Metrics to include: ['accuracy', 'latency']
Metric dict after trial: {'accuracy': (0.8761, None), 'latency': (261, None)}
observations [<ax.core.observation.Observation object at 0x31eecf0d0>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_size': 0.16666444442962952}, trial_index=0)]
Suggested config: {'num_layers': 5, 'hidden_size': 21, 'kernel_size

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'accuracy': 0.898, 'latency': 525, 'timestamp': 1735083638, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': '29ae04fb', 'date': '2024-12-24_23-40-38', 'time_this_iter_s': 10.425843000411987, 'time_total_s': 10.425843000411987, 'pid': 1599, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 10.425843000411987, 'iterations_since_restore': 1, 'experiment_tag': '2_hidden_size=21,kernel_size=5,num_layers=5', 'config/num_layers': 5, 'config/hidden_size': 21, 'config/kernel_size': 5}
Metrics to include: ['accuracy', 'latency']
Metric dict after trial: {'accuracy': (0.898, None), 'latency': (525, None)}
observations [<ax.core.observation.Observation object at 0x31eb10fd0>, <ax.core.observation.Observation object at 0x3cb47c810>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_size': 0.16666444442962952}, trial_index=0), Observati

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'accuracy': 0.9125, 'latency': 2540, 'timestamp': 1735083717, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': '58fa2603', 'date': '2024-12-24_23-41-57', 'time_this_iter_s': 75.92250084877014, 'time_total_s': 75.92250084877014, 'pid': 1742, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 75.92250084877014, 'iterations_since_restore': 1, 'experiment_tag': '3_hidden_size=127,kernel_size=5,num_layers=4', 'config/num_layers': 4, 'config/hidden_size': 127, 'config/kernel_size': 5}
Metrics to include: ['accuracy', 'latency']
Metric dict after trial: {'accuracy': (0.9125, None), 'latency': (2540, None)}
observations [<ax.core.observation.Observation object at 0x31ef62bd0>, <ax.core.observation.Observation object at 0x3cb477a10>, <ax.core.observation.Observation object at 0x3cb43ff10>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'k

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'accuracy': 0.9276, 'latency': 756, 'timestamp': 1735083733, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': 'dc80ba9d', 'date': '2024-12-24_23-42-13', 'time_this_iter_s': 12.94192385673523, 'time_total_s': 12.94192385673523, 'pid': 2620, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 12.94192385673523, 'iterations_since_restore': 1, 'experiment_tag': '4_hidden_size=54,kernel_size=7,num_layers=2', 'config/num_layers': 2, 'config/hidden_size': 54, 'config/kernel_size': 7}
Metrics to include: ['accuracy', 'latency']
Metric dict after trial: {'accuracy': (0.9276, None), 'latency': (756, None)}
observations [<ax.core.observation.Observation object at 0x3cb493390>, <ax.core.observation.Observation object at 0x3cb492bd0>, <ax.core.observation.Observation object at 0x3cb43c190>, <ax.core.observation.Observation object at 0x3cb462290>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'accuracy': 0.8984, 'latency': 2163, 'timestamp': 1735083779, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': '305570eb', 'date': '2024-12-24_23-42-59', 'time_this_iter_s': 43.59163475036621, 'time_total_s': 43.59163475036621, 'pid': 2809, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 43.59163475036621, 'iterations_since_restore': 1, 'experiment_tag': '5_hidden_size=103,kernel_size=7,num_layers=3', 'config/num_layers': 3, 'config/hidden_size': 103, 'config/kernel_size': 7}
Metrics to include: ['accuracy', 'latency']
Metric dict after trial: {'accuracy': (0.8984, None), 'latency': (2163, None)}
observations [<ax.core.observation.Observation object at 0x3cb43fa10>, <ax.core.observation.Observation object at 0x3cb4ca8d0>, <ax.core.observation.Observation object at 0x3cb4d47d0>, <ax.core.observation.Observation object at 0x3cb4d7810>, <ax.core.observation.Observation object at 0x3cb4916d0>]
observation

  warn("Encountered exception in computing model fit quality: " + str(e))


Result: {'accuracy': 0.9171, 'latency': 780, 'timestamp': 1735083807, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': 'c3ba019f', 'date': '2024-12-24_23-43-27', 'time_this_iter_s': 25.010422945022583, 'time_total_s': 25.010422945022583, 'pid': 3378, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 25.010422945022583, 'iterations_since_restore': 1, 'experiment_tag': '6_hidden_size=65,kernel_size=3,num_layers=4', 'config/num_layers': 4, 'config/hidden_size': 65, 'config/kernel_size': 3}
Metrics to include: ['accuracy', 'latency']
Metric dict after trial: {'accuracy': (0.9171, None), 'latency': (780, None)}
observations [<ax.core.observation.Observation object at 0x3cb474610>, <ax.core.observation.Observation object at 0x31e99ac50>, <ax.core.observation.Observation object at 0x3cb4b70d0>, <ax.core.observation.Observation object at 0x31ef6d450>, <ax.core.observation.Observation object at 0x3cb51c590>, <ax.core.obs



observations [<ax.core.observation.Observation object at 0x3cb4d4b50>, <ax.core.observation.Observation object at 0x3199dc1d0>, <ax.core.observation.Observation object at 0x3cb5385d0>, <ax.core.observation.Observation object at 0x319a36c10>, <ax.core.observation.Observation object at 0x319a105d0>, <ax.core.observation.Observation object at 0x3199e9ed0>, <ax.core.observation.Observation object at 0x3cb4fbd10>, <ax.core.observation.Observation object at 0x3cb490490>, <ax.core.observation.Observation object at 0x3199ab750>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_size': 0.16666444442962952}, trial_index=0), ObservationFeatures(parameters={'num_layers': 0.9000016000064, 'hidden_size': 0.04867248649070558, 'kernel_size': 0.5}, trial_index=1), ObservationFeatures(parameters={'num_layers': 0.7000008000031999, 'hidden_size': 0.9867257498629646, 'kernel_size': 0.5}, trial_index=2), ObservationFeatures(par



observations [<ax.core.observation.Observation object at 0x3cb4e75d0>, <ax.core.observation.Observation object at 0x3cb51c0d0>, <ax.core.observation.Observation object at 0x3cb4fb150>, <ax.core.observation.Observation object at 0x3cb4d6d50>, <ax.core.observation.Observation object at 0x3199a8850>, <ax.core.observation.Observation object at 0x3199dffd0>, <ax.core.observation.Observation object at 0x3199df150>, <ax.core.observation.Observation object at 0x319a23890>, <ax.core.observation.Observation object at 0x319a22a90>, <ax.core.observation.Observation object at 0x3199dec90>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_size': 0.16666444442962952}, trial_index=0), ObservationFeatures(parameters={'num_layers': 0.9000016000064, 'hidden_size': 0.04867248649070558, 'kernel_size': 0.5}, trial_index=1), ObservationFeatures(parameters={'num_layers': 0.7000008000031999, 'hidden_size': 0.9867257498629646, 'ke



observations [<ax.core.observation.Observation object at 0x319a65310>, <ax.core.observation.Observation object at 0x319a24590>, <ax.core.observation.Observation object at 0x3199a5f10>, <ax.core.observation.Observation object at 0x3cb5385d0>, <ax.core.observation.Observation object at 0x3199f6150>, <ax.core.observation.Observation object at 0x3199f7d10>, <ax.core.observation.Observation object at 0x319a8f410>, <ax.core.observation.Observation object at 0x319a11950>, <ax.core.observation.Observation object at 0x319a10550>, <ax.core.observation.Observation object at 0x3199f4ed0>, <ax.core.observation.Observation object at 0x3199b9650>, <ax.core.observation.Observation object at 0x319a54590>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_size': 0.16666444442962952}, trial_index=0), ObservationFeatures(parameters={'num_layers': 0.9000016000064, 'hidden_size': 0.04867248649070558, 'kernel_size': 0.5}, trial_



observations [<ax.core.observation.Observation object at 0x3cb4b4910>, <ax.core.observation.Observation object at 0x3199f6d90>, <ax.core.observation.Observation object at 0x3cb4e46d0>, <ax.core.observation.Observation object at 0x3199a4290>, <ax.core.observation.Observation object at 0x3cb53b150>, <ax.core.observation.Observation object at 0x319a11d90>, <ax.core.observation.Observation object at 0x319a10550>, <ax.core.observation.Observation object at 0x319a35f10>, <ax.core.observation.Observation object at 0x319a34590>, <ax.core.observation.Observation object at 0x319a10fd0>, <ax.core.observation.Observation object at 0x3199ba8d0>, <ax.core.observation.Observation object at 0x319a3c1d0>, <ax.core.observation.Observation object at 0x319a4c710>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_size': 0.16666444442962952}, trial_index=0), ObservationFeatures(parameters={'num_layers': 0.9000016000064, 'hidde



observations [<ax.core.observation.Observation object at 0x3199abc10>, <ax.core.observation.Observation object at 0x319a005d0>, <ax.core.observation.Observation object at 0x3199f59d0>, <ax.core.observation.Observation object at 0x3cb4e6290>, <ax.core.observation.Observation object at 0x3cb4ffe90>, <ax.core.observation.Observation object at 0x3199c65d0>, <ax.core.observation.Observation object at 0x319a11950>, <ax.core.observation.Observation object at 0x319a12f10>, <ax.core.observation.Observation object at 0x319e04d90>, <ax.core.observation.Observation object at 0x3199c7490>, <ax.core.observation.Observation object at 0x319a21050>, <ax.core.observation.Observation object at 0x319a22f90>, <ax.core.observation.Observation object at 0x319a8c5d0>, <ax.core.observation.Observation object at 0x319a56c90>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_size': 0.16666444442962952}, trial_index=0), ObservationF



observations [<ax.core.observation.Observation object at 0x31ef63710>, <ax.core.observation.Observation object at 0x319a100d0>, <ax.core.observation.Observation object at 0x3cb53aed0>, <ax.core.observation.Observation object at 0x3199e9e50>, <ax.core.observation.Observation object at 0x3199e9c10>, <ax.core.observation.Observation object at 0x319a67750>, <ax.core.observation.Observation object at 0x319a27c50>, <ax.core.observation.Observation object at 0x319a27910>, <ax.core.observation.Observation object at 0x319a261d0>, <ax.core.observation.Observation object at 0x319a64850>, <ax.core.observation.Observation object at 0x319e10f50>, <ax.core.observation.Observation object at 0x319e3f210>, <ax.core.observation.Observation object at 0x319e3c090>, <ax.core.observation.Observation object at 0x319e3f710>, <ax.core.observation.Observation object at 0x319e59590>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.09999839999359994, 'hidden_size': 0.632743386326263, 'kernel_



observations [<ax.core.observation.Observation object at 0x31e991510>, <ax.core.observation.Observation object at 0x3199c7890>, <ax.core.observation.Observation object at 0x3199dced0>, <ax.core.observation.Observation object at 0x319a4ef10>, <ax.core.observation.Observation object at 0x3cb4f9a90>, <ax.core.observation.Observation object at 0x3199e9590>, <ax.core.observation.Observation object at 0x319a269d0>, <ax.core.observation.Observation object at 0x319a03210>, <ax.core.observation.Observation object at 0x319a12010>, <ax.core.observation.Observation object at 0x3199ebad0>, <ax.core.observation.Observation object at 0x319e109d0>, <ax.core.observation.Observation object at 0x319e5abd0>, <ax.core.observation.Observation object at 0x319a87fd0>, <ax.core.observation.Observation object at 0x319a3f990>, <ax.core.observation.Observation object at 0x319a35a90>, <ax.core.observation.Observation object at 0x319a35590>]
observation_features [ObservationFeatures(parameters={'num_layers': 0.0999



observations [<ax.core.observation.Observation object at 0x3cb4b79d0>, <ax.core.observation.Observation object at 0x3199aa610>, <ax.core.observation.Observation object at 0x3199eacd0>, <ax.core.observation.Observation object at 0x319a8e6d0>, <ax.core.observation.Observation object at 0x319a10190>, <ax.core.observation.Observation object at 0x319a64b50>, <ax.core.observation.Observation object at 0x319a67e90>, <ax.core.observation.Observation object at 0x319a656d0>, <ax.core.observation.Observation object at 0x319a67250>, <ax.core.observation.Observation object at 0x319a66050>, <ax.core.observation.Observation object at 0x319ea7410>, <ax.core.observation.Observation object at 0x319e77f90>, <ax.core.observation.Observation object at 0x319a208d0>, <ax.core.observation.Observation object at 0x319e06590>, <ax.core.observation.Observation object at 0x319e16890>, <ax.core.observation.Observation object at 0x319e156d0>, <ax.core.observation.Observation object at 0x319ea00d0>, <ax.core.observat



observations [<ax.core.observation.Observation object at 0x319a272d0>, <ax.core.observation.Observation object at 0x319a57350>, <ax.core.observation.Observation object at 0x319a22990>, <ax.core.observation.Observation object at 0x319a8ca90>, <ax.core.observation.Observation object at 0x3cb52e890>, <ax.core.observation.Observation object at 0x3199a5c10>, <ax.core.observation.Observation object at 0x3199a6050>, <ax.core.observation.Observation object at 0x319e6d0d0>, <ax.core.observation.Observation object at 0x319e6f6d0>, <ax.core.observation.Observation object at 0x3199a4050>, <ax.core.observation.Observation object at 0x319e50490>, <ax.core.observation.Observation object at 0x319e3f890>, <ax.core.observation.Observation object at 0x319a653d0>, <ax.core.observation.Observation object at 0x319ebabd0>, <ax.core.observation.Observation object at 0x319eb9c50>, <ax.core.observation.Observation object at 0x319eb9910>, <ax.core.observation.Observation object at 0x319ed11d0>, <ax.core.observat

2024-12-24 23:45:16,131	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/sidharrthnagappan/ray_results/evaluate_2024-12-24_23-40-16' in 0.0172s.


Result: {'accuracy': 0.9285, 'latency': 504, 'timestamp': 1735083916, 'checkpoint_dir_name': None, 'done': True, 'training_iteration': 1, 'trial_id': '01c12111', 'date': '2024-12-24_23-45-16', 'time_this_iter_s': 8.429319858551025, 'time_total_s': 8.429319858551025, 'pid': 4895, 'hostname': 'Sidharrths-MacBook-Pro-142.local', 'node_ip': '127.0.0.1', 'time_since_restore': 8.429319858551025, 'iterations_since_restore': 1, 'experiment_tag': '20_hidden_size=36,kernel_size=7,num_layers=2', 'config/num_layers': 2, 'config/hidden_size': 36, 'config/kernel_size': 7}
Metrics to include: ['accuracy', 'latency']
Metric dict after trial: {'accuracy': (0.9285, None), 'latency': (504, None)}


2024-12-24 23:45:16,136	INFO tune.py:1041 -- Total run time: 298.48 seconds (298.18 seconds for the tuning loop).


In [4]:
result_grid.get_dataframe()

Unnamed: 0,accuracy,latency,timestamp,checkpoint_dir_name,done,training_iteration,trial_id,date,time_this_iter_s,time_total_s,pid,hostname,node_ip,time_since_restore,iterations_since_restore,config/num_layers,config/hidden_size,config/kernel_size,logdir
0,0.8761,261,1735083625,,False,1,eec15173,2024-12-24_23-40-25,3.469562,3.469562,1499,Sidharrths-MacBook-Pro-142.local,127.0.0.1,3.469562,1,1,87,3,eec15173
1,0.898,525,1735083638,,False,1,29ae04fb,2024-12-24_23-40-38,10.425843,10.425843,1599,Sidharrths-MacBook-Pro-142.local,127.0.0.1,10.425843,1,5,21,5,29ae04fb
2,0.9125,2540,1735083717,,False,1,58fa2603,2024-12-24_23-41-57,75.922501,75.922501,1742,Sidharrths-MacBook-Pro-142.local,127.0.0.1,75.922501,1,4,127,5,58fa2603
3,0.9276,756,1735083733,,False,1,dc80ba9d,2024-12-24_23-42-13,12.941924,12.941924,2620,Sidharrths-MacBook-Pro-142.local,127.0.0.1,12.941924,1,2,54,7,dc80ba9d
4,0.8984,2163,1735083779,,False,1,305570eb,2024-12-24_23-42-59,43.591635,43.591635,2809,Sidharrths-MacBook-Pro-142.local,127.0.0.1,43.591635,1,3,103,7,305570eb
5,0.9171,780,1735083807,,False,1,c3ba019f,2024-12-24_23-43-27,25.010423,25.010423,3378,Sidharrths-MacBook-Pro-142.local,127.0.0.1,25.010423,1,4,65,3,c3ba019f
6,0.8709,153,1735083813,,False,1,7e725cc6,2024-12-24_23-43-33,2.038176,2.038176,3721,Sidharrths-MacBook-Pro-142.local,127.0.0.1,2.038176,1,1,51,3,7e725cc6
7,0.8688,240,1735083824,,False,1,14c4df33,2024-12-24_23-43-44,7.117346,7.117346,3793,Sidharrths-MacBook-Pro-142.local,127.0.0.1,7.117346,1,5,16,3,14c4df33
8,0.9152,112,1735083830,,False,1,81dd0df6,2024-12-24_23-43-50,1.354536,1.354536,3917,Sidharrths-MacBook-Pro-142.local,127.0.0.1,1.354536,1,1,16,7,81dd0df6
9,0.887,144,1735083837,,False,1,5a76f217,2024-12-24_23-43-57,4.162961,4.162961,3977,Sidharrths-MacBook-Pro-142.local,127.0.0.1,4.162961,1,3,16,3,5a76f217


In [7]:
from ax.service.utils.report_utils import _pareto_frontier_scatter_2d_plotly

_pareto_frontier_scatter_2d_plotly(ax_client.experiment)



ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [6]:
!pip install --upgrade nbformat

Collecting nbformat
  Using cached nbformat-5.10.4-py3-none-any.whl.metadata (3.6 kB)
Collecting fastjsonschema>=2.15 (from nbformat)
  Using cached fastjsonschema-2.21.1-py3-none-any.whl.metadata (2.2 kB)
Using cached nbformat-5.10.4-py3-none-any.whl (78 kB)
Using cached fastjsonschema-2.21.1-py3-none-any.whl (23 kB)
Installing collected packages: fastjsonschema, nbformat
  Attempting uninstall: nbformat
    Found existing installation: nbformat 4.2.0
    Uninstalling nbformat-4.2.0:
      Successfully uninstalled nbformat-4.2.0
Successfully installed fastjsonschema-2.21.1 nbformat-5.10.4
