[Tune] Bug in the BOHP. #37394

yap231995 · 2023-07-13T10:23:51Z

What happened + What you expected to happen

The below is the error pops up. Somehow it says that there is issue with the logic in Ray. I simply increase the number of sample to 50.
I was trying to apply BOHB into the trainings.

Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 1149, in _on_trial_reset
self._actor_started(tracked_actor, log="REUSED")
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 719, in _actor_started
self._unstage_trial_with_resources(trial)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 646, in _unstage_trial_with_resources
raise RuntimeError(
RuntimeError: Started a trial with resources requested by a different trial, but this trial was lost. This is an error in Ray Tune's execution logic. Please raise a GitHub issue at https://github.com/ray-project/ray/issues

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tuner.py", line 347, in fit
return self._local_tuner.fit()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 712, in _fit_internal
analysis = run(
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tune.py", line 1070, in run
runner.step()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 256, in step
if not self._actor_manager.next(timeout=0.1):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 224, in next
self._actor_task_events.resolve_future(future)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 118, in resolve_future
on_result(result)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 752, in on_result
self._actor_task_resolved(
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 300, in _actor_task_resolved
tracked_actor_task._on_result(tracked_actor, result)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 824, in _on_result
raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 1149, in _on_trial_reset
self._actor_started(tracked_actor, log="REUSED")
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 719, in _actor_started
self._unstage_trial_with_resources(trial)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 646, in _unstage_trial_with_resources
raise RuntimeError(
RuntimeError: Started a trial with resources requested by a different trial, but this trial was lost. This is an error in Ray Tune's execution logic. Please raise a GitHub issue at https://github.com/ray-project/ray/issues

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/trevor/Hyperparameter_search/main.py", line 149, in
results = tuner.fit()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tuner.py", line 349, in fit
raise TuneError(
ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use tuner = Tuner.restore("/home/trevor/ray_results/bohb_test", trainable=...).

Versions / Dependencies

ray, version 2.5.1
hpbandster 0.7.4
ConfigSpace 0.7.1

Reproduction script

import torch.optim as optim
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
import ray
from ray import air, tune
from ray.tune.schedulers.hb_bohb import HyperBandForBOHB
from ray.tune.search.bohb import TuneBOHB
from ray.air import session
import ConfigSpace as CS

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        # In this example, we don't change the model architecture
        # due to simplicity.
        self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
        self.fc = nn.Linear(192, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, 192)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)


EPOCH_SIZE = 512
TEST_SIZE = 256

def train(model, optimizer, train_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # We set this just for the example to run quickly.
        if batch_idx * len(data) > EPOCH_SIZE:
            return
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()


def test(model, data_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(data_loader):
            # We set this just for the example to run quickly.
            if batch_idx * len(data) > TEST_SIZE:
                break
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    return correct / total

def train_mnist(config):
    # Data Setup
    mnist_transforms = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.1307, ), (0.3081, ))])

    train_loader = DataLoader(
        datasets.MNIST("~/data", train=True, download=True, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)
    test_loader = DataLoader(
        datasets.MNIST("~/data", train=False, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # print("config: HELLO", config)
    model = ConvNet()
    model.to(device)

    optimizer = optim.SGD(
        model.parameters(), lr=config["lr"], momentum=config["momentum"])
    for i in range(100):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)

        # Send the current training result back to Tune
        session.report({"mean_accuracy": acc})

        if i % 5 == 0:
            # This saves the model to the trial directory
            torch.save(model.state_dict(), "./model.pth")




"""This example demonstrates the usage of BOHB with Ray Tune.

Requires the HpBandSter and ConfigSpace libraries to be installed
(`pip install hpbandster ConfigSpace`).
"""



if __name__ == "__main__":
    ray.init(num_cpus=8)

    # Optional: Pass the parameter space yourself

    config_space = CS.ConfigurationSpace()
    config_space.add_hyperparameter(
        CS.UniformFloatHyperparameter("lr", lower=0.0005, upper=0.1))
    config_space.add_hyperparameter(
        CS.UniformFloatHyperparameter("momentum", lower=0.1, upper=0.9))
    # config_space.add_hyperparameter(
    #     CS.CategoricalHyperparameter(
    #         "activation", choices=["relu", "tanh"]))

    bohb_hyperband = HyperBandForBOHB(

        time_attr="training_iteration",
        max_t=100,
        reduction_factor=4,
        stop_last_trials=False,
    )

    bohb_search = TuneBOHB(
        space=config_space,  # If you want to set the space manually
        metric="mean_accuracy",
        mode="max",
    )
    bohb_search = tune.search.ConcurrencyLimiter(bohb_search, max_concurrent=4)

    tuner = tune.Tuner(
        train_mnist,
        run_config=air.RunConfig(name="bohb_test", stop={"training_iteration": 100}),
        tune_config=tune.TuneConfig(
            metric="mean_accuracy",
            mode="max",
            scheduler=bohb_hyperband,
            search_alg=bohb_search,
            num_samples=50,
        )
    )
    results = tuner.fit()

    print("Best hyperparameters found were: ", results.get_best_result().config)
    # best_result = results.get_best_result()  # Get best result object
    # best_config = best_result.config  # Get best trial's hyperparameters
    # best_logdir = best_result.log_dir  # Get best trial's logdir
    # best_checkpoint = best_result.checkpoint  # Get best trial's best checkpoint
    # best_metrics = best_result.metrics  # Get best trial's last results
    # best_result_df = best_result.metrics_dataframe  # Get best result as pandas dataframe
    # Get a dataframe with the last results for each trial
    df_results = results.get_dataframe()

    # Get a dataframe of results for a specific score or mode
    df = results.get_dataframe(filter_metric="mean_accuracy", filter_mode="max")
    # print("results in df_results: ", df_results)
    # print("results in df: ", df)

Issue Severity

High: As it block my task from running.

The text was updated successfully, but these errors were encountered:

xwjiang2010 · 2023-08-02T15:54:16Z

@krfricke Another one related to execution logic.

krfricke · 2023-08-02T16:08:45Z

This should be fixed in 2.6.0 (#36951)

yap231995 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 13, 2023

jjyao added the tune Tune-related issues label Jul 17, 2023

hora-anyscale changed the title ~~Bug in the BOHP.~~ [Tune] Bug in the BOHP. Jul 19, 2023

xwjiang2010 added P2 Important issue, but not time-critical air and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 2, 2023

krfricke closed this as completed Aug 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] Bug in the BOHP. #37394

[Tune] Bug in the BOHP. #37394

yap231995 commented Jul 13, 2023 •

edited

Loading

xwjiang2010 commented Aug 2, 2023

krfricke commented Aug 2, 2023

[Tune] Bug in the BOHP. #37394

[Tune] Bug in the BOHP. #37394

Comments

yap231995 commented Jul 13, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

xwjiang2010 commented Aug 2, 2023

krfricke commented Aug 2, 2023

yap231995 commented Jul 13, 2023 •

edited

Loading