Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] Bug in the BOHP. #37394

Closed
yap231995 opened this issue Jul 13, 2023 · 2 comments
Closed

[Tune] Bug in the BOHP. #37394

yap231995 opened this issue Jul 13, 2023 · 2 comments
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues

Comments

@yap231995
Copy link

yap231995 commented Jul 13, 2023

What happened + What you expected to happen

The below is the error pops up. Somehow it says that there is issue with the logic in Ray. I simply increase the number of sample to 50.
I was trying to apply BOHB into the trainings.

Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 1149, in _on_trial_reset
self._actor_started(tracked_actor, log="REUSED")
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 719, in _actor_started
self._unstage_trial_with_resources(trial)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 646, in _unstage_trial_with_resources
raise RuntimeError(
RuntimeError: Started a trial with resources requested by a different trial, but this trial was lost. This is an error in Ray Tune's execution logic. Please raise a GitHub issue at https://github.com/ray-project/ray/issues

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tuner.py", line 347, in fit
return self._local_tuner.fit()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 712, in _fit_internal
analysis = run(
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tune.py", line 1070, in run
runner.step()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 256, in step
if not self._actor_manager.next(timeout=0.1):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 224, in next
self._actor_task_events.resolve_future(future)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 118, in resolve_future
on_result(result)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 752, in on_result
self._actor_task_resolved(
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 300, in _actor_task_resolved
tracked_actor_task._on_result(tracked_actor, result)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 824, in _on_result
raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 1149, in _on_trial_reset
self._actor_started(tracked_actor, log="REUSED")
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 719, in _actor_started
self._unstage_trial_with_resources(trial)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 646, in _unstage_trial_with_resources
raise RuntimeError(
RuntimeError: Started a trial with resources requested by a different trial, but this trial was lost. This is an error in Ray Tune's execution logic. Please raise a GitHub issue at https://github.com/ray-project/ray/issues

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/trevor/Hyperparameter_search/main.py", line 149, in
results = tuner.fit()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tuner.py", line 349, in fit
raise TuneError(
ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use tuner = Tuner.restore("/home/trevor/ray_results/bohb_test", trainable=...).

Versions / Dependencies

ray, version 2.5.1
hpbandster 0.7.4
ConfigSpace 0.7.1

Reproduction script

import torch.optim as optim
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
import ray
from ray import air, tune
from ray.tune.schedulers.hb_bohb import HyperBandForBOHB
from ray.tune.search.bohb import TuneBOHB
from ray.air import session
import ConfigSpace as CS

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        # In this example, we don't change the model architecture
        # due to simplicity.
        self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
        self.fc = nn.Linear(192, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 3))
        x = x.view(-1, 192)
        x = self.fc(x)
        return F.log_softmax(x, dim=1)


EPOCH_SIZE = 512
TEST_SIZE = 256

def train(model, optimizer, train_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        # We set this just for the example to run quickly.
        if batch_idx * len(data) > EPOCH_SIZE:
            return
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()


def test(model, data_loader):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(data_loader):
            # We set this just for the example to run quickly.
            if batch_idx * len(data) > TEST_SIZE:
                break
            data, target = data.to(device), target.to(device)
            outputs = model(data)
            _, predicted = torch.max(outputs.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    return correct / total

def train_mnist(config):
    # Data Setup
    mnist_transforms = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.1307, ), (0.3081, ))])

    train_loader = DataLoader(
        datasets.MNIST("~/data", train=True, download=True, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)
    test_loader = DataLoader(
        datasets.MNIST("~/data", train=False, transform=mnist_transforms),
        batch_size=64,
        shuffle=True)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # print("config: HELLO", config)
    model = ConvNet()
    model.to(device)

    optimizer = optim.SGD(
        model.parameters(), lr=config["lr"], momentum=config["momentum"])
    for i in range(100):
        train(model, optimizer, train_loader)
        acc = test(model, test_loader)

        # Send the current training result back to Tune
        session.report({"mean_accuracy": acc})

        if i % 5 == 0:
            # This saves the model to the trial directory
            torch.save(model.state_dict(), "./model.pth")




"""This example demonstrates the usage of BOHB with Ray Tune.

Requires the HpBandSter and ConfigSpace libraries to be installed
(`pip install hpbandster ConfigSpace`).
"""



if __name__ == "__main__":
    ray.init(num_cpus=8)

    # Optional: Pass the parameter space yourself

    config_space = CS.ConfigurationSpace()
    config_space.add_hyperparameter(
        CS.UniformFloatHyperparameter("lr", lower=0.0005, upper=0.1))
    config_space.add_hyperparameter(
        CS.UniformFloatHyperparameter("momentum", lower=0.1, upper=0.9))
    # config_space.add_hyperparameter(
    #     CS.CategoricalHyperparameter(
    #         "activation", choices=["relu", "tanh"]))

    bohb_hyperband = HyperBandForBOHB(

        time_attr="training_iteration",
        max_t=100,
        reduction_factor=4,
        stop_last_trials=False,
    )

    bohb_search = TuneBOHB(
        space=config_space,  # If you want to set the space manually
        metric="mean_accuracy",
        mode="max",
    )
    bohb_search = tune.search.ConcurrencyLimiter(bohb_search, max_concurrent=4)

    tuner = tune.Tuner(
        train_mnist,
        run_config=air.RunConfig(name="bohb_test", stop={"training_iteration": 100}),
        tune_config=tune.TuneConfig(
            metric="mean_accuracy",
            mode="max",
            scheduler=bohb_hyperband,
            search_alg=bohb_search,
            num_samples=50,
        )
    )
    results = tuner.fit()

    print("Best hyperparameters found were: ", results.get_best_result().config)
    # best_result = results.get_best_result()  # Get best result object
    # best_config = best_result.config  # Get best trial's hyperparameters
    # best_logdir = best_result.log_dir  # Get best trial's logdir
    # best_checkpoint = best_result.checkpoint  # Get best trial's best checkpoint
    # best_metrics = best_result.metrics  # Get best trial's last results
    # best_result_df = best_result.metrics_dataframe  # Get best result as pandas dataframe
    # Get a dataframe with the last results for each trial
    df_results = results.get_dataframe()

    # Get a dataframe of results for a specific score or mode
    df = results.get_dataframe(filter_metric="mean_accuracy", filter_mode="max")
    # print("results in df_results: ", df_results)
    # print("results in df: ", df)

Issue Severity

High: As it block my task from running.

@yap231995 yap231995 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 13, 2023
@jjyao jjyao added the tune Tune-related issues label Jul 17, 2023
@hora-anyscale hora-anyscale changed the title Bug in the BOHP. [Tune] Bug in the BOHP. Jul 19, 2023
@xwjiang2010
Copy link
Contributor

@krfricke Another one related to execution logic.

@xwjiang2010 xwjiang2010 added P2 Important issue, but not time-critical air and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 2, 2023
@krfricke
Copy link
Contributor

krfricke commented Aug 2, 2023

This should be fixed in 2.6.0 (#36951)

@krfricke krfricke closed this as completed Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical tune Tune-related issues
Projects
None yet
Development

No branches or pull requests

4 participants