You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The below is the error pops up. Somehow it says that there is issue with the logic in Ray. I simply increase the number of sample to 50.
I was trying to apply BOHB into the trainings.
Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 1149, in _on_trial_reset
self._actor_started(tracked_actor, log="REUSED")
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 719, in _actor_started
self._unstage_trial_with_resources(trial)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 646, in _unstage_trial_with_resources
raise RuntimeError(
RuntimeError: Started a trial with resources requested by a different trial, but this trial was lost. This is an error in Ray Tune's execution logic. Please raise a GitHub issue at https://github.com/ray-project/ray/issues
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tuner.py", line 347, in fit
return self._local_tuner.fit()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 588, in fit
analysis = self._fit_internal(trainable, param_space)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/impl/tuner_internal.py", line 712, in _fit_internal
analysis = run(
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tune.py", line 1070, in run
runner.step()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 256, in step
if not self._actor_manager.next(timeout=0.1):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 224, in next
self._actor_task_events.resolve_future(future)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py", line 118, in resolve_future
on_result(result)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 752, in on_result
self._actor_task_resolved(
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/air/execution/_internal/actor_manager.py", line 300, in _actor_task_resolved
tracked_actor_task._on_result(tracked_actor, result)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 824, in _on_result
raise TuneError(traceback.format_exc())
ray.tune.error.TuneError: Traceback (most recent call last):
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 1149, in _on_trial_reset
self._actor_started(tracked_actor, log="REUSED")
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 719, in _actor_started
self._unstage_trial_with_resources(trial)
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/execution/tune_controller.py", line 646, in _unstage_trial_with_resources
raise RuntimeError(
RuntimeError: Started a trial with resources requested by a different trial, but this trial was lost. This is an error in Ray Tune's execution logic. Please raise a GitHub issue at https://github.com/ray-project/ray/issues
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/trevor/Hyperparameter_search/main.py", line 149, in
results = tuner.fit()
File "/home/trevor/anaconda3/envs/pytorch_2/lib/python3.9/site-packages/ray/tune/tuner.py", line 349, in fit
raise TuneError(
ray.tune.error.TuneError: The Ray Tune run failed. Please inspect the previous error messages for a cause. After fixing the issue, you can restart the run from scratch or continue this run. To continue this run, you can use tuner = Tuner.restore("/home/trevor/ray_results/bohb_test", trainable=...).
Versions / Dependencies
ray, version 2.5.1
hpbandster 0.7.4
ConfigSpace 0.7.1
Reproduction script
import torch.optim as optim
import torch.nn as nn
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch.nn.functional as F
import ray
from ray import air, tune
from ray.tune.schedulers.hb_bohb import HyperBandForBOHB
from ray.tune.search.bohb import TuneBOHB
from ray.air import session
import ConfigSpace as CS
class ConvNet(nn.Module):
def __init__(self):
super(ConvNet, self).__init__()
# In this example, we don't change the model architecture
# due to simplicity.
self.conv1 = nn.Conv2d(1, 3, kernel_size=3)
self.fc = nn.Linear(192, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 3))
x = x.view(-1, 192)
x = self.fc(x)
return F.log_softmax(x, dim=1)
EPOCH_SIZE = 512
TEST_SIZE = 256
def train(model, optimizer, train_loader):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
# We set this just for the example to run quickly.
if batch_idx * len(data) > EPOCH_SIZE:
return
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
def test(model, data_loader):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.eval()
correct = 0
total = 0
with torch.no_grad():
for batch_idx, (data, target) in enumerate(data_loader):
# We set this just for the example to run quickly.
if batch_idx * len(data) > TEST_SIZE:
break
data, target = data.to(device), target.to(device)
outputs = model(data)
_, predicted = torch.max(outputs.data, 1)
total += target.size(0)
correct += (predicted == target).sum().item()
return correct / total
def train_mnist(config):
# Data Setup
mnist_transforms = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.1307, ), (0.3081, ))])
train_loader = DataLoader(
datasets.MNIST("~/data", train=True, download=True, transform=mnist_transforms),
batch_size=64,
shuffle=True)
test_loader = DataLoader(
datasets.MNIST("~/data", train=False, transform=mnist_transforms),
batch_size=64,
shuffle=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# print("config: HELLO", config)
model = ConvNet()
model.to(device)
optimizer = optim.SGD(
model.parameters(), lr=config["lr"], momentum=config["momentum"])
for i in range(100):
train(model, optimizer, train_loader)
acc = test(model, test_loader)
# Send the current training result back to Tune
session.report({"mean_accuracy": acc})
if i % 5 == 0:
# This saves the model to the trial directory
torch.save(model.state_dict(), "./model.pth")
"""This example demonstrates the usage of BOHB with Ray Tune.
Requires the HpBandSter and ConfigSpace libraries to be installed
(`pip install hpbandster ConfigSpace`).
"""
if __name__ == "__main__":
ray.init(num_cpus=8)
# Optional: Pass the parameter space yourself
config_space = CS.ConfigurationSpace()
config_space.add_hyperparameter(
CS.UniformFloatHyperparameter("lr", lower=0.0005, upper=0.1))
config_space.add_hyperparameter(
CS.UniformFloatHyperparameter("momentum", lower=0.1, upper=0.9))
# config_space.add_hyperparameter(
# CS.CategoricalHyperparameter(
# "activation", choices=["relu", "tanh"]))
bohb_hyperband = HyperBandForBOHB(
time_attr="training_iteration",
max_t=100,
reduction_factor=4,
stop_last_trials=False,
)
bohb_search = TuneBOHB(
space=config_space, # If you want to set the space manually
metric="mean_accuracy",
mode="max",
)
bohb_search = tune.search.ConcurrencyLimiter(bohb_search, max_concurrent=4)
tuner = tune.Tuner(
train_mnist,
run_config=air.RunConfig(name="bohb_test", stop={"training_iteration": 100}),
tune_config=tune.TuneConfig(
metric="mean_accuracy",
mode="max",
scheduler=bohb_hyperband,
search_alg=bohb_search,
num_samples=50,
)
)
results = tuner.fit()
print("Best hyperparameters found were: ", results.get_best_result().config)
# best_result = results.get_best_result() # Get best result object
# best_config = best_result.config # Get best trial's hyperparameters
# best_logdir = best_result.log_dir # Get best trial's logdir
# best_checkpoint = best_result.checkpoint # Get best trial's best checkpoint
# best_metrics = best_result.metrics # Get best trial's last results
# best_result_df = best_result.metrics_dataframe # Get best result as pandas dataframe
# Get a dataframe with the last results for each trial
df_results = results.get_dataframe()
# Get a dataframe of results for a specific score or mode
df = results.get_dataframe(filter_metric="mean_accuracy", filter_mode="max")
# print("results in df_results: ", df_results)
# print("results in df: ", df)
Issue Severity
High: As it block my task from running.
The text was updated successfully, but these errors were encountered:
yap231995
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jul 13, 2023
xwjiang2010
added
P2
Important issue, but not time-critical
air
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Aug 2, 2023
What happened + What you expected to happen
The below is the error pops up. Somehow it says that there is issue with the logic in Ray. I simply increase the number of sample to 50.
I was trying to apply BOHB into the trainings.
Versions / Dependencies
ray, version 2.5.1
hpbandster 0.7.4
ConfigSpace 0.7.1
Reproduction script
Issue Severity
High: As it block my task from running.
The text was updated successfully, but these errors were encountered: