-
-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional OOM Errors after automatic batch size search #129
Comments
Tagging @lvermue as he it most familiar with it. One question beforehand: |
There is no manual allocation of GPU memory outside of pykeen. I simply instantiate a model and three triple factories (using the pykeen api). The training factory is passed to the model and the validation factory is given to the stopper. The test factory is not passed to pykeen at this point and no gpu memory should have been allocated for that. |
Hi @kantholtz Based on your comment
and the log you provided it looks like that this always happens after the first time the early stopper ran, but not at later stages. |
Yes, I have a test task running for trying to reproduce this error. So far it crashed twice (with exactly the same stack trace):
I have added a log statement here: https://github.com/pykeen/pykeen/blob/22d381c79f425b60c21f8247a0634ddd6c48f1e0/src/pykeen/training/training_loop.py#L341(the
In between there have been runs that succeeded. I leave it running for now - maybe it will crash at some other point - but I doubt it will. |
Disclaimer: This code is not using the internal pykeen methods for freeing memory - is there an API endpoint for doing this? Looking at the training_loop file, it does not seem like there is It just happened for an evaluation on the test data without any training:
The code I have used for that: metrics = evaluator.evaluate(
model=train_result.model,
mapped_triples=keen_dataset.testing.mapped_triples,
tqdm_kwargs=dict(
position=1,
ncols=80,
leave=False,
)
) The pykeen proposal:
The embeddings are of size (Note for me: |
Can you share the entire script-code that launches the training and evaluation? Also to exclude the obvious mistakes. Is this the only python process running on that GPU, i.e. you didn't start any second training and/or someone else did on the GPU at hand? |
Yes, it is definitely the only python process running there.
Mhh, that would be a bit too much as it is not a single script but embedded in a bigger project. I can, however, assure that I do no manual memory allocation, calling torch in any way and only rely on pykeen calls. Some of the code around the error looks like this: @helper.notnone
def single(
*,
config: Config = None,
split_dataset: split.Dataset = None,
keen_dataset: keen.Dataset = None,
) -> TrainingResult:
# TODO https://github.com/pykeen/pykeen/issues/129
BATCH_SIZE = 250
# preparation
if not config.general.seed:
# choice of range is arbitrary
config.general.seed = np.random.randint(10**5, 10**7)
log.info(f'setting seed to {config.general.seed}')
helper.seed(config.general.seed)
# initialization
result_tracker = config.resolve(config.tracker)
result_tracker.start_run()
result_tracker.log_params(dataclasses.asdict(config))
device = resolve_device(
device_name=config.model.preferred_device)
# target filtering for ranking losses is enabled by default
loss = config.resolve(
config.loss,
)
regularizer = config.resolve(
config.regularizer,
device=device,
)
model = config.resolve(
config.model,
loss=loss,
regularizer=regularizer,
random_seed=config.general.seed,
triples_factory=keen_dataset.training,
preferred_device=device,
)
evaluator = config.resolve(
config.evaluator,
batch_size=BATCH_SIZE,
)
optimizer = config.resolve(
config.optimizer,
params=model.get_grad_params(),
)
stopper = config.resolve(
config.stopper,
model=model,
evaluator=evaluator,
evaluation_triples_factory=keen_dataset.validation,
result_tracker=result_tracker,
evaluation_batch_size=BATCH_SIZE,
)
training_loop = config.resolve(
config.training_loop,
model=model,
optimizer=optimizer,
negative_sampler_cls=config.sampler.constructor,
negative_sampler_kwargs=config.sampler.kwargs,
)
# training
ts = datetime.now()
try:
losses = training_loop.train(**{
**dataclasses.asdict(config.training),
**dict(
stopper=stopper,
result_tracker=result_tracker,
clear_optimizer=True,
)
})
except RuntimeError as exc:
log.error(f'training error: "{exc}"')
log.error('sweeping training loop memory up under the rug')
# not working although documented?
# result_tracker.wandb.alert(title='RuntimeError', text=msg)
result_tracker.run.finish(exit_code=1)
gc.collect()
training_loop.optimizer.zero_grad()
training_loop._free_graph_and_cache()
raise exc
training_time = Time(start=ts, end=datetime.now())
result_tracker.log_metrics(
prefix='validation',
metrics=dict(best=stopper.best_metric, metric=stopper.metric),
step=stopper.best_epoch)
# aggregation
return TrainingResult(
created=datetime.now(),
git_hash=helper.git_hash(),
config=config,
# metrics
training_time=training_time,
losses=losses,
# instances
model=model,
stopper=stopper,
result_tracker=result_tracker,
)
@helper.notnone
def _create_study(
*,
config: Config = None,
out: pathlib.Path = None,
) -> optuna.Study:
out.mkdir(parents=True, exist_ok=True)
db_path = out / 'optuna.db'
timestamp = datetime.now().strftime('%Y.%m.%d-%H.%M')
study_name = f'{config.model.cls}-sweep-{timestamp}'
log.info(f'create optuna study "{study_name}"')
# TODO use direction="maximise"
study = optuna.create_study(
study_name=study_name,
storage=f'sqlite:///{db_path}',
)
# if there are any initial values to be set,
# create and enqueue a custom trial
params = {
k: v.initial for k, v in config.suggestions.items()
if v.initial is not None}
if params:
log.info('setting initial study params: ' + ', '.join(
f'{k}={v}' for k, v in params.items()))
study.enqueue_trial(params)
return study
@helper.notnone
def multi(
*,
base: Config = None,
out: pathlib.Path = None,
split_dataset: split.Dataset = None,
**kwargs
) -> None:
# Optuna lingo:
# Trial: A single call of the objective function
# Study: An optimization session, which is a set of trials
# Parameter: A variable whose value is to be optimized
assert base.optuna, 'no optuna config found'
def objective(trial):
# obtain optuna suggestions
config = base.suggest(trial)
name = f'{split_dataset.name}-{config.model.cls}-{trial.number}'
path = out / f'trial-{trial.number:04d}'
# update configuration
tracker = dataclasses.replace(config.tracker, experiment=name)
config = dataclasses.replace(config, tracker=tracker)
# run training
try:
result = single(
config=config,
split_dataset=split_dataset,
**kwargs)
except RuntimeError as exc:
msg = f'objective: got runtime error "{exc}"'
log.error(msg)
# post mortem (TODO last model checkpoint)
config.save(path)
raise ryn.RynError(msg)
best_metric = result.stopper.best_metric
log.info(f'! trial {trial.number} finished: '
f'best metric = {best_metric}')
# min optimization
result.save(path)
return -best_metric if base.optuna.maximise else best_metric
study = _create_study(config=base, out=out)
study.optimize(
objective,
n_trials=base.optuna.trials,
gc_after_trial=True,
catch=(ryn.RynError, ),
)
log.info('finished study')
@helper.notnone
def train(
*,
config: Config = None,
split_dataset: split.Dataset = None,
keen_dataset: keen.Dataset = None,
offline: bool = False,
) -> None:
time = str(datetime.now()).replace(' ', '_')
out = ryn.ENV.KGC_DIR / split_dataset.name / f'{config.model.cls}-{time}'
config.save(out)
multi(
out=out,
base=config,
split_dataset=split_dataset,
keen_dataset=keen_dataset)
@helper.notnone
def train_from_kwargs(
*,
config: str = None,
split_dataset: str = None,
offline: bool = False):
log.info('running training from cli')
if offline:
log.warning('offline run!')
split_dataset, keen_dataset = _load_datasets(path=split_dataset)
print(f'\n{split_dataset}\n{keen_dataset}\n')
config = Config.load(config)
config.general.dataset = split_dataset.name
train(
config=config,
split_dataset=split_dataset,
keen_dataset=keen_dataset,
offline=offline) The |
@kantholtz I assume that #433 and/or #419 also fix this issue. Can you confirm? |
I am testing this at the moment coincidentally :) Up to this point, no problem occurred. I wanted to report whether it works at some point but I'll be closing this issue now and just re-open, if I should encounter it again. Thanks! |
I will add that I have the same issue, albeit on apple silicon (MPS) device. Crashes OOM in eval. Happens with RotateE and TransD with early stopping, at the least |
Crash happens in CPU only mode as well as mps, and with automatic batch size search disabled. Crash happensi n eval stage.
|
Describe the bug
Sometimes, with identical configuration, training aborts because of CUDA out of memory errors. It seems as if it does not abort while the evaluator runs, but after evaluation when training is resumed.
To Reproduce
I have yet to find a way to reproduce this.
Expected behavior
It should not run into OOM errors ;)
Environment (please complete the following information):
I have observed this on three different types of cards (with 8, 11 and 24 GB of memory) and two different systems. Unfortunately, I have no control over the installed driver version. As such, I cannot upgrade to a newer CUDA version.
System 1
System 2 (the one I am currently working on)
Additional information
You can circumvent this by setting a fixed batch size. This requires the change described in #125
These are some of the latest errors freshly grepped from my log files (on System 2):
config.training
offers a fixed batch size (this is part of my hpo).The text was updated successfully, but these errors were encountered: