Occasional OOM Errors after automatic batch size search #129

kantholtz · 2020-11-03T08:08:16Z

Describe the bug
Sometimes, with identical configuration, training aborts because of CUDA out of memory errors. It seems as if it does not abort while the evaluator runs, but after evaluation when training is resumed.

To Reproduce
I have yet to find a way to reproduce this.

Expected behavior
It should not run into OOM errors ;)

Environment (please complete the following information):
I have observed this on three different types of cards (with 8, 11 and 24 GB of memory) and two different systems. Unfortunately, I have no control over the installed driver version. As such, I cannot upgrade to a newer CUDA version.

System 1

OS: Ubuntu 18.04.4
Python version: 3.8.3
Version of this software: 1.0.5.dev0
Versions of required Python packages:
- click 7.1.2
- click-default-group 1.2.2
- numpy 1.19.1
- torch 1.6.0
- tqdm 4.46.0

System 2 (the one I am currently working on)

OS: Ubuntu 18.04.4
Python version: 3.8.5
Version of this software: 1.0.6.dev0
Versions of required Python packages:
- click 7.1.2
- click-default-group 1.2.2
- numpy 1.19.3
- torch 1.6.0
- tqdm 4.51.0

Additional information

You can circumvent this by setting a fixed batch size. This requires the change described in #125

These are some of the latest errors freshly grepped from my log files (on System 2):

    try:

        losses = training_loop.train(**{
            **dataclasses.asdict(config.training),
            **dict(
                stopper=stopper,
                result_tracker=result_tracker,
                clear_optimizer=True,
            )
        })

    except RuntimeError as exc:
        log.error(f'training error: "{exc}"')

config.training offers a fixed batch size (this is part of my hpo).

[10.22 | 03:54:08] pykeen.evaluation.evaluator    INFO [60236] | Starting batch_size search for evaluation now...
[10.22 | 03:54:15] pykeen.evaluation.evaluator    INFO [60236] | Concluded batch_size search with batch_size=1024.
[10.22 | 03:54:53] pykeen.evaluation.evaluator    INFO [60236] | Evaluation took 38.19s seconds
[10.22 | 03:55:57] ryn.kgc.pipeline            ERROR [60236] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.27 GiB already allocated; 4.88 GiB free; 18.52 GiB reserved in total by PyTorch)"
--
[10.23 | 01:15:24] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 01:15:32] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=1024.
[10.23 | 01:16:13] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 40.44s seconds
[10.23 | 01:17:10] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 11.08 GiB (GPU 0; 23.88 GiB total capacity; 11.13 GiB already allocated; 1.20 GiB free; 22.20 GiB reserved in total by PyTorch)"
--
[10.23 | 04:04:14] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 04:04:19] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 04:05:07] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.38s seconds
[10.23 | 04:06:03] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 05:29:15] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:29:21] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:30:08] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.37s seconds
[10.23 | 05:31:04] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 8.31 GiB (GPU 0; 23.88 GiB total capacity; 8.38 GiB already allocated; 6.72 GiB free; 16.68 GiB reserved in total by PyTorch)"
--
[10.23 | 05:32:20] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:32:25] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:33:13] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.46s seconds
[10.23 | 05:34:14] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 05:35:28] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:35:34] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:36:22] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.41s seconds
[10.23 | 05:37:22] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 05:38:33] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:38:39] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:39:27] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.45s seconds
[10.23 | 05:40:25] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 05:41:33] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:41:39] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:42:26] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.02s seconds
[10.23 | 05:43:23] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 05:44:31] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:44:37] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:45:24] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.06s seconds
[10.23 | 05:46:22] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 05:47:37] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:47:43] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:48:30] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.47s seconds
[10.23 | 05:49:31] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 05:50:45] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 05:50:50] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=512.
[10.23 | 05:51:38] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 47.43s seconds
[10.23 | 05:52:40] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 9.23 GiB (GPU 0; 23.88 GiB total capacity; 9.31 GiB already allocated; 4.87 GiB free; 18.53 GiB reserved in total by PyTorch)"
--
[10.23 | 06:52:59] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 06:53:08] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=1024.
[10.23 | 06:53:48] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 40.33s seconds
[10.23 | 06:54:45] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 11.08 GiB (GPU 0; 23.88 GiB total capacity; 11.13 GiB already allocated; 1.20 GiB free; 22.20 GiB reserved in total by PyTorch)"
--
[10.23 | 06:55:57] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 06:56:05] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=1024.
[10.23 | 06:56:45] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 40.38s seconds
[10.23 | 06:57:43] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 11.08 GiB (GPU 0; 23.88 GiB total capacity; 11.13 GiB already allocated; 1.20 GiB free; 22.19 GiB reserved in total by PyTorch)"
--
[10.23 | 06:58:52] pykeen.evaluation.evaluator    INFO [10270] | Starting batch_size search for evaluation now...
[10.23 | 06:59:01] pykeen.evaluation.evaluator    INFO [10270] | Concluded batch_size search with batch_size=1024.
[10.23 | 06:59:41] pykeen.evaluation.evaluator    INFO [10270] | Evaluation took 40.40s seconds
[10.23 | 07:00:36] ryn.kgc.pipeline            ERROR [10270] | training error: "CUDA out of memory. Tried to allocate 11.08 GiB (GPU 0; 23.88 GiB total capacity; 11.13 GiB already allocated; 1.20 GiB free; 22.19 GiB reserved in total by PyTorch)"

The text was updated successfully, but these errors were encountered:

mberr · 2020-11-03T09:10:41Z

Tagging @lvermue as he it most familiar with it.

One question beforehand:
The error seems to be logged from a module outside pykeen. Do you allocate any additional tensors outside of it? If yes, and this allocation varies, pykeen will have problems with automatic memory optimization.

kantholtz · 2020-11-03T09:15:34Z

There is no manual allocation of GPU memory outside of pykeen. I simply instantiate a model and three triple factories (using the pykeen api). The training factory is passed to the model and the validation factory is given to the stopper. The test factory is not passed to pykeen at this point and no gpu memory should have been allocated for that.

lvermue · 2020-11-03T09:42:07Z

Hi @kantholtz
Thanks for providing more details on this issue.

Based on your comment

It seems as if it does not abort while the evaluator runs, but after evaluation when training is resumed.

and the log you provided it looks like that this always happens after the first time the early stopper ran, but not at later stages.
Can you confirm this?

kantholtz · 2020-11-04T12:48:10Z

Yes, I have a test task running for trying to reproduce this error. So far it crashed twice (with exactly the same stack trace):

[11.04 | 12:32:49]

  File "ryn/ryn/kgc/trainer.py", line 283, in single
    losses = training_loop.train(**{
  File "ryn/lib/pykeen/src/pykeen/training/training_loop.py", line 182, in train
    result = self._train(
  File "ryn/lib/pykeen/src/pykeen/training/training_loop.py", line 417, in _train
    if stopper is not None and stopper.should_evaluate(epoch) and stopper.should_stop(epoch):
  File "ryn/lib/pykeen/src/pykeen/stoppers/early_stopping.py", line 127, in should_stop
    metric_results = self.evaluator.evaluate(
  File "ryn/lib/pykeen/src/pykeen/evaluation/evaluator.py", line 155, in evaluate
    return evaluate(
  File "ryn/lib/pykeen/src/pykeen/evaluation/evaluator.py", line 558, in evaluate
    relation_filter = _evaluate_batch(
  File "ryn/lib/pykeen/src/pykeen/evaluation/evaluator.py", line 645, in _evaluate_batch
    batch_scores_of_corrupted = model.predict_scores_all_heads(batch[:, 1:3], slice_size=slice_size)
  File "ryn/lib/pykeen/src/pykeen/models/base.py", line 582, in predict_scores_all_heads
    scores = self.score_h(rt_batch)
  File "ryn/lib/pykeen/src/pykeen/models/unimodal/distmult.py", line 170, in score_h
    scores = self.interaction_function(h=h, r=r, t=t)
  File "ryn/lib/pykeen/src/pykeen/models/unimodal/distmult.py", line 133, in interaction_function
    return torch.sum(h * r * t, dim=-1)
RuntimeError: CUDA out of memory. Tried to allocate 2.54 GiB (GPU 0; 7.93 GiB total capacity; 2.59 GiB already allocated; 2.30 GiB free; 5.13 GiB reserved in total by PyTorch)

I have added a log statement here: https://github.com/pykeen/pykeen/blob/22d381c79f425b60c21f8247a0634ddd6c48f1e0/src/pykeen/training/training_loop.py#L341(the starting epoch X statement):

...
[11.04 | 12:32:19] ryn.kgc.trainer              INFO [55372] | resolved device, running on cuda
[11.04 | 12:32:22] pykeen.training.training_loop    INFO [55372] | starting epoch 1
[11.04 | 12:32:22] pykeen.training.training_loop    INFO [55372] | starting epoch 1
[11.04 | 12:32:28] pykeen.evaluation.evaluator    INFO [55372] | Starting batch_size search for evaluation now...
[11.04 | 12:32:30] pykeen.evaluation.evaluator    INFO [55372] | Concluded batch_size search with batch_size=256.
[11.04 | 12:32:44] pykeen.evaluation.evaluator    INFO [55372] | Evaluation took 13.26s seconds
[11.04 | 12:32:44] pykeen.training.training_loop    INFO [55372] | starting epoch 2
[11.04 | 12:32:49] ryn.kgc.trainer             ERROR [55372] | training error: "CUDA out of memory. Tried to allocate 2.54 GiB (GPU 0; 7.93 GiB total capacity; 2.59 GiB already allocated; 2.30 GiB free; 5.13 GiB reserved in total by PyTorch)"
[11.04 | 12:32:49] ryn.kgc.trainer             ERROR [55372] | sweeping training loop memory up under the rug
[11.04 | 12:32:53] ryn.kgc.trainer             ERROR [55372] | objective: got runtime error "CUDA out of memory. Tried to allocate 2.54 GiB (GPU 0; 7.93 GiB total capacity; 2.59 GiB already allocated; 2.30 GiB free; 5.13 GiB reserved in total by PyTorch)"
...
[11.04 | 13:39:34] ryn.kgc.trainer              INFO [55372] | resolved device, running on cuda
[11.04 | 13:39:34] pykeen.training.training_loop    INFO [55372] | starting epoch 1
[11.04 | 13:39:35] pykeen.training.training_loop    INFO [55372] | starting epoch 1
[11.04 | 13:39:40] pykeen.evaluation.evaluator    INFO [55372] | Starting batch_size search for evaluation now...
[11.04 | 13:39:43] pykeen.evaluation.evaluator    INFO [55372] | Concluded batch_size search with batch_size=256.
[11.04 | 13:39:56] pykeen.evaluation.evaluator    INFO [55372] | Evaluation took 13.60s seconds
[11.04 | 13:39:56] pykeen.training.training_loop    INFO [55372] | starting epoch 2
[11.04 | 13:40:02] ryn.kgc.trainer             ERROR [55372] | training error: "CUDA out of memory. Tried to allocate 2.54 GiB (GPU 0; 7.93 GiB total capacity; 2.59 GiB already allocated; 2.30 GiB free; 5.13 GiB reserved in total by PyTorch)"
[11.04 | 13:40:02] ryn.kgc.trainer             ERROR [55372] | sweeping training loop memory up under the rug
[11.04 | 13:40:06] ryn.kgc.trainer             ERROR [55372] | objective: got runtime error "CUDA out of memory. Tried to allocate 2.54 GiB (GPU 0; 7.93 GiB total capacity; 2.59 GiB already allocated; 2.30 GiB free; 5.13 GiB reserved in total by PyTorch)"
...

In between there have been runs that succeeded. I leave it running for now - maybe it will crash at some other point - but I doubt it will.

kantholtz · 2020-11-04T14:30:56Z

Disclaimer: This code is not using the internal pykeen methods for freeing memory - is there an API endpoint for doing this? Looking at the training_loop file, it does not seem like there is

It just happened for an evaluation on the test data without any training:

  File "ryn/ryn/kgc/trainer.py", line 567, in evaluate
    eval_result = _evaluate_wrapper(
  File "ryn/ryn/kgc/trainer.py", line 530, in _evaluate_wrapper
    eval_result = _evaluate(train_result, keen_dataset)
  File "ryn/ryn/kgc/trainer.py", line 508, in _evaluate
    metrics = evaluator.evaluate(
  File "ryn/lib/pykeen/src/pykeen/evaluation/evaluator.py", line 155, in evaluate
    return evaluate(
  File "ryn/lib/pykeen/src/pykeen/evaluation/evaluator.py", line 558, in evaluate
    relation_filter = _evaluate_batch(
  File "ryn/lib/pykeen/src/pykeen/evaluation/evaluator.py", line 661, in _evaluate_batch
    positive_filter, relation_filter = create_sparse_positive_filter_(
  File "ryn/lib/pykeen/src/pykeen/evaluation/evaluator.py", line 381, in create_sparse_positive_filter_
    filter_batch = (entity_filter_test & relation_filter).nonzero(as_tuple=False)
RuntimeError: CUDA out of memory. Tried to allocate 8.10 GiB (GPU 0; 23.88 GiB total capacity; 1.67 GiB already allocated; 7.91 GiB free; 15.49 GiB reserved in total by PyTorch)

The code I have used for that:

    metrics = evaluator.evaluate(
        model=train_result.model,
        mapped_triples=keen_dataset.testing.mapped_triples,
        tqdm_kwargs=dict(
            position=1,
            ncols=80,
            leave=False,
        )
    )

The pykeen proposal:

[11.04 | 15:23:05] pykeen.evaluation.evaluator    INFO [42989] | Starting batch_size search for evaluation now...
[11.04 | 15:23:17] pykeen.evaluation.evaluator    INFO [42989] | Concluded batch_size search with batch_size=4096.

The embeddings are of size 50. There are 31496 test triples. The card has 24G RAM.

(Note for me: oke.fb15k237_30061990_50/DistMult-2020-11-03_11:22:41.522290/trial-0007)

lvermue · 2020-11-04T14:42:43Z

Can you share the entire script-code that launches the training and evaluation?

Also to exclude the obvious mistakes. Is this the only python process running on that GPU, i.e. you didn't start any second training and/or someone else did on the GPU at hand?

kantholtz · 2020-11-04T14:53:37Z

Also to exclude the obvious mistakes. Is this the only python process running on that GPU, i.e. you didn't start any second training and/or someone else did on the GPU at hand?

Yes, it is definitely the only python process running there.

Can you share the entire script-code that launches the training and evaluation?

Mhh, that would be a bit too much as it is not a single script but embedded in a bigger project. I can, however, assure that I do no manual memory allocation, calling torch in any way and only rely on pykeen calls. Some of the code around the error looks like this:

@helper.notnone
def single(
        *,
        config: Config = None,
        split_dataset: split.Dataset = None,
        keen_dataset: keen.Dataset = None,
) -> TrainingResult:

    # TODO https://github.com/pykeen/pykeen/issues/129
    BATCH_SIZE = 250

    # preparation

    if not config.general.seed:
        # choice of range is arbitrary
        config.general.seed = np.random.randint(10**5, 10**7)
        log.info(f'setting seed to {config.general.seed}')

    helper.seed(config.general.seed)

    # initialization

    result_tracker = config.resolve(config.tracker)
    result_tracker.start_run()
    result_tracker.log_params(dataclasses.asdict(config))

    device = resolve_device(
        device_name=config.model.preferred_device)

    # target filtering for ranking losses is enabled by default
    loss = config.resolve(
        config.loss,
    )

    regularizer = config.resolve(
        config.regularizer,
        device=device,
    )

    model = config.resolve(
        config.model,
        loss=loss,
        regularizer=regularizer,
        random_seed=config.general.seed,
        triples_factory=keen_dataset.training,
        preferred_device=device,
    )

    evaluator = config.resolve(
        config.evaluator,
        batch_size=BATCH_SIZE,
    )

    optimizer = config.resolve(
        config.optimizer,
        params=model.get_grad_params(),
    )

    stopper = config.resolve(
        config.stopper,
        model=model,
        evaluator=evaluator,
        evaluation_triples_factory=keen_dataset.validation,
        result_tracker=result_tracker,
        evaluation_batch_size=BATCH_SIZE,
    )

    training_loop = config.resolve(
        config.training_loop,
        model=model,
        optimizer=optimizer,
        negative_sampler_cls=config.sampler.constructor,
        negative_sampler_kwargs=config.sampler.kwargs,
    )

    # training

    ts = datetime.now()

    try:

        losses = training_loop.train(**{
            **dataclasses.asdict(config.training),
            **dict(
                stopper=stopper,
                result_tracker=result_tracker,
                clear_optimizer=True,
            )
        })

    except RuntimeError as exc:
        log.error(f'training error: "{exc}"')
        log.error('sweeping training loop memory up under the rug')

        # not working although documented?
        # result_tracker.wandb.alert(title='RuntimeError', text=msg)
        result_tracker.run.finish(exit_code=1)

        gc.collect()
        training_loop.optimizer.zero_grad()
        training_loop._free_graph_and_cache()

        raise exc

    training_time = Time(start=ts, end=datetime.now())
    result_tracker.log_metrics(
        prefix='validation',
        metrics=dict(best=stopper.best_metric, metric=stopper.metric),
        step=stopper.best_epoch)

    # aggregation

    return TrainingResult(
        created=datetime.now(),
        git_hash=helper.git_hash(),
        config=config,
        # metrics
        training_time=training_time,
        losses=losses,
        # instances
        model=model,
        stopper=stopper,
        result_tracker=result_tracker,
    )


@helper.notnone
def _create_study(
        *,
        config: Config = None,
        out: pathlib.Path = None,
) -> optuna.Study:

    out.mkdir(parents=True, exist_ok=True)
    db_path = out / 'optuna.db'

    timestamp = datetime.now().strftime('%Y.%m.%d-%H.%M')
    study_name = f'{config.model.cls}-sweep-{timestamp}'

    log.info(f'create optuna study "{study_name}"')
    # TODO use direction="maximise"
    study = optuna.create_study(
        study_name=study_name,
        storage=f'sqlite:///{db_path}',
    )

    # if there are any initial values to be set,
    # create and enqueue a custom trial

    params = {
        k: v.initial for k, v in config.suggestions.items()
        if v.initial is not None}

    if params:
        log.info('setting initial study params: ' + ', '.join(
            f'{k}={v}' for k, v in params.items()))
        study.enqueue_trial(params)

    return study


@helper.notnone
def multi(
        *,
        base: Config = None,
        out: pathlib.Path = None,
        split_dataset: split.Dataset = None,
        **kwargs
) -> None:

    # Optuna lingo:
    #   Trial: A single call of the objective function
    #   Study: An optimization session, which is a set of trials
    #   Parameter: A variable whose value is to be optimized
    assert base.optuna, 'no optuna config found'

    def objective(trial):

        # obtain optuna suggestions
        config = base.suggest(trial)
        name = f'{split_dataset.name}-{config.model.cls}-{trial.number}'
        path = out / f'trial-{trial.number:04d}'

        # update configuration
        tracker = dataclasses.replace(config.tracker, experiment=name)
        config = dataclasses.replace(config, tracker=tracker)

        # run training
        try:
            result = single(
                config=config,
                split_dataset=split_dataset,
                **kwargs)

        except RuntimeError as exc:
            msg = f'objective: got runtime error "{exc}"'
            log.error(msg)

            # post mortem (TODO last model checkpoint)
            config.save(path)
            raise ryn.RynError(msg)

        best_metric = result.stopper.best_metric
        log.info(f'! trial {trial.number} finished: '
                 f'best metric = {best_metric}')

        # min optimization
        result.save(path)
        return -best_metric if base.optuna.maximise else best_metric

    study = _create_study(config=base, out=out)

    study.optimize(
        objective,
        n_trials=base.optuna.trials,
        gc_after_trial=True,
        catch=(ryn.RynError, ),
    )

    log.info('finished study')


@helper.notnone
def train(
        *,
        config: Config = None,
        split_dataset: split.Dataset = None,
        keen_dataset: keen.Dataset = None,
        offline: bool = False,
) -> None:

    time = str(datetime.now()).replace(' ', '_')
    out = ryn.ENV.KGC_DIR / split_dataset.name / f'{config.model.cls}-{time}'
    config.save(out)

    multi(
        out=out,
        base=config,
        split_dataset=split_dataset,
        keen_dataset=keen_dataset)


@helper.notnone
def train_from_kwargs(
        *,
        config: str = None,
        split_dataset: str = None,
        offline: bool = False):
    log.info('running training from cli')

    if offline:
        log.warning('offline run!')

    split_dataset, keen_dataset = _load_datasets(path=split_dataset)

    print(f'\n{split_dataset}\n{keen_dataset}\n')

    config = Config.load(config)
    config.general.dataset = split_dataset.name

    train(
        config=config,
        split_dataset=split_dataset,
        keen_dataset=keen_dataset,
        offline=offline)

The config.resolve calls basically invoke the pykeen specific get_*_cls() methods to first obtain the correct constructor and then create an instance.

lvermue · 2021-05-18T12:46:53Z

@kantholtz I assume that #433 and/or #419 also fix this issue. Can you confirm?

kantholtz · 2021-05-18T12:53:02Z

I am testing this at the moment coincidentally :) Up to this point, no problem occurred. I wanted to report whether it works at some point but I'll be closing this issue now and just re-open, if I should encounter it again. Thanks!

ddofer · 2024-04-21T08:19:57Z

I will add that I have the same issue, albeit on apple silicon (MPS) device. Crashes OOM in eval. Happens with RotateE and TransD with early stopping, at the least

ddofer · 2024-06-04T11:26:04Z

Crash happens in CPU only mode as well as mps, and with automatic batch size search disabled. Crash happensi n eval stage.

tf = TriplesFactory.from_labeled_triples(
  df_kg.rename(columns={"SUBJ_NAME":"source",
                        "PREDICATE":"type",
                        "OBJ_NAME":"target"})[["source", "type", "target"]].values ,
)
training, testing, validation = tf.split([.8, .1, .1])

results = pipeline(
    training=training,
    testing=testing,
    validation=validation,
    model= "TransE"
    ,epochs= 2
    ,dimensions= 128
    ,random_seed=42,
    # device='mps', ## apple silicon - crashes OOM ; but works with cpu? depends on size, embed size
    # device="cpu", # "cuda" # runs stably on cuda
    evaluation_fallback = True,
    stopper='early', 
    training_loop_kwargs=dict(automatic_memory_optimization=False),
    evaluator_kwargs=dict(automatic_memory_optimization=False,batch_size=32), # batch_size
)

kantholtz added the bug Something isn't working label Nov 3, 2020

lvermue mentioned this issue Nov 16, 2020

GPU Memory fills up during hpo_pipeline trials and between subsequent runs #151

Open

kantholtz closed this as completed May 18, 2021

lrey-civetta mentioned this issue Mar 28, 2023

OOM with RotatE Model when Early Stopping #1251

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional OOM Errors after automatic batch size search #129

Occasional OOM Errors after automatic batch size search #129

kantholtz commented Nov 3, 2020 •

edited

Loading

mberr commented Nov 3, 2020

kantholtz commented Nov 3, 2020

lvermue commented Nov 3, 2020

kantholtz commented Nov 4, 2020

kantholtz commented Nov 4, 2020 •

edited

Loading

lvermue commented Nov 4, 2020

kantholtz commented Nov 4, 2020

lvermue commented May 18, 2021

kantholtz commented May 18, 2021

ddofer commented Apr 21, 2024

ddofer commented Jun 4, 2024

Occasional OOM Errors after automatic batch size search #129

Occasional OOM Errors after automatic batch size search #129

Comments

kantholtz commented Nov 3, 2020 • edited Loading

mberr commented Nov 3, 2020

kantholtz commented Nov 3, 2020

lvermue commented Nov 3, 2020

kantholtz commented Nov 4, 2020

kantholtz commented Nov 4, 2020 • edited Loading

lvermue commented Nov 4, 2020

kantholtz commented Nov 4, 2020

lvermue commented May 18, 2021

kantholtz commented May 18, 2021

ddofer commented Apr 21, 2024

ddofer commented Jun 4, 2024

kantholtz commented Nov 3, 2020 •

edited

Loading

kantholtz commented Nov 4, 2020 •

edited

Loading